[Arch] SparkContext and its components

When you work with Spark or read documents about Spark, definitely you will face SparkContext, which is inside the driver at client-side. This really made me confused and curious when I heard about it so I decided to dig into it. To summarize it in some words, I would say that SparkContext, in general, is the brain of Spark. It sits on client-side, but in somehow, it acts like a master: submitting jobs, scheduling jobs, handling failed jobs, closely working with executors…

I provide here a figure to describe what the important components in a SparkContext are. SparkContext is inside Spark-core package, refer to this post to see Spark’s packages.Their description (some from Spark documents, some from my own words) are also below the figure.

spark-context-components

  • listenerBus: an asynchronous listener bus for Spark events. SparkContext is inside the driver, which sits on the client machine. Easily to understand, it needs a channel to communicate with the cluster manager and the executors
    • SparkListener: Interface for listening to events from the Spark scheduler.
    • JobProgressListener: Tracks task-level information to be displayed in the UI
  • RDD Graph: We have the concept of DAG of RDD in Spark. You can imagine that it is a linked list of RDD. Each RDD will contain the SparkContext that it belongs to and its dependencies. The benefits of building list of RDD as a DAG: easy to maintain and reconstruct RDD transformations when lost happens, easy to split into stages. More information about RDD can be found at the thesis of Matei.
  • DAGScheduler: in charge of splitting the DAG into stages and submitting each stages when ready.
    • DagScheduler has some hashmap and hashset to keep track of jobs’ status and stages’ status.
    • DAGScheduler also has an event listener to handle event: DAGSchedulerEventProcessLoop. This class extends DAGSchedulerEvent. The DAGScheduler uses an event queue architecture where any thread can post an event but there is a single “logic” thread that reads these events and takes decisions. This greatly simplifies synchronization.
  • TaskScheduler: Low-level task scheduler interface. This interface allows plugging in different task schedulers. Each TaskScheduler schedules tasks for a single SparkContext. These schedulers get sets of tasks submitted to them from the DAGScheduler for each stage, and are responsible for sending the tasks to the cluster, running them, retrying if there are failures, and mitigating stragglers. They return events to the DAGScheduler.
  • TaskSchedulerImpl: an exclusive implementation of TaskScheduler. It schedules tasks for multiple types of clusters by acting through a SchedulerBackend. It handles common logic, like determining a scheduling order across jobs, waking up to launch speculative tasks…
  • SchedulerBackend: A backend interface for scheduling systems that allows plugging in different ones under TaskSchedulerImpl.
  • Backend Impl: implementation type of a backend. It contains an Executor and is in charge of launching task on that Executor. I will come back to it in next post about Job submission progress and life of a Spark job. At the moment, Spark provides these types of backend:
    • LocalBackend
    • SparkDeploySchedulerBackend
    • YarnClusterSchedulerBackend
    • YarnClientSchedulerBackend
    • CoarseMesosSchedulerBackend
    • SimrSchedulerBackend

* SparkEnv contains too many “important” components at the client side. In this post, I just briefly explain those components so you can imagine its functions.

  • SparkEnv: Holds all the runtime environment objects for a running Spark instance (either master or worker), including the serializer, Akka actor system, block manager, map output tracker, etc. Currently Spark code finds the SparkEnv through a global variable, so all the threads can access the same SparkEnv.
    • MetricSystem: Spark Metrics System, created by specific “instance”, combined by source, sink, periodically poll source metrics data to sink destinations. This can be configured through metrics.properties file. Got confused? I post here some lines in the metrics.properties file:
      • #Enable ConsoleSink for all instances by class name
        • *.sink.console.class=org.apache.spark.metrics.sink.ConsoleSink
      • #Polling period for ConsoleSink
        • *.sink.console.period=10
        • *.sink.console.unit=seconds
      • #Master instance overlap polling period
        • master.sink.console.period=15
        • master.sink.console.unit=seconds

Still confused? Go directly to that file and read it 😉

    • Serializer: provide a thread-safe library for serialization
    • CacheManager: responsible for passing RDDs partition contents to the BlockManager and making sure a node doesn’t load two copies of an RDD at once.
    • BlockManager: Manager running on every node (driver and executors) which provides interfaces for putting and retrieving blocks both locally and remotely into various stores (memory, disk, and off-heap).
    • MapOutputTracker: keeps track of the location of the map output of a stage. We have two types of tracker: master and worker. The difference is the way they use the HashMap type to keep track the information.
    • ShuffleManager: Pluggable interface for shuffle systems. Hash and Sort are two techniques currently used in shuffle phase of Spark.
    • BroadcastManager: manage broadcasted objects.
    • ShuffleMemoryManager: Allocates a pool of memory to task threads for use in shuffle operations.
    • OutputCommitCoordinator: Authority that decides whether tasks can commit output to HDFS. Uses a “first committer wins” policy. It is initiated in both driver and worker sides.

* Some thoughts to say:

  • The component surprises me the most is SparkEnv. Its name sounds insignificantly but it includes many cool stuffs here.
  • DAGScheduler also keeps track of failed stages and is in charge of resubmitting them ( to taskScheduler).
  • Spark provides multiple SparkContext inside one driver through configuring “spark.driver.allowMultipleContexts”!! Btw, I haven’t tested it yet. Result will be posted asap.
  • One user can run more than one SparkContext as the same time, in different processes (but only one Spark WebUI can be created due to port binding problem).
  • Still curious about job scheduling process, I did understand the flow of the job submission process but I will face many problems about scheduling so for sure I will come back to it.
  • Actually, as far as I know, Shuffle phase of Spark is a big improvement compared to Hadoop. It’s another story and (hopefully) will be on next post.
  • Again, and also to conclude, I would like to repeat that SparkContext is the brain of Spark, which all the intelligence stuffs happen here.

[Update]

  • I found out a very nice document about Shuffle phase of Spark. The only improvement that I can figure out at the moment is Spark lets user to use the sort in shuffle phase or not. This is useful as many computations do not need to have a sort in shuffle phase (Hadoop always sorts the map output in shuffle phase). You can check the document at my Github or here.
  • As I promised, post about Spark job submission.
Advertisements

2 thoughts on “[Arch] SparkContext and its components

  1. Pingback: [Arch] Spark job submission breakdown | Quang-Nhat HOANG-XUAN

  2. Pingback: [Arch] SparkSQL Internals – Part 1: SQLContext | Quang-Nhat HOANG-XUAN

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s