[Sysdeg] Worksharing framework and its design – Part 1

As the previous post I mentioned, my internship will mostly focus on designing and implementing a worksharing framework for GROUPING SETS on Apache Spark. I will briefly discuss about the concept of the framework and its design in this post.

Actually, many sharing (scan, computation) frameworks have been proposed in also traditional database systems and in distributed systems. As far as I know, mostly the works on distributed systems are on Apache Hadoop. There are some well-known frameworks about this topic: Nova, MR-Share, Co-Scan… You can also find the documents about these frameworks on my Github.

I did a lot of searching but I didn’t find anything about worksharing on Apache Spark. Maybe this will be the first work-sharing framework on Apache Spark (hopefully). Ok, we start to tackle the problems.

What I need to deal with a worksharing framework for Apache Spark?

  • Each Spark program has its own RDD DAG. How can we centralize all these DAGs to do the optimization?
  • Spark mode: standalone mode, cluster mode, and its behaviours for each mode. To be clearer, how can user submit a Spark program? the submission process? the scheduling process?
  • Scheduling: How many jobs from users can be optimized at once?

A little bit about Spark submission process

Spark Scheduling and Submission Process

  • Each Spark application has its own SparkContext, a driver to be easier to understand, which functions as a DAGScheduler, a TaskManager, a TaskScheduler…
  • In both standalone mode and cluster mode, the “master” node always functions as a Resource Manager. All the rest will be done by the SparkContext.
  • User can submit a Spark application using two ways:
    • Use the spark-shell for interactivity
    • Use the spark-submit to submit the jar files to the master

Due to the problems I proposed above, we came up with a design for this framework. I quickly sketch the design as you can see in the figure below.

sys-design

  • Centralization: we need a “Spark Server” to centralize all the DAGs from the submitted Spark applications. All the SparkContexts at the “Spark clients” are now only used to submit their DAGs to “Spark-Server”. At Spark Server, a new DAG will be created and all the optimizations will be processed on this new DAG. It is also in charge of scheduling and submitting the new jobs to the master.
  • Scheduling: we need a queue to queue Spark job, and also need a deadline to decide when we should do the “work-sharing” on these jobs. I will discuss about this problem later.

Notice that this design is just in the alpha phase and it can be changed at any time I find out a better solution. In the next part, I will go deeper into the technical details of this design.

Advertisements

2 thoughts on “[Sysdeg] Worksharing framework and its design – Part 1

  1. Pingback: [SysDeg] Worksharing Framework and its design – Part 2: Communication method | Quang-Nhat HOANG-XUAN

  2. Pingback: [SysDeg] Worksharing Framework and its design – Part 3: Prototype for the first version | Quang-Nhat HOANG-XUAN

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s