As the previous post I mentioned, my internship will mostly focus on designing and implementing a worksharing framework for GROUPING SETS on Apache Spark. I will briefly discuss about the concept of the framework and its design in this post.
Actually, many sharing (scan, computation) frameworks have been proposed in also traditional database systems and in distributed systems. As far as I know, mostly the works on distributed systems are on Apache Hadoop. There are some well-known frameworks about this topic: Nova, MR-Share, Co-Scan… You can also find the documents about these frameworks on my Github.
I did a lot of searching but I didn’t find anything about worksharing on Apache Spark. Maybe this will be the first work-sharing framework on Apache Spark (hopefully). Ok, we start to tackle the problems.
What I need to deal with a worksharing framework for Apache Spark?
- Each Spark program has its own RDD DAG. How can we centralize all these DAGs to do the optimization?
- Spark mode: standalone mode, cluster mode, and its behaviours for each mode. To be clearer, how can user submit a Spark program? the submission process? the scheduling process?
- Scheduling: How many jobs from users can be optimized at once?
A little bit about Spark submission process
- Each Spark application has its own SparkContext, a driver to be easier to understand, which functions as a DAGScheduler, a TaskManager, a TaskScheduler…
- In both standalone mode and cluster mode, the “master” node always functions as a Resource Manager. All the rest will be done by the SparkContext.
- User can submit a Spark application using two ways:
- Use the spark-shell for interactivity
- Use the spark-submit to submit the jar files to the master
Due to the problems I proposed above, we came up with a design for this framework. I quickly sketch the design as you can see in the figure below.
- Centralization: we need a “Spark Server” to centralize all the DAGs from the submitted Spark applications. All the SparkContexts at the “Spark clients” are now only used to submit their DAGs to “Spark-Server”. At Spark Server, a new DAG will be created and all the optimizations will be processed on this new DAG. It is also in charge of scheduling and submitting the new jobs to the master.
- Scheduling: we need a queue to queue Spark job, and also need a deadline to decide when we should do the “work-sharing” on these jobs. I will discuss about this problem later.
Notice that this design is just in the alpha phase and it can be changed at any time I find out a better solution. In the next part, I will go deeper into the technical details of this design.