After 6 months working, I’ve just defensed my master thesis at the middle of September. The system works properly and efficiently.
So sorry since there wasn’t any posts during July to August. As the previous posts, I had done with the design of my system, the rest that I need to do is implementing them.
The implementation of my system seems not really too difficult, except the implementation of Simultaneous Pipeline technique of MRShare.
All the implementation of my system is available of the Github of our group. In this post, I just want to discuss about Simultanous Pipeline technique.
In general, this technique is used when merging multiple jobs read the same input file into one meta job. The technique has been used in Multi Query Optimization of Apache Pig. In MRShare, they proposed this technique on Hadoop MapReduce as an automatic module. On Apache Spark, to the best of our knowledges, all the query optimizations are currently focusing on single query, not multi query. We reintroduce the simultaneous pipeline technique and implement it on Apache Spark.
There are 4 new RDDs that we create to implement the technique.
- MuxRDD: The MuxRDD is used to bu↵er the input record and multiplex it into multiple pipelines, which represent multiple jobs.
- LabellingRDD: The LabellingRDD is used to attach the label into each tuple before it is shu✏ed over the network. The label is an integer which is attached to the key of the tuple. So, the old tuple [Key, Value] becomes [(Label, Key), Value].
- DispatchRDD: The DispatchRDD is used to route the tuple to the right pipeline by using the label attached inside it.
- PullRDD: Because Spark is based on Pull Mechanism, which means the child RDD asks its parent RDDs to give it the tuple. In default, the saveAsTextFile action has the puller to trigger the job. To keep Spark at it is, we create our new puller called PullRDD.
We also modify some parts of the shuffle component on Apache Spark:
- ShuffledRDD: the ShuffledRDD contains only one Aggregator, the object holds the aggregate function. In our case, to merge multiple jobs which have different Aggregators, we create a list of Aggregators to hold their Aggregators.
- ExternalSorter: before writing to files and shuffling them over the network, Spark does a local aggregation (it is a combiner in Hadoop MapReduce), so we also need to apply the correct aggregate function to the tuple by checking the label attached into it.
- AppendOnlyMap: when the aggregation happens in the other stage after getting the data over the network (in Hadoop MapReduce, it is the Reduce phase), again, we need to apply the correct aggregate function to the list of tuple by checking the label attached into it.
We provide two examples to demonstrate how simultaneous pipeline technique works. Two jobs are described at two figures below. For each figure, on the left are two original DAGs and on the right is the meta DAG which is merged from those two DAGs.
- Two jobs, each job has only one ShuffleRDD From figure 1, we can see the metajob is a DAG which is mainly composed from the two original DAGs. The MuxRDD is attached right after the scan RDD, it does the multiplexing the input into two pipelines.
- Before going to the ShuffleRDD, a LabellingRDD is inserted right before the ShuffleRDD to attach the label into each tuple. In this example, we have two jobs, so the label has two values which is 0 and 1. The ShuffledRDD is modified so it contains two aggregate functions. The aggregation before and after shuffling happens on the correct tuple with the correct aggregate function due to the label attached into each tuple. The DispatchRDD is attached after the ShuffledRDD and routes the correct tuple to the associated pipeline by the label attached into each tuple. The PullRDD is inserted at last to do the pull and trigger the whole job, then saves the output to disk storage.
- Two jobs, one job has two ShuffleRDDs The part of DAG1 from the beginning up to the second ShuffledRDD and DAG2 are merged as the same way as the case above. The difference is that the PullRDD just saves a part of data comes from the pipeline of DAG2, PullRDD is also followed by the rest part of DAG1 and saves to disk storage when finishes. The process is described in figure 2.
This technique is embedded into Spark Source Code and is an automatic feature of SparkSQL Server. In those two examples, we just discuss about two simple cases that can happen when using MRShare and simultaneous pipeline technique, but the more complex DAGs can also be merged into one meta job automatically.