The situation is that: there’s one of our work need to be benchmarked. It’s still the SparkSQL Server that I designed and developed but new sharing technique was implemented and integrated into it. So, we decided to use the spark-sql-perf of Databricks to benchmark our work, and the benchmarked queries we used is TPCDS.
After a lot of discussions, we decided to change the design of the system a little bit, so it can be more general and extensible. The new design is described as the figure above. The WorkSharing Detector remains the same as the old design. Its goal is generating bags of DAGs which are labeled with … Continue reading [SysDeg] Worksharing Framework and its design: Some modifications
Long time no see! After one month playing with caching in Spark, I learned many valuable lessons (which will be posted on other blog posts, about Cache Manager and Block Manager of Spark). Our team came back to the design of the system - spark SQL server. To be honest, i spent too much time … Continue reading [SysDeg] Worksharing Framework and its design – Part 3: Prototype for the first version
After having a basic understanding about Spark and SparkSQL, I came back to my system. The high level design of the system remains the same as I described two months ago. It is a client-server model, but the server is changed from the Spark server to the SparkSQL server. I spent roughly two weeks for some coding … Continue reading [SysDeg] Worksharing Framework and its design – Part 2: Communication method
The paper of SparkSQL provides a very nice figure about SparkSQL data flow. I’ve had experiences on Apache Pig for more than one year so I realized that it is better to put them all together. I created a new figure that includes the data flow of Hive, Pig, and SparkSQL. I know a little … Continue reading [Arch] SparkSQL Internals – Part 2: SparkSQL Data Flow