The paper of SparkSQL provides a very nice figure about SparkSQL data flow. I’ve had experiences on Apache Pig for more than one year so I realized that it is better to put them all together. I created a new figure that includes the data flow of Hive, Pig, and SparkSQL. I know a little … Continue reading [Arch] SparkSQL Internals – Part 2: SparkSQL Data Flow
apache spark
[Arch] SparkSQL Internals – Part 1: SQLContext
I assume that you’ve already read these documents about SparkSQL. Things that you should keep in your mind: DataFrame API: where relational processing meets procedural processing. Catalyst: extensible query optimizer which works on trees and rules, provides lazy optimization and is easy to extend/add a new rule. In this post, I will introduce to you … Continue reading [Arch] SparkSQL Internals – Part 1: SQLContext
[Sysdeg] Moving to SparkSQL, why not?
Maybe you still remember the draft design of the system I proposed here. The reason why I delayed posting the part-2, which mostly focuses on technical details, because Spark is new for me so I need time to dig more into it. However, the design won’t be changed so much, I think. Come back to … Continue reading [Sysdeg] Moving to SparkSQL, why not?
[Arch] Spark job submission breakdown
I did write the Spark deploy mode in this post too, but I realized that it would be too long, so I decided to split it into two posts. I suggest you to go to this to have a good and general view of Spark in the ecosystem. In this post, I will focus on … Continue reading [Arch] Spark job submission breakdown
[Overview] Spark deploy modes
Spark has several deploy modes, this will affect the way our sparkdriver communicates with the executors. So, I want to say a little about these modes. Through Spark Deploy Mode document, we know that Spark has three deploy modes: Standalone: without having Yarn or Mesos, you can run your own cluster by starting a master … Continue reading [Overview] Spark deploy modes