[Arch] SparkSQL Internals – Part 2: SparkSQL Data Flow

The paper of SparkSQL provides a very nice figure about SparkSQL data flow. I’ve had experiences on Apache Pig for more than one year so I realized that it is better to put them all together. I created a new figure that includes the data flow of Hive, Pig, and SparkSQL. I know a little … Continue reading [Arch] SparkSQL Internals – Part 2: SparkSQL Data Flow

Advertisements

[Arch] SparkSQL Internals – Part 1: SQLContext

I assume that you’ve already read these documents about SparkSQL. Things that you should keep in your mind: DataFrame API: where relational processing meets procedural processing. Catalyst: extensible query optimizer which works on trees and rules, provides lazy optimization and is easy to extend/add a new rule. In this post, I will introduce to you … Continue reading [Arch] SparkSQL Internals – Part 1: SQLContext

[Arch] SparkContext and its components

When you work with Spark or read documents about Spark, definitely you will face SparkContext, which is inside the driver at client-side. This really made me confused and curious when I heard about it so I decided to dig into it. To summarize it in some words, I would say that SparkContext, in general, is … Continue reading [Arch] SparkContext and its components

[Arch] An overview of Spark components and their dependencies

I sketch here the components inside Spark and their dependencies so you can have a general overview of Spark. Each component is in charge of a particular function (of course). Straightforwardly, you can understand most of the components and their functions. I just explain some components that "not easy" to understand. - repl: the interractive shell … Continue reading [Arch] An overview of Spark components and their dependencies