I assume that you’ve already read these documents about SparkSQL. Things that you should keep in your mind:
- DataFrame API: where relational processing meets procedural processing.
- Catalyst: extensible query optimizer which works on trees and rules, provides lazy optimization and is easy to extend/add a new rule.
In this post, I will introduce to you about the SQLContext. If SparkContext is the brain of Spark, so SQLContext is also the brain of SparkSQL.
SparkSQL is a new component of Spark. Inside package sql of Spark, we have core, catalyst, hive. The two components that I will focus are core and catalyst.
- sparksql-core is in charge of providing the programming interface for users, translating the language that it supports into a DAG which will be executed by spark-core.
- catalyst, which we can guess what it does from its name, acts as a query optimizer.
- What is SQLContext and its functions?
- How those two components core and catalyst look like in a big picture?
- What exactly is DataFrame?
SQLContext is the entry point of SparkSQL if you want to play with it. It allows the creation of DataFrame objects and the execution of the SQL queries. In a similar way to the post about SparkContext, I also provide here a figure to show you the main components of SQLContext and their functions (and of course the descriptions (from the codes and my own words) are below).
Notice: the yellow circle is lazy val (the difference between a val and a lazy val in Scala is, that a val is executed when it is defined while a lazy val is executed when it is accessed the first time. Refer to here for more details.)
* Components relate to analysis, optimization and plans are from catalyst package, the rest are from core package.
- SparkContext: refer to this post. SparkSQL is a component of Spark so we need a SparkContext for the entry point of Spark before creating the SQLContext.
- SQLConf: A class that enables the setting and getting of mutable config parameters/hints.
- Catalog: An interface for looking up relations by name and used by Analyzer. There are three kinds of catalog, the implemented one in SQLContext is SimpleCatalog which has a HashMap to manage relations.
- CacheManager: Provides support in a SQLContext for caching query results and automatically using these cached results when subsequent queries are executed. Data is cached using byte buffers stored in an InMemoryRelation. This relation is automatically substituted query plans that return the `sameResult` as the originally cached query.
- Analyzer: Provides a logical query plan analyzer, which translates UnresolvedAttributes and UnresolvedRelations into fully typed objects using information in a schema Catalog and a FunctionRegistry.
- Optimizer: query optimizer. This is a big feature of SparkSQL, details will be on next posts.
- SparkSQL provides SQL so for sure it needs a parser. We have two parsers here:
- ddlParser: data definition parser, a parser for foreign DDL commands
- sqlParser: The top level Spark SQL parser. This parser recognizes syntaxes that are available for all SQL dialects supported by Spark SQL, and delegates all the other syntaxes to the `fallback` parser.
- SparkPlanner: SparkSQL generates many PhysicalPlans if possible, after going through the cost model, it outputs only one PhysicalPlan, which will be executed.
- prepareForExecution: Prepares a planned SparkPlan for execution by inserting shuffle operations as needed.
- CheckAnalysis: Throws user facing errors when passed invalid queries that fail to analyze.
- QueryExecution: The primary workflow for executing relational queries using Spark. This is the most important component of SparkSQL as I think. It has several plans that will be generated during the execution. All those plans are lazy evaluated so it will be executed when an action is called by user. Why analyzed, withCachedData, optimizedPlan are logical plans, the rest plans are physical plans. Details will be on next post.
The paper of SparkSQL has a nice part talks about DataFrame. I just mention it here some key points about it. Technical details will be on next posts.
- DataFrame is a distributed collection of data organized into named columns.
- A DataFrame is equivalent to a relational table in Spark SQL.
- It can also be manipulated in similar ways to the RDD.
- It keeps track of their schema and support many relational operations.
- Each DataFrame objects has a ‘lazy’ logical plan to compute a dataset and to have better chance for optimization.
What are inside a DataFrame?
- A LogicalPlan
- A QueryExecution
- A SQLContext
*Some thoughts to share:
- From SQLConf, I know that SparkSQL supports two types of dialect: sql and hive. By the way, the codegen feature is disable by default.
- Inside CacheManager, there is a case class CachedData, which holds a cached logical plan and its data.
- The main data type in Catalyst is a tree composed of node objects.
- Analyzer and Optimizer work by using pattern matching, to be more precise, by using rules to transform a tree to another tree.
- Rules can be executed multiple times. So catalyst groups rules into batches and executes each batch until it reaches a fixed point, which is the max_iterations.
- QueryExecution is the place where catalyst works the most. It does all the analysis, the optimization and translation.
- Cost-based optimization is only used to select the join algorithm, the metric is the table size. Very similar to Calcite in Hive.
- SQLContext contains many ‘lazy’ plans, so be carefull when you debug a SparkSQL program because this can drive you crazy.
This post is just to let you know the overall picture of SparkSQL and it’s still very brief. Next posts, I will focus on the data flow of SparkSQL and its internals.