Maybe you still remember the draft design of the system I proposed here. The reason why I delayed posting the part-2, which mostly focuses on technical details, because Spark is new for me so I need time to dig more into it. However, the design won’t be changed so much, I think.
Come back to Spark, I did implement a simple client and a simple server to sending and receiving the DAG. I did not find out my necessary information from the DAG sent from the client to the Spark Server. In the abstract point of view, many users can write their programs in different ways, this lead to the difficulty to optimize them (find the similarities and pack them). In technical point of view, some parts of the DAG I received at Spark Server were missing, which means I could not received the full DAG, because they can not be serialized. By the way, users can have many anonymous functions inside their programs, I don’t think sending the anonymous functions over the network is a good idea. We need to send the whole class of the anonymous function to the Spark server or we’ll get the exception “ClassNotFound”. Let’s assume that we can get them, we’ll face the problem about finding the similarities I said above.
After many discussion with my team, we decided to move to SparkSQL, a new module that integrates relational processing with Spark’s functional programming API. Why?
- As far as I know, it generates many plans (logical, optimized logical, physical plans…) which could be better to represent a job as logical level.
- SparkSQL has already provided the standard syntax for GROUPING SET, CUBE, ROLLUP so it could be easier for me to find the similarities among jobs.
- Its flow is somehow very similar to Pig, which I have already had some experiences on.
Next posts, I will provide some fundamental and of course the internal of SparkSQL.
For my internship, what do I need to know about SparkSQL?
- SparkSQL components.
- SparkSQL flow.
- Plans generated during SparkSQL runtime. (This is the most important thing I need to figure out)
- GROUPING SET/CUBE/ROLLUP in SparkSQL.
For my curiousity, what do I need to know about SparkSQL? Well, the rest.
* If you have any thoughts about the problems I mentioned above, please let me know.