I heard about what I will do in my internship 6 months ago. Well, to be precise, it was right after I finished my summer internship. It is designing and building a worksharing framework (scan, computation) for Pig queries – Hadoop MapReduce, which mostly focuses on GROUPING SETS operation.
6 months later, my internship still works on that idea, but the framework has changed! Instead of working on Hadoop Mapreduce and Apache Pig, I will work on a definitely new framework to me: Apache Spark, which means that I need to learn everything from zero, come up with the design of the system, make it real, and re-implement (or port) everything I had in Apache Pig – Hadoop to Apache Spark in the next 6 months.
To be honest, this will bring to me many difficulties, but learning something new is also attractive to me, especially it is learning Spark.
Before diving into the main purpose of the project, which is the sharing framework, deeply understand Spark is a MUST for me. Spark internal is exactly what I am looking for, unfortunately, there are still few documents about it. In roughly one week, I found out that there’s something I need to understand well about Apache Spark:
- Resilient Distributed Dataset (RDD): the heart of Apache Spark, which makes it outperform Hadoop and become more and more popular.
- Job submission and scheduling process: what are the differences of this process between Apache Spark and Apache Hadoop?
- What SparkContext is and what its purposes are. One important note is that SparkContext, in somehow, acts like a “master”/“manager” in each Spark application.
I don’t want to talk about things you can find out easily by yourselves. It’s better for you to read the paper of Spark to understand the RDD. There are also many documents show what the submission process of Spark is. All these information can be found out easily on the Internet, or you can visit my Github to get the most important ones (in my opinion).
Next posts, I will discuss more about my framework, how I came up with its design, in the high level point of view and also in the technical point of view.