My internship and some documents on Apache Spark

I heard about what I will do in my internship 6 months ago. Well, to be precise, it was right after I finished my summer internship. It is designing and building a worksharing framework (scan, computation)  for Pig queries – Hadoop MapReduce, which mostly focuses on GROUPING SETS operation.

6 months later, my internship still works on that idea, but the framework has changed! Instead of working on Hadoop Mapreduce and Apache Pig, I will work on a definitely new framework to me: Apache Spark, which means that I need to learn everything from zero, come up with the design of the system, make it real, and re-implement (or port) everything I had in Apache Pig – Hadoop to Apache Spark in the next 6 months.

Apache Spark vs Apache Hadoop

To be honest, this will bring to me many difficulties, but learning something new is also attractive to me, especially it is learning Spark.

Before diving into the main purpose of the project, which is the sharing framework, deeply understand Spark is a MUST for me. Spark internal is exactly what I am looking for, unfortunately, there are still few documents about it. In roughly one week, I found out that there’s something I need to understand well about Apache Spark:

  • Resilient Distributed Dataset (RDD): the heart of Apache Spark, which makes it outperform Hadoop and become more and more popular.
  • Job submission and scheduling process: what are the differences of this process between Apache Spark and Apache Hadoop?
  • What SparkContext is and what its purposes are. One important note is that SparkContext, in somehow, acts like a “master”/“manager” in each Spark application.

I don’t want to talk about things you can find out easily by yourselves. It’s better for you to read the paper of Spark to understand the RDD. There are also many documents show what the submission process of Spark is. All these information can be found out easily on the Internet, or you can visit my Github to get the most important ones (in my opinion).

Next posts, I will discuss more about my framework, how I came up with its design, in the high level point of view and also in the technical point of view.

Advertisements

One thought on “My internship and some documents on Apache Spark

  1. Pingback: Worksharing framework and its design – Part 1 | Quang-Nhat HOANG-XUAN

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s