I have worked on Apache open source projects for one year and a half, mostly on Apache Hadoop and especially on Apache Pig. In this post, I’m not talking about the detailed techniques or the internal architecture of Pig, but instead some reminders and suggestions for those who need to modify its source codes from scratch. Let’s start.
Working with Apache, you have a big community behind. In general, all Apache projects have nice guides for its contributors. By the way, each of them also has a mailing list for developers. So, you have to read the guide carefully, subscribe to the mailing list, and don’t hesitate to send an email to ask them for help when you have troubles.
Successfully compile Pig source codes and run it is a big win. Yes, absolutely true. Working with open sources projects means that you will face a lot of errors related to the “missing dependencies” when you compile them. You should have basic knowledge on build automation tool (maven or ant). Depend on the Hadoop distro your cluster uses, your Pig can run on it or not due to the incompatibilities. Sometimes, the last solution you have is sending an email to the mailing list and waiting for the answer.
Be familiar with your IDE, there’s a ton of tips that you can find on the Internet for helping you work productively with your IDE. Choose an IDE that you can import easily your project, mine is Eclipse 🙂
I have implemented many algorithms on Apache Pig. Before moving to Pig, if you are gonna implement an algorithm, be sure your implementation can firstly work well on Hadoop Mapreduce.
Pig local mode sometimes can be your foe. Do NOT trust it 100%. Pig has a very nice local mode and this helped me a lot for debugging but it also made me waste a lot of time. The reason was that there are some features don’t work correctly in local mode while they actually do in mapreduce mode. If your Pig works well in local mode but doesn’t in mapreduce mode, think about debugging it in mapreduce mode first before thinking about you implemented it wrong.
Big data, you are working with Big Data, I mean your input can be up to hundreds of gigabytes or even terabytes. Then, have your mind always in “Big Data” mode. When creating some things that you think they’re normal, a simple loop, a variable, an if condition,… be careful, it can make your Pig works slower than you imagine.
I mention above debugging your Pig in Mapreduce mode, what exactly does it mean? Well, small inputs, printline and logs are your friends. Small input files, even in kilobytes, can help you manage what is going on. Printing each line of code, which you think can raise bugs, is also a simple and efficient way. Lastly, read the logs, the counters, I mean everything the framework gives to you.
Follow the how-to-contribute guide if you want your contribution accepted by Apache. It took me three months to get my patch committed, just for the reason that I did not follow the conventions of Apache. All these stuffs can be found in the how-to-contribute guide above and just take 1 minute to set in your IDE.
Working with big open sources projects like Apache Pig and Apache Hadoop sometimes can drive yourself crazy. You may face a lot of troubles at the beginning; but when you conquer it and look back, maybe you will think like me: “Well, it’s not that hard, huh!”. And last thing you should always keep in mind: “No pains, no gains”. 🙂