I set up Apache Giraph – Now what?

After setting up Giraph, I did – at least I wanted to – my first jump into it by running the Shortest Path example. Well, my first try was an epic fail by digging into code, getting lost inside packages and classes with no clue of what’s going on (This was my first attempt to work on an open-source project).

What I needed to do first (before getting lost), was to observe and understand how things work in an open-source project and specifically in Giraph. So, here’s what you need to know before even touching the keyboard:

  1. (Lack of) Documentation. Usually, documentation belongs in the category of endangered species. It can be noticed in all open source projects. In this case as well, I was facing the problem of not finding articles/posts/blogs for most of my initial questions.
  2. (Powerful) Mailing Lists (and archives of mailing lists). I call them time-saviors. There are three mailing lists:
    • The User List: users of Giraph make their questions here.
    • The Developer List: Giraph developers offer their valuable contributions.
    • The Commits List: for the ones who want to receive an e-mail when a commit is submitted.

    How to use a mailing list? First, define your role regarding Giraph (you will be just using it or you want to help in developing the system or both). And then:

    • Subscribe to receive all future e-mails and directly ask questions. The community is small but active!
    • Search into archives of mailing lists. I found extremely useful to dig into archives through “The Mail Archive” as it offers a search button. 😉 Here are the links for the user-list, dev-list and commits-list. 
  3. Main Objectives of Pregel model. Theory becomes practice in the shape of Giraph. You won’t easily understand Giraph if you haven’t understood the basics of Pregel first. A Pregel-based algorithm:
    • Is vertex-centric & iterative: every vertex executes the same function till the halt condition is met.
    • Receives an input file with the graph.
    • Creates an output file with the result.

    Here’s the news: all three bullets correspond to three different Java classes in Giraph (the names I give below are just samples):

    • The MainCode.java is the actual algorithm. It has a compute() method which is called by each vertex.
    • The InputFormat.java specifies the format of the input file that expects to receive and saves the values into the corresponding parameters of the main code. The format includes (i) the format of the lines given in the input file (Json, Text), (ii) the type of the object (vertex or edge) and (iii) the types of the object’s attributes (int, float, double).
    • The OutputFormat.java specifies the format of the output file that is about to generate. This format includes (i) the format of the lines to be generated in the output file (Json, Text) and (ii) the attributes to be printed in the output file.

    All the three java codes are necessary to execute a Giraph-based application.

And now we are ready! Dig into the code! Don’t get scared! You can start by just having a look at it. Personally, I use Eclipse to work with Giraph. Once you put the project in Eclipse, you can see that the code is divided into packages depending on their functionality (duuh!). Below I give a list of packages you should check out first.

In directory:  giraph-core/src/main/java, we have most of the code that is essential to us.

  • org.apache.giraph.graph → Contains the Vertex.java which is the basic class for writing the main code for computation. Devote some time to carefully read this file. A vertex comes with 4 attributes; 1-Vertex Id, 2-Vertex value, 3-Vertex list of edges, 4-Vertex message to be sent to neighbors. When creating an application for computation, your code should extend this Vertex java file and later on you can choose to leave blanks some of the attributes depending on your algorithm.
  • org.apache.giraph.io.formats → Contains java files for reading/writing the input/output files.
  • org.apache.giraph.utils → Contains java files that facilitate our programming in Giraph, e.g. the ArrayListWritable is a class that creates an ArrayList. Why do we need this class and not directly use the ArrayList? Because this one is Writable, which means that in the case of a failure during execution of a Giraph algorithm, the values for Writable objects can be retrieved (they are not totally lost). (Please if someone knows more or can explain this better, throw a comment! It will be very much appreciated :)). Another file worth-mentioning is the intPair which creates an object with 2 values; it can be translated to a pair of vertices.
  • org.apache.giraph.master Contains java files for adding a master vertex to perform centralized computation between supersteps.
  • org.apache.giraph.aggregators → Contains java files for adding aggregators in application. Aggregators execute between supersteps. Examples of aggregators given here compute the sum/product of vertices values, find the maximum/minimum value, etc.

I can continue extending this list, but I’m already moving out of the scope of this post. 🙂 I will only mention the directory with simple Giraph examples, which is giraph-examples/src/main/java. Some of the examples given here are the Shortest Paths and PageRank.

And now, we are – somehow – ready to code in Giraph 😉


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s