When planning to run a code in Giraph, I ask myself some questions. When I answer to all my questions, I move to actually implement and run the code. (so I kinda discuss a lot with myself :p). Let’s have a look to this inner discussion – while running the Shortest Paths problem.
~~~~~ Q#1: What’s the Shortest Path problem?
Problem Description: Find the shortest path between 2 vertices in a graph, so that the sum of weights of the edges in the path is minimized. The example given in Giraph finds the shortest path from each vertex to the source-vertex.
~~~~~ Q#2: How can this be implemented in Giraph?
Think “Pregely”: Since in Pregel the same code is executed in all vertices at the same time, we need to think as we are a vertex.
So, I am a vertex. I should receive messages, make some computation and send messages. If I am the source-vertex, then no edges are needed to reach myself and therefore the shortest path to myself is weighted to zero. Otherwise, I want to find the shortest path to the given source vertex. I do not know any information other than: (i) my Id (not so useful in this example), (ii) my value (which is initialized to a maximum constant at the beginning), (iii) the values on the edges going out from me (read from input file) and (iv) the messages I receive from neighbors (computed during this phase). Since I do not know what’s going on between me and the source (how many other vertices exist, how much they weight), I should receive messages with the sum of edges-weights from the source up to me. From all the messages I receive, I choose the minimum sum (this is the shortest path) and make it my value only if it’s smaller than my current value. Then I prepare the messages to be sent to my neighbors; for each of my edges, I add the edge weight to my value and send the new sum to the destination vertex of this edge.
~~~~~ Q#3: How should the main code of Shortest Paths be?
The above computation of the vertex is the compute() method of the algorithm. This algorithm is one of the Giraph examples. Directory: giraph-examples/scr/main/java/. Package: org.apache.giraph.examples. Name: SimpleShortestPathsVertex.java
~~~~~ Q#4: What is the input file? What information do I need in order to run the algorithm and in what format am I gonna receive this information? Is any of the existing java files for I/O capable to read my input file?
- The Input File should contain vertices and edges between them with weights on the edges. Let’s say, each line should have the SourceVertexId, and tubles of DestinationVertexId and EdgeWeight.
- The VertexId and EdgeWeight should be of some type: int, double, whatever.
- Yes, there is an existing input format, i.e. JsonLongDoubleFloatDoubleVertexInputFormat. This file is a Vertex Reader. It expects to receive information for a vertex; a Long Vertex Id, a Double Vertex Value, a Float Edge Value and expects the messages sent/received to transfer Double values. This takes lines in Json format. A very good explanation of the Json format is given in the comments of the file: “The files should be in the following JSON format:
JSONArray(, , JSONArray(JSONArray(, ), …)).”
Here is a simple input file I have created. I give 0 to all vertices values, because anyway the code does not take them into consideration, it initializes them to a MAX constant.
[0,0,[[1,1],[3,3]]] [1,0,[[0,1],[2,2],[3,1]]] [2,0,[[1,2],[4,4]]] [3,0,[[0,3],[1,1],[4,4]]] [4,0,[[3,4],[2,4]]]
~~~~~ Q#5: What is the output file? What do I want to print in the output file? Is any of the existing java files for I/O capable to generate the desired output file?
- The Output File should contain all vertices with the value of their shortest path to the source-vertex. Let’s say, each line should have the VertexId and the the Sum of the weights for its shortest path.
- The VertexId and the Sum should be of some type, int, double, whatever.
- Yes, there is an existing output format, i.e. IdWithValueTextOutputFormat. This writes out vertices’ IDs and values.
~~~~~ Q#6: How do I run the algorithm?
Start the MapReduce and run the Giraph command line.
1. In the hadoop directory run:
2. Create the input folder in the HDFS and move the input file there.
hadoop fs -mkdir /in hadoop fs -put /local-directory-to-input-file/input_file /in/
3. Run the command, in which you should include: (i) the jar file generated when installing giraph, (ii) the path to the main code, (iii) the path to the code for reading the input file and the path to the input file, (iv) the path to the code for generating the output file and the path to the output file, (v) the number of workers. Below I give all these parameters in the same order.
hadoop jar /directory-to-giraph/giraph-core/target/giraph-0.2-SNAPSHOT-for-hadoop-0.20.203.0-jar-with-dependencies.jar org.apache.giraph.GiraphRunner org.apache.giraph.examples.SimpleShortestPathsVertex -vif org.apache.giraph.io.formats.JsonLongDoubleFloatDoubleVertexInputFormat -vip /in/input-json -of org.apache.giraph.io.formats.IdWithValueTextOutputFormat -op /outShortest -w 1
~~~~~ Q#7: What results should I expect?
If it runs successfully, then an output folder named outShortest is created with the output file. Open the file:
hadoop fs -cat /outShortest/*
The result should be this:
0 1.0 2 2.0 1 0.0 3 1.0 4 5.0
Here the source vertex was the Vertex Id 1. This was given as a constant at the beginning of the code and can be of course changed.
Observation: I have noticed that this algorithm works only when the input file gives an undirected graph, i.e. when both edges a → b and b → a are included in the input file.
Why? At Superstep 0, the vertices get initialized with a MAX value. If a vertex does not exist at superstep 0, when it gets created because of receiving a message, its value will not be initialized to MAX like the others (since we are not at superstep 0 anymore). Therefore, its behavior get a bit random with no value initialization.
What should we do? I submitted my first patch!
And the conclusion is…? I am learning new stuff! One of the Giraph members suggested to use the VertexValueFactory.java* to fix the problem. We are in the process of fixing it and in the meanwhile I’m learning more cool stuff 😀
* The VertexValueFactory is responsible to initialize the vertex value when it is created either by reading the input file or by receiving a message.