Latest Update: Check Questions: 4, 5, 7, 8 for changes. Instead of using the internal InputFormat and OutputFormat that SimplePageRankComputation has (which are currently buggy), I use others to make it work!
I’ve noticed an increase of the views for the Shortest Paths example, so I decided to post my fairytale with PageRank as well. Please! Any suggestions, improvements, positive/negative feedback about these posts are more than welcome! I will respect you till the end of time 😉
So, let’s ask ourselves.
~~~~~ Q#1: What’s the PageRank problem?
Problem Description: Assign a weight to each node of a graph which represents its relative importance inside the graph. PageRank usually refers to a set of webpages, and tries to measure which ones are the most important in comparison with the rest from the set. The importance of a webpage is measured by the number of incoming links, i.e. references it receives from other webpages.
Understand the details: PageRank represents the likelihood that a person reaches a particular page by randomly clicking on links. Initially, a person has to click to one of the webpages in the set. There is an equal probability to click to any webpage at that moment, i.e. 1/num_of_webpages (we use 0.15 in this example). After clicking to one webpage the first time, the next possible webpage to be clicked can be one of the webpages that the first webpage links to. That means, the probability to click to the second webpage is the probability of the referrer-webpage divided by the number of webpages it links to. Imagine the second webpage has many referrers. Then the probability of clicking to this webpage is: the probability of clicking to the first webpage (we use 0.15 in this case) PLUS the rest of the probability (1-0.15=0.85) * the summation of the probabilities of its referrers divided by the number of webpages they link to. Read it again. Think it twice. You can see the iteration – at least somehow. Please, have a look at the great explanation of PageRank in Wiki for better understanding.
~~~~~ Q#2: How can this be implemented in Giraph?
Think “Pregely”: I am a vertex and I represent a node, let’s say a webpage. I should receive messages, make some computation and send messages. My importance depends on the number of incoming edges. Thus, I expect my neighbors, i.e. the vertices connected to me, to send me a message via the edges, in order to compute my importance correctly. For each message I receive, I add it to a sum parameter (which is initialized to 0 before starting receiving messages). This sum is then used in the computation of my importance which will be saved as my value. I compute my value with the following equation: myValue = (0.15 / totalNumberVertices) + (0.85 * sum). Then, I prepare the messages to be sent to my neighbors; my importance-weight must be divided among the vertices I link to. Thus, I divide my value by the number of edges and send the result to all my neighbors. Note: In the first iteration, I will am not receiving any message, thus I will proceed directly to the last step of preparing the message (like explained just now) and send it to my neighbors.
~~~~~ Q#3: How should the main code of PageRank be?
The code exists in the Giraph examples. Directory: giraph-examples/scr/main/java/. Package: org.apache.giraph.examples. Name: SimplePageRankComputation.java
~~~~~ Q#4: What is the input file? What information do I need in order to run the algorithm and in what format am I gonna receive this information? Is any of the existing java files for I/O capable to read my input file?
- The Input File should contain vertices and edges between them with weights on the edges. Let’s say, each line should have the SourceVertexId, SourceVertexValue and tubles of DestinationVertexId and EdgeWeight.
- The VertexId, VertexValue and EdgeWeight should be of some type: int, double, whatever.
- I use an existing input format, the JsonLongDoubleFloatDoubleVertexInputFormat. This class is a Vertex Reader. It expects to receive information for a vertex; a Long Vertex Id, a Double Vertex Value, a Float Edge Value.
- Here is a simple input file I have created. It’s the same I used in a previous post about running the Shortest Paths example. I give 0 to all vertices values, because anyway the code does not initially take them into consideration.
[0,0,[[1,1],[3,3]]] [1,0,[[0,1],[2,2],[3,1]]] [2,0,[[1,2],[4,4]]] [3,0,[[0,3],[1,1],[4,4]]] [4,0,[[3,4],[2,4]]]
~~~~~ Q#5: What is the output file? What do I want to print in the output file? Is any of the existing java files for I/O capable to generate the desired output file?
- The Output File should contain all vertices with their value which represents their relative importance in the set. Let’s say, each line should have the VertexId and VertexValue.
- The VertexId can be of any type (Int, Long, Double) and the VertexValue should be of either Long or Double since it holds a result from computations with real numbers.
- I use an existing output format, the IdWithValueTextOutputFormat. This writes out vertices’ IDs and values.
~~~~~ Q#6: Is there anything new? Any additional knowledge by studying this example?
Yes there is! 😉 Let’s introduce Master Compute! (applause)
The Master Compute is an additional feature offered by Giraph. It takes the role of the master (duh) and can execute some computation between the iterations. This is a way to introduce centralization into our algorithms. In each iteration, the Master Compute runs first before the workers.
What’s more? Aggregators! (again applause)
The Aggregators are also additional features to the Pregel model. They allow global computation, checking global conditions and keeping statistics. During a superstep, vertices can send values to an aggregator. At the end of the superstep, the aggregator aggregates these values to produce a global result (sum, max, min, etc.), thus vertices can receive the aggregated global value in the next superstep.
How to use both Master Compute and Aggregators into my code?
SimplePageRankComputation does it already. Since, the Master Compute runs first before any worker, it is very convenient that the master compute should be responsible to register the aggregators needed. Once the aggregators are registered – at the very beginning of the superstep 0, vertices can send values at any point and receive the result in the next iteration. There are 3 aggregators in the code for calculating: (i) the maximum vertex value, (ii) the minimum vertex value, (iii) sum of values.
Thirsty for more details? Read the awesome article about Aggregators and Master Compute from the website of Giraph.
~~~~~ Q#7: How do I run the algorithm?
Start the MapReduce and run the Giraph command line.
1. In the hadoop directory run:
2. Create the input folder in the HDFS and move the input file there.
hadoop fs -mkdir /in hadoop fs -put /local-directory-to-input-file/input_file /in/
3. Run the command, in which you should include: (i) the jar file generated when installing giraph, (ii) the path to the main code, (iii) the path to the code for reading the input file and the path to the input file, (iv) the path to the code for generating the output file and the path to the output file, (v) the number of workers, (vi) the master compute class. Below I give all these parameters in the same order. Notice that the master compute class is part of the main code SimplePageRankComputation. Thus, we call this class through the main code using the symbol \$.
hadoop jar /home/marsty5/Programming/EclipseProjects/giraph/giraph-core/target/giraph-0.2-SNAPSHOT-for-hadoop-0.20.203.0-jar-with-dependencies.jar org.apache.giraph.GiraphRunner org.apache.giraph.examples.SimplePageRankComputation -vif org.apache.giraph.io.formats.JsonLongDoubleFloatDoubleVertexInputFormat -vip /in/input_file -of org.apache.giraph.io.formats.IdWithValueTextOutputFormat -op /outPageRank -w 2 -mc org.apache.giraph.examples.SimplePageRankComputation\$SimplePageRankMasterCompute
~~~~~ Q#8: What results should I expect?
If it runs successfully, then an output folder named outPageRank is created with the output file. Open the file:
hadoop fs -cat /out/outPageRank/* 0 0.16682289373110673 4 0.17098446073203238 2 0.17098446073203238 1 0.24178880797750443 3 0.24178880797750443
A vertex value shows the probability to reach that vertex. Since the highest probability is 1, if you sum up the vertices values, the result should be ~1 😉
So, did everything work right for you?