Run example in Giraph: PageRank

Latest Update: Check Questions: 4, 5, 7, 8 for changes. Instead of using the internal InputFormat and OutputFormat that SimplePageRankComputation has (which are currently buggy), I use others to make it work!

I’ve noticed an increase of the views for the Shortest Paths example,  so I decided to post my fairytale with PageRank as well. Please! Any suggestions, improvements, positive/negative feedback about these posts are more than welcome! I will respect you till the end of time 😉

So, let’s ask ourselves.

~~~~~ Q#1: What’s the PageRank problem?

Problem Description: Assign a weight to each node of a graph which represents its relative importance inside the graph. PageRank usually refers to a set of webpages, and tries to measure which ones are the most important in comparison with the rest from the set. The importance of a webpage is measured by the number of incoming links, i.e. references it receives from other webpages.

Understand the details: PageRank represents the likelihood that a person reaches a particular page by randomly clicking on links. Initially, a person has to click to one of the webpages in the set. There is an equal probability to click to any webpage at that moment, i.e. 1/num_of_webpages (we use 0.15 in this example). After clicking to one webpage the first time, the next possible webpage to be clicked can be one of the webpages that the first webpage links to. That means, the probability to click to the second webpage is the probability of the referrer-webpage divided by the number of webpages it links to. Imagine the second webpage has many referrers. Then the probability of clicking to this webpage is: the probability of clicking to the first webpage (we use 0.15 in this case) PLUS the rest of the probability (1-0.15=0.85) * the summation of the probabilities of its referrers divided by the number of webpages they link to. Read it again. Think it twice. You can see the iteration – at least somehow. Please, have a look at the great explanation of PageRank in Wiki for better understanding.

~~~~~ Q#2: How can this be implemented in Giraph?

Think “Pregely”: I am a vertex and I represent a node, let’s say a webpage. I should receive messages, make some computation and send messages. My importance depends on the number of incoming edges. Thus, I expect my neighbors, i.e. the vertices connected to me,  to send me a message via the edges, in order to compute my importance correctly. For each message I receive, I add it to a sum parameter (which is initialized to 0 before starting receiving messages). This sum is then used in the computation of my importance which will be saved as my value. I compute my value with the following equation: myValue = (0.15 / totalNumberVertices) + (0.85 * sum). Then, I prepare the messages to be sent to my neighbors; my importance-weight must be divided among the vertices I link to. Thus, I divide my value by the number of edges and send the result to all my neighbors. Note: In the first iteration, I will am not receiving any message, thus I will proceed directly to the last step of preparing the message (like explained just now) and send it to my neighbors.

top

~~~~~ Q#3: How should the main code of PageRank be?

The code exists in the Giraph examples. Directory: giraph-examples/scr/main/java/. Package: org.apache.giraph.examples. Name: SimplePageRankComputation.java

top

~~~~~ Q#4: What is the input file? What information do I need in order to run the algorithm and in what format am I gonna receive this information? Is any of the existing java files for I/O capable to read my input file?

  • The Input File should contain vertices and edges between them with weights on the edges. Let’s say, each line should have the SourceVertexId, SourceVertexValue and tubles of DestinationVertexId and EdgeWeight.
  • The VertexId, VertexValue and EdgeWeight should be of some type: int, double, whatever.
  • I use an existing input format, the JsonLongDoubleFloatDoubleVertexInputFormat. This class is a Vertex Reader. It expects to receive information for a vertex; a Long Vertex Id, a Double Vertex Value, a Float Edge Value.
  • Here is a simple input file I have created. It’s the same I used in a previous post about running the Shortest Paths example. I give 0 to all vertices values, because anyway the code does not initially take them into consideration.
[0,0,[[1,1],[3,3]]]
[1,0,[[0,1],[2,2],[3,1]]]
[2,0,[[1,2],[4,4]]]
[3,0,[[0,3],[1,1],[4,4]]]
[4,0,[[3,4],[2,4]]]

top

~~~~~ Q#5: What is the output file? What do I want to print in the output file? Is any of the existing java files for I/O capable to generate the desired output file?

  • The Output File should contain all vertices with their value which represents their relative importance in the set. Let’s say, each line should have the VertexId and VertexValue.
  • The VertexId can be of any type (Int, Long, Double) and the VertexValue should be of either Long or Double since it holds a result from computations with real numbers.
  • I use an existing output format, the IdWithValueTextOutputFormat. This writes out vertices’ IDs and values.

top

~~~~~ Q#6: Is there anything new? Any additional knowledge by studying this example? 

Yes there is! 😉 Let’s introduce Master Compute! (applause)

The Master Compute is an additional feature offered by Giraph. It takes the role of the master (duh) and can execute some computation between the iterations. This is a way to introduce centralization into our algorithms. In each iteration, the Master Compute runs first before the workers.

What’s more? Aggregators! (again applause)

The Aggregators are also additional features to the Pregel model. They allow global computation, checking global conditions and keeping  statistics. During a superstep, vertices can send values to an aggregator. At the end of the superstep, the aggregator aggregates these values to produce a global result (sum, max, min, etc.), thus vertices can receive the aggregated global value in the next superstep.

How to use both Master Compute and Aggregators into my code? 

SimplePageRankComputation does it already. Since, the Master Compute runs first before any worker, it is very convenient that the master compute should be responsible to register the aggregators needed. Once the aggregators are registered – at the very beginning of the superstep 0, vertices can send values at any point and receive the result in the next iteration. There are 3 aggregators in the code for calculating: (i) the maximum vertex value, (ii) the minimum vertex value, (iii) sum of values.

Thirsty for more details? Read the awesome article about Aggregators and Master Compute from the website of Giraph.

top

~~~~~ Q#7: How do I run the algorithm?

Start the MapReduce and run the Giraph command line.

1. In the hadoop directory run:

bin/start-dfs.sh
bin/start-mapred.sh

2. Create the input folder in the HDFS and move the input file there.

hadoop fs -mkdir /in
hadoop fs -put /local-directory-to-input-file/input_file /in/

3. Run the command, in which you should include: (i) the jar file generated when installing giraph, (ii) the path to the main code, (iii) the path to the code for reading the input file and the path to the input file, (iv) the path to the code for generating the output file and the path to the output file, (v) the number of workers, (vi) the master compute class. Below I give all these parameters in the same order. Notice that the master compute class is part of the main code SimplePageRankComputation. Thus, we call this class through the main code using the symbol \$.

hadoop jar /home/marsty5/Programming/EclipseProjects/giraph/giraph-core/target/giraph-0.2-SNAPSHOT-for-hadoop-0.20.203.0-jar-with-dependencies.jar org.apache.giraph.GiraphRunner org.apache.giraph.examples.SimplePageRankComputation -vif org.apache.giraph.io.formats.JsonLongDoubleFloatDoubleVertexInputFormat -vip /in/input_file -of org.apache.giraph.io.formats.IdWithValueTextOutputFormat -op /outPageRank -w 2 -mc org.apache.giraph.examples.SimplePageRankComputation\$SimplePageRankMasterCompute 

top

~~~~~ Q#8: What results should I expect?

If it runs successfully, then an output folder named outPageRank is created with the output file. Open the file:

hadoop fs -cat /out/outPageRank/*
0 0.16682289373110673
4 0.17098446073203238
2 0.17098446073203238
1 0.24178880797750443
3 0.24178880797750443

A vertex value shows the probability to reach that vertex. Since the highest probability is 1, if you sum up the vertices values, the result should be ~1 😉

So, did everything work right for you?

top

Advertisements

18 thoughts on “Run example in Giraph: PageRank

  1. Hi,
    I have a question about the Master Compute part. I could not find either SimplePageRankCompute or SimplePageRankVertexMasterCompute classes in the latest code, or in the code archives. So which Master Compute class is this example using?

    Reply
    • Search for SimplePageRankComputation. I just noticed in some lines I wrote it wrong 🙂
      The SimplePageRankMasterCompute class lies inside the java file SimplePageRankComputation.java

      Reply
      • Got it! Thanks 🙂 It would be great if you could write about worker context too. I read something about it in another blog, but it wasn’t too clear.

      • Yeah, I want to write about worker context as well 🙂 I need to fully understand it first and then I’ll post about it! (after my thesis defense!)

  2. Hey,

    You have done a great job 🙂

    I have one query.
    How can I compile and run .java written by me. I have so far tried examples only.

    Thanks in advance

    Reply
    • Hi Harsh,

      Thank you 🙂

      Hmm, the easiest and quickest way is to download the Giraph code from Git into the Eclipse (or any other platform you are familiar with), create a package inside Giraph code and put your code there. So, whenever you compile giraph, your code will be compiled as well. This is also the worst way to do it, because (I think) whenever you update to the latest code of Giraph, you may lose your code (? not sure though).

      Definitely, it’s better to keep your code separate from Giraph code. You can create your own maven project and in the pom.xml of your project you can include the giraph and hadoop versions you are using. Surely, if you are not familiar with maven it will cost you some time.

      If you want a fast/naive solution, go for the 1st way. If you want a more permanent/correct solution, go for the 2nd.

      And please! Make this question in the mailing list! I’m far from calling myself expert, so from the mailing list you can get more complete answers!

      When you do it, keep me updated! 🙂
      (I should post something about this :D)
      Cheers

      Reply
    • Hey, thanks! Well, a vertex holds 4 values: . You can set the type/class of its ID to be ‘Text’. Also, you need to change the type of the ID to ‘Text’ in the InputFormat and OutputFormat, in order to read from the file and write to a file correctly!

      Reply
  3. Hi,
    it’s really helpful to me as a freshman in Giraph. It has taken me plenty of time to make it work, and thanks for sharing your brilliant job, which helps me save a lot of time. Now one question left: in the command, what is the function of the parameter -mc? or another saying : -mc org.apache.giraph.examples.SimplePageRankComputation\$SimplePageRankMasterCompute is for what? Wish your answer. Thanks again! ^-^

    Reply
    • -mc stands for Master Compute
      As mentioned above:
      ” Master Compute runs first before any worker, it is very convenient that the master compute should be responsible to register the aggregators needed. Once the aggregators are registered – at the very beginning of the superstep 0, vertices can send values at any point and receive the result in the next iteration. There are 3 aggregators in the code for calculating: (i) the maximum vertex value, (ii) the minimum vertex value, (iii) sum of values.”

      and also that:
      “the master compute class is part of the main code SimplePageRankComputation. Thus, we call this class through the main code using the symbol \$.”

      I guess it’s self explanatory now on wards….(Lol… I am a year late to reply but I just started off 🙂

      Reply
  4. In the PageRank algorithm, you can select how many iterations of PageRank to calculate before deciding to finally stop. How would specify the number of iterations here?

    Reply
  5. Hey Maria, based on your post I could run the PageRank example code now on my C5 cluster. Thanks for the nice post.

    Reply
  6. Hi,

    Great set of articles! Really very helpful for getting started. Thanks a lot.

    I was wondering if you have some time to write about WorkerContext .. I am unable to find any resources (like the one about aggregators) which explains what they are for.

    Thanks again,
    John

    Reply
  7. Hi i am facing this issue when i am running the example.plz help

    hduser@bigtapp:/usr/local/giraph$ hadoop jar /usr/local/giraph/giraph-examples/target/giraph-examples-1.2.0-SNAPSHOT-for-hadoop-2.7.0-jar-with-dependencies.jar org.apache.giraph.GiraphRunner org.apache.giraph.examples.SimplePageRankComputation -vif org.apache.giraph.io.formats.JsonLongDoubleFloatDoubleVertexInputFormat -vip /user/input/tiny.txt -vof org.apache.giraph.io.formats.IdWithValueTextOutputFormat -op /user/output/SimplePageRank -w 1 -ca giraph.SplitMasterWorker=false

    15/05/07 11:28:38 INFO utils.ConfigurationUtils: No edge input format specified. Ensure your InputFormat does not require one.
    15/05/07 11:28:38 INFO utils.ConfigurationUtils: No edge output format specified. Ensure your OutputFormat does not require one.
    15/05/07 11:28:38 INFO utils.ConfigurationUtils: Setting custom argument [giraph.SplitMasterWorker] to [false] in GiraphConfiguration
    15/05/07 11:28:39 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform… using builtin-java classes where applicable
    15/05/07 11:28:39 INFO Configuration.deprecation: mapreduce.job.counters.limit is deprecated. Instead, use mapreduce.job.counters.max
    15/05/07 11:28:39 INFO Configuration.deprecation: mapred.job.map.memory.mb is deprecated. Instead, use mapreduce.map.memory.mb
    15/05/07 11:28:39 INFO Configuration.deprecation: mapred.job.reduce.memory.mb is deprecated. Instead, use mapreduce.reduce.memory.mb
    15/05/07 11:28:39 INFO Configuration.deprecation: mapred.map.tasks.speculative.execution is deprecated. Instead, use mapreduce.map.speculative
    15/05/07 11:28:39 INFO Configuration.deprecation: mapreduce.user.classpath.first is deprecated. Instead, use mapreduce.job.user.classpath.first
    15/05/07 11:28:39 INFO Configuration.deprecation: mapred.map.max.attempts is deprecated. Instead, use mapreduce.map.maxattempts
    15/05/07 11:28:39 INFO job.GiraphJob: run: Since checkpointing is disabled (default), do not allow any task retries (setting mapred.map.max.attempts = 1, old value = 4)
    15/05/07 11:28:39 INFO Configuration.deprecation: mapred.job.tracker is deprecated. Instead, use mapreduce.jobtracker.address
    15/05/07 11:28:39 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
    15/05/07 11:28:42 INFO mapreduce.JobSubmitter: number of splits:1
    15/05/07 11:28:42 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1430974889906_0002
    15/05/07 11:28:42 INFO impl.YarnClientImpl: Submitted application application_1430974889906_0002
    15/05/07 11:28:42 INFO mapreduce.Job: The url to track the job: http://bigtapp:8088/proxy/application_1430974889906_0002/
    15/05/07 11:28:42 INFO job.GiraphJob: Tracking URL: http://bigtapp:8088/proxy/application_1430974889906_0002/
    15/05/07 11:28:42 INFO job.GiraphJob: Waiting for resources… Job will start only when it gets all 2 mappers
    15/05/07 11:28:55 INFO mapreduce.Job: Running job: job_1430974889906_0002
    15/05/07 11:28:55 INFO mapreduce.Job: Job job_1430974889906_0002 running in uber mode : false
    15/05/07 11:28:55 INFO mapreduce.Job: map 100% reduce 0%
    15/05/07 11:28:55 INFO mapreduce.Job: Job job_1430974889906_0002 failed with state FAILED due to: Task failed task_1430974889906_0002_m_000000
    Job failed as tasks failed. failedMaps:1 failedReduces:0

    15/05/07 11:28:55 INFO mapreduce.Job: Counters: 8
    Job Counters
    Failed map tasks=1
    Launched map tasks=1
    Other local map tasks=1
    Total time spent by all maps in occupied slots (ms)=5785
    Total time spent by all reduces in occupied slots (ms)=0
    Total time spent by all map tasks (ms)=5785
    Total vcore-seconds taken by all map tasks=5785
    Total megabyte-seconds taken by all map tasks=5923840

    when i check in output file nothing is there..

    Reply
  8. Hi,

    I’ve recently builded the giraph on my local machine. and run the SimpleShortestPathsComputation and SimplePageRankComputation. After changing something(adding simple print statement etc) in those java file I’m again building giraph. But that change is not reflecting when I’m re-running those classes. Please guide me to resolve this

    Reply

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s