Correcting report… Check.
Preparing slides… Check.
Getting feedback from my supervisors in Telefónica… Check.
Getting feedback from my supervisor in UPC… Check.
Correcting slides… In Progress.
Making 3495736254 rehearsals… In Progress.
Anxious mode… ON!
“Let’s spread some Pregel and Giraph love and knowledge” mode… ON!
See you on Monday! 😉
Right now, I should not be writing here, but only in my report :p But hey! I will be fast 😀
The day I was waiting for so long is approaching! 4 days till delivering the final thesis report. (teeth grinding, tears rolling, and a secret smile waits to give its huge finale to this 6-month performance)
I have so many words, definitions and numbers going around my head. And all this “jungle bubble” is taking structure in a form of sentences but getting restricted and limited in some lines of – somehow – academic writing.
I implemented 3 algorithms; all of them are Pregel-based, implemented on top of Giraph. They are iterative, vertex-centric and scalable.
Latest Update: Check Questions: 4, 5, 7, 8 for changes. Instead of using the internal InputFormat and OutputFormat that SimplePageRankComputation has (which are currently buggy), I use others to make it work!
I’ve noticed an increase of the views for the Shortest Paths example, so I decided to post my fairytale with PageRank as well. Please! Any suggestions, improvements, positive/negative feedback about these posts are more than welcome! I will respect you till the end of time 😉
So, let’s ask ourselves.
~~~~~ Q#1: What’s the PageRank problem?
Problem Description: Assign a weight to each node of a graph which represents its relative importance inside the graph. PageRank usually refers to a set of webpages, and tries to measure which ones are the most important in comparison with the rest from the set. The importance of a webpage is measured by the number of incoming links, i.e. references it receives from other webpages.
When planning to run a code in Giraph, I ask myself some questions. When I answer to all my questions, I move to actually implement and run the code. (so I kinda discuss a lot with myself :p). Let’s have a look to this inner discussion – while running the Shortest Paths problem.
~~~~~ Q#1: What’s the Shortest Path problem?
Problem Description: Find the shortest path between 2 vertices in a graph, so that the sum of weights of the edges in the path is minimized. The example given in Giraph finds the shortest path from each vertex to the source-vertex.
~~~~~ Q#2: How can this be implemented in Giraph?
Think “Pregely”: Since in Pregel the same code is executed in all vertices at the same time, we need to think as we are a vertex. Continue reading
After setting up Giraph, I did – at least I wanted to – my first jump into it by running the Shortest Path example. Well, my first try was an epic fail by digging into code, getting lost inside packages and classes with no clue of what’s going on (This was my first attempt to work on an open-source project).
What I needed to do first (before getting lost), was to observe and understand how things work in an open-source project and specifically in Giraph. So, here’s what you need to know before even touching the keyboard: Continue reading
<< Please leave a comment whether you found this guide useful, consistent, accurate or even deficient and crap 🙂 >>
Setting up Giraph can be a bit tricky and time consuming, unless you follow word by word a (subjectively) good guide, like the one I am giving here :p (and still there is a high probability that something else will go wrong).
Giraph has a few prerequisites that need to be satisfied before running successfully. These are:
- Install Oracle Java.
- Set up Apache Hadoop.
- Set up Apache Maven (optional but strongly recommended).
After completing these steps, you can happily proceed with setting up Giraph!
Below, I am trying to give a clear guide to install all of them. The guide is based on the steps I followed to install them in machines with Ubuntu 64-bit 12.04 and Ubuntu 64-bit 12.10.
So, here it goes!
As I said in my previous post, I am interacting a lot with Apache Giraph. Before beginning with Giraph, I needed to understand the concept behind it. Giraph is an open-source implementation of the Pregel model, proposed by Google in 2010.
What is Pregel?
A model or framework designed for processing large-scale graphs.
How large can these large-scale graphs be? Well, let’s say billions is a good number. 🙂
What makes Pregel special?
Or shall I ask: Why not MapReduce (M/R)?