As I said in my previous post, I am interacting a lot with Apache Giraph. Before beginning with Giraph, I needed to understand the concept behind it. Giraph is an open-source implementation of the Pregel model, proposed by Google in 2010.
What is Pregel?
How large can these large-scale graphs be? Well, let’s say billions is a good number. 🙂
What makes Pregel special?
Or shall I ask: Why not MapReduce (M/R)?
Well, M/R and graph processing are not the best buddies. In M/R it’s quite hard to express graph algorithms, there is no option of calculating per node/vertex. Moreover, in M/R we have to pass the entire state of the graph from one phase to the other causing too much I/O traffic and watching the performance to go down.
Pregel comes to the rescue and offers what M/R cannot! The programmer can make calculations per vertex – thus it’s very expressive – and there is no necessity for passing the graph state – less I/O, better performance. Pregel makes the graph processing while preserving fault tolerance, scalability and being distributed.
How does Pregel work?
Initialization: To start, we dedicate several machines (or cores) for the Pregel computation. Among them one is elected as the master for coordinating the rest, the workers. After inserting as an input a directed graph, the graph gets divided into partitions, each one including a set of vertices and their outgoing edges.
Execution: The master assigns these partitions to the workers and then a loop of superstepsbegins! In each superstep, all vertices (in each worker) pass through 3 steps: (i) receive messages from other vertices, (ii) execute the same user-defined function, (iii) send messages to other vertices. At the end of each superstep, a global synchronization occurs among the workers and messages are exchanged.
Termination: This wonderful loop terminates when all vertices become inactive and there are no messages in transit (no messages to be sent to other vertices).
For more details – including strong and weak points of the paper, have a look at the review I made as part of ID2220 – Advance Topics in Distributed Systems, at KTH.