Okapi ML is in the house!

It’s been almost a month that grafos.ml officially launched in Telefónica R&D.

Grafos started as a side project with the main objective to develop tools and algorithms for large-scale graph mining and graph analytics. Grafos comes with two components:

  • The Okapi ML Library; the first library with algorithms built on top of Apache Giraph.
  • The Real-Time Giraph; a graph analytics engine that brings analysis of dynamic graphs – made easy.

I have already received some questions about Okapi and here’s what I said.

Why Okapi? Okapi is the missing piece from the puzzle of  graph analytics. Graph processing systems vary. Models, paradigms and techniques diverge. However, graph processing algorithms are endangered species. Well, not for long 🙂 Okapi provides algorithms ready to be used and analyse large-scale graphs in the most efficient way.

Why Okapi instead of Mahout? Mahout is a well established ML library that follows the MapReduce paradigm. Okapi aims to be an ML and Graph Analytics library following Giraph. (Friendly reminder: MapReduce becomes suboptimal when it comes to iterative algorithms [yes really], while Giraph embraces iterations and nails it with large-scale graphs.)

What does Okapi provide? So far, Okapi addresses  two areas of interest; Collaborative Filtering and Graph Analytics. Documentation is provided along with the growing codebase. Here’s a direct link to Okapi. More algorithms are yet to come – and you can be part of it!

Is Real-Time Giraph possible? The limit is one’s imagination. Luckily, researchers in Telefonica set the limit far away and dare to innovate. As Claudio aptly explains, this modified version of the Giraph runtime aims to run “real-time computations of graph algorithms for event-based workloads. Without changing your existing code for offline analysis.“. This component is yet to be published. Wait for it – or do not wait and become a tester by dropping a line!

The reactions and responses from institutions, researchers, and colleagues have been motivating. Okapi is open-source and anyone can download it or clone it. Give it a try!  In the Okapi Mailing List you can post questions, find answers and provide any kind of feedback (yes negative as well).

We celebrate the launch of Grafos with the 1st Okapi Hackathon! The Hackathon starts on Wednesday, March 26th and ends on Thursday evening, March 27th. Location: at Telefonica in the amazing city of Barcelona. I wish I could be there these two days. We hope and believe that this Hackathon will be a great opportunity for brainstorming, growing the Okapi codebase, improving and welcoming new friends and collaborators.

We are super happy to welcome our first users and we are looking forward to growing the Okapi family!

Okapi is in the house – and it plans to stay 🙂

 

3 days before presenting

Correcting report… Check.

Preparing slides… Check.

Getting feedback from my supervisors in Telefónica… Check.

Getting feedback from my supervisor in UPC… Check.

Correcting slides… In Progress.

Making 3495736254 rehearsals… In Progress.

Anxious mode… ON!

“Let’s spread some Pregel and Giraph love and knowledge” mode… ON!

See you on Monday! 😉

 

4 days before delivering the Thesis Report

Right now, I should not be writing here, but only in my report :p But hey! I will be fast 😀

The day I was waiting for so long is approaching! 4 days till delivering the final thesis report. (teeth grinding, tears rolling, and a secret smile waits to give its huge finale to this 6-month performance)

I have so many words, definitions and numbers going around my head. And all this “jungle bubble” is taking structure in a form of sentences but getting restricted and limited in some lines of – somehow – academic writing.

I implemented 3 algorithms; all of them are Pregel-based, implemented on top of Giraph. They are iterative, vertex-centric and scalable.

Continue reading

Run example in Giraph: PageRank

Latest Update: Check Questions: 4, 5, 7, 8 for changes. Instead of using the internal InputFormat and OutputFormat that SimplePageRankComputation has (which are currently buggy), I use others to make it work!

I’ve noticed an increase of the views for the Shortest Paths example,  so I decided to post my fairytale with PageRank as well. Please! Any suggestions, improvements, positive/negative feedback about these posts are more than welcome! I will respect you till the end of time 😉

So, let’s ask ourselves.

~~~~~ Q#1: What’s the PageRank problem?

Problem Description: Assign a weight to each node of a graph which represents its relative importance inside the graph. PageRank usually refers to a set of webpages, and tries to measure which ones are the most important in comparison with the rest from the set. The importance of a webpage is measured by the number of incoming links, i.e. references it receives from other webpages.

Continue reading

Run Example in Giraph: Shortest Paths

When planning to run a code in Giraph, I ask myself some questions. When I answer to all my questions, I move to actually implement and run the code. (so I kinda discuss a lot with myself :p). Let’s have a look to this inner discussion – while running the Shortest Paths problem.

~~~~~ Q#1: What’s the Shortest Path problem?

Problem Description: Find the shortest path between 2 vertices in a graph, so that the sum of weights of the edges in the path is minimized. The example given in Giraph finds the shortest path from each vertex to the source-vertex.

~~~~~ Q#2: How can this be implemented in Giraph?

Think “Pregely”: Since in Pregel the same code is executed in all vertices at the same time, we need to think as we are a vertex. Continue reading

I set up Apache Giraph – Now what?

After setting up Giraph, I did – at least I wanted to – my first jump into it by running the Shortest Path example. Well, my first try was an epic fail by digging into code, getting lost inside packages and classes with no clue of what’s going on (This was my first attempt to work on an open-source project).

What I needed to do first (before getting lost), was to observe and understand how things work in an open-source project and specifically in Giraph. So, here’s what you need to know before even touching the keyboard: Continue reading

How to Set up Apache Giraph

<< Please leave a comment whether you found this guide useful, consistent, accurate or even deficient and crap 🙂 >>

Setting up Giraph can be a bit tricky and time consuming, unless you follow word by word a (subjectively) good guide, like the one I am giving here :p (and still there is a high probability that something else will go wrong).

Giraph has a few prerequisites that need to be satisfied before running successfully. These are:

  1. Install Oracle Java.
  2. Set up Apache Hadoop.
  3. Set up Apache Maven (optional but strongly recommended).

After completing these steps, you can happily proceed with setting up Giraph!

Below, I am trying to give a clear guide to install all of them. The guide is based on the steps I followed to install them in machines with Ubuntu 64-bit 12.04 and Ubuntu 64-bit 12.10.

So, here it goes!

Continue reading