How to Set up Apache Giraph

<< Please leave a comment whether you found this guide useful, consistent, accurate or even deficient and crap 🙂 >>

Setting up Giraph can be a bit tricky and time consuming, unless you follow word by word a (subjectively) good guide, like the one I am giving here :p (and still there is a high probability that something else will go wrong).

Giraph has a few prerequisites that need to be satisfied before running successfully. These are:

  1. Install Oracle Java.
  2. Set up Apache Hadoop.
  3. Set up Apache Maven (optional but strongly recommended).

After completing these steps, you can happily proceed with setting up Giraph!

Below, I am trying to give a clear guide to install all of them. The guide is based on the steps I followed to install them in machines with Ubuntu 64-bit 12.04 and Ubuntu 64-bit 12.10.

So, here it goes!

1. Install Oracle Java 7

(mostly via terminal)

Step 1: Go to Oracle website. Press the button to download the Java Platform (JDK) 7uXX (where XX is the version of Java 7 you want to install). Select the “Accept the License Agreement” and click on the jdk-7uXX-linux-x64.tar.gz. Once its downloaded, move the tarball in a folder of your choice. Let’s call it directory-to-java-folder.

Step 2: Unpack the tarball and install the JDK with the command:

directory-to-java-folder$ tar -xvf jdk-7uXX-linux-x64.tar.gz

It creates a folder with the name jdk-1.7.0_XX. Let’s call it java-folder.

Step 3: Open the bashrc and add the path leading to the java-folder:

# Set JAVA environment variable 
export JAVA_HOME=/directory-to-java-folder/java-folder/bin/java
export PATH=$PATH:/directory-to-java-folder/java-folder/bin

Compile the bashrc to execute the new content of the file:

source ~/.bashrc

Close and reopen the terminal for the changes to take place.

Step 4: CHECK that JAVA JDK is set up correctly:

java -version

It should print something like this:

java version "1.7.0_XX"
Java(TM) SE Runtime Environment (build 1.7.0_XX-b21)
Java HotSpot(TM) 64-Bit Server VM (build 23.6-b04, mixed mode)

Note #1: The Java JRE is used for running Java programs. The JDK is suitable for developing (and also running) Java programs. Thus, by installing the JDK, there is no need to care about the JRE.

Note #2: I strongly recommend to read the Ubuntu Documentation about Java for any doubts and queries (and guides for installations to other OS).

If you have reached this point without any problems, then Java works like a charm! Congrats! 😉 (You may delete the tarball now, since it’s not needed anymore)

top

2. Set up Apache Hadoop

Before even starting with Hadoop, you should install ssh and rsync (if you don’t have them already):

sudo apt-get install ssh 
sudo apt-get install rsync

Now make sure that you can ssh to localhost without giving a passphrase:

ssh localhost

If it asks for a passphrase, stop the command and run the following:

ssh-keygen -t dsa -P '' -f ~/.ssh/id_dsa 
cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys

Try again ‘ssh localhost’. It should print a welcome-note upon successful connection to localhost.

Next pre-step is to find/decide which version of Apache Giraph you would like to have. Depending on the Giraph version, different Hadoop versions are recommended/compatible. In my case, I wanted to work on Giraph-X.X, and from their website I found out that “Apache Hadoop 0.20.203.0” is a suitable secure Hadoop version.

Step 1: Go to Hadoop website and click on the Download a release now!. Click on the first link – which is the recommended depending on your location. Click on the Download a release now!. If you cannot see the Hadoop version you want, click on the link archives and find/download the version you want. Once its downloaded, move the tarball in a folder of your choice. Let’s call it directory-to-hadoop-folder.

Step 2: Unpack the tarball and Install Hadoop (below I show for the hadoop 0.20.203.0)

directory-to-hadoop-folder$ tar -xvf hadoop-0.20.203.0rc1.tar.gz

It creates a folder with the name hadoop-0.20.203.0. Let’s call it hadoop-folder.

Step 3: Open the bashrc file and add the path leading to the hadoop-folder:

# Set Hadoop environment variable
export HADOOP_HOME=/directory-to-hadoop-folder/hadoop-folder
export PATH=$PATH:$HADOOP_HOME/bin

Compile the bashrc to execute the new content of the file:

source ~/.bashrc

Close and reopen the terminal for the changes to take place.

Step 4: CHECK that Hadoop is set up correctly – DO NOT SKIP THIS STEP – CRITICALLY IMPORTANT:

which hadoop

It should print something like this:

/directory-to-hadoop-folder/hadoop-folder/bin/hadoop

Since, we are in a single machine, let’s make Hadoop work in Pseudo-distributed mode. (the following instructions are taken from Single Node Setup)

Go into /directory-to-hadoop-folder/hadoop-folder/.

Open conf/core-site.xml and add the following:

         fs.default.name
         hdfs://localhost:9000

Open conf/hdfs-site.xml and add the following:

         dfs.replication
         1

Open conf/mapred-site.xml and add the following:

         mapred.job.tracker
         localhost:9001

Now, let’s run a short example – to make sure that Hadoop works.

1. Format a new distributed – filesystem and start the hadoop daemons (HDFS and MapReduce)

bin/hadoop namenode -format
bin/start-all.sh

(Sometimes the last command is not executed successfully – for me at least – thus instead of running start-all.sh, I run: bin/start-dfs.sh and bin/start-mapred.sh)

2. Copy the input files into the distributed filesystem

bin/hadoop fs -put conf input

3. Run some of the examples provided

bin/hadoop jar hadoop-examples-*.jar grep input output 'dfs[a-z.]+'

4. View the output

bin/hadoop fs -cat output/*

5. Finally, stop the daemons

bin/stop-all.sh

Note #1: Have a look at the Single Node Setup and Cluster Setup guides from Apache for more information!

If you have reached this point without any problems, then Hadoop works like a charm! Congrats! 😉

(You may delete the tarball now, since it’s not needed anymore)

top

3. Set up Apache Maven

(mostly via terminal)

Step 1: Go to Maven website. Download the latest tarball-file apache-maven-3.X.X-bin.tar.gz where XX are the numbers for the latest version. Move the tarball in a folder of your choice. Let’s call it directory-to-maven-folder.

Step 2: Unpack the tarball and Install Maven

directory-to-maven-folder$ tar -xvf apache-maven-3.X.X-bin.tar.gz

It creates a folder with the name apache-maven-3.X.X. Let’s call it maven-folder.

Step 3: Open bashrc (with the command: vim ~/.bashrc) and add the environment variables:

export M2_HOME=/path-to-maven-folder/maven-folder
export M2=$M2_HOME/bin
export PATH=$M2:$PATH

 Compile the bashrc to execute the new content of the file:

source ~/.bashrc

Close and reopen the terminal for the changes to take place.

Step 4: CHECK that maven is set correctly:

Check if paths to Maven exist (they should give the same paths specified above in the bashrc)

mvn --version         

It should print something like this:
Apache Maven 3.0.4 (r1232337; 2012-01-17 09:44:56+0100)
Maven home: /path-to-maven-folder/maven-folder
Java version: 1.7.0_11, vendor: Oracle Corporation
Java home: /usr/lib/jvm/jdk1.7.0_11/jre
Default locale: en_US, platform encoding: UTF-8
OS name: "linux", version: "3.2.0-40-generic", arch: "amd64", family: "unix"

If you have reached this point without any problems, then Maven works like a charm! Congrats! 😉

(You may delete the tarball now, since it’s not needed anymore)

top

4. Set up Apache Giraph

(via terminal)

Step 1: Create a folder in which you would like to download Giraph project. Let’s call it giraph-folder.

Download the code from the online source repository:

giraph-folder$ git clone https://git-wip-us.apache.org/repos/asf/giraph.git

 Step 2: Compile the code using maven (This will take quite a long time).

giraph-folder$ mvn install

If the last message printed seem like this: [INFO] BUILD SUCCESS

it means it is installed successfully! Congrats! In the giraph-folder/giraph-core/ a folder named taget/ is created which contains a jar file named similarly to this: giraph-0.2-SNAPSHOT-for-hadoop-0.20.203.0-jar-with-dependencies.jar

If build is not successful, then.. bad luck 😛 Leave a message and we can see it through… (though I’m not an expert, just a warning :p)

Finally, if you have reached this point without any problems, then everything work like a charm! Enjoy! 😉

Note #1: In my next few posts I will describe how to run examples from Giraph library. Stay tuned!

 top

— THE END —

Advertisements

8 thoughts on “How to Set up Apache Giraph

  1. Pingback: How to setup Apache Giraph on Mac OS X | A great adventure in Computer Science

    • Hey! Thanks for the ping!
      So I guess my guide is only for Ubuntu and not for mac 🙂
      By having a very quick view at your post, it’s quite similar! I like that you also include the PageRank example and setting up Zookeeper. Keep it up!

      Reply
  2. I got java, hadoop, openssh, maven, giraph right. All thanks to your blog.
    These are the outputs I’m currently getting
    —————————————————————————————————
    $java -version
    java version “1.7.0_55”
    OpenJDK Runtime Environment (IcedTea 2.4.7) (7u55-2.4.7-1ubuntu1~0.13.10.1)
    OpenJDK 64-Bit Server VM (build 24.51-b03, mixed mode)

    $mvn –version
    Apache Maven 3.0.4
    Maven home: /usr/share/maven
    Java version: 1.7.0_55, vendor: Oracle Corporation
    Java home: /usr/lib/jvm/java-7-openjdk-amd64/jre
    Default locale: en_IN, platform encoding: UTF-8
    OS name: “linux”, version: “3.11.0-12-generic”, arch: “amd64”, family: “unix”

    $hadoop version
    Hadoop 2.4.0
    Subversion http://svn.apache.org/repos/asf/hadoop/common -r 1583262
    Compiled by jenkins on 2014-03-31T08:29Z
    Compiled with protoc 2.5.0
    From source with checksum 375b2832a6641759c6eaf6e3e998147
    This command was run using /home/gonephishing/Downloads/hadoop-2.4.0/share/hadoop/common/hadoop-common-2.4.0.jar

    ——————————————————————————————————
    But now when I open a new terminal, everything except $hadoop version is working. When I go for $hadoop version, I get an error message saying command not found.

    So I did $ssh localhost
    and then $hadoop version
    and this time i got the same output. I really don’t know why it’s happening like this and where does $ssh localhost fit in the complete picture.
    Please help

    Reply

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s