Monday, February 20, 2012

Accumulo on Mac OSX

Quick point of interest for anyone wishing to run Apache Accumulo on a mac.  You need to have three Apache projects (Accumulo, Hadoop, Zookeeper) installed for this to work, and luckily they all work flawlessly on mac.

Download Accumulo incubation src here, build using Maven as directed in the README. Move the distro someplace you will run everything from, /usr/local; /opt; or whatever you pefer.  A hadoop user is usually created to run all the services but I'm just using my local self and directories.

accumulo

Next get hadoop version 0.20.2 as also recommended by the accumulo team in the README.

For the user you select to run hadoop, zookeeper, and accumulo you will be setting the following common variable for the shell that will be used to start all the services. There is an option to add this to each of the services config so you must know the value in any event.

Add this line to (or its output) to the file $HADOOP_HOME/conf/hadoop-env.sh


export JAVA_HOME=$(/usr/libexec/java_home)

I chose to run in pseudo distributed mode as it mirrors cluster setup somewhat (cluster of 1). For that you will need to modify three files in the hadoop config directory $HADOOP_HOME/conf

*NOTE This caused a race condition that ran my processor at 90%, suggest if this occurs to use non distributed approach.


core-site.xml
======================

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>


<!-- Put site-specific property overrides in this file. -->


<configuration>
   <property>
     <name>fs.default.name</name>
     <value>hdfs://localhost:9000</value>
   </property>
</configuration>
~


mapred-site.xml
=======================

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>


<!-- Put site-specific property overrides in this file. -->


<configuration>
  <property>
     <name>mapred.job.tracker</name>
     <value>localhost:9001</value>
  </property>
</configuration>
~              



hdfs-site.xml
========================

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>


<!-- Put site-specific property overrides in this file. -->


<configuration>
   <property>
      <name>dfs.replication</name>
       <value>1</value>
   </property>


</configuration>


~

Make sure the both masters, and slaves, list localhost as the only value.

You will need password-less ssh configured for your "nodes" to talk to one another. Since the node is the same host, we enable that here only. Open up system-preferences and change the following setting under Sharing. I am  running all this as myself for dev purposes but normally there is a hadoop user as previously mentioned, if you did it that way obviously set up ssh for him.



 Now that is done, create the keys for you user in the user home dir and add them to the keyring as follows.


$ ssh-keygen -t dsa -P '' -f ~/.ssh/id_dsa
Generating public/private dsa key pair.
Your identification has been saved in /Users/cwyse/.ssh/id_dsa.
Your public key has been saved in /Users/cwyse/.ssh/id_dsa.pub.
The key fingerprint is:
c6:60:39:90:d7:d4:08:43:7e:d0:f1:5e:00:e4:1d:f5 cwyse@Chris-Wyses-MacBook-Pro-2.local
The key's randomart image is:
+--[ DSA 1024]----+
|    .o=*==o..    |
|    .o.=+o.o .   |
|     .* o o . E  |
|     . = . .     |
|        S .      |
|       .         |
|                 |
|                 |
|                 |
+-----------------+

$ cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys



Now when you $ssh localhost you will be connected without prompt for a password.

Next format the name-node using the following command from $HADOOP_HOME

$ $HADOOP_HOME/bin/hadoop namenode -format


You'll see some positive output listing your storeage directory and a shutdown, mine was.


Storage directory /tmp/hadoop-cwyse/dfs/name  


Now run 
$HADOOP_HOME/bin/start-all.sh


Some more good happy things happen, then go to this URL; http://localhost:50070/ and you will see something akin to this.






Congrats, we are half way there!


Now you will need zookeeper version  > 3.3.0


zookeeper


Installing zookeeper OOB is easy, I'm sure there are a ton of configurations in both hadoop and zookepper but this is just to get up and running with an accumulo shell so that you can being basic development on a cluster like setup. 


In $ZOOKEEPER_HOME/conf there is a sample config file, I copied it to zoo.cfg and ran with that option no changes.  


Quick connect to the service then shut it down.



$ZOOKEEPER_HOME/bin/zkServer.sh start
$ZOOKEEPER_HOME/bin/zkCli.sh -server 127.0.0.1:2181
Zookeeper will only need to be running in standalone mode.

The accumulo setup itself is straight out of the README.
Modify $ACCUMULO_HOME/conf/accumulo-env.sh by copying the *.example file and changing your env variables to JAVA_HOME and the locations you installed hadoop and zookeeper in. Copy the accumulo-site.xml, master and slave examples in place as well. Make sure hadoop and zookeeper are both running as described above.

Run 
$ACCUMULO_HOME/bin/accumulo init
to initialize the accumulo hdfs structure and setup instance and credentials.
Output upon should look like this.
$ACCUMULO_HOME/bin/start-all.sh 
Starting tablet servers and loggers .... done
Starting tablet server on localhost
Starting logger on localhost
Starting master on localhost
Starting garbage collector on localhost
Starting monitor on localhost
Starting tracer on localhost
I've yet to play with this much beyond writing this out so please let me know if there is something amiss.

Credits really go out to the Accumulo, hadoop, and zookeeper documenters as well as Chuck Lam's excellent "Hadoop In Action" Manning publication.  


2 comments:

  1. Awesome post, thanks! Was able to follow it easily and had Accumulo up and running in 30 minutes.

    The only thing is trying to find a distribution of Hadoop 0.20.2. Ended up getting it from Cloudera site. Probably could have used another version but not sure how far I could go.


    ReplyDelete
  2. Ran into issues when running $ACCUMULO_HOME/bin/accumulo init.

    Had an error:
    Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/accumulo/start/Platform
    Caused by: java.lang.ClassNotFoundException: org.apache.accumulo.start.Platform
    at java.net.URLClassLoader$1.run(URLClassLoader.java:202)
    at java.security.AccessController.doPrivileged(Native Method)
    at java.net.URLClassLoader.findClass(URLClassLoader.java:190)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
    at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:247)
    Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/accumulo/start/Main
    Caused by: java.lang.ClassNotFoundException: org.apache.accumulo.start.Main
    at java.net.URLClassLoader$1.run(URLClassLoader.java:202)
    at java.security.AccessController.doPrivileged(Native Method)
    at java.net.URLClassLoader.findClass(URLClassLoader.java:190)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
    at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:247)

    Not sure if you have encountered this or not.

    ReplyDelete