Tuesday, February 21, 2012

Configuring Apache Mahout and a Clustering Example

Mahout is an open source machine learning library from Apache. At the moment, it primarily implements recommender engines (collaborative filtering)clustering, and classification algorithms. It’s also scalable across machines. Mahout aims to be the machine learning tool of choice when the collection of data to be processed is very large, perhaps far too large for a single machine.

Several approaches to machine learning are followed to solve problems using Apache Mahout Machine Learning. Supervised and unsupervised learning are the main ones supported by Apache Mahout.

Supervised learning is tasked with learning a function from labeled training data in order to predict the value of any valid input. Common examples of supervised learning include classifying e-mail messages as spam. 
Unsupervised learning is tasked with making sense of data without any examples of what is correct or incorrect. It is most commonly used for clustering similar input into logical groups.

In this post, explanation is given on how to setup Apache Mahout (CDH3 distribution). Its pretty simple to install mahout using CDH3 distribution. If you want to setup apache mahout, follow the instructions in these below steps :
As mentioned in the overview, Mahout is an open source scalable machine learning library. It is recommended that Mahout runs on top of Hadoop when processing large amounts of data. It is sufficient to install Mahout only on the Hadoop master node.
Before installing Mahout,  ensure that Hadoop is installed in any of the modes. For the purposes of this blog, install Hadoop in cluster mode. Proceed with these commands in the same order -
user1@ubuntu-server:~$ apt-get install maven2
user1@ubuntu-server:~$ cd /opt
user1@ubuntu-server:~$ svn co http://svn.apache.org/repos/asf/mahout/trunk
user1@ubuntu-server:~$ mv trunk mahout_trunk
user1@ubuntu-server:~$ ln -s mahout_trunk/ mahout
user1@ubuntu-server:~$ cd mahout
user1@ubuntu-server:~$ mvn install
P.S.: Sometimes some tests fail while building mahout from source. In such cases use – user1@ubuntu-server:~$ mvn -DskipTests install
Edit .bash_profile to add entry for $MAHOUT_HOME, $HADOOP_CONF_DIR and change $PATH
vim ~/.bash_profile
  • export MAHOUT_HOME=/opt/mahout
Logout and login for the changes to take effect. After successful login, typing echo $MAHOUT_HOME should print /opt/mahout on the console.
That’s it! Mahout is installed successfully. Lets check it with a Clustering Algorithm sample :
Apache Mahout Clustering Example

These techniques attempt to group a large number of things together into clusters that share some similarity. It’s a way to discover hierarchy and order in a large or hard-to-understand data set, and in that way reveal interesting patterns or make the data set easier to comprehend. As an example, Google News groups news articles by topic using clustering techniques, in order to present news grouped by logical story, rather than presenting a raw listing of all articles.
A simple clustering example
This example will demonstrate clustering of control charts which exhibits a time series. A time series of control charts needs to be clustered into their close knit groups. It contains six different classes (Normal, Cyclic, Increasing trend, Decreasing trend, Upward shift, Downward shift). With these trends occurring on the input data set, the Mahout clustering algorithm will cluster the data into their corresponding class buckets.
Follow these steps -
  • If you have not started Hadoop yet, now is the right time.
user1@ubuntu-server:~$ $HADOOP_HOME/bin/start-all.sh
  • Download the input data set
user1@ubuntu-server:~$ wget http://archive.ics.uci.edu/ml/databases/synthetic_control/synthetic_control.data
  •  Copy the downloaded data set into HDFS
user1@ubuntu-server:~$ $HADOOP_HOME/bin/hadoop fs -mkdir testdata
user1@ubuntu-server:~$ $HADOOP_HOME/bin/hadoop fs -put /PATH/TO/synthetic_control.data testdata
(P.S.: HDFS input directory name should be “testdata”)
  • Mahout’s mahout-examples-$MAHOUT_VERSION.job does the actual clustering task and so it needs to be created. (P.S.: if the job file is already present under $MAHOUT_HOME/examples/target/ folder, then skip the following 2 commands.)
user1@ubuntu-server:~$ cd $MAHOUT_HOME
user1@ubuntu-server:~$ mvn clean install
The job will be generated in $MAHOUT_HOME/examples/target/ and it’s name will contain the $MAHOUT_VERSION number.
  • Perform clustering
user1@ubuntu-server:~$ $MAHOUT_HOME/bin/mahout org.apache.mahout.clustering.syntheticcontrol.canopy.Job
user1@ubuntu-server:~$ $MAHOUT_HOME/bin/mahout org.apache.mahout.clustering.syntheticcontrol.kmeans.Job
The clustering output is in SequenceFile format which is not human readable. Hence, we need to use the clusterdump utility provided by Mahout.
  • Copy the cluster output from HDFS onto your local file system
user1@ubuntu-server:~$ $HADOOP_HOME/bin/hadoop fs -get output $MAHOUT_HOME/examples
  • Run clusterdump to convert output into human readable form
user1@ubuntu-server:~$ $MAHOUT_HOME/bin/mahout clusterdump --seqFileDir output/clusters-10 --pointsDir output/clusteredPoints --output $MAHOUT_HOME/examples/output/clusteranalyze.txt
“clusteranalyze.txt” is the human readable output. Go have a look!
Points to remember
  • Make sure that Hadoop is installed successfully and that the daemons are running before running Mahout jobs
  • The HDFS output directory is cleared when a new run starts so the results must be retrieved before a new run
  • Computed clusters are contained in output/clusters-i
  • All result clustered points are placed into output/clusteredPoints



Anonymous said...

I keep getting a weird error that --seqfiledir is not a supported flag, and instead to use --input. I'm using mahout-0.7-snapshot.

Any ideas?


Sandip said...

Thanks for giving information about setting mahout path in profile file I stuck there nearly three days

Need Consulting help ?


Email *

Message *

All posts, comments, views expressed in this blog are my own and does not represent the positions or views of my past, present or future employers. The intention of this blog is to share my experience and views. Content is subject to change without any notice. While I would do my best to quote the original author or copyright owners wherever I reference them, if you find any of the content / images violating copyright, please let me know and I will act upon it immediately. Lastly, I encourage you to share the content of this blog in general with other online communities for non-commercial and educational purposes.