Cloud, Big Data and Mobile: Configuring Apache Mahout and a Clustering Example

Mahout is an open source machine learning library from Apache. At the moment, it primarily implements recommender engines (collaborative filtering), clustering, and classification algorithms. It’s also scalable across machines. Mahout aims to be the machine learning tool of choice when the collection of data to be processed is very large, perhaps far too large for a single machine.

Several approaches to machine learning are followed to solve problems using Apache Mahout Machine Learning. Supervised and unsupervised learning are the main ones supported by Apache Mahout.

Supervised learning is tasked with learning a function from labeled training data in order to predict the value of any valid input. Common examples of supervised learning include classifying e-mail messages as spam.
Unsupervised learning is tasked with making sense of data without any examples of what is correct or incorrect. It is most commonly used for clustering similar input into logical groups.

In this post, explanation is given on how to setup Apache Mahout (CDH3 distribution). Its pretty simple to install mahout using CDH3 distribution. If you want to setup apache mahout, follow the instructions in these below steps :

Installation

As mentioned in the overview, Mahout is an open source scalable machine learning library. It is recommended that Mahout runs on top of Hadoop when processing large amounts of data. It is sufficient to install Mahout only on the Hadoop master node.

Before installing Mahout, ensure that Hadoop is installed in any of the modes. For the purposes of this blog, install Hadoop in cluster mode. Proceed with these commands in the same order -

user1@ubuntu-server:~$ apt-get install maven2

user1@ubuntu-server:~$ cd /opt

user1@ubuntu-server:~$ svn co http://svn.apache.org/repos/asf/mahout/trunk

user1@ubuntu-server:~$ mv trunk mahout_trunk

user1@ubuntu-server:~$ ln -s mahout_trunk/ mahout

user1@ubuntu-server:~$ cd mahout

user1@ubuntu-server:~$ mvn install

P.S.: Sometimes some tests fail while building mahout from source. In such cases use – user1@ubuntu-server:~$ mvn -DskipTests install

Edit .bash_profile to add entry for $MAHOUT_HOME, $HADOOP_CONF_DIR and change $PATH

vim ~/.bash_profile

export HADOOP_CONF_DIR=$HADOOP_HOME/conf
export MAHOUT_HOME=/opt/mahout
export PATH=$PATH:$MAHOUT_HOME

Logout and login for the changes to take effect. After successful login, typing echo $MAHOUT_HOME should print /opt/mahout on the console.

That’s it! Mahout is installed successfully. Lets check it with a Clustering Algorithm sample :

Apache Mahout Clustering Example

Clustering

These techniques attempt to group a large number of things together into clusters that share some similarity. It’s a way to discover hierarchy and order in a large or hard-to-understand data set, and in that way reveal interesting patterns or make the data set easier to comprehend. As an example, Google News groups news articles by topic using clustering techniques, in order to present news grouped by logical story, rather than presenting a raw listing of all articles.

A simple clustering example

This example will demonstrate clustering of control charts which exhibits a time series. A time series of control charts needs to be clustered into their close knit groups. It contains six different classes (Normal, Cyclic, Increasing trend, Decreasing trend, Upward shift, Downward shift). With these trends occurring on the input data set, the Mahout clustering algorithm will cluster the data into their corresponding class buckets.

Follow these steps -

If you have not started Hadoop yet, now is the right time.

user1@ubuntu-server:~$ $HADOOP_HOME/bin/start-all.sh

Download the input data set

user1@ubuntu-server:~$ wget http://archive.ics.uci.edu/ml/databases/synthetic_control/synthetic_control.data

Copy the downloaded data set into HDFS

user1@ubuntu-server:~$ $HADOOP_HOME/bin/hadoop fs -mkdir testdata

user1@ubuntu-server:~$ $HADOOP_HOME/bin/hadoop fs -put /PATH/TO/synthetic_control.data testdata

(P.S.: HDFS input directory name should be “testdata”)

Mahout’s mahout-examples-$MAHOUT_VERSION.job does the actual clustering task and so it needs to be created. (P.S.: if the job file is already present under $MAHOUT_HOME/examples/target/ folder, then skip the following 2 commands.)

user1@ubuntu-server:~$ cd $MAHOUT_HOME

user1@ubuntu-server:~$ mvn clean install

The job will be generated in $MAHOUT_HOME/examples/target/ and it’s name will contain the $MAHOUT_VERSION number.

Perform clustering

For canopy clustering

user1@ubuntu-server:~$ $MAHOUT_HOME/bin/mahout org.apache.mahout.clustering.syntheticcontrol.canopy.Job

For k-means clustering

user1@ubuntu-server:~$ $MAHOUT_HOME/bin/mahout org.apache.mahout.clustering.syntheticcontrol.kmeans.Job

The clustering output is in SequenceFile format which is not human readable. Hence, we need to use the clusterdump utility provided by Mahout.

Copy the cluster output from HDFS onto your local file system

user1@ubuntu-server:~$ $HADOOP_HOME/bin/hadoop fs -get output $MAHOUT_HOME/examples

Run clusterdump to convert output into human readable form

user1@ubuntu-server:~$ $MAHOUT_HOME/bin/mahout clusterdump --seqFileDir output/clusters-10 --pointsDir output/clusteredPoints --output $MAHOUT_HOME/examples/output/clusteranalyze.txt

“clusteranalyze.txt” is the human readable output. Go have a look!

Points to remember

Make sure that Hadoop is installed successfully and that the daemons are running before running Mahout jobs
The HDFS output directory is cleared when a new run starts so the results must be retrieved before a new run
Computed clusters are contained in output/clusters-i
All result clustered points are placed into output/clusteredPoints

Links:

https://cwiki.apache.org/confluence/display/MAHOUT/BuildingMahout

https://cwiki.apache.org/confluence/display/MAHOUT/Clustering+of+synthetic+control+data

https://cwiki.apache.org/confluence/display/MAHOUT/Cluster+Dumper

2 comments:

Anonymous said...: I keep getting a weird error that --seqfiledir is not a supported flag, and instead to use --input. I'm using mahout-0.7-snapshot.

Any ideas?

Thanks,
Ravish; May 4, 2012 at 12:39 AM
Sandip said...: Thanks for giving information about setting mahout path in profile file I stuck there nearly three days; March 2, 2015 at 4:34 AM

Cloud, Big Data and Mobile

Pages

Tuesday, February 21, 2012

Configuring Apache Mahout and a Clustering Example

2 comments:

Need Consulting help ?

Followers

My Presentations / Webinars / Conferences

Popular Posts - All Time

My Articles

SlideShares