Mahout is an open source machine learning library from Apache. At the moment, it primarily implements recommender engines (collaborative filtering), clustering, and classification algorithms. It’s also scalable across machines. Mahout aims to be the machine learning tool of choice when the collection of data to be processed is very large, perhaps far too large for a single machine.
Several approaches to machine learning are followed to solve problems using Apache Mahout Machine Learning. Supervised and unsupervised learning are the main ones supported by Apache Mahout.
Supervised learning is tasked with learning a function from labeled training data in order to predict the value of any valid input. Common examples of supervised learning include classifying e-mail messages as spam.
Unsupervised learning is tasked with making sense of data without any examples of what is correct or incorrect. It is most commonly used for clustering similar input into logical groups.
Unsupervised learning is tasked with making sense of data without any examples of what is correct or incorrect. It is most commonly used for clustering similar input into logical groups.
In this post, explanation is given on how to setup Apache Mahout (CDH3 distribution). Its pretty simple to install mahout using CDH3 distribution. If you want to setup apache mahout, follow the instructions in these below steps :
Installation
As mentioned in the overview, Mahout is an open source scalable machine learning library. It is recommended that Mahout runs on top of Hadoop when processing large amounts of data. It is sufficient to install Mahout only on the Hadoop master node.
Before installing Mahout, ensure that Hadoop is installed in any of the modes. For the purposes of this blog, install Hadoop in cluster mode. Proceed with these commands in the same order -
P.S.: Sometimes some tests fail while building mahout from source. In such cases use – user1@ubuntu-server:~$ mvn -DskipTests install
Edit .bash_profile to add entry for $MAHOUT_HOME, $HADOOP_CONF_DIR and change $PATH
- export HADOOP_CONF_DIR=$HADOOP_HOME/conf
- export MAHOUT_HOME=/opt/mahout
- export PATH=$PATH:$MAHOUT_HOME
Logout and login for the changes to take effect. After successful login, typing echo $MAHOUT_HOME should print /opt/mahout on the console.
That’s it! Mahout is installed successfully. Lets check it with a Clustering Algorithm sample :
Apache Mahout Clustering Example
Clustering
These techniques attempt to group a large number of things together into clusters that share some similarity. It’s a way to discover hierarchy and order in a large or hard-to-understand data set, and in that way reveal interesting patterns or make the data set easier to comprehend. As an example, Google News groups news articles by topic using clustering techniques, in order to present news grouped by logical story, rather than presenting a raw listing of all articles.
A simple clustering example
This example will demonstrate clustering of control charts which exhibits a time series. A time series of control charts needs to be clustered into their close knit groups. It contains six different classes (Normal, Cyclic, Increasing trend, Decreasing trend, Upward shift, Downward shift). With these trends occurring on the input data set, the Mahout clustering algorithm will cluster the data into their corresponding class buckets.
Follow these steps -
- If you have not started Hadoop yet, now is the right time.
- Download the input data set
- Copy the downloaded data set into HDFS
(P.S.: HDFS input directory name should be “testdata”)
- Mahout’s mahout-examples-$MAHOUT_VERSION.job does the actual clustering task and so it needs to be created. (P.S.: if the job file is already present under $MAHOUT_HOME/examples/target/ folder, then skip the following 2 commands.)
The job will be generated in $MAHOUT_HOME/examples/target/ and it’s name will contain the $MAHOUT_VERSION number.
- Perform clustering
- For canopy clustering
- For k-means clustering
The clustering output is in SequenceFile format which is not human readable. Hence, we need to use the clusterdump utility provided by Mahout.
- Copy the cluster output from HDFS onto your local file system
- Run clusterdump to convert output into human readable form
“clusteranalyze.txt” is the human readable output. Go have a look!
Points to remember
- Make sure that Hadoop is installed successfully and that the daemons are running before running Mahout jobs
- The HDFS output directory is cleared when a new run starts so the results must be retrieved before a new run
- Computed clusters are contained in output/clusters-i
- All result clustered points are placed into output/clusteredPoints
Links:
2 comments:
I keep getting a weird error that --seqfiledir is not a supported flag, and instead to use --input. I'm using mahout-0.7-snapshot.
Any ideas?
Thanks,
Ravish
Thanks for giving information about setting mahout path in profile file I stuck there nearly three days
Post a Comment