Friday, February 17, 2012

Machine Learning "The Brain of Hadoop"

Web , Social and Mobile Applications are growing at such a lightning pace that they require new frameworks such as Apache Hadoop to be designed to exclusively process large volumes of data reliably and cost effectively.

Increasingly, the success of companies in the information age is depending upon how quickly and efficiently they turn vast amounts of data into actionable information. Whether it’s for processing hundreds or thousands of personal e-mail messages a day or recommending the right products to buy, the need for tools that can organize and enhance data has never been greater and connected like today. Therein lies the premise and the promise of the field of machine learning.
Machine learning is used in variety of use cases from game playing to fraud detection to stock-market analysis to e-commerce predictions. Some of them are listed below :
It is used by Netflix and Amazon to recommend products to users based on their past purchases, or systems that find all of the similar news articles on a given day. It is also used to categorize Web pages automatically according to genre (sports, economy, war, and so on) by google or to mark e-mail messages as spam by gmail. It is used by Target and Walmart to predict the customer buying behavior and offer them juicy discounts and send them right coupons. 
Now the questions is , How are such recommendations , predictions done in BigData context ?

Welcome Apache Mahout 

Mahout is an open source machine learning library from Apache. At the moment, it primarily implements recommender engines (collaborative filtering)clustering, and classification algorithms. It’s also scalable across machines. Mahout aims to be the machine learning tool of choice when the collection of data to be processed is very large, perhaps far too large for a single machine.

Several approaches to machine learning are followed to solve problems using Apache Mahout Machine Learning. Supervised and unsupervised learning are the main ones supported by Apache Mahout.

Supervised learning is tasked with learning a function from labeled training data in order to predict the value of any valid input. Common examples of supervised learning include classifying e-mail messages as spam. 
Unsupervised learning is tasked with making sense of data without any examples of what is correct or incorrect. It is most commonly used for clustering similar input into logical groups.
Let us briefly look into these primary machine learning themes -

Recommender Engines
They are the most immediately recognizable machine learning technique in use today. You’ll have seen services or sites that attempt to recommend books or movies or articles based on your past actions. They try to infer tastes and preferences and identify unknown items that are of interest. As an example, is perhaps the most famous e-commerce site to deploy recommendations. Based on purchases and site activity, Amazon recommends books and other items likely to be of interest.
These techniques attempt to group a large number of things together into clusters that share some similarity. It’s a way to discover hierarchy and order in a large or hard-to-understand data set, and in that way reveal interesting patterns or make the data set easier to comprehend. Theoretical example : we may want to group all news related Earth Quake , Tsunami , Famine , Flood , Storm hit etc under Natural Calamities. As an real example, Google News groups news articles by topic using clustering techniques, in order to present news grouped by logical story, rather than presenting a raw listing of all articles.
These techniques decide how much a thing is or isn’t part of some type or category, or how much it does or doesn’t have some attribute. Classification, like clustering, is ubiquitous, but it’s even more behind the scenes. Often these systems learn by reviewing many instances of items in the categories in order to deduce classification rules. As an example, Yahoo! Mail decides whether or not incoming messages are spam based on prior emails and spam reports from users, as well as on characteristics of the email itself.

1 comment:

Unknown said...

Hbase being more suitable for data warehousing, and large scale data processing and analysis.

Need Consulting help ?


Email *

Message *

All posts, comments, views expressed in this blog are my own and does not represent the positions or views of my past, present or future employers. The intention of this blog is to share my experience and views. Content is subject to change without any notice. While I would do my best to quote the original author or copyright owners wherever I reference them, if you find any of the content / images violating copyright, please let me know and I will act upon it immediately. Lastly, I encourage you to share the content of this blog in general with other online communities for non-commercial and educational purposes.