TechnologyNov 19, 2013

Mahout Explained in 5 Minutes or Less

Josh Gertzen

In the spectrum of big data tools, Apache Mahout is a machine-learning engine that fits into the data mining category of the big data landscape. It is one of the more interesting tools in the big data toolbox because it allows you to extract actionable tasks from a big data set. What do we mean by actionable tasks? Things such as purchase recommendations based on a similar customer’s buying habits, or determining whether a user comment is spam based on the word clusters it contains.

Common Use Cases

– Recommendations – Analyze user behavior and find items the user has a high probability of being interested in. This kind of application is often used to provide purchase recommendations on ecommerce sites.

– __Automatic Document Classification__ – Based on prior categorization of documents, look at new documents and determine best categories. A typical use is to auto-organize new content or to flag potential spam comments.

– Content Clustering – Group documents, web pages and articles based on contained topics and their related documents. Most common use of this is search engines, which cluster pages based on keywords, page links, etc.

– Item Sets – Look at a set of items and identify which items usually appear together. Similar to recommendations, but not directly behavior based. A common use case is the “Did you forget this?” feature on ecommerce sites.

Key Features

– Proven Algorithms Included – Included set of algorithms try to solve common problems encountered in many industries. Generally you don’t need a statistician or data scientist to get highly useful results.

– Scalable to Large Data Sets – Designed to distribute across large data center clusters that run Apache Hadoop and apply the map/reduce paradigm. For smaller use cases, the foundation is highly optimized to allow for solid performance in non-distributed environments.

– Active & Open Community – With a vibrant, responsive, and diverse community, there’s often many discussions in the community forums that address your problem space. Additionally, the platform is distributed under the commercially friendly Apache Software license.

Companies Using Mahout

– Yahoo! Mail – Uses Mahout’s Frequent Pattern Set Mining to detect spam patterns and filter unwanted email.

– Linked.In – Have recently started experimenting with Mahout for model training and are evaluating broader deployments.

– NAVTEQ Media Solutions – Uses Mahout to process user interactions with advertisements to optimize ad selection.

– NewsCred – Uses Mahout to generate clusters of news articles and to surface the important stories of the day.

Take Away

– Each day companies collect huge amounts of data about their internal systems, employees, customers, and generally every aspect of their business.

– Most of this data simply gets collected, stored, and rarely used.

– Mahout comes out of the box with dozens of recommendation, clustering, and classification routines that can extract insights from big data sets.

– Use cases for Mahout continue to grow all the time and with a little effort its capabilities can be applied to any big data analytics problem.

– Any business or organization can start taking advantage of its benefits thanks to its well thought out and easy to approach design.