Quick Overview: Apache Hadoop
Hadoop, an open-source Apache project, is a framework that can be used for performing operations on data in a distributed environment using a simple programing model called MapReduce. It is also a scalable and fault tolerant system. In the realm of Big Data, Hadoop falls primarily into the distributed processing category but also has a powerful storage capability. The core components of Hadoop are:
1. Hadoop YARN – A manager and scheduling system that schedules resources on a cluster of machines.
2. Hadoop MapReduce – A programming model for distributed parallel processing of large data sets in which small chunks are fed to mappers that isolate targeted data and then to reducers that collate meaning from it. MapReduce jobs written to process megabytes of data can easily scale to terabytes of data and beyond.
3. Hadoop Distributed File System (HDFS) – A self-healing, high bandwidth clustered file storage, which is optimized for high throughput access to data. It can store any type of data, structured or complex from any number of sources in their original format.
Common Use Cases
– Search – Hadoop is used in ecommerce to analyze and index data to deliver more relevant and useful search results to customers.
– Flexible Analytics – Since data stored within a Hadoop system can have very high fidelity, analysts have more flexibility than ever to go back and ask new questions, look for new relationships in data and use techniques like machine learning to gain new insights.
– Point-of-Sale Transaction Analysis – Retailers can use large quantities of recent and historical sales data from PoS systems combined with other data about the customers obtained in stores and online to forecast demand and increase sales.
– Threat Analysis – Online businesses use Hadoop to detect threat and fraudulent activity. In order to do this, companies capture, store, and analyze content and pattern of messages that flow through the network to tell the difference between a legitimate and a fraudulent transaction.
See Cloudera’s white paper for more use cases.
– High Availability – MapReduce, a YARN based system has efficient load balancing. It ensures that jobs run and fail independently. It also restarts jobs automatically on failure.
– Scalability of Storage/Compute – Using the MapReduce model, applications can scale from a single node to hundreds of nodes without having to re-architect the system. Scalability is built into the model because data is chunked and distributed as independent compute quantities.
– Controlling Cost – Adding or retiring nodes based on the evolution of storage and analytic requirements is easy with Hadoop. You don’t have to commit to more storage or processing power ahead of time and can scale only when required, thus controlling your costs.
– Agility and Innovation – Since data is stored in its original format and there is no predefined schema, it is easy to apply new and evolving analytic techniques to this data using MapReduce.
Companies Using Hadoop
– Skybox Imaging – Uses Hadoop to process high definition images obtained from satellites to identify changes on the surface of earth, monitor global infrastructure, etc.
– Zions Bancorporation– Uses Hadoop to detect criminal behavior by analyzing massive amounts of data in web logs, server logs, and customer transactions.
– Opower – Uses Hadoop to find ways their customers can save money on home energy consumption.
– New York Times – Used Hadoop in Amazon EC2 to convert scanned images of over 11 million articles from 1821 to 1922 to PDFs. Each article is composed of numerous small images that needed to be scaled, “glued” together, and then converted to PDF.
See these articles in GIGAOM and Apache PoweredBy for more examples of companies that are using Hadoop.
– The variety, volume, and the velocity at which companies must capture, analyze, and store data is evolving.
– Using Hadoop all these three can be handled very effectively.
– MapReduce, the most widely used component of Hadoop, makes it easy to build a scalable, highly available, extremely resilient, and yet cost effective system with its parallel-distributed programming model.
– MapReduce also lets developers focus on addressing business needs rather than handling low-level details of distributed applications.
– The HDFS component of Hadoop stores the data in its native format without forcing any transformation.
– Data analysts can decide when and how to do these transformations depending on the kind of questions they are trying to answer at that point. Access to the highest fidelity data can boost innovation.
– It is worth noting that MapReduce is optimized for batch processing of data but not for real time processing of data. Tools like Twitter Storm should be considered for real time processing. However, that’s a topic for another blog post, which we will publish in the near future.