Back

TechnologyJun 04, 2014

Apache HBase Explained in 5 Minutes or Less

Philip Shon

QUICK OVERVIEW

Apache HBase is a highly distributed, NoSQL database solution that scales to store large amounts of sparse data. In the scheme of Big Data, it fits into the storage category and is simply an alternative or additional data store option. It is a column-oriented, key-value store that has been modeled after Google’s BigTable. Since it is built on top of Apache Hadoop, an HBase deployment inherits many of the advantages and complexities of a Hadoop cluster deployment. It inherits Hadoop’s high-availability characteristics, and has strong integration with the Hadoop ecosystem, which include tools like Apache Hive, Apache Pig, and Apache Sqoop. While the Hadoop dependency provides these valuable benefits, they are not without its downsides. While other NoSQL databases, such as Cassandra and MongoDB, are designed and written as single standalone executables, HBase is heavily dependent on outside processes such as Apache Zookeeper, the Hadoop Distributed Filesystem (HDFS), and Hadoop’s MapReduce. These dependencies can present challenges when troubleshooting an HBase cluster and are open-source projects that are maintained outside of the HBase project.

COMMON USE CASES

–  Log Analytics – Thanks to the ecosystem developed by the Hadoop project, HBase comes with out-of-the-box integration with Hadoop MapReduce. HBase is also optimized for sequential reads and scans, which make it well suited for batch analysis of log data.

–  Capturing Metrics – Organizations have turned to HBase to serve as their operational data store, capturing various real-time metrics from their applications and servers. HBase’s column-oriented data model makes HBase a natural fit for persisting these values.

KEY FEATURES

–  Deep Integration with Apache Hadoop – Since HBase has been built on top of Hadoop, it supports parallelized processing via MapReduce. HBase can be used as both a source and output for MapReduce jobs. Integration with Apache Hive allows users to query HBase tables using the Hive Query Language, which is similar to SQL.

–  Strong Consistency – The HBase project has made strong consistency of reads and writes a core design tenet. A single server in an HBase cluster is responsible for a subset of data, and with atomic row operations, HBase is able to ensure consistency.

–  Failure Detection – When a node fails, HBase automatically recovers the writes in progress and edits that have not been flushed, then reassigns the region server that was handling the data set where the node failed.

–  Real-time Queries – HBase is able to provide random, real-time access to its data by utilizing the configuration bloom filters, block caches, and Log Structured Merge trees to efficiently store and query data.

COMPANIES USING HBASE

–  Facebook – When Facebook decided to re-launch their messaging platform in 2010, they moved away from their MySQL-based solution to an HBase. HBase’s combination of a simpler consistency model when compared to Cassandra, solid performance, and in-house experience with Hadoop and HDFS were factors contributing to their decision. More on how Facebook is using HBase throughout their company.

–  Pinterest – Pinterest utilizes HBase provide their users personalized feeds, capture data for Rich Pins, and power their recommendations process. View this video to learn more about how Pinterest is using HBase.

–  Explorys – Explorys uses HBase to capture billions of anonymous clinical, operation, and financial data points. Explorys uses this platform to help their customers drive quality care, minimize costs and mitigate risks. Slides showing how Explorys is taking advantage of HBase.

TAKE AWAY

– HBase is an easy addition for organizations who are already considering a Hadoop deployment.

–  HBase takes advantage of the best components that originated out of the Hadoop project (HDFS, Zookeeper, and MapReduce), to provide horizontal scalability and fault-tolerance.

–  The deep integration with Hadoop allows users to be creative with how large sets of data are analyzed.

– The design tenet of having strong consistency helps reduce complexity in their systems, as noted by the Facebook team.

Have a Question?

Please complete the Captcha