Back

TechnologyNov 26, 2013

Cassandra Explained in 5 Minutes or Less

Philip Shon

Apache Cassandra is a massively scalable, column family NoSQL database solution that provides users the ability to store large amounts of structured and unstructured data. In the big data landscape, it fits into the structured storage category and is simply an alternative or additional data store option. Originally open-sourced in 2008 by Facebook, Cassandra combines the distributed systems technologies from the Amazon Dynamo key-value store and Google’s BigTable column-based data model. With Cassandra’s decentralized architecture, there is no single point of failure in a cluster and its performance is able to scale linearly with the addition of nodes.

Cassandra’s ability to easily scale across multiple data centers is why many companies turn to Cassandra to store their data in the cloud. Companies like Netflix have taken advantage of Cassandra’s decentralized model and replication strategy, to span their Cassandra deployment across multiple Amazon Web Services availability zones to make their data more resilient against outage such as the AWS Outage on Oct. 22, 2012.

common use cases

– Time Series Data Model – Due to the data model and log-structured storage engine, Cassandra benefits from high performing write operations that make it well suited to store and analyze sequentially captured metrics (i.e. measurements from sensors, application logs, etc.). These models take advantage of the fact that rows in a column are determined by the application, not a predefined schema. Each row in a column family can contain a different number of columns, and there is no requirement for the column names to match.

– Storing Key-Value Data With High Availability – Sites like Reddit and Digg have chosen to use Cassandra as a persistent cache for their data. As their traffic increases, Cassandra is able to scale linearly and without downtime as they add new nodes to their cluster.

Key Features

– Flexible Data Model – The Dynamo and BigTable concepts built into Cassandra allow for complex data structures that would be difficult to model in traditional relational databases. The model works with a wide array of data modeling use cases (i.e. Netflix Instant streaming queues, user preferences, application logging, etc.)

– No Single Point of Failure – In a cluster, all nodes are created equal. Data is distributed across the cluster and each node is capable of handling read and write requests. When configured with the proper data replication strategy, individual node failures can be resolved with no downtime.

– Linearly Scalability – By default, all data stored in a cluster is distributed across the entire cluster. As a result of adding new nodes to a cluster, data becomes more sparsely distributed across the nodes, thus reducing the load that each node is under.

– Hadoop Integration – Cassandra has supported integration with Hadoop since 2010. Users can perform MapReduce jobs that read and write data to a Cassandra column family, and Apache Pig support is available to query and store data into a column family as well.

Companies Using Cassandra

– Netflix – Uses Cassandra as their storage solution for storing customer viewing data and supporting their streaming API. They have also contributed multiple supporting open source projects as a product of their experiences with Cassandra such as their Java-based client (Astyanax), cluster management tool (Priam), and a Jmeter plugin used to load test a cluster. More on Netflix using Cassandra.

– eBay – Takes advantage of the wide column-based data model to store time-series data that feeds into their fraud detection system. Cassandra also plays a key component in eBay “Social Signals” product, which allow users to like/own/want items listed on eBay. Read more about Cassandra at eBay or view the slides.

– Expedia – Uses Cassandra to cache full prices for a stay at a given hotel, which increases the performance of searches for hotels based on price. View slides about how Expedia uses Cassandra.

– Constant Contact – Migrated away from existing DB2 relation database in order to capture at least two years of historical data to support new social media products. Check out Constant Contact’s slides on using Cassandra.

Take Away

– Cassandra’s architecture scales well with the demands of a business’s growing datasets.

– The decentralized architecture of a Cassandra cluster makes it well suited for high-availability cloud deployments. As a cluster’s capacity requirement increase, read and write throughput will increase linearly as new nodes are added to a cluster.

– Core design principals are based on lessons learned from Google’s BigTable data model for structured data, and Amazon’s high availability key-value store, Dynamo.

– Cassandra’s flexible data model makes it well suited for write-heavy applications that can take advantage of the fact that rows are not required to have the same fixed number of columns for every row in a column family.

For more technology insights, connect with Credera on LinkedIn.