Over the past 30 years, the data storage landscape has changed dramatically. Data warehouses have been the most common form of repository for complex reporting and analytical needs, but high volumes of data, new data types, and new user personas (data scientists, business analysts, etc.) have forced the industry to innovate and offer alternative solutions.
In my client engagements, I am often asked about data lakes and how they can be leveraged in a modern data architecture. To best answer this question, I have found that a comparison of data lakes to the more common data warehouse can be helpful. In this blog post, I will provide a general overview of data lakes and describe six core areas where they differ from data warehouses in an enterprise data solution.
Lakes vs. Warehouse Basics
Where a data warehouse is a well-structured, schema-first repository with known data access patterns, a data lake is a large storage area that holds data as-received with minimal transformations. Additionally, data lakes ingest and store data at scale in loosely defined business domain areas. They have become synonymous with big data solutions because they excel at the “3 Vs” of big data: volume, variety, and velocity.
1. Data Types: Structured vs Unstructured
Since their primary purpose is analysis and reporting, data warehouses require structured and cleansed data. Queries built for analysis and reporting require higher magnitudes of processing power due to the nature of aggregations. If your data isn’t structured or clean, then you will be compensating for it in your query. This allows data warehouses to provide increased speed for analytical and aggregation queries that are often used in reporting.
On the other hand, data lakes allow for data of any structure to be stored, including transactional data, CSVs, images, videos, byte sequences, and sensor data. Data lakes are indiscriminate of structure, meaning they can exist as a repository for various data sources without any cleansing before the data dump. That being said, there is some contextual work to be performed downstream to make use of this data. This will be discussed in the upcoming sections.
Source: Dpaul BI Blog
2. Schema: Schema-on-Write vs Schema-on-Read
Before storing data in a data warehouse, the data model must be clearly defined. Table structure, relationships, and indexes must all be considered. This type of data model is commonly known as a schema-on-write. Schema-on-write models often have a higher upfront investment and place the burden of data interpretation on the producer of the data. However, this investment pays off in the long term as data can be easily queried and analyzed going forward.
Data lakes are designed to be indiscriminate of data and allow for a schema-on-read data model. A schema-on-read model is highly flexible and affords data lakes unparalleled data ingestion speeds. The burden of data interpretation is shifted from the producer to the consumer as no schema exists until the consumer attempts to extract the data. In schema-on-read models, it is important to remember that the lack of a data model does not mean a lack of data governance. If the data is not properly governed, the integrity of the data lake can deteriorate and become what is known as a “data swamp.”
3. Extensibility: Poor vs High
Extensibility is the measure of how well a repository handles adding new features or data sources. This is an area of weakness for data warehouses, as a proper data model is required for any ingestion data sources that are added. Additionally, new ingestion sources could require significantly remodeling the data warehouse’s existing schema, as well as any ETL processes that populate the data warehouse. This can be both costly and time intensive.
The lack of a schema for data lakes allows them to be highly extensible at a low cost. Adding a new data source will neither break your data lake nor require a substantial time investment. Opposite to a data warehouse, you’re able to quickly load the data and figure out how to consume it once it’s needed. However, it is important to consider how the data may be consumed in the future. Too much indiscriminate data dumping could lead to low data integrity and morph your data lake into a data swamp, as touched on in the previous section.
Note: Extensibility should not be confused with scalability. Scalability is the ability of a repository to grow and handle larger datasets. Both data warehouses and data lakes are highly scalable by adding additional processing power or adding more HDFS clusters, respectively.
4. Security: Mature vs Immature
What data warehouses lack in extensibility, they make up for in terms of security. This is an area where data warehouses excel compared to data lakes. With their row and columnar structure, schema-on-write data model, and decades of use, data warehouses have developed a mature security model. This model allows administrators very granular control of security that can restrict access down to individual tables, procedures, functions, and views.
By comparison, data lakes are relatively new and have a security model that is still maturing. The lack of a structured schema can provide difficulties when locking down all aspects of the data lake. This does not prevent you from creating a secure data lake, but the road may be more arduous. Advancements can be made in this area by committing your data lake to a cloud-based ecosystem such as AWS or Azure. These solutions have not entirely closed the gap, but they are a great way to quickly improve security.
5. Granularity: Cleansed vs Raw
Granularity is one aspect where data lakes and data warehouses share similar characteristics. In the case of data warehouses, you typically contain full granularity down to the individual record. Since the data being ingested into the warehouse has been cleansed and structured, the data warehouse is often considered the “master” data source. The business can then spin off individual data marts, which are subsets of your data warehouse where the data has been summarized by subject area.
Similarly, data lakes will typically contain full granularity of data. The main difference being that the data lake contains the raw form of the data, rather than the cleansed form that exists in the data warehouse. In many cases, this distinction may not matter, but when it comes to highly unstructured data, such as metadata, images, or emails, having the raw input for these records can be very important for archive and/or deep analytic purposes.
6. Tools: RDBMS vs Hadoop
With data warehouses requiring so much structure, the most common tools used for management are relational database management systems, often referred to as RDBMS. These systems will leverage the SQL programming language to access the data. Some common examples of RDBMS platforms are Microsoft SQL Server, Oracle, Teradata, and Amazon’s Redshift.
Data lakes are most often associated with the Hadoop ecosystem. The tools commonly associated with Hadoop, and consequently data lakes, are HDFS, MapReduce, Hive, and Spark. Microsoft’s Azure and Amazon’s AWS each contain their own tools for managing data lakes from within the cloud that mirror the tools seen in the Hadoop ecosystem.
Tying It All Together
Data lakes have a lot of value that may translate to your organization, but they are likely not a viable replacement for your entire data warehouse. Instead, a data lake could complement your data warehouse in areas that it is currently lacking. Often, this will lead to a hybrid data lake-data warehouse solution that grants the extensibility and quick ingestion capabilities of the data lake as well as the optimized analytic capabilities of the data warehouse.
Structured and Unstructured Data
Potential High Cost
Mature Security Model
Evolving Security Model
Full Granularity and Summarized Data
Full Granularity (Raw)
RDBMS (SQL Server, Oracle, Redshift)
For example, an ecommerce platform could leverage their data warehouse to store structured sales data, customer information, inventory data, marketing campaigns, and customer segmentation. Within the same business, the data lake could be leveraged to contain all the sales data, customer data (behavior, social, call center info, emails, etc.), clickstream data, and the website analytics in a raw form. The data lake would offer a source for data scientists and analysts to mine the data, extract customer intelligence, develop patterns and AI models, and then send the aggregate of these results to the data warehouse for use in reporting and analytics. The data warehouse in this scenario offers structured and predefined decision support, while the data lake offers deep analytics.
Of course, there is not a standard solution for every company. Cloud advancements are improving both data warehouses and data lakes, allowing each to be viable solutions where their counterpart previously prevailed. Architectures vary and technologies must be evaluated on a case-by-case basis, but it is critical to understand what options are available when investing in an enterprise data platform. If you have questions about data warehouses, data lakes, or big data, feel free reach out to me directly at email@example.com or get in touch with Credera at firstname.lastname@example.org. We would be happy to help you think through your current challenges.