The data engineering space has recently moved in two directions: data processing power versus simplicity. Tools optimized for processing power are often too specialized and complex, while those that prioritize simplicity are sacrificing computational efficiency.
In this blog post, we walk through the benefits and drawbacks of each approach and highlight a new framework that incorporates the best of both. With the advent of Delta Live Tables, data teams can utilize the immense processing power of Databricks while simultaneously maintaining the ease of use of the Modern Data Stack. We also share what this means for the future of data engineering and next steps for your organization to consider.
Today’s Data Engineering: Power or Simplicity, Pick One
Power Through Apache Spark
In the power direction, data engineers have harnessed the power of distributed computing with Apache Spark. This framework offers the ability to perform transformations and aggregations on large, complex datasets in an efficient manner by partitioning them across nodes in a cluster. It also provides flexibility by supporting both streaming and batch data processing.
As a result of its scalability, Spark has been the leader in big data engineering in recent years. Its compatibility with the lakehouse architecture enables the use of SQL at the end of pipelines as well.
Despite its impressive performance and suitability for big data tasks, Spark has typically been seen as overly complex. Spark dataframes do not behave like typical dataframes due to the framework’s distributed nature. Engineers must pay extra attention to how the data is partitioned across the cluster to optimize shuffling during transformations. This also adds complexity to machine learning models to allow them to run on this partitioned data.
Furthermore, before an engineer can even begin working with these tools, they must go through a relatively complicated infrastructure setup process to configure the appropriate software and hardware to be used in the environment.
Simplicity With SQL-centric Tools
In the other direction, firms have opted for simplicity over processing power, utilizing SQL-centric tools, many of which fall under the blanket term “Modern Data Stack.” This ecosystem has built up around the rapid rise of the data build tool (dbt) that enables analysts to build data pipelines using only SQL Select statements that reference one another. The tool delegates the execution of that SQL to the cloud data warehouse of their choice (with Amazon Redshift, Snowflake, GCP BigQuery, and Azure Synapse as the common choices). This has led to the rise of a new role, the analytics engineer, that combines data analyst capabilities around asking the right questions and data engineering rigor around building repeatable pipelines.
Interestingly, by also building in a data testing framework and providing the ability to document data during development, dbt often leads to better DataOps processes than traditional tooling used by data teams. However, dbt has limitations around the types of processing allowed (batch-only, SQL only) that can be confining, and many of its advantages are tied to it being used end-to-end.
Delta Live Tables Unifies These Approaches
Databricks, from the creators of Spark, is a unified and highly scalable Spark platform that has recently announced its general availability of Delta Live Tables (DLT) to unify these two approaches. DLT essentially offers a native implementation of the Modern Data Stack on Databricks. They combine the efficiency and robustness of the Spark framework with the ease of use and software best practices of the Modern Data Stack to help data engineers build reliable and manageable data pipelines.
Overview of Delta Live Tables
The main unit of execution for DLT is a pipeline, but unlike a typical extract, transform, load (ETL) pipeline which is defined as a series of Spark operations, it consists of queries. Queries allow users to simply set the data source and the target schema while DLT manages the orchestration of the intermediate data transformations, thus simplifying development.
Data quality can also be enforced using expectations, which allow users to set sanity checks throughout the pipeline and specify error handling when those checks fail. Other benefits of DLT include support of both streaming and batch data processing with a single API, monitoring of pipelines with the DLT user interface, and chain dependency of dataframes that automatically propagates any updates to the pipeline downstream. Users are also able to drop back to PySpark when necessary for more complex operations. Here is a feature comparison between DLT and dbt.
Current Limitations With Delta Live Tables
One current drawback with DLT is its lack of compatibility with the dbt ecosystem. Because of its open nature, simplicity, and widespread adoption, many other vendors have integrated their tools into the dbt workflow. This ecosystem enables a chief data officer at a company just getting started with data to quickly achieve the technology footprint to enable data democratization, a topic we recently explored in our redefining data governance whitepaper.
Additionally, with integrations into tools like Mode Analytics for data visualization, the dbt codebase can be utilized to define the full end-to-end pipeline. Finally, the data observability/quality community (BigEye, DataFold, Montecarlo) have built many integrations that ensure data quality is a first-class citizen in dbt implementations. We expect the Databricks team will find a way to make DLT compatible with these tools soon.
What Does This Mean for the Future of Data Engineering?
The key achievement in Delta Live Tables is the unification of data engineering, analytics engineering, and data science onto a common platform. When combined with the rest of the Databricks platform (Notebooks, on-demand compute clusters, MLFlow), it enables end-to-end collaboration across these three personas without giving up the power needed to tackle the most complex workloads. Each can work in their preferred interface, with data engineers writing Spark code in Notebooks or their preferred integrated development environment, analytics engineers writing SQL in a native interface, and data scientists writing Python in integrated Notebooks. The Databricks platform then provides automated orchestration and performance optimizations by executing them on a unified platform.
The value to data teams comes in the form of easy collaboration which results in quicker delivery. Teams working together can move things into production much faster. In many organizations, the timeline for taking advanced machine learning-based data products to production is nearly two months. Most of this time is spent integrating components and deploying.
With a unified pipeline running on Databricks, this timeline is reduced to days or hours, simply reworking the pipeline parameters to point to production sources, load testing and potentially tuning, and then going live. With the Databricks Repos integration, teams can leverage DataOps principles to leverage multiple environments as the standard way of working, improving quality while maintaining velocity, ultimately reducing the go-live time to a matter of hours.
While these changes point to a brighter, more collaborative future within data teams, the key reason for building this functionality is to build data products, and adoption of these requires trust. Therefore, a likely next step is integrating DLT into the broader ecosystem to enable what is becoming known as data reliability engineering and to tackle data trust building. This will involve changes like integration into the Databricks Unity Catalog for data governance and discoverability, better lineage and reporting integration into the MLFlow Stack, and Databricks SQL integration with visualization tools like Mode Analytics.
Data teams no longer have to choose between power and simplicity when implementing a modern data ecosystem. Delta Live Tables, combined with the broader Databricks platform, provide a unique opportunity to boost the speed of developing data products, while still adhering to software best practices.
As your organization considers incorporating DLT, try to identify specific teams or pipelines that could benefit from a more unified approach. Which hand-offs between team members or pipeline steps are the most time consuming and cause the most severe bottlenecks? To learn how to best leverage DLT for your data products, reach out to us at firstname.lastname@example.org.
- Modern Data Architecture
- Data & Analytics
- Data Architecture
- Data Storage
- Big Data
- Data Governance
- Data Collection
- Data Science
- Data Modeling