Data•Dec 11, 2020
Analytics Workbench Part 2: Unpacking the Technical Underpinnings of Advanced Analytic Platforms
In the first part of this series, we walked through the “what” and “why” of the eight building blocks of an advanced analytics platform. Throughout this article, we will cover the “how,” specifically through Azure, AWS, and Google Cloud. We will unpack the technical underpinnings of each building block so you are equipped to lay the technology foundation of a scalable and successful analytics platform.
As a reminder, our eight building blocks are the following:
Flattening the data
1. Data Extraction
Extracting data from an outside source and storing it in your data environment is the foundational step that allows for analysis of that data downstream. Your data architecture can include either periodic “batch” data ingestion, streaming data ingestion, or both methods.
Using a function as a service (FaaS) will allow you to use serverless technology to automate data ingestion for your environment. This can include API calls or SFTP functionality to gather data and can generally be programmed in various programming languages, such as Python, Java, Node.js, and C#. Using a serverless tech is best for automating the data pull on a desired frequency and is also the most cost effective as it eliminates setup/maintenance of the cloud infrastructure and the cost of idle time.
Streaming Data Platform
If your upstream data source is an application that is constantly generating data, like a point of sale or internet of things (IoT) device, you can create data streaming platforms to act as an ingestion endpoint for this data. This allows for near-real-time ingestion of data into your platform. Typically, these implementations will be built either using Apache Kafka or one of the major cloud provider’s implementations of Kafka.
2. Data Storage
The data storage layer enables the structuring of inbound data files into a data lake, as well as storage of binary media files like images and videos that are needed for analysis.
Each cloud provider has a cloud file storage option that provides the flexibility to configure different levels of service. These configuration options will include access frequency, geo-redundancy, and partition styles, allowing for your implementation to increase or decrease performance to meet your use case and ensure you only pay for what you need.
3. Identity Resolution
An identity resolution layer allows organizations to better leverage their data, stitching together a more comprehensive view of each customer. While there are various cloud-provider-based identity resolution solutions, we would recommend a third-party solution such as LiveRamp or Neustar. Using a third-party solution provides a unique advantage of being able to utilize customer data outside of the scope of your existing datasets, providing a more enriched result for analysis.
4. Data Modeling
Data modeling provides the opportunity for your organization to plan out what data structure will best support analysis in later stages.
Exploratory Data Analysis (EDA)
The EDA aspect of data modeling requires your team to analyze each data source and table within those sources to gain a thorough understanding of the data at hand. EDA is a crucial part of data modeling, and there are multiple tools that allow for the quick querying and visualization of data to simplify your analysis.
Once you build the data model, it is important to track the inputs and outputs. Cataloging your data means tracking all incoming pieces of data and documenting what transformations are occurring, resulting in the eventual final data model for analysis. Each cloud provider has a platform for storing this metadata and will allow you to track lineage of each data source through your environment.
5. Data Transformation
A data transformation tool will allow you to convert raw data into the intended data model in a consistent manner. This will include a broad set of capabilities, like filtering, joining, or cleaning your data. While there are hundreds of extract, transform, and load (ETL) tools available on the market, your use case will help narrow down to the tool of choice within your preferred cloud provider.
Set-Based Processing ETL Tools
Spark is a widely popular tool used for big data ETL use cases. Each cloud provider has its own serverless functionality to implement Spark code, which will be the fastest way for you to develop powerful, fast pipelines for data transformations.
Set and Stream Processing ETL Tools
If your use case needs a solution for stream-based data, Apache Beam and its cloud-specific implementations are the best tools for the job. Beam and Google’s Cloud Dataflow will allow you to develop identical logic for processing and transforming data in both batch and stream workflows. While the performance in this option is not quite as powerful as Spark, the cross-functionality of being able to cross over into streaming use cases may make it a viable option for your architecture.
File and API ETL Tools
The final class of ETL Tools is a bit broader and provides a wider set of capabilities. These tools excel with picking up individual files or connecting to APIs, as well as performing highly complex transformations on data. The tools will also introduce a lower learning curve for implementation but the underlying technology used will often times result in poorer performance.
6. Data Warehousing
A data warehouse allows for the organized storage of large amounts of data, as well as access to query massive amounts of data quickly. This is typically where you would store the previously created data model by using ETL transformations to load the data into your data warehouse. Each of these options listed will provide capabilities beyond classic SQL databases, as they enable the parallel processing of data, which increases performance. Modern data warehouses also provide the ability to separate compute from storage, turning the query engine into a pay-as-you-query service, rather than an always-on server.
7. Flattening the Data
This layer allows for summarization of complex data structures and simplification of data visualization. Data flattening can be accomplished in each data warehouse option listed above, and a good option would be to implement the summarization of data using a materialized view. The main goal with flattening data is speed—speeding up dashboard development and speeding up data query times.
The visualization layer makes data universally available and understandable, leading to real action. All of the leading solutions allow the end user to design their own dashboards and select the key performance indicators (KPIs) they want to monitor. While each cloud provider has their own business intelligence (BI) software, it is important to note that you can cross providers from databases and BI software.
Bringing It All Together
No matter the cloud platform you choose, the building blocks for building your analytics workbench will be available. Below are sample diagrams for each cloud platform, including the building blocks for a workbench and the technologies you could potentially implement.
Advanced analytics platforms are a considerable investment and require thoughtful planning. Before embarking on this journey, carefully consider your objectives. Finally, while all the technologies mentioned in this article are industry-leading solutions, the orchestration between the layers is just as important as the technology itself and often requires custom integration. Please feel free to reach out to firstname.lastname@example.org to learn more and discuss how to build your own advanced analytics platform.