If there’s one constant I’ve encountered in my 27 years in the IT industry it’s that nothing remains the same. The rapid pace of technological innovation and evolving business needs means I must constantly re-evaluate my approach to technical challenges. One such technical item is the generation of surrogate keys in the context of Business Intelligence (BI) environments (ODSs, EDWs, data marts, etc.).
Assuming the basics of surrogate keys and the justifications for their use is well understood, I will focus on addressing the following questions:
What is the best approach to generate surrogate keys in today’s environment?
Will the approach to generate surrogate keys vary based on technical platform or can it be consistent across platforms?
typical approach to surrogates in a classic environment
The approach to the generation of surrogate keys that I’ve used for the past few decades is to utilize some type of sequence generator. This may be a database component (e.g., an Oracle Sequence or Microsoft SQL Server Identity Column), an integration tool component (e.g., an Informatica Sequence Generator), or even target based logic to increment against the current maximum value in the target table. These approaches work on the principle of generating a new and thus unique sequential numerical key value whenever a new natural/business key value is encountered. This approach worked well when BI environments resided in a single environment with monthly or weekly batch updates and had large batch windows in which to perform data loads. It was easy to understand, could be consistently applied, and was battle tested in a wide range of technical environments and modeling approaches. It adapted to a normalized, hybrid, or dimensional model with equal facility.
advent of the business intelligence ecosystem presents challenges
With the introduction of significant extensions to BI environments such as Big Data, NO-SQL, and real-time update requirements, I have found this long-standing approach to the generation of surrogate keys is becoming a hindrance to the population of a consistent and integrated data environment. As organizations move away from monolithic platforms/architectures with a fixed schedule to more of a best-of-breed ecosystem with dynamic requirements, maintaining the consistency and integrity of surrogate keys generated through sequences becomes a bottleneck while increasing integrity concerns. As BI environments continue to evolve into complex ecosystems, our classic approach of using sequence-based surrogate keys becomes problematic in two primary areas:
Scheduling dependencies in a disappearing batch window
Cross-platform dependencies and consistency
SCHEDULING DEPENDENCIES IN A DISAPPEARING BATCH WINDOW
In order to show the nature of the problem, I present a small data model based on the Data Vault modeling pattern in Figure 1 below. While a Data Vault is shown here, this same issue would exist in a normalized (3NF) or dimensional model as well. This is a simple subset of a model showing four hubs (many of which have multiple satellites) plus two links with the corresponding satellites (one modeled as a dependent link). All of the satellites as shown are modeled to track full historical change states.
Figure 1 – Sample Data Vault Model
Assuming sequence generators are utilized to generate the surrogate keys, the ETL process dependencies to populate the model section is shown below in Figure 2 with sample run times in HH:MM:SS.
Figure 2- ETL Dependencies Using Sequence Based Surrogate Keys
This shows that even in this small subset of a model, I start to have a fairly complex dependency chain, which requires four distinct layers of processing that must be performed sequentially in order to properly load the model. As shown, this particular batch cycle will average 1:10:30 to complete (H2+L1+L2+S1L2). As requirements start to arise for real-time or near real-time reporting, this approach becomes more and more unworkable in order to fit into my limited load time frames.
One work around that has been commonly used to get around at least part of this in the Data Vault environment is to remodel any dependent links as shown above in a logical model to be independent links with direct hub links in the physical model. Even with this, the above model and ETL stream would still show three load layers, with Link 2 being dependent on Hub 1 and Hub 2 directly vs. through Link 1, and runtime would be reduced 30% (to 0:50:30). However, this is still too long in many of today’s environments.
Cross-Platform Dependencies and Consistency
In addition to the time constraints introduced by the utilization of sequences for the generation of surrogate keys, there is a further constraint in that these surrogate keys are available only in the originating/target environment. As BI Ecosystems become more the norm, I need to be able to relate data across various platforms and databases in a consistent manner. If I use a sequence generator approach to tie things together, I must define one sequence as the master for each entity, then expose the resulting surrogate key and natural key sets to all consuming environments. This will require either cross-database or cross-platform views, if feasible, or the introduction of a replication strategy to all consuming environments to ensure every system stays in sync.
As the need for faster data loads increases and as environments that must be kept in sync proliferate, I see an increasing need to parallelize my processing in order to meet delivery commitments. Additionally, I see a need to be able to generate surrogate keys independently in the various environments while ensuring they are still consistent across environments. This need to speed delivery combined with the growth of disparate platforms in an integrated BI Ecosystem suggests a different approach to the generation of surrogate keys needs to be considered.
surrogate key generation requirements in a bi ecosystem
With the advent of the era of the BI Ecosystem, it is now becoming necessary to deliver solutions with some or all of the following benefits:
Full parallelization of all load processes, which results in the full batch cycle time equating to the longest running single process. In my example above this is 30 minutes, assuming run times remain consistent throughout. In all likelihood, however, cycle times will shorten due to reduced ETL complexity
Ability to support real-time/near real-time data updates with minimal cycle dependencies
Consistent approach across all architectural components in the BI Ecosystem
Allow dependent links in Data Vault without requiring multiple level lookups or multiple layers to load. This same benefit would also apply to the multiple dependent tables in a Snowflake or Star schema
Allow consistent key values across deployment environments (Development, Integrated Testing, UAT, Production, etc.)
Simplified elimination of duplicates
Consistent single column joins for natural/business keys with multiple nodes and variable column formats
Consistent data type optimized for index creation and join support
Stability and persistence of keys over time
hash based surrogate keys to the rescue
My preferred approach to the generation of surrogate keys in a complex BI Ecosystem has become to replace sequence logic with hash logic. One essential consideration to keep in mind is the technique used must consistently produce the same output result hash value given the same input, even if the process is run on multiple distinct toolsets or environments.
Most database platforms and integration tools now support the ability to generate hash values based on many of the common hash algorithms, either directly or via user-defined functions. The most commonly supported algorithm is MD5 (128-bit), though SHA1 (160-bit) and SHA2 (256-bit) are also commonly supported.
For those unfamiliar with a hash generator, it is a mathematical algorithm that given a specific string input will always produce the same output, usually in the form of a fixed length hexadecimal string. This will allow the same natural key value to always produce the identical output hash value when using the same algorithm, regardless of database platform or integration tool. Additionally, they are built such that different strings, even with just the difference of a single byte, will produce distinct hash values. One thing to keep in mind with hash algorithms is that they are a one-way encryption algorithm. The hash will always create the same output given the same input, but the output cannot be reversed to derive the original input. This means any natural key values must still be stored if there is a desire to expose them for reporting purposes, essentially the same as we had to do for sequence based surrogates.
hash based surrogate keys are not perfect
The main drawback to the use of hash values for the generation of surrogate keys has been the potential for creating collisions. For clarity, a collision is defined as the same hash value being generated given two distinct input values. With that said, I have yet to encounter an instance where a collision occurred in the definition of an entity that could serve as a hub or dimension. I have seen hash used for multiple years (change data capture purposes) allowing comparison of multiple (often 50 or more) columns at a time by simply comparing MD5 based hash values. Further, I have yet to see a collision in a much broader and higher volume population (e.g., dimensions as large as 300 million members using MD5 hash algorithms to check for data changes).
Given the limit of the target hash size, it is not possible to guarantee that the same hash will never be created for two different input strings; however, the likelihood of such a collision occurring can be calculated. The likelihood of a collision will vary with the length of the target hash value. In order to calculate the likelihood of a collision in a given population, one needs only to go back to the most basic problem addressed in elementary statistics—namely, the likelihood of any two people in a random population having the same birthday (as a refresher, the likelihood of a common birthday passes 50% once a population reaches 23). While the particular hash algorithm used may lead to a collision likelihood that differs somewhat from the theoretical likelihood, it is representative enough to be used to evaluate the safety of using hash as a surrogate key generator.
For simplification, the following calculation can be used to approximate the likelihood of uniqueness in a given population. The complement of this calculation is then the likelihood of a collision in a given population.
Where k is the population, N is the possible combinations supported by the hash, and e is the base of the natural logarithm.
The question now becomes, “Are the advantages to be gained by the introduction of hash surrogate keys enough to offset the theoretical possibility of a collision?” To help answer this, Figure 3 below shows the likelihood of a collision with various population volumes, using the size of the three most likely candidate hash algorithms to use for a BI Ecosystem.
Figure 3- Hash Uniqueness and Likelihood of Collisions
Using these three common algorithms, a .1% possibility of a collision is exceeded for each only once the possible member population exceeds 1018, 1023, or 1038 for the MD5, SHA1, and SHA2 algorithms respectively. Stated another way, using MD5 you would need a population of nearly one billion billion in order to exceed a .1% likelihood of encountering a duplicate. While the possibility of a collision cannot be altogether discounted, the likelihood of a collision in these common algorithms is such that the use of a hash for surrogate key generation can be safely used for most purposes to be found in a BI Ecosystem.
summary of hash based surrogate key approach
Using a hash algorithm in data integration processes for the calculation of a surrogate key will provide the following advantages:
All data integration processes can be executed in parallel as there is no need to process parent tables before child tables, enabling the potential for significantly shortening the time required to complete load processes
The same hash values will exist in all environments (DEV, INT, UAT, PROD), which can help simplify testing procedures and defect resolution
The same hash values can be created independently on various technology stacks with confidence that they will always match. This allows multiple ETL tools, RDBMSs, NO-SQL databases, and Hadoop variants to be used without regard for originating a master key set on just one platform
The simplified data integration dependencies allow simpler implementation of real-time processing streams by the removal of parent/child dependencies
Join keys continue to be fixed-length keys, character instead of integer, but still fairly optimal on most database platforms
ETL processes are simplified and optimized by replacing often large lookup processes with direct hash calculations that can normally be done in memory
The use of a hash algorithm for surrogate key generation does introduce a few complications not present with sequence based surrogates:
Possibility of collisions may require some staging layer duplicate elimination processes
Referential integrity cannot be implemented in the database, which may require procedural processes to be introduced to maintain integrity. If, however, the ecosystem crosses multiple database platforms, this will likely be needed in any event since integrity constraints cannot be enforced directly across database platforms
While I cannot say that I will exclusively use hash as a way of generating surrogate keys in my projects going forward, I can say I will now start with that as my default approach and only vary from that approach when sufficient justification can be provided. It is also worth mentioning that hash surrogates are now becoming more common in the EDW industry. Some standards, most notably the Data Vault 2.0 standards recently released, now require hash keys be used exclusively for Hub and Link surrogate keys in order to be compliant with the standards.