If you’ve ever tried to Google, “What is data modeling?” you might have seen a result that says data modeling is “the process of creating a data model.” While that definition isn’t very useful, I hope this blog post will provide a helpful introduction to the concept of data modeling. At Credera, we help our clients become more data-driven, and often that starts with cleaning and modeling data.
Data modeling occurs at three levels—physical, logical, and conceptual.
A physical model is a schema or framework for how data is physically stored in a database.
A conceptual model identifies the high-level, user view of data.
A logical data model sits between the physical and conceptual levels and allows for the logical representation of data to be separate from its physical storage.
This blog post will primarily discuss logical data modeling.
data models describe business entities and relationships
Data models are made up of entities, which are the objects or concepts we want to track data about, and they become the tables in a database. Products, vendors, and customers are all examples of potential entities in a data model. Entities have attributes, which are details we want to track about entities—you can think of attributes as the columns in a table. If we have a product entity, the product name could be an attribute.
Entities don’t exist in isolation; they’re connected to each other. The connections between entities in a data model are called relationships, and relationships reflect business rules. Relationships between entities can be one-to-one, one-to-many, or many-to-many.
The relationship between products and vendors can illustrate a one-to-many relationship. For example, if Vendor A makes Widget 1 and Widget 2, that means a single vendor makes many products. If you think about this relationship in the other direction, it still makes sense—I can purchase multiple products from the same vendor. One-to-many relationships are by far the most common type, but it’s important to know how to deal with the other two kinds of relationships.
One-to-one relationships usually occur when a business needs to use entities that are supertypes and subtypes, also called parents and children. For example, a product could be a supertype, and a specific kind of product, like a book, could be the subtype. In some situations, there are good reasons for storing general product data in a separate table from more specific data about books, and this can help in representing hierarchical relationships among entities. In this example, a single book title should be associated with a single product ID number, and vice versa, making this a one-to-one relationship between products and books.
The third kind of relationship, a many-to-many relationship, happens when multiple instances of one entity are related to multiple instances of another entity. Books and authors are a great example of this relationship—one author can write multiple books, and a single book can be written by multiple authors. Although many-to-many relationships exist, you don’t actually see them in data models. Instead, we use associative entities, which are tables that break the many-to-many relationship into many-to-one relationships. In this case, you might have a books table, an authors table, and an author-book table to link the two.
data modeling should enforce data integrity
When we talk with clients about leveraging their data, data integrity is a crucial prerequisite. Before companies can start using their data to make decisions, they need to be able to trust that the data sets are accurate and reliable. In data modeling, there are two kinds of rules that are foundational to maintaining data integrity—entity integrity and referential integrity.
Entity integrity means that the data within a single entity or table are reliable. The use of primary keys is an essential step toward entity integrity. Primary keys are unique identifiers, such as product ID numbers, that serve the purpose of identifying a particular record and preventing data duplication. There are three parts to the entity integrity rule:
All entities should have a primary key.
The values of all primary keys must be unique.
The value of a primary key cannot be null.
Referential integrity means that the relationship between two entities or tables is reliable. The use of foreign keys is an essential step toward referential integrity. Foreign keys are the primary keys of one table that appear in a different table. The rule of referential integrity says that for any foreign key value in one table, there must be a matching primary key value in the referenced table.
data modeling involves normalization
An introduction to data modeling would not be complete without mentioning the concept of normalization. Normalization is the process by which anomalies are avoided and redundancy is eliminated, and the concept was first suggested by a data scientist named Edgar Codd. According to Codd, normalizing a data model means structuring data so that each entity only has one theme or topic. In more technical terms, we refer to this as removing partial dependencies and transitive dependencies.
Let’s say I want to store data about customer purchases. I’ll want to know information like the dollar amount purchased, the customer name, and the store where the purchase took place. I could store all of these data points in a single table, but that would lead to problems. For example, a customer who has made hundreds of purchases may get married and change her name. To reflect that change in my records, I would have to go through my table and change her name for every single purchase she’s ever made. Instead of dealing with hundreds of changes, a normalized data model would allow me to make only one change. In this situation, I should store customer data in a customer table, product data in a product table, store data in a store table, and so on. By using primary keys and foreign keys, I can link these tables together and access all the information I need. Better yet, when I need to change a customer’s name, I only have to make the change in one place. The diagram shown here is a very simple illustration—most data models are much larger and more complex.
There are different levels of normalization, starting at first normal form and going up to sixth normal form. However, third normal form, abbreviated as 3NF, is most common. Data models in 3NF have tables with unique records in which all partial and transitive dependencies have been removed. I like to remember 3NF like this: Every non-key column is based on “the key, the whole key and nothing but the key, so help me Codd!” – William Kent
different models have different purposes
When your organization starts data modeling, it’s important to have discussions about the purpose of the data model. There are two types of data models—relational models and dimensional models—and they serve very different purposes.
Relational data modeling results in tables that are in 3NF, and it is used for building transactional and operational systems, such as point of sale systems. Relational models are designed to be great at getting data into a database in a way that maintains integrity and stores every piece of non-key data only once.
Dimensional data modeling can result in a design called a star schema, which has denormalized tables, and it is used for building reporting and analytical systems. Dimensional models are designed to be great at getting data out of a data warehouse and into the hands of business users. By storing data in a less normalized form, dimensional models make it much easier to query across many different tables. It’s worth noting that there are reasons for using a 3NF design, rather than a star schema, in a data warehouse. In modern data architecture, business intelligence tools often bridge the gap between multiple levels of normalization.
star schema structure
data modeling matters
Data modeling can feel rather abstract, but it’s a concept that deserves attention from both IT and business stakeholders. Data modeling reflects business rules, and when a data model is implemented in a system, it solidifies and reinforces those rules. Creating the entities and relationships in a data model is akin to defining the objects and operations in an organization. The implication here is that the people who create the data model are responsible for getting those definitions right. Data modeling should not occur in isolation, and it should include both business and technical experts.
moving forward with data modeling
I hope this post has helped you gain a high-level understanding of data modeling and why it’s important for your organization. If you have questions about data modeling, or you’re looking to become more data-driven, Credera can help—reach out to us at email@example.com. Happy data modeling!