A database with no data is as useful as a foodbank with no food. When you generate a new environment or a local database sandbox for a developer to work in, doing so without data is not helpful.
If your process does not yet support easily generating a new database environment, please see our earlier post introducing a great option for building and maintaining your application’s database dependencies, Closer to Continuous: Database Source Control.
Now that your database schema is being treated as a first-class citizen within the process, we can start treating database content the same way. Specifically, there are two classifications of data that should be managed within the deployment process: MVP Data and Test Data.
MVP Data (Minimum Viable Product Data)
This category encapsulates any data your application requires to function properly. Common examples include a table of user roles or a table of lookup values (like a list of states and provinces you might display to fill out an address form). Essentially, this is the data you’ll need to deploy along with your code when you release to production.
Test data is multi-faceted, but each of those facets share the goal of creating a known state that can be tested. We’ve broken these facets into the following subcategories:
Functional Test Data – Much like the mock data you might use in your unit tests, you’ll want to seed your database with data that enables you to move through a project and exercise all of the happy paths. For example, if you’re building an ecommerce website, then you’ll need a sample product catalog to work with before you can test the checkout workflow. To get started building this data you can trace through the test plans for your project to identify the required data. There are some similarities between configuration data and functional test data in that you’ll need both to run the application from a developer’s work station, but the key differentiator is that we’ll expect our configuration data to be set up in production. Our functional test data is similar to production data, but it’s not data we plan to package with the deployment.
Edge-Case Test Data – Again, this is similar to the mock data you would set up to facilitate testing, but it’s specifically targeted toward edge-case scenarios. For the standard ecommerce scenario, good test data would include product data with HTML tags, apostrophes, and other anomalies (such as extra-long product descriptions and titles) to test expected behavior.
Bulk Test Data – This scenario closes the gap between your sparsely populated testbed database and your production environment, which might have millions of records. This sort of data helps you validate that development has accounted for these ‘large result-set’ scenarios that your application will face in production.
Seeing It in Action
Let’s take a look at some more examples by discussing a database project that could be used for a fantasy football website. Here are the tables we’ll use in the next sections:
The top three tables of our example will require some data for us to build out the site. The number of teams, divisions, and conferences in the NFL doesn’t change during the season and doesn’t often change between seasons. These three tables represent a good example of MVP configuration data. Here’s an example of how those three tables might be populated.
Now that those configuration tables are populated, we can start to think about filling out the rest of the database with test data that we’ll need to be able to test our application. The next step is to populate the Players table, and this is where we’ll get into known-state data. Right off the bat, we’ll want our known-state data to begin to represent how our application will behave in production. There are 32 teams in the NFL and every year they have at most 53 players on their roster. So it would seem the next obvious step is to populate the database with more than 1600 sample players.
However, there is a risk that this dataset inaccurately represents our production data. To mitigate that risk, we’ll expand upon the data we’re seeding with data that helps us proactively identify common trouble spots.
Edge-Case Test Data
Bulk Test Data
Players with shorter than average names.
Players with longer than average names.
Players with special characters in their names.
32 teams each with a full roster for the 2016 season to represent the current year of data.
A full set of players to ensure that when someone builds their fantasy football team they can find the player they’re looking for.
A history of rosters that will accrue over the lifetime of the site so that people can compare statistics across multiple NFL seasons.
To accomplish this, we’ll start by seeding with the edge-case test data using this script.
We recommend starting with the edge-case test data because you want to make these scenarios easy to find. You want to plan for them early and often so that as you design your application you are not surprised by runtime defects found by your QA team, or worse, your customers. So in scenarios where you might show pages of data, we want these records to appear on the first page.
In the project download below we have included a post deployment script that will execute all of the data scripts in the proper order. A post deployment script is a script file that Database Projects will recognize and execute as part of the deployment workflow. Here we’ve used the file to focus on execution flow and deferred the implementation to the scripts we linked above.
With our database structure and a working set of sample data under source control, we’ve reached the second step on the road to a more mature approach to database lifecycle management. We now have our database structure and data under source control. This is a tremendous asset for the development team that:
Reduces the cost of new developers joining an active project by minimizing the cost for setting up a new environment and informing them, through scripted documentation, of the edge cases they should think of when building new features.
Reduces the cost of new feature development and maintenance for brown-field projects because developers can start where the last developer finished (no database restore required).
Reduces the chance that customers will be the ones to identify issues in production.
Reduces wasteful back-and-forth cycles between the development team and the testing team.
You can download the whole project here.