Most big data and artificial intelligence (AI) projects fail or at least don’t live up to the hype and expectations. In the first part of this blog series, we identified four reasons and provided some guidance on dealing with the first one (a rush to do something).
A rush to “do something” (without understanding why).
Poor data quality/governance.
Immature data engineering practices.
In the second post we discussed the need for data governance and a more lightweight approach. But how exactly do we do this? How do we make sure the data is “good enough” for the particular use case? And how do we deal with data changing over time? Those are things we’ll cover in this blog post.
avoiding the heavyweight solution
Let’s assume you have some bad data preventing you from achieving your goal. You inevitably will. As we called out in the second part of this series, the first knee-jerk reaction from technologists is to establish some very expensive, heavyweight solutions. It can often look like this:
Create a master data management (MDM) solution.
Clean all existing data based on the MDM.
Integrate MDM into all applications to ensure bad data doesn’t reenter.
Update validation rules in all transactional systems.
Force everyone to migrate to a new source system (and of course rewrite all reports and dashboards).
And so on…
MDM solutions aren’t intrinsically bad, and they may have their place in some organizations. But often the quest for “perfect” data prevents “good enough” data that can still achieve the business purpose. We suggest a light touch on the wheel of these efforts: Implement just enough to achieve the goal and iterate. It is far easier to build momentum from quick wins that drive the business forward incrementally than burning all your trust and social capital while you try to boil the ocean.
time for a vacation
Let’s assume you have identified the business process you are trying to improve. You have crafted the correct metrics and defined the minimum set of data to calculate those. You then curated and cleansed that data and assigned an ongoing data owner to manage the pull requests. It is almost time to treat yourself to that beer and vacation. But first we must deal with the rapid pace of data changes. A recent Reachforce study estimates that in the next hour 59 business addresses will change, 11 companies will change their name, and 41 new businesses will open!
In short, all your hard work is going to be erased by the unrelenting pace of change. Customers move jobs, houses, and phone carriers. Businesses open new locations and close old ones. Your own company will reorganize and change business processes in order to eke out more efficiency. But don’t despair. This is not your father’s governance and you are not bound by your father’s technology.
At the core of all this change is one simple question: How do I know when something has changed? Once you know something has changed the rest of the machinery you just established will whirr into gear driving all the change. The problem is that no person in an organization knows when all these changes take place. So long as each data set has an owner and each owner works the system, it will all be OK. In a perfect world, that would be true. Unfortunately, we have never seen this perfect world where all the constituents in an organization prioritize the good of the whole over what they need to get done now. In short, you need to have someone watching the watchers.
who is watching the watchers?
For mission critical data, most organizations already do have these watchers. In the simplest and perhaps most ubiquitous case, organization-level KPIs are typically viewed at least daily by a variety of executives. Any deviations, positive or negative, often trigger an e-mail from that executive, which summarily prompts a fire drill for at least that department to dig into what just happened. Not only does this reactive fire drill instill many less-than-ideal norms, it does not scale. For any organization of meaningful size, there will be at least dozens if not hundreds of core KPIs that would need to be monitored, each of which could likely be sliced by half a dozen or more dimensions (geography, version, user segment, etc.). What is needed is a superhuman who tirelessly looks at every metric 24/7.
Fortunately, computers are terrific at this task. All you need to do is point a computer at those metrics and ask it to tell you when anything looks strange. Did you get more users signing up today than normal? Is the average cart size smaller? Is a location not sending you any data? Does your latest email campaign have a higher than normal unsubscribe rate? Do two tables suddenly show different values for the same user? These are exactly the types of questions that machine learning is amazing at answering.
Algorithms can not only make their customer’s lives better through recommendations and personalization, but they can and should make a company’s internal data better as well.
Rather than asking what data do I need to make my models better, you can finally ask what models do I need to make my data better.
beware of alert fatigue
A word to the wise: Be careful as you go down this path. Avoiding this trap is the difference between a good data science team and a great one. Exactly because machine learning is so great at this task and there is so much data flowing through an organization, it’s easy to get carried away.
We have seen organizations spend months of effort building hundreds of algorithms to monitor every aspect of their data only to have everyone go from exuberance in the early days to completely ignoring the alerts by the end. The trouble is the world is noisy. The more you monitor, the more things you are going to find. Simply put, people don’t want to be told that you sold one less widget in some region than expected unless it makes a difference to the business. Watch for alert fatigue and only alert when you really need to. If you aren’t sure when to stop, try this simple litmus test:
When you stop seeing business improvements in the application of data, stop improving the data.
It goes back to a principle we use for bug fixing: If finding and fixing a bug that happens on one in a million requests doesn’t improve revenue, cut cost, or affect customer satisfaction, why spend the money to deal with it? In short, let the business case drive your level of effort.
Figure 1. Uber’s Data Quality Monitoring
Metric-level anomalies (right) are too noisy for everyday use and induce alert fatigue. If, however, we set an appropriate threshold for table-level anomalies (red line, left), we can minimize the number of alerts generated and focus on only the most destructive issues.
Figure 2 shows the differences between this approach and the traditional “Big G” approach:
Figure 2. Credera’s Data Governance Approach
governance without bureaucracy
Governance is necessary to ensure consistency in a dynamic environment (systems, data changes, etc.) But it’s also tricky because it can lead to bureaucracy. For some larger organizations, the biggest initial hurdle can be getting buy-in and scoping the first problems tight enough. It’s tempting to say yes to everyone in order to get your initiative funded. This is where bringing in a third party to help manage scope and set appropriate expectations to guarantee your first project overdelivers can be helpful. Additionally, there are some clever approaches to having the metric- and row-level machine learning models feed into a table or overall business data quality score that we’d be happy to share with you.
Just remember, a good policy is open and easy to use, starts small, and incrementally makes improvements that drive business value.
In the next blog post we’ll tack the third challenge to effective data: immature data engineering practices.
We at Credera would love to help organizations understand how to leverage their data. Please reach out to us at email@example.com to start a conversation.