AI•Aug 10, 2020
Mind the Gap – It’s Not AI/ML Unless It’s in Production: Data Strategy Series Part 4
While artificial intelligence (AI) and machine learning (ML) can offer great leaps forward, they can also create an astounding amount of technical debt. In fact, Google researchers call machine learning “the high-interest credit card of technical debt.” In this blog post, we’ll focus on these challenges and the solution by using an analogy to traditional software development lifecycles (SDLC).
This is the fourth installment in our series on improving your use of data. Just to recap, in part one we discussed how many data efforts didn’t achieve value because they didn’t have a specific goal or business purpose they were trying to achieve. We suggested defining a small objective, working through that, then iterating. This approach achieves value sooner and validates lessons that can be used to course correct.
In parts two and three, we covered the need for accurate data to achieve the purpose. This includes putting the governance in place around definitions of data, understanding the flow through systems, and managing the quality of data to make sure it is “good enough” to achieve the intended goals. We also cautioned against large efforts that try to achieve “perfect” data across the enterprise. Instead, we suggested using machine learning to monitor your data and drive actionable alerts when data quality degrades.
There is no doubt that the biggest opportunity to drive business outcomes with data is through the effective use of AI/ML (fed by good data as discussed in blog posts two and three in this series). For example:
GE has saved its industrial customers over $1.6 billion through the use of ML-powered predictive maintenance.
Highmark Inc. saved over $260 million in 2019 by leveraging ML to detect fraud, waste, and abuse in their health care insurance group.
ML-powered recommendation engine accounts for more than 35% of Amazon’s total revenue.
But even when the right problem is tackled and the data quality issues are addressed, AI/ML still have a problem living up to the hype. We believe this is because of a lack of maturity in the end-to-end processes and tools for using AI/ML solutions at production scale.
Following the original release of this article, Google published an AI/ML article that stated the same thing — "Creating an ML model is the easy part—operationalizing and managing the lifecycle of ML models, data, and experiments is where it gets complicated."
To illustrate this, we’ll make an analogy using general software development as a basis for comparison.
Software Development 101
When there is a need for a software application, developers do not just sit down and start developing code that runs on their laptops and, voila, the problem is solved. It’s much more involved.
As software development has matured, we have created best practices to make sure we are building the right things, building them correctly, testing them, and monitoring and supporting them in production. Virtually all software development organizations have some variation of this process that includes all these elements:
These processes, whether they be lean, agile, DevOps, ITIL, etc., have been maturing for decades. The tools and methods we use, whether they be design sprints, integrated development environments, test frameworks, containers and schedulers, or continuous integration (CI)/continuous development (CD) pipelines have also been around for years and have continued to evolve.
More importantly, these tools and processes have been developed to model the actual production environment in which they will eventually run. To steal a concept from electrical engineering, there is very little “impedance mismatch” between what is developed and what is run. A container running software on a developer laptop is the same (or very, very similar based on configuration parameters we control) in both the testing and the production environments. This, we will see below, is a significant issue for AI/ML-based solutions in the real world.
Aside: To be sure, some of these “best practices” like governance and change control can run amok and become a hindrance to development. But as the Latin proverb states, “abusus non tollit usum” or “abuse is not an argument against proper use.” We are assuming “proper” use here.
However, when using data to generate AI/ML solutions, there are very few analogs to these best practices. Let’s take, for example, these elements of the typical lifecycle:
As we laid out in part one, there is usually not a lot of thought given to what we are trying to accomplish. These use cases should drive the development of the model. In fact, they are often expensive and detrimental. Too often modeling assumptions are ignored, spurious correlations are discovered, and the business is led down the wrong path.
AI/ML work is often done by data scientists on laptops using analysis tools like R, Jupyter, Zeppelin, or even Excel, and often some combination of all of these tools. Models are predominately built in Python with tools like scikit-learn or TensorFlow. While these are excellent tools, there can be a large “impedance mismatch” between these tools and the actual production systems. Thus leading to great “laptop” models being thrown away because they don’t work in production.
How do we actually test AI/ML systems? It’s easy to the test the mechanisms that we use. For example, this model was trained on some data and returned some result. But that doesn’t tell us if we are achieving the desired business outcomes. After all, models will always give you an answer, however, if not designed properly it will not be the right answer. Even if the model does give the correct answer, it may not be the right answer for the user
For example, imagine the autopilot model that drives Teslas. Developers can test the mechanisms that run the model and make the recommendation, but does it actually stay in its lane and stop at stoplights? That must be measured and monitored after deployment. Only then do we know if it works or needs tweaks.
Finally, there is an element in testing AI/ML solutions that doesn’t have a straightforward analog in traditional development: time. Take our Tesla autopilot for example: One week that may be working fine. The next day it may cause a collision. Why? The stoplight could have been damaged by a windstorm, it could be covered with snow, perhaps the sun caused severe glare on the camera lens at exactly the wrong moment, a shopping bag caught in the wind obstructed the view, or countless other reasons (literally countless, that’s the point of having an AI/ML model). Notice too, that this example likely has fewer structural changes as it involves actual infrastructure. The challenge when predicting human behaviors is even more fickle. So we must constantly monitor the solutions to know when we must update the models.
How do we deploy something like an AI model at scale? Where does it need to deploy to (on the edge, cell phone, car, etc., or in the cloud, web application, etc.)? Are these devices guaranteed to have connectivity or do they need to work offline? How do we distribute updates? How will you detect regression? In the software development ecosystem, there have been many tools developed to help deploy at scale, everything from script-based solutions, configuration management tools (Chef, Ansible, etc.) to containers and schedulers.
As shown in the Telsa example, companies must constantly monitor the use of data in their systems and its effect on business outcomes. Many companies have only basic monitoring that covers availability and performance metrics like “is the site up?” and “how are the pages?” More business specific metrics like “today’s online sales” are usually relegated to reporting tools that update nightly or weekly. To make the most effective use of AI/ML, we must monitor for situations like our example above in order to start reacting sooner.
A Holistic Approach
At Credera, when we develop solutions that incorporate AI/ML, we try to take all these production factors into consideration. Rather than focusing simply on the details of the model (which isolated data scientists are prone to do), we look across the system to understand questions such as these:
Where and how will this be deployed? For example, is it deployed in a data center or locally on a mobile application or embedded system?
How do these components interact with the rest of the system? Do we have the data and the quality to be successful (see part two)?
How is this data fed to the model? Can we ensure changes to models are allowed without changes to other parts of the system?
How do we get feedback from production performance of the system to the data scientists?
How will latency and connectivity affect performance?
How will performance of the models affect the overall system?
Can we handle failures gracefully?
How does this system scale? Can the impacts of scale be reliably and predictably assessed?
How will we package, deploy, and update the models given the architecture?
If they are distributed, how do we track and monitor different versions? Can we force upgrades?
What automation (CI/CD) can we put in place to make changes to the models? How does this interact with updates of the overall system?
How will testing systems know what the correct answer should be? How can we validate that in testing?
How do changes to the model affect the application and vice versa?
Can we put tests in place to ensure reasonable results via the build system?
Who is responsible for debugging issues?
Do we have the appropriate traceability? That is, can we see what changes are made, by whom they are made, and what impacts they are having?
Can we roll back changes that are causing issues?
Can we decipher if the degradation is caused by the input data, the model, the infrastructure supporting that, or the application itself?
Monitor and Improve:
What are the key business outcomes driven by the model?
How will we measure outcomes, not mechanisms (the models execute and provide any answer vs. the models give a good outcome that drives business value)?
How do we incorporate techniques such as A/B testing or multi-variate testing to validate changes?
These production factors cannot be overlooked in the rush to get something working. The failures we have seen are usually driven by missing these production needs. There are not many best practices available to fall back on. Furthermore, most of these answers lay outside the actual AI/ML components (e.g., monitoring). You need good collaboration between architects, developers, support personnel, and data scientists to ensure good solutions.
As with any emerging technology space, there are constant improvements in the space. For example, we see some things like Tensorflow Extended (TFX), SageMaker, etc., that are helping address the gaps between data and models and their use in production. But until then, we must make sure to add the necessary components and processes to bridge these gaps.
In our fifth and final installment in this series, we look at the impact of these solutions on the people in your organization and what needs to be done to address those changes.