Technology•May 15, 2020
A Tale of Enterprise DevOps Evolution: Taking a Step Back to Move Forward
DevOps, it seems, has finally taken root in enterprises. Studies show that companies are adopting the principles and practices of DevOps and are seeing significant improvement in technology delivery and company performance. At Credera, we are big proponents of DevOps, and we love to see this uptick in adoption. But sometimes we see clients stumble because they try to adhere dogmatically to “best practices.” These practices are not absolute canon. Sometimes companies have specific situations or constraints that require a deviation from best practices—while still keeping DevOps principles in mind. We ran into one example recently and thought it would be helpful to share what we learned.
devops best practices gone wrong
DevOps best practices state that deploying small batches of changes more frequently reduces the risk of introducing changes into our production systems. This makes sense, since a small batch of changes can be validated easily and identifying and correcting a problem with a small batch of changes is far simpler than identifying the offending change in a large collection of changes.
However, sometimes things can be too small and too frequent for the rest of the organization to handle effectively. We recently served a client who was quickly deploying small batches of changes into their integrated test environment, but they were seeing an increase in costly defects in production. Upon investigation, the defects were traced back to code that hadn’t been tested in the integrated testing environment. The code wasn’t tested because the integrated testing environment was deemed “too unstable” for testing. The instability of the test environment was caused largely by a constant stream of deployments from each of the development teams, where each deployment caused a temporary service outage while the deployed service was restarted. With dependent services constantly being restarted and new code being deployed, any failures discovered during testing were dismissed as “environmental instability.”
To resolve these issues, we proposed a “timed gate” that would limit deployments into the integrated test environment at specified intervals once or twice a day. This would slow the churn and allow all changes to be validated before deployment to production.
Our client was upset by this recommendation. Wouldn’t this be a step in the wrong direction? Isn’t agile and DevOps about continuous delivery, on-demand deployments, and focusing on “individuals and interactions over processes and tools?” How could we recommend something that seemed to be moving backward?
We empathized with their concerns and even shared their desire to achieve more frequent deployments. But even though frequent, small-batch deployments are one of the foundational pillars of good DevOps (and was still our goal), in this case it wasn’t their best next step. Instead, they needed to:
Understand the constraints of the current system.
Focus on the vision and take small steps toward success.
Recognize the needs of the enterprise instead of just one team.
understand your constraints
Our client felt that with the proper controls and automated tests in place, their teams could deploy their changes to the integrated testing environment without bothering other teams. However, they had a constraint that made this impractical. Their core business processes run on a mainframe. That mainframe is a large piece of shared infrastructure that wasn’t set up to support many teams running tests simultaneously on different parts of the system—everyone must test in the same environment, and changes to one part of the system impact users of another part. Therefore, by deploying code, teams would disrupt or invalidate testing that other teams were running.
Additionally, their mainframe hosts a massive shared database, which means many of their systems are tightly coupled together. It is very challenging or impossible to test a change made to a tightly coupled system without testing the whole system, because it is unclear what processes might be impacted by the change. In order to test changes to their system, our client was forced to run entire end-to-end scenarios, starting with offers made to customers and finishing with invoicing the customers and processing payments. As unfortunate as these massive tests were, they were necessary to protect the main revenue streams of the business from breaking changes. And again, any deployment of code during one of these tests would disrupt or invalidate the test, forcing it to start over.
It is worth noting that our client was in the middle of decoupling and rearchitecting their systems to be independently deployable and testable. But this is not a short process. In the meantime, the business must continue to run and generate revenue, which means it needs to be building, testing and deploying new features in the midst of their current architecture.
It’s kind of like competitive swimming. If you work out regularly, you will increase your lung capacity and be able to hold your breath longer. But until then, you need to keep coming up for air.
focus on the vision and take small steps toward success
Our client was discouraged by the recommendation to put in a timed gate because he felt we should be deploying more frequently with less centralized controls. A timed gate was the opposite of that. However, deployments to production were still only happening once a month even though they were deploying their changes to the integrated test environment whenever the developers wanted.
With all of the changes deploying to the integrated test environment constantly, the teams had to declare a “code freeze” for the week before the planned production deployment. It was during this week-long freeze that tests were finally able to be completed and the changes verified. It took a week because an entire month’s worth of changes had built up without being tested.
In contrast, a timed gate actually moved them toward deploying to production more frequently, even though deployments to the integrated test environment would be less frequent. By stabilizing the integrated test environment, the changes would have enough time to be fully validated. Once the changes were validated, they could be deployed to production immediately. This created the opportunity to go to production once a day, if there were no other organizational constraints.
recognize the needs of the enterprise
While our client’s team may have had sufficient controls and automated tests in place to ensure high-quality deployments, this was not uniformly the case across the enterprise:
Many other teams did not have sufficient automated tests.
The test results were not used to control the deployments.
The lack of sufficient automated tests was really the root issue for why deployments took so long but continued to fail. Without automated tests, changes cannot be validated quickly and therefore can’t be deployed to production. To achieve on-demand deployments, the development teams needed to create more automated tests, and additional deployment automation was required to use the test results as a “quality gate” (instead of a timed gate) for deploying into the integrated test environment. Stabilizing the integrated test environment helped achieve this objective as well, because it gave the development team a stable target environment to develop and validate their automated tests.
In summary, we believe strongly in the principles of DevOps, but existing organizational and technical constraints sometimes require a step back before you can take the “best next step” forward.
what’s your best next step?
Do you need help creating a DevOps roadmap that focuses on your company’s long-term objectives while taking small steps toward success? Reach out to our DevOps experts at firstname.lastname@example.org.