In 2019, 82% of enterprise system downtime cost companies over $200,000 per hour. The cost of these avoidable mistakes isn’t confined to lost productivity or transactions; they can also do long-term damage by negatively impacting a company’s reputation in the marketplace. This highlights the importance of examining the impact of any configuration changes to production databases, virtual machines, network, DNS, storage, or any other system that serves a critical business function before they are made.
In part one of this blog series, we talked about major public cloud outages and how critical geo-redundancy is to business continuity. In this blog post we are going to talk about lessons learned from cloud catastrophes caused by failures to properly manage important production changes and how not to repeat these commonly made mistakes. Let’s take a look at some cautionary tales when that didn’t happen.
Cautionary Cloud Catastrophe Tales
All organizations that rely on technology struggle with change management at one time or another, even large well-known companies like Salesforce, Microsoft, and AWS. Each of them has made mistakes we can all learn from.
Manual Production Changes Are Dangerous
Last year, thousands of Salesforce Marketing Cloud users were locked out of the service for 15 hours after a faulty database script was run against its Pardot marketing automation service, allowing users to see and edit all of their company’s Pardot data regardless of permission settings. Salesforce quickly responded by cutting off access to all current and past Pardot customers as it worked to undo the database permission changes.
The irony is Salesforce’s marketing automation system was felled by a manual change made to a database maintenance script. This is a great example of not following or bypassing change and configuration management safe practices and not testing the impact of a change before putting it into production.
Process Automation Is Not a Silver Bullet
Last September, people across the globe were unable to log in to Office 365 and other related services, including Microsoft Teams, Office.com, Power Platform, and Dynamics365 for five hours because of a bug in their Safe Deployment Process (SDP) for Azure AD. SDP is designed to test Azure AD service updates by initially targeting a validation ring that contains no customer data, followed by an inner ring that contains Microsoft only users, and lastly the production environment. These changes are normally deployed in phases across five rings over several days. However, in this case an SDP bug impacted its ability to read deployment metadata, so a defective update was deployed to all five rings simultaneously. The issue was compounded by the fact that Microsoft’s SDP rollback method was also compromised so they were forced to undertake a longer manual rollback process.
The moral of this story is that process automation is not a panacea. Even a well-designed automated change management process can fail spectacularly. In this example SDP’s error handling caused it to “fail open” instead of failing closed and shutting down.
Testing Changes Is Critical for Reliability
On the day before Thanksgiving, AWS added capacity to its front-end fleet of Kinesis servers in the flagship US-East-1 region, not suspecting that scaling out would trigger unknown thread count limitations in their architecture. This resulted in 17 hours of outages across both AWS services and a number of external companies such as Adobe Spark, 1Password, iRobot, and The Washington Post. One service that AWS lost during this time was Cognito, which is what is used to update the AWS Service Health Dashboard. The downtime was prolonged by the fact that the fleet of Kinesis front-end servers required a staggered reboot to avoid flooding the network with traffic. AWS stated they would work around the OS thread count limit by scaling servers up with more CPU and memory rather than scaling out with more servers.
Is it fair to say AWS should have discovered this dependency in load or elasticity testing? Probably not. It would be nearly impossible to simulate operations at that scale, but it does underscore the importance of those testing activities. This is also another example of why geo redundancy in the cloud is necessary for critical services.
Lessons Learned from Cloud Catastrophes
Business critical service outages are costly in terms of lost productivity, business transactions, and reputation. Organizations take great pains to plan geo redundancy for disasters and high availability for system failures, but change management processes also deserve this level of due diligence because they are just as likely to cause downtime if they are mismanaged.
From the examples above we can ascertain three key points:
Getting control of production access and change management is the first priority.
Automating changes to reduce human error is also an effective practice, but it’s not a cure-all.
It is important to understand application and system dependencies and leverage sound DevOps principles by minimizing them, so making changes to one system or application does not disrupt others.
After the fact, the most important thing a technology organization can do is learn from its failures. This comes from diligent root cause analysis and blameless post-mortems. Organizations like Google and Etsy have found performance improvements by fostering a culture of learning by assuming that every team member acted with the best intentions based on the information they had at the time and creating an atmosphere of psychological safety.
Do you need help with cloud reliability, scalability, or operations excellence? Credera has experience helping organizations in public, private, and hybrid cloud scenarios. Reach out to us at firstname.lastname@example.org.