AWS outage post-mortem fingers DNS as the culprit that took out a chunk of the internet and services for days — automation systems race and crash
The recent Amazon Web Services outage that took out a significant portion of the internet, games, and even smart home devices for days, was extensively covered in the news. Cloud services' distributed architecture should protect customers from failures like this one, so what went wrong? Amazon published a detailed technical post-mortem of the failure, and as the famous haiku poem goes: "It's not DNS. / There's no way it's DNS. / It was DNS."
As a rough analogy, consider what happens when there's a car crash. There's a traffic jam that stretches for miles, in an accordion-like effect that lasts well after the accident scene has been cleared. The very first problem was fixed relatively quickly, with a three-hour outage from October 19 at 11:48 PM until October 20 at 2:40 AM. However, as with the traffic jam example, dependencies started breaking, and didn't fully come online until much later.
The root cause was reportedly that the DNS configuration for DynamoDB (database service) was broken and published to Route53 (DNS service). In turn, parts of EC2 (virtual machine service) also went down, as its automated management services rely on DynamoDB. Amazon's Network Load Balancer also naturally depends on DNS, so it too encountered issues.
It's worth noting that DynamoDB failing across the entire US-East-1 region is, by itself, enough to bring down what are probably millions of websites and services. However, not being able to bring up EC2 instances was extra bad, and load balancing being affected was diamond-badge bad.
The specific technical issue behind the DNS failure was a programmer's "favorite" bug: a race condition, in which two repeating events keep re-doing or undoing each other's effects — the famous GIF of Bugs Bunny and Daffy Duck with the poster is illustrative.
The DynamoDB DNS resolution uses two components: a DNS Planner that, as the name implies, periodically issues a new Plan that considers system load and availability. The DNS Enactors, whenever they see a new Plan, apply it to Route53 as a transaction, meaning a plan either fully applies or it doesn't. So far, so good.
What happened was that the first DNS Enactor was taking its sweet time to apply what we'll call the Old Plan. As New Plans came in, another Enactor took one and applied it. There's now good and updated data in Route53, and a clean-up of outdated plans (Old Plan included) is issued, just as First Enactor finished applying Old Plan.
Get Tom's Hardware's best news and in-depth reviews, straight to your inbox.
As the clean-up removed the Old Plan at the same time it was finally being applied, this plan now contained empty information, and that was applied, effectively removing all DynamoDB DNS entries. This gigantic "oops" had to be fixed manually, too, given the now-inconsistent state of the DNS entries.
Just in time, the EC2 service started breaking, as its automated management systems depend on DynamoDB. New EC2 instances couldn't be created, and that resulted in a huge backlog of requests to go through. The stampede required techs to throttle the creation of EC2 instances for quite a while, meaning many services and websites were still down until such time as their instances could be online.
As the cherry on top (technically, below), Amazon's Network Load Balancer (NLB) also took a hit, as delays in fixing the DNS entries and their subsequent propagation had the Load Balancer bring up new EC2 instances with an inconsistent network state. Also due to that, many of the NLB health checks failed, even though the underlying infrastructure was actually running OK. These failing health checks caused NLB nodes and backend targets to (guess what!) be removed from DNS, only to return later.
In the aftermath, Amazon's techs brought down DynamoDB's DNS Planner and DNS Enactor until fixes are issued for the race condition scenario that took place, and more protections are added. Likewise, EC2 is gaining a new test suite for odd working conditions, and the Network Load Balancer will have control mechanisms that limit how much capacity is removed when a health check fails.
Automated control systems are a necessity in the world of cloud computing or any minimally serious enterprise, but as just illustrated, they require extremely careful programming, and decentralization, which is where Amazon apparently failed. Always remember, "the cloud is just someone else's computer."
Follow Tom's Hardware on Google News, or add us as a preferred source, to get our latest news, analysis, & reviews in your feeds.

Bruno Ferreira is a contributing writer for Tom's Hardware. He has decades of experience with PC hardware and assorted sundries, alongside a career as a developer. He's obsessed with detail and has a tendency to ramble on the topics he loves. When not doing that, he's usually playing games, or at live music shows and festivals.