AWS outage post-mortem fingers DNS as the culprit that took out a chunk of the internet and services for days — automation systems race and crash

(Image credit: Getty Images)

The recent Amazon Web Services outage that took out a significant portion of the internet, games, and even smart home devices for days, was extensively covered in the news. Cloud services' distributed architecture should protect customers from failures like this one, so what went wrong? Amazon published a detailed technical post-mortem of the failure, and as the famous haiku poem goes: "It's not DNS. / There's no way it's DNS. / It was DNS."

As a rough analogy, consider what happens when there's a car crash. There's a traffic jam that stretches for miles, in an accordion-like effect that lasts well after the accident scene has been cleared. The very first problem was fixed relatively quickly, with a three-hour outage from October 19 at 11:48 PM until October 20 at 2:40 AM. However, as with the traffic jam example, dependencies started breaking, and didn't fully come online until much later.

The specific technical issue behind the DNS failure was a programmer's "favorite" bug: a race condition, in which two repeating events keep re-doing or undoing each other's effects — the famous GIF of Bugs Bunny and Daffy Duck with the poster is illustrative.

As the cherry on top (technically, below), Amazon's Network Load Balancer (NLB) also took a hit, as delays in fixing the DNS entries and their subsequent propagation had the Load Balancer bring up new EC2 instances with an inconsistent network state. Also due to that, many of the NLB health checks failed, even though the underlying infrastructure was actually running OK. These failing health checks caused NLB nodes and backend targets to (guess what!) be removed from DNS, only to return later.

Follow Tom's Hardware on Google News, or add us as a preferred source, to get our latest news, analysis, & reviews in your feeds.

Bruno Ferreira is a contributing writer for Tom's Hardware. He has decades of experience with PC hardware and assorted sundries, alongside a career as a developer. He's obsessed with detail and has a tendency to ramble on the topics he loves. When not doing that, he's usually playing games, or at live music shows and festivals.