AWS outage post-mortem fingers DNS as the culprit that took out a chunk of the internet and services for days — automation systems race and crash

Network tech entangled in cables
(Image credit: Getty Images)

The recent Amazon Web Services outage that took out a significant portion of the internet, games, and even smart home devices for days, was extensively covered in the news. Cloud services' distributed architecture should protect customers from failures like this one, so what went wrong? Amazon published a detailed technical post-mortem of the failure, and as the famous haiku poem goes: "It's not DNS. / There's no way it's DNS. / It was DNS."

As a rough analogy, consider what happens when there's a car crash. There's a traffic jam that stretches for miles, in an accordion-like effect that lasts well after the accident scene has been cleared. The very first problem was fixed relatively quickly, with a three-hour outage from October 19 at 11:48 PM until October 20 at 2:40 AM. However, as with the traffic jam example, dependencies started breaking, and didn't fully come online until much later.

The specific technical issue behind the DNS failure was a programmer's "favorite" bug: a race condition, in which two repeating events keep re-doing or undoing each other's effects — the famous GIF of Bugs Bunny and Daffy Duck with the poster is illustrative.

Google Preferred Source

Follow Tom's Hardware on Google News, or add us as a preferred source, to get our latest news, analysis, & reviews in your feeds.

Bruno Ferreira
Contributor

Bruno Ferreira is a contributing writer for Tom's Hardware. He has decades of experience with PC hardware and assorted sundries, alongside a career as a developer. He's obsessed with detail and has a tendency to ramble on the topics he loves. When not doing that, he's usually playing games, or at live music shows and festivals.

  • oldsysarch
    As described it actually wasn't DNS per se, it was failing to account for potential race conditions in update processes. Ie poor software design. Typical inexperienced coding design that does not properly dealing with edge conditions. You want robust code, you need to properly account for potential failure modes and that's what experience teaches you, all the "what might happen".
    Reply
  • Sam Hobbs
    Saying that it was DNS implies it was a problem with the design of DNS. The problem sounds like a complicated mess but it seems to me that it was a problem with the DNS processing but not the design of DNS.
    Reply
  • pjmelect
    Couldn't they just turn it off and then back on again as in the South Park episode:D
    Reply
  • antrozous
    pjmelect said:
    Couldn't they just turn it off and then back on again as in the South Park episode:D
    Over logging (that's the title of the episode 😉)
    Reply