Yesterday's global internet outage caused by single file on Cloudflare servers — unexpected file size caused catastrophic error, knocking out several major websites

Cloudflare banner
(Image credit: Getty / Bloomberg)

Cloudflare, one of the biggest DDoS and security providers on the internet, suffered a major outage yesterday, knocking out several major websites, including X, OpenAI, and even some McDonald’s branches across the globe. Its chief technology officer has since apologized for the massive error, while its co-founder, Matthew Prince, has since released the details of the cause of the outage on the company blog.

Since Cloudflare is a web security outfit that protects a big chunk of the internet from DDoS and other similar network intrusions, one of the first thoughts of the company was that it was under attack. In fact, Microsoft released a report of a record-breaking DDoS attack against its servers on the same day that the Cloudflare issue happened. However, the company realized that it was actually caused by a configuration error after further investigation.

“The issue was not caused, directly or indirectly, by a cyber attack or malicious activity of any kind. Instead, it was triggered by a change to one of our database systems’ permissions, which caused the database to output multiple entries into a 'feature file' used by our Bot Management system,” Prince wrote in the blog. “That feature file, in turn, doubled in size. The larger-than-expected feature file was then propagated to all the machines that make up our network.”

Google Preferred Source

Follow Tom's Hardware on Google News, or add us as a preferred source, to get our latest news, analysis, & reviews in your feeds.

Jowi Morales
Contributing Writer

Jowi Morales is a tech enthusiast with years of experience working in the industry. He’s been writing with several tech publications since 2021, where he’s been interested in tech hardware and consumer electronics.

  • blitzkrieg316
    One file can take down the global internet... and not the first time...

    We need to seriously rethink our entire internet dependence from systems acting like SKYNET...
    Reply
  • Imaletyoufinish
    So that's what happened. Had a bunch of Cloudflare server popups when I was looking up CUs in some of AMD's APUs (Strix Halo for example) and Tech Power Up, Notebook Check and other sites with that information (all in a row using Duckduckgo search) were not loading up and had the same Cloudflare server error. After a while I switched VPN locations and things seemed back to normal so didn't know how big of an issue this was until reading this.
    Reply
  • valthuer
    blitzkrieg316 said:
    One file can take down the global internet... and not the first time...

    We need to seriously rethink our entire internet dependence from systems acting like SKYNET...

    There's no going back, i'm afraid.

    If it's any consolation, the fact that so many people around the world (including companies, like OpenAI) rely on Cloudflare, can only mean one thing: any major issues, will be addressed as quickly as possible.
    Reply
  • TechieTwo
    It just confirms for hackers what networks to attack. :mad:
    Reply
  • derekullo
    TechieTwo said:
    It just confirms for hackers what networks to attack. :mad:
    That's like refusing to have doors or exterior walls on your house because that's the first spot thieves will try to attack to break in :P
    https://upload.wikimedia.org/wikipedia/en/7/79/Roll_Safe_meme.jpg
    Cloudflare's original goal was to prevent DDOS attacks ... Project Honey Pot
    They have since become a Content Delivery Network that caches data in multiple data centers so that it can be served quickly to users.

    Without Cloudflare the internet would be much slower due to much more successful and prevalent DDOS attacks and having to wait for data to come from a single server versus being cached in multiple places in multiple countries.
    Reply
  • ezst036
    I read somewhere (I think it was a substack) that this configuration file pairs up to some brand new code that was deployed using the Rust language and there wasn't not nearly as much testing around it as there should have been.

    That's interesting to me considering the sterling reputation that Rust seems to (otherwise) have.
    Reply
  • Sam Hobbs
    Could the failure have been mitigated with better error checking? In other words, when the unexpected size caused a problem, the program should have exited gracefully with a useful error message. Was that done? If not then their entire system needs to be evaluated and better error checking added where appropriate.
    Reply
  • Sam Hobbs
    ezst036 said:
    there wasn't not nearly as much testing around it as there should have been.
    I was going to say something about testing but it is not always possible to know what to test for. In this case if they knew of the possibility of a larger file size then it would not have been unexpected and they would have coded for the possibility. If the program did not exit gracefully upon the error then that is what it should have done as I said previously.
    Reply
  • StevenW1969
    Decentralize is the only answer to a centralized system. Put all your eggs in one basket, they may all get cracked.
    Reply
  • snemarch
    ezst036 said:
    I read somewhere (I think it was a substack) that this configuration file pairs up to some brand new code that was deployed using the Rust language and there wasn't not nearly as much testing around it as there should have been.

    That's interesting to me considering the sterling reputation that Rust seems to (otherwise) have.
    The core issues were:
    1) for performance reasons, infrastructure code preallocates memory for a fixed maximum number of "features" (which had a healthy amount of buffer compared to current/expected feature size - 200 max entries vs. current ~60 entries).
    2) A bug in data generation caused configuration files to explode size-wise.

    The code assumed number features wouldn't exceed the max amount, but "panicked safely" instead of corrupting memory which would be worse.

    I'm not sure if there's good alternatives to the main logic - you *probably* don't want to just disable the infrastructure security features if the file is too big. Just loading "up to the maximum" doesn't seem like a good idea. And not putting bounds on feature size could lead to even worse outages because other parts of the infrastructure services could get OOM-killed.

    There's definitely something to learn wrt. controlled roll-outs and ability to do fast fallback to last-known-good-version :)
    Reply