Gfail: Gmail Goes Down for Nearly 2 Hours

Google yesterday evening took to its Gmail Blog to talk about what the company dubbed a "Big Deal." Detailing that the outage was because of a miscalculation of capacity, Ben Treynor, VP Engineering and Site Reliability Czar, wrote that the company had already thoroughly investigated what happened, and was compiling a list of things it intended to fix or improve as a result of the investigation.

It's nice to know the team has things under control, but what actually went wrong? Treynor says that yesterday morning the team took a small fraction of Gmail's servers offline to perform routine upgrades, a procedure that normally goes off without a hitch. Ben continued on to explain that this time the team underestimated the load that some recent changes (ironically, some designed to improve service availability) placed on the request routers. At approximately 12:30, a few of the request routers became overloaded and so, the load was transferred onto the remaining request routers. More became overloaded and it sort of went from there until they were all down.

Steps Google is taking to ensure the same thing doesn’t happen again include increasing request router capacity, and figuring out a way to make sure problems in datacenter A don't affect datacenter B.