Google Cloud and Seagate offered a peek at their efforts to use machine learning, a type of AI, to predict when data center hard disk drives (HDDs), which are responsible for storing many terabytes of data, might start to fail so they can plan around those disruptions to their systems.
Right now there’s no getting around the fact that HDDs fail. They’re less reliable than SSDs —assuming those drives aren’t being pushed to the limit while they mine Chia—but they also offer higher capacities at lower prices. That’s an important factor for companies like Google Cloud that need to be able to handle massive amounts of data, either in support of their own projects or on behalf of their customers.
”At Google Cloud, we know first-hand how critical it is to manage HDDs in operations and preemptively identify potential failures,” the company said in a recent blog post detailing those efforts. “We are responsible for running some of the largest data centers in the world—any misses in identifying these failures at the right time can potentially cause serious outages across our many products and services.”
The problem was that manually identifying a failing drive, which Google Cloud defined as an HDD “that fails or has experienced three or more problems in 30 days,” is a time-consuming process that requires physical access to the device. Google Cloud and Seagate wanted to use machine learning to reduce the amount of time engineers would have to spend testing drives to determine their risk of failure.
Google Cloud said that it has “millions of disks deployed in operation that generate terabytes (TBs) of raw telemetry data,” including “billions of rows of hourly SMART(Self-Monitoring, Analysis and Reporting Technology) data and host metadata, such as repair logs, Online Vendor Diagnostics (OVD) or Field Accessible Reliability Metrics (FARM) logs, and manufacturing data about each disk drive.”
That means the company has a staggering number of HDDs that all generate “hundreds of parameters and factors that must be tracked and monitored.” This being Google Cloud, however, the sheer amount of available information was also beneficial. Between Google Cloud, Seagate and Accenture, that data could be put to use in a machine learning model capable of predicting a drive‘s chances of failing.
The companies tested two models: One based on AutoML Tables and one that was custom-developed for this project. The former won out with “a precision of 98% with a recall of 35% compared to precision of 70-80% and recall of 20-25% from [the] custom ML model,” (which also means the experiment served the dual purpose of demonstrating the benefits of using AutoML instead of a custom solution).
Google Cloud said that it plans “to expand the system to support all Seagate drives—and we can’t wait to see how this will benefit our OEMs and our customers!“ More information about the project is available via the company’s blog post.
I presume they empty the bad drive first so that when repairs are complete, it's blank, and does not contain old data that might cause Google's overall storage to have old, out-of-date stuff.
So they fix the drive without worrying about the data contents, as if reformatting blank, not recovering any data from it.
Why drain it of all data if data is redundant anyway, does Google really have some data in only one place, on only one drive, ever?
And why not simply swap a new drive for an old drive, and service the old drive off-line, for reuse later if appropriate, with no interruption of service while testing, are the hard drives hard-wired soldered into place inside Google servers?