Google doubts hard drives fail because of excessive temperature, usage

Mountain View (CA) - As a company with one of the world's largest IT infrastructures, Google has an opportunity to do more than just search the Internet. From time to time, the company publishes the results of internal research. The most recent project one is sure to spark interest in exploring how and under what circumstances hard drives work - or not.

There is a rule of thumb for replacing hard drives, which taught customers to move data from one drive to another at least every five years. But especially the mechanical nature of hard drives makes these mass storage devices prone to error and some drives may fail and die long before that five-year-mark is reached. Traditionally, extreme environmental conditions are cited as the main reasons for hard drive failure, extreme temperatures and excessive activity being the most prominent ones.

Google said that it is collecting "vital information" about all of its systems every few minutes and stores the data for further analysis. For example, this information includes environmental factors (such as temperatures), activity levels and SMART parameters (Self-Monitoring Analysis and Reporting Technology) that are commonly considered to be good indicators to describe the health of disk drives.

Breaking out different levels of utilization, the Google study shows an interesting result. Only drives with an age of six months or younger show a decidedly higher probability of failure when put into a high activity environment. Once the drive survives its first months, the probability of failure due to high usage decreases in year 1, 2, 3 and 4 - and increases significantly in year 5. Google's temperature research found an equally surprising result: "Failures do not increase when the average temperature increases. In fact, there is a clear trend showing that lower temperatures are associated with higher failure rates. Only at very high temperatures is there a slight reversal of this trend," the authors of the study found.

In contrast the company discovered that certain SMART parameters apparently do have an effect drive failures. For example, drives typically scan the disk surface in the background and report errors as they discover them. Significant scan errors can hint to surface errors and Google reports that fewer than 2% of its drives show scan errors. However, drives with scan errors turned out to be ten times more likely to fail than drives without scan errors. About 70% of Google's drives with scan errors survived the first eight months after the first scan error was reported.

Similarly, reallocation counts, a number that results from the remapping of faulty sectors to a new physical sector, can have a dramatic impact on a hard drive's life: Google said that drives with one or more reallocations fail more often than those with none. The observed average impact on the average fail rate came in at a factor of 3-6, while about 85% of the drives survive past eight months after the first reallocation.

TOPICS

Tom's Hardware is the leading destination for hardcore computer enthusiasts. We cover everything from processors to 3D printers, single-board computers, SSDs and high-end gaming rigs, empowering readers to make the most of the tech they love, keep up on the latest developments and buy the right gear. Our staff has more than 100 years of combined experience covering news, solving tech problems and reviewing components and systems.