Sign in with
Sign up | Sign in

What Do We Know About Storage?

Investigation: Is Your SSD More Reliable Than A Hard Drive?
By

SSDs are a relatively new technology (at least compared to hard drives, which are almost 60 years old). It’s understandable that we would compare the new kid on the block against tried and true. But what do we really know about hard drives? Two important studies shed some light. Back in 2007, Google published a study on the reliability of 100 000 consumer PATA and SATA drives used in its data center. Similarly, Dr. Bianca Schroeder and adviser Dr. Garth Gibson calculated the replacement rates of over 100 000 drives used at some of the largest national labs. The difference is that they also cover enterprise SCSI, SATA, and Fibre Channel drives.

If you haven’t read either paper, we highly recommend at least reading the second study. It won best paper at the File and Storage Technologies (FAST ’07) conference. For those not interested in pouring over academic papers, we’ll also summarize.

MTTF Rating 

You remember what MTBF means (here's a hint: we covered it on page four of OCZ's Vertex 3: Second-Generation SandForce For The Masses), right? Let’s use the Seagate Barracuda 7200.7 as an example. It has a 600 000-hour MTBF rating. In any large population, we'd expect half of these drives to fail in the first 600 000 hours of operation. Assuming failures are evenly distributed, one drive would fail per hour. We can convert this to an annualized failure rate (AFR) of 1.44%.

But that’s not what Google or Dr. Schroeder found, because failures do not necessarily equal disk replacements. That is why Dr. Schroeder measured the annualized replacement rate (ARR). This is based on the number of actual disks replaced, according to service logs.

While the datasheet AFRs are between 0.58% and 0.88%, the observed ARRs (annualized replacement rates) range from 0.5% to as high as 13.5%. That is, the observed ARRs by data set and type, are by up to a factor of 15 higher than datasheet AFRs.

Drive makers define failures differently than we do, and it’s no surprise that their definition overstates drive reliability. Typically, a MTBF rating is based on accelerated life testing, return unit data, or a pool of tested drives. Vendor return data continues to be highly suspect, though. As Google states, “we have observed… situations where a drive tester consistently ‘green lights’ a unit that invariably fails in the field.”

Drive Failure Over Time

Most people assume that the failure rate of a hard drive looks like a bathtub curve. At first, you see many drives fail in the beginning due to a phenomenon referred to as infant mortality. After that initial period, you expect to see low failure rates. At the other end, there’s a steady rise as drives finally wear out. Neither study found that assumption to be true. Overall, they found that drive failures steadily increase with age.

Enterprise Drive Reliability

When you compare the two studies, you realize that the 1 000 000 MTBF Cheetah drive is much closer to a datasheet MTBF of 300 000 hours. This means that “enterprise” and “consumer” drives have pretty much the same annualized failure rate, especially when you are comparing similar capacities. According to Val Bercovici, director of technical strategy at NetApp, "…how storage arrays handle the respective drive type failures is what continues to perpetuate the customer perception that more expensive drives should be more reliable. One of the storage industry’s dirty secrets is that most enterprise and consumer drives are made up of largely the same components. However, their external interfaces (FC, SCSI, SAS, or SATA), and most importantly their respective firmware design priorities/resulting goals play a huge role in determining enterprise versus consumer drive behavior in the real world."

Data Safety and RAID

Dr. Schroeder’s study covers the use of enterprise drives used in large RAID systems in some of the biggest high-performance computing labs. Typically, we assume that data is safer in properly-chosen RAID modes, but the study found something quite surprising.

The distribution of time between disk replacements exhibits decreasing hazard rates, that is, the expected remaining time until the next disk was replaced grows with the time it has been since the last disk replacement.

This means that the failure of one drive in an array increases the likelihood of another drive failure. The more time that passes since the last failure means the more time is expected to pass until the next one. Of course, this has implications for the RAID reconstruction process. After the first failure, you are four times more likely to see another drive fail within the same hour. Within 10 hours, you are only two times more likely to experience a subsequent failure.

Temperature

One of the stranger conclusions comes from Google’s paper. The researchers took temperature readings from SMART—the self-monitoring, analysis, and reporting technology built into most hard drives—and they found that a higher operating temperature did not correlate with a higher failure rate. Temperature does seem to affect older drives, but the effect is minor.

Is SMART Really Smart?

The short answer is no. SMART was designed to catch disk errors early enough so that you can back up your data. But according to Google, more than one-third of all failed drives did not trigger an alert in SMART. This isn't a huge surprise, as many industry insiders have been suspecting this for years. It turns out that SMART is really optimized to catch mechanical failures. Much of a disk is still electronic, though. That's why behavioral and situational problems like power failure go, unnoticed while data integrity issues are caught. If you're relying on SMART to tell you of an impending failure, you need to plan for additional layer of redundancy if you want to ensure the safety of your data.

Now let's see how SSDs stack up against hard drives.

Ask a Category Expert

Create a new thread in the Reviews comments forum about this subject

Example: Notebook, Android, SSD hard drive

Display all 137 comments.
This thread is closed for comments
Top Comments
  • 23 Hide
    Anonymous , July 29, 2011 5:52 AM
    You guys do the most comprehensive research I have ever seen. If I ever have a question about anything computer related, this is the first place I go to. Without a doubt the most knowledgeable site out there. Excellent article and keep up the good work.
  • 21 Hide
    Device Unknown , July 29, 2011 10:26 AM
    K-zonI will say that i didn't read the article word for word. But of it seems that when someone would change over from hard drive to SSD, those numbers might be of interest. Of the sealed issue of return, if by the time you check that you had been using something different and something said something else different, what you bought that was different might not be of useful use of the same thing. Otherwise just ideas of working with more are hard said for what not to be using that was used before. Yes? But for alot of interest into it maybe is still that of rather for the performance is there anything of actual use of it, yes? To say the smaller amounts of information lost to say for the use of SSDs if so, makes a difference as probably are found. But of Writing order in which i think they might work with at times given them the benefit of use for it. Since they seem to be faster. Or are. Temperature doesn't seem to be much help for many things are times for some reason. For ideas of SSDs, finding probably ones that are of use that reduce the issues is hard from what was in use before. When things get better for use of products is hard placed maybe. But to say there are issues is speculative, yes? Especially me not reading the whole article. But of investments and use of say "means" an idea of waste and less use for it, even if its on lesser note , is waste. In many senses to say of it though. Otherwise some ideas, within computing may be better of use with the drives to say. Of what, who knows... Otherwise again, it will be more of operation place of instances of use. Which i think will fall into order of acccess with storage, rather information is grouped or not grouped to say as well. But still. they should be usually useful without too many issues, but still maybe ideas of timiing without some places not used as much in some ways.


    Please tell me English is your 3rd language. I couldn't understand anything you said lol
  • 17 Hide
    acku , July 29, 2011 4:52 AM
    Quote:
    Endurance of floating gate transistor used in flash memories is low. The gate oxide wears out due to the tunnelling of electrons across it. Hopefully phase change memory can change things around since it offers 10^6 times more endurance for technology nodes


    As we explained in the article, write endurance is a spec'ed failure. That won't happen in the first year, even at enterprise level use. That has nothing to do with our data. We're interested in random failures. The stuff people have been complaining about... BSODs with OCZ drives, LPM stuff with m4s, the SSD 320 problem that makes capacity disappear... etc... Mostly "soft" errors. Any hard error that occurs is subject to the "defective parts per million" problem that any electrical component also suffers from.

    Cheers,
    Andrew Ku
    TomsHardware.com
Other Comments
  • 3 Hide
    hardcore_gamer , July 29, 2011 4:50 AM
    Endurance of floating gate transistor used in flash memories is low. The gate oxide wears out due to the tunnelling of electrons across it. Hopefully phase change memory can change things around since it offers 10^6 times more endurance for technology nodes
  • 17 Hide
    acku , July 29, 2011 4:52 AM
    Quote:
    Endurance of floating gate transistor used in flash memories is low. The gate oxide wears out due to the tunnelling of electrons across it. Hopefully phase change memory can change things around since it offers 10^6 times more endurance for technology nodes


    As we explained in the article, write endurance is a spec'ed failure. That won't happen in the first year, even at enterprise level use. That has nothing to do with our data. We're interested in random failures. The stuff people have been complaining about... BSODs with OCZ drives, LPM stuff with m4s, the SSD 320 problem that makes capacity disappear... etc... Mostly "soft" errors. Any hard error that occurs is subject to the "defective parts per million" problem that any electrical component also suffers from.

    Cheers,
    Andrew Ku
    TomsHardware.com
  • 16 Hide
    slicedtoad , July 29, 2011 5:18 AM
    hacker groups like lulsec should do something useful and get this kind of internal data from major companies.
  • -1 Hide
    jobz000 , July 29, 2011 5:31 AM
    Great article. Personally, I find myself spending more and more time on a smartphone and/or tablet, so I feel ambivalent about spending so much on a ssd so I can boot 1 sec faster.
  • 23 Hide
    Anonymous , July 29, 2011 5:52 AM
    You guys do the most comprehensive research I have ever seen. If I ever have a question about anything computer related, this is the first place I go to. Without a doubt the most knowledgeable site out there. Excellent article and keep up the good work.
  • 10 Hide
    acku , July 29, 2011 5:55 AM
    slicedtoadhacker groups like lulsec should do something useful and get this kind of internal data from major companies.


    All of the data is so fragmented... I doubt that would help. You still need to take a fine toothcomb to figure out how the numbers were calculated.

    gpm23You guys do the most comprehensive research I have ever seen. If I ever have a question about anything computer related, this is the first place I go to. Without a doubt the most knowledgeable site out there. Excellent article and keep up the good work.


    Thank you. I personally love these type of articles.. very reminiscent of academia. :) 

    Cheers,
    Andrew Ku
    TomsHardware.com
  • 9 Hide
    cangelini , July 29, 2011 6:51 AM


    To the contrary! We noticed that readers were looking to see OWC's drives in our round-ups. I made sure they were invited to our most recent 120 GB SF-2200-based story, and they chose not to participate (this after their rep jumped on the public forums to ask why OWC wasn't being covered; go figure).

    They will continue to receive invites for our stories, and hopefully we can do more with OWC in the future!

    Best,
    Chris Angelini
  • 8 Hide
    ikyung , July 29, 2011 7:03 AM
    Once you go SSD, you can't go back. I jumped on the SSD wagon about a year ago and I just can't seem to go back to HDD computers =[. Of course I only use SSDs for certain programs and HDD for storage.
  • 6 Hide
    Pyree , July 29, 2011 7:37 AM
    SSD may not be more reliable but I think they are still physically more durable. I have yet to see a mechanical drive that will survive free fall onto a hard surface.
  • -2 Hide
    iamtheking123 , July 29, 2011 7:43 AM
    PyreeSSD may not be more reliable but I think they are still physically more durable. I have yet to see a mechanical drive that will survive free fall onto a hard surface.

    I've yet to see a hard drive free fall period. It's just a "feature" people spout to justify paying waaaaayyyy too much per gb.
  • 2 Hide
    Anonymous , July 29, 2011 7:45 AM
    Nice to see an article like this but if i would have made it i would have tested 2.5" normal harddrives. Atleast when comparing the return rates since those kind of drives normally is fitted into laptops wich are moved around alot, causing them to be more in a danger of faults.
    I work in a PC store and we see a great deal more 2.5" drives from laptop with bad sectors than we do 3.5" from stationary PC's
  • 4 Hide
    whysobluepandabear , July 29, 2011 8:37 AM
    I could've summarized this entire article up in 1 page. I actually would've preferred it that way.


    A.) They (SSD and HDD companies) lie to us, and the figures and statistics are not reliable, nor paint an accurate picture of their reliability/performance.

    B.) The slowest SSD rapes the fastest HDD by a significant margin.

    C.) SSDs are no more reliable than HDDs - lack of moving parts does not necessarily mean lack of failure.

    D.) Failure is a bit misused, as it's a term used to describe the progressive failing of a drive, and not the sudden.

    E.) Rather than performance, many companies (and consumers) are more concerned about reliability, as like said, even the slowest SSD is MUCH faster than the fastest HDD.



    And that's it in a nut shell.
  • 0 Hide
    flong , July 29, 2011 9:47 AM
    Superb article - this kind of review has long been overdue.

    For those posters to impatient to actually read the article - it is not disjointed, you have to carefully put all the pieces together.

    The simple conclusion is that AFR (annual failure rates) are slightly greater for SSDs than HDDs for the first 3 years or so. After that, based on the graph provided, HDD failure appears to be much greater than SSD. That is based on the slopes of the failure rates, but this conclusion is only as reliable as the data points on that graph.

    Neither failure rate of either SSDs or HDDs categorically proves one is more reliable than the other. They are roughly equal in reliability.

    SSD reliability appears to be relegated to the manufacturer's quality control. In the various reports from industry users, Intel SSDs virtually did not fail - while other manufacturers appear to be having quality control issues (OCZ perhaps, but it is hard to tell without reliable data).

    It appears that Intel has mastered the art of making a reliable SSD. Now when their next generation of SSDs come out that actually can compete speedwise with the third generation Sandforce drives, we may have a real winner depending on cost. Their next set of SSDs coming out is rumored to push the 1 GB/s threshold (though that would have to be over a PCIE slot as SATA 3 only allows 600 mb/s).

    SSDs are inherently more reliable than HDDs and as demand rises, it is likely that SSD reliability will surpass HDD reliability. They are only separated by 1-2% now in this article.

    It is likely that in 10 years, HDDs may not exist in their present form and SSDs will dominate the storage market.
  • 21 Hide
    Device Unknown , July 29, 2011 10:26 AM
    K-zonI will say that i didn't read the article word for word. But of it seems that when someone would change over from hard drive to SSD, those numbers might be of interest. Of the sealed issue of return, if by the time you check that you had been using something different and something said something else different, what you bought that was different might not be of useful use of the same thing. Otherwise just ideas of working with more are hard said for what not to be using that was used before. Yes? But for alot of interest into it maybe is still that of rather for the performance is there anything of actual use of it, yes? To say the smaller amounts of information lost to say for the use of SSDs if so, makes a difference as probably are found. But of Writing order in which i think they might work with at times given them the benefit of use for it. Since they seem to be faster. Or are. Temperature doesn't seem to be much help for many things are times for some reason. For ideas of SSDs, finding probably ones that are of use that reduce the issues is hard from what was in use before. When things get better for use of products is hard placed maybe. But to say there are issues is speculative, yes? Especially me not reading the whole article. But of investments and use of say "means" an idea of waste and less use for it, even if its on lesser note , is waste. In many senses to say of it though. Otherwise some ideas, within computing may be better of use with the drives to say. Of what, who knows... Otherwise again, it will be more of operation place of instances of use. Which i think will fall into order of acccess with storage, rather information is grouped or not grouped to say as well. But still. they should be usually useful without too many issues, but still maybe ideas of timiing without some places not used as much in some ways.


    Please tell me English is your 3rd language. I couldn't understand anything you said lol
  • 3 Hide
    dimar , July 29, 2011 11:14 AM
    My 5 months old Corsair Performance 3 failed few weeks ago. I did all I could not to abulse it. Temp, swap, progarm files, user profile files setup on HDD. Performance was still super fast. Started getting BSODs, and then the system wouldn't start. Checking SSD using checkdisk would freeze. Corsair exchanged it with a new one, with a new firmware, which they haven't posted online yet. Lost my windows data. This time I made a backup partition on the HDD, where I backup windows SSD partition every week.
  • 3 Hide
    tehcheeze , July 29, 2011 11:53 AM
    I am now on my 3rd OCZ Agility 2 120gig SSD. The problem I've had with it both times is that after a while it just would not be recognized by the OS or BIOS. That and waking from sleep, the systems was really unstable/slow.
  • 4 Hide
    acku , July 29, 2011 11:56 AM
    flongSuperb article - this kind of review has long been overdue. For those posters to impatient to actually read the article - it is not disjointed, you have to carefully put all the pieces together. The simple conclusion is that AFR (annual failure rates) are slightly greater for SSDs than HDDs for the first 3 years or so. After that, based on the graph provided, HDD failure appears to be much greater than SSD. That is based on the slopes of the failure rates, but this conclusion is only as reliable as the data points on that graph.Neither failure rate of either SSDs or HDDs categorically proves one is more reliable than the other. They are roughly equal in reliability. SSD reliability appears to be relegated to the manufacturer's quality control. In the various reports from industry users, Intel SSDs virtually did not fail - while other manufacturers appear to be having quality control issues (OCZ perhaps, but it is hard to tell without reliable data). It appears that Intel has mastered the art of making a reliable SSD. Now when their next generation of SSDs come out that actually can compete speedwise with the third generation Sandforce drives, we may have a real winner depending on cost. Their next set of SSDs coming out is rumored to push the 1 GB/s threshold (though that would have to be over a PCIE slot as SATA 3 only allows 600 mb/s).SSDs are inherently more reliable than HDDs and as demand rises, it is likely that SSD reliability will surpass HDD reliability. They are only separated by 1-2% now in this article.It is likely that in 10 years, HDDs may not exist in their present form and SSDs will dominate the storage market.


    Actually that SSD 320 problem would have counted as a failure. When you can't accesses data, that's a big no no.

    Thanks for the kudos. But a few corrections. There is no data to suggest that hdd failures are greater than ssds. The projections in the graph assume a constant failure rate, which never occurs. I just put it in so that people could see how it relates to a AFR of 1%. For the moment, it's unclear if SSDs are more reliable. The initial 2 year data suggests otherwise.
  • 4 Hide
    acku , July 29, 2011 12:03 PM
    Quote:
    SSD may not be more reliable but I think they are still physically more durable. I have yet to see a mechanical drive that will survive free fall onto a hard surface.

    I completely agree. But think that information on reliability is important in the face of all the marketing that suggests otherwise.
Display more comments