Our data center survey exclusively covers Intel SSD failure rates because those are the drives that big businesses currently trust the most. Given the challenges in determining SSD reliability at a high-level, we're not purposely trying to take a magnifying glass and determine which vendor sells the most reliable solid-state drives, but brand does matters.
Google's research team writes the following on hard drives: "Failure rates are known to be highly correlated with drive models, manufacturers, and vintages. Our results do not contradict this fact. Most age-related results are impacted by drive vintages.”
The experiences reported by data centers imply that the same holds true for SSDs. One executive we spoke with off the record said that he thought prices on OCZ's Vertex 2 were great, he thought their reliability was awful. Late last year, his company was trying out some new gear and cracked open a case of 200 Vertex 2 Pros, only to find about 20 of them DOA. And this isn't the first gentleman to pass on a story like that.
What Does This Really Mean For SSDs?
Let's put everything we've explored into some rational perspective. Here is what we know about hard drives from the two cited studies.
- MTBF tells you nothing about reliability.
- The annualized failure rate (AFR) is higher than what manufacturers claim.
- Drives do not have a tendency to fail during the first year of use. Failure rates steadily increase with age.
- SMART is not a reliable warning system for impending failure detection.
- The failure rates of “enterprise” and “consumer” drives are very much similar.
- The failure of one drive in an array increases the likelihood of another drive failure.
- Temperature has a minor effect on failure rates.
Thanks to Softlayer's 5000+-drive operation, we know that some of those points also apply to SSDs. As we saw in the published studies, hard drive failure rates are affected by controllers, firmware, and interfaces (SAS versus SATA). If it's true that write endurance doesn't play a role in random drive failures and that vendors use compute-quality NAND in MLC- and SLC-based products, then we'd expect enterprise-class SSDs to be no more reliable than consumer offerings.
Higher Reliability Through Fewer Devices
Of course, enterprise needs touch on reliability and performance. In order to push the highest I/O throughput in storage-bound applications using hard drives, IT professionals deploy arrays of short-stroked 15 000 RPM disks in RAID. Scaling up sometimes requires cabinets of servers loaded with mechanical storage. Given the superior random I/O characteristics of SSDs, just a handful of drives can not only simplify that configuration but also cut power and cooling requirements.
Fewer devices installed means fewer devices to fail, too. Since you use one solid-state drive to replace multiple hard drives, consolidation ends up benefiting the business adopting flash-based storage. If the swap were a 1:1 ratio, that argument wouldn't work. But at 1:4 or more, you're really cutting into the number of disks that would eventually fail, and we can't let that point be under-emphasized.
From there, it's really up to you to be smart about the way you deploy storage in order to get the most value from solid-state and hard drives. Of course you can't entirely replace mechanical disks with SSDs; they're too expensive. So rather than trying to protect data from some of the issues currently affecting SSDs by creating redundant (expensive) arrays, just make sure the information exists in multiple places. As Robin Harris at StorageMojo writes, "Forget RAID, just replicate the data three times." Data redundancy with SSDs doesn't have to be high-cost. If you're in a medium-sized or large business, you really only need one copy of performance-sensitive information on flash that's continuously backed up by hard drives.
The idea that you can spend less money and still get a substantial performance increase should be very attractive. And it's not a new concept. Google has been doing for years with its hard drive-based servers. Translating it to a world with SSDs yields extremely high I/O, high reliability, and data redundancy at the low cost of simple cluster storage file replication.
Bringing It Back To The Desktop
Eh hem, sorry. We went off on an enterprise tangent there. Blame it on the data centers we've been talking to. When it comes to enthusiasts, we really can't make the assumption that an SSD is more reliable than a hard drive. If anything, the recent flurry of recalls and firmware bugs should be proof enough that write endurance isn't our biggest enemy in the battle to demonstrate the maturity of solid-state technology.
At the end of the day, a piece of hardware is a piece of hardware, and it'll have its own idiosyncrasies, regardless of whether it plays host to any moving parts or not. Why is the fact that SSDs aren't mechanically-oriented immaterial in their overall reliability story? We took the question to the folks at the Center for Magnetic Recording Research. Don't let that name fool you; CMRR does a lot of solid-state research, and it's the hub for most of the exhaustive storage research done worldwide. Dr. Gordon Hughes, one of the principal creators of S.M.A.R.T. and Secure Erase, points out that both the solid-state and hard drive industries are pushing the boundaries of their respective technologies. And when they do that, they're not trying to create the most reliable products. As Dr. Steve Swanson, who researches NAND, adds, "It's not like manufacturers make drives as reliable as they possibly can. They make them as reliable as economically feasible." The market will only bear a certain cost for any given component. So although NAND vendors could continue selling 50 nm flash in order to promise higher write endurance than memory etched at 3x or 25 nm, going back to paying $7 or $8 per gigabyte doesn't sound like any fun either.
Perhaps the most frustrating part of this challenging exploration is knowing that each vendor selling hard drives and SSDs alike has the data we'd all like to see on reliability. They build millions of devices each year (IDC says there were 11 million SSDs sold in 2009) and track every return. No doubt, failure rates depend on quality assurance, shipping, and ultimately how a customer uses the product, which is out of the manufacturer's control. But under the best of conditions, hard drives typically top out at 3% by the fifth year. Suffice it to say, the researchers at CMRR are adamant that today's SSDs aren't an order of magnitude more reliable than hard drives.
Reliability is a sensitive subject, and we've spent many hours on the phone with multiple vendors and their customers trying to conduct our own research based on the SSDs that are currently being used en masse. The only definitive conclusion we can reach right now is that you should take any claim of reliability from an SSD vendor with a grain of salt.
Giving credit where it is due, many of the IT managers we interviewed reiterated that Intel's SLC-based SSDs are the shining standard by which others are measured. But according to Dr. Hughes, there's nothing to suggest that its products are significantly more reliable than the best hard drive solutions. We don’t have failure rates beyond two years of use for SSDs, so it’s possible that this story will change. Should you be deterred from adopting a solid-state solution? So long as you protect your data through regular backups, which is imperative regardless of your preferred storage technology, then we don't see any reason to shy away from SSDs. To the contrary, we're running them in all of our test beds and most of our personal workstations. Rather, our purpose here is to call into question the idea that SSDs are definitely more reliable than hard drives, based on today's limited backup for such a claim.
Hard drives are well-documented in massive studies because they've been around for so long. We'll undoubtedly learn more about SSDs as time goes on. We leave a standing invitation to Intel, Samsung, OCZ, Micron, Crucial, Kingston, Corsair, Mushkin, SandForce, and Marvell to provide us with internal data demonstrating reliability rates for a more comprehensive investigation.
A special thanks goes out to Softlayer, RackSpace, NetApp, CMRR, Los Alamos National Labs, Pittsburgh Supercomputing Center, San Diego Supercomputer Center, ZT Systems, and all of the unnamed data centers who responded to our calls for information. Some of the data that we have cannot be published due to confidentially, but we appreciate everyone that took the time to chat with us on the subject.
As we explained in the article, write endurance is a spec'ed failure. That won't happen in the first year, even at enterprise level use. That has nothing to do with our data. We're interested in random failures. The stuff people have been complaining about... BSODs with OCZ drives, LPM stuff with m4s, the SSD 320 problem that makes capacity disappear... etc... Mostly "soft" errors. Any hard error that occurs is subject to the "defective parts per million" problem that any electrical component also suffers from.
All of the data is so fragmented... I doubt that would help. You still need to take a fine toothcomb to figure out how the numbers were calculated.
gpm23You guys do the most comprehensive research I have ever seen. If I ever have a question about anything computer related, this is the first place I go to. Without a doubt the most knowledgeable site out there. Excellent article and keep up the good work.
Thank you. I personally love these type of articles.. very reminiscent of academia. :)
Of the sealed issue of return, if by the time you check that you had been using something different and something said something else different, what you bought that was different might not be of useful use of the same thing.
Otherwise just ideas of working with more are hard said for what not to be using that was used before. Yes?
But for alot of interest into it maybe is still that of rather for the performance is there anything of actual use of it, yes?
To say the smaller amounts of information lost to say for the use of SSDs if so, makes a difference as probably are found. But of Writing order in which i think they might work with at times given them the benefit of use for it. Since they seem to be faster. Or are.
Temperature doesn't seem to be much help for many things are times for some reason. For ideas of SSDs, finding probably ones that are of use that reduce the issues is hard from what was in use before.
When things get better for use of products is hard placed maybe.
But to say there are issues is speculative, yes? Especially me not reading the whole article.
But of investments and use of say "means" an idea of waste and less use for it, even if its on lesser note , is waste. In many senses to say of it though.
Otherwise some ideas, within computing may be better of use with the drives to say. Of what, who knows...
Otherwise again, it will be more of operation place of instances of use. Which i think will fall into order of acccess with storage, rather information is grouped or not grouped to say as well.
But still. they should be usually useful without too many issues, but still maybe ideas of timiing without some places not used as much in some ways.
They will continue to receive invites for our stories, and hopefully we can do more with OWC in the future!