Decrypting Failure Statistics: ZT Systems (~155 000 SSDs)
Hard drives and SSDs are both affected by different factors as a result of their respective architectures. When we're thinking about reasons to worry about hard drive health, we gravitate to the fact that they're based on mechanical, moving parts that, although designed to very specific tolerances, are destined to wear out over time.
We also know that SSDs don't have those issues. Their solid-state nature alleviates the fear of damage due to crashing heads or a worn motor. But because SSDs are virtualized (in that you can't physically map out a static LBA space as you can on a hard drive), there are other variables in play that tend to affect reliability. Firmware is the most significant, and we see its impact in play almost every time an SSD problem is reported.
Over the past three years, Intel's SSD-related bugs have all been stomped out over time with newer firmware. Crucial's issues with link power management on the m4 were solved with a new firmware. And we've seen SandForce's most well-known partner, OCZ, address a number of customer complaints with various firmware versions. In fact, the SandForce situation is particularly unique. Because drive vendors are able to tweak the controller's firmware as a means of differentiation, SandForce-based drives from any given company could conceivably have different issues. That certainly complicates the issue of reliability (or at least consistency).
Specific issues aside, this discussion feeds into our point that there's a need to compare failure numbers across brands. The problem with this is that the way each vendor, reseller, and customer does the math is slightly different, making a true comparison almost impossible.
For example, we were extremely impressed by Intel's reliability presentation at IDF 2011. But in discussions with ZT Systems, the company Intel cited, we discovered that the 0.26% AFR figure doesn't actually take age into account and only covers validated errors. Frankly, if you're an IT manager, you care about unvalidated errors, too. There are situations where you send a defective product back and the manufacturer claims there's no error. This doesn't mean that the drive is problem-free, because it could be suffering from a configuration- or application-oriented issue. We've seen plenty of real-world examples to know that this continues to be a problem.
Unvalidated errors are typically 2x to 3x higher than validated errors. Indeed, ZT System claims an unvalidated failure rate of 0.43% for 155 000 X25-Ms, but again this figure isn't normalized for age, as drives were deployed in groups. According to the company's CTO, Casey Cerretani, the final numbers are still being tabulated, but early estimates peg the unvalidated AFR for the first year closer to 0.7%. Of course, this still doesn't take long-term reliability into account, which is one reason it's difficult to fairly compare solid-state and hard disk drives.
The bottom line is that we now know different reporting methods easily affect the perception of reliability when it comes to digesting vendor data. Moreover, only time will tell if the SSD reliability story will hold up against hard drives. And now you know why they're not as directly comparable as some of the information presented might suggest.