Hard Drive Reliability & RAID-0

zenmaster

Splendid
Feb 21, 2006
3,867
0
22,790
http://www.tgdaily.com/2007/02/16/google_hard_drives/

1000 HDDs bought new.

After 1 year 1.7% fail. (17)
These leaves 983 Working HDDs.

The second year 8% of the 983 fail. (78.64 -> Make it 79 for 904 left)

The Third year 8.6% of the 904 fail. (77.74 -> Make it 77 for 827)

This means that 17.3% of all drives will fail after 3 years.

For a Raid-0 Setup with two drives this means that only 68.4% of the RAID-0 setups will not have total failure after 3 years.

Personally, I was quite shocked at the high drive failure rates after year 1.
I expected the failure rate for years 2 & 3 to be close to year one if not perhaps even lower. (On the thought if something is going to fail it will very soon.)

I'm sure there is a stats major who may quibble with my exact methodology for calculating my stats, but the point would remain the same.
 

dt

Distinguished
Aug 10, 2004
520
0
18,980
well come on now its raid 0... you expect speed and performance not redundancy...

those statistics really dont mean anything because it varies on what type of drives you use and what type of activity they are under .. etc..

and if companies knew that a drive would die in 3 years why does companies put a 5 year warranty on it ? because they know some drives are bad and some drives are good. it all depends...
 

zenmaster

Splendid
Feb 21, 2006
3,867
0
22,790
Actually Drive usage appears to have little impact..... (See Below)
Yes, Some die and some do not.

The point of the post is not to prove that RAID-0 is a reliable safe method.
The point is to show that the failure rate is probably FAR higher than most expect in a 3-year period.

Many posters keep talking about 500,000+ MTB Failure rates and the odds of having a drive fail for many years as being almost statistically impossible.

This study which is likely the most comprehensive to date, is quite shocking at the rate of drive failure.

---------------------------------------------------------------------
The study uses failure data from several large scale deployments,
including a large number of SATA drives. They
report a significant overestimation of mean time to failure
by manufacturers and a lack of infant mortality effects.
-----------------------------------------------------------------------
In this study we report on the failure characteristics of
consumer-grade disk drives. To our knowledge, the
study is unprecedented in that it uses a much larger
population size than has been previously reported and
presents a comprehensive analysis of the correlation between
failures and several parameters that are believed to
affect disk lifetime. Such analysis is made possible by
a new highly parallel health data collection and analysis
infrastructure, and by the sheer size of our computing
deployment.
One of our key findings has been the lack of a consistent
pattern of higher failure rates for higher temperature
drives or for those drives at higher utilization levels.
Such correlations have been repeatedly highlighted
by previous studies, but we are unable to confirm them
by observing our population. Although our data do not
allow us to conclude that there is no such correlation,
it provides strong evidence to suggest that other effects
may be more prominent in affecting disk drive reliability
in the context of a professionally managed data center
deployment.
--------------------------------------------------------------
 

WizardOZ

Distinguished
Sep 23, 2006
250
0
18,780
Actually Drive usage appears to have little impact..... (See Below)
Yes, Some die and some do not.

The point of the post is not to prove that RAID-0 is a reliable safe method.
The point is to show that the failure rate is probably FAR higher than most expect in a 3-year period.

Many posters keep talking about 500,000+ MTB Failure rates and the odds of having a drive fail for many years as being almost statistically impossible.

This study which is likely the most comprehensive to date, is quite shocking at the rate of drive failure.

---------------------------------------------------------------------
The study uses failure data from several large scale deployments,
including a large number of SATA drives. They
report a significant overestimation of mean time to failure
by manufacturers and a lack of infant mortality effects.
-----------------------------------------------------------------------
In this study we report on the failure characteristics of
consumer-grade disk drives. To our knowledge, the
study is unprecedented in that it uses a much larger
population size than has been previously reported and
presents a comprehensive analysis of the correlation between
failures and several parameters that are believed to
affect disk lifetime. Such analysis is made possible by
a new highly parallel health data collection and analysis
infrastructure, and by the sheer size of our computing
deployment.
One of our key findings has been the lack of a consistent
pattern of higher failure rates for higher temperature
drives or for those drives at higher utilization levels.
Such correlations have been repeatedly highlighted
by previous studies, but we are unable to confirm them
by observing our population. Although our data do not
allow us to conclude that there is no such correlation,
it provides strong evidence to suggest that other effects
may be more prominent in affecting disk drive reliability
in the context of a professionally managed data center
deployment.
--------------------------------------------------------------

Points to ponder:

1) MTBF is a statistical probability construct. Way back in High School, introductary statistics taught that combined probability was not additive, but MULTIPLICATIVE. In short, total error equals X times Y, NOT X plus Y, where X and Y are the individual probabilities of two independent events occurring. In our case, when two (or more) individual hard drives will llikely fail.

2) MTBF is calculated on a relatively small sample size of a particular manufacturing run. Smaller sample size means greater uncertainty. And, as we also learned in High School, statistics can be manipulated.

3) On what basis do you assume the manufacturer is taking a truly random sample of product from a particular run?

4) On what basis do you assume that the manufacturer of a particular line is not fudging the data? There were cases of "over-optimistic" specs for assorted audio equipment over the years. In some cases, investigation demonstrated (at best) dubious methodology. Keep in mind that everyone was scratching each other's backs. Even US-based manufacturers like Macintosh or European-based manufacturers like Bang and Oluffson used components manufactured in Japan. And the Japanese electronics manufacturers were, and remain, an incestuous bunch. The acronym "CYA" comes to mind.

5) The article specifically references the fact that "consumer level" devices were under study. About 2 years ago, all of the major "first-tier" hard drive manufacturers changed their warranty policies. The warranties on consumer products were dropped from 3 yaers to 1. This was discussed at some length here at THG.

6) The failure rate statistics quoted in the article are instructive on a number of levels. Firstly, they confirm one of the most serious concerns raised in the THG article - that the quality of the product had been severly compromised in order to reduce costs. Secondly, since Google is most likely to use product from first-tier manufacturers, this doesn't say anything positive about the reliability of current product. Thirdly, remember that the large increases in capacity in the past 18 months, along with the release of 2 revs of a new interface which has essentially replaced a technology that has been in use for almost 20 years (IDE/ATA), along with entirely new storage techniques (perpendicular vs horizontal) used to increase storage capacity have been accompanied by very significant price decreases. Such price decreases are abnormal in any other market, especially for the really new, (b)le(e)/a-ding edge equipment. How did you think the manufacturers were managing to stay in business? Something has to give under these circumstances. You'd better believe it that quality and reliability and very low product prices are mutually exclusive. If you think I am wrong, why is it that there are so few hard drive manufacturers left out there? Do the corporate names Quantum, Conner Peripherals, Fujitsu, and Maxtor ring any bells? Quantum, Connor and Maxtor have been bought out by Seagate. Fujitsu got out of the IDE/consumer market about 18 months ago. I don't think IBM is manufacturing consumer hard drives anymore. Samsung is not considered a first-tier manufacturer, but they still offer a 3 year warranty, and have some of the best prices around. Draw your own conclusions.

7) What passes for RAID controllers on assorted MoBos is a very bad joke, in extremely poor taste. These clowns are unable to ensure compatability of their onboard RAID controllers between revs of the same model of MoBo. Never mind different generations. And while using the same controller chip at that. What's wrong with this picture?

Given that these are trivial, stupidly inconsequential and jejune concerns, I am amazed that there are so few RAID 0 systems set up out there, especially in mission-critical applications. Even more amazing is that, despite the fact that a RAID 0 setup won't actually lead to a real and noticeable performance improvement in 90% of the cases where it is implemented, is that there are so few RAID 0 setups out there.

The good readership will need to bear with my sarcasm. I am not one of those morons who is willing to pay a several-hundred dollar premium for the latest graphics card that only gives a 1 to 3 FPS "improvement" in performance over the previous generation. Even 10 to 50 FPS at anything over 60 FPS cannot be seen (why do you think movies are still made at 30 FPS?). Tragically, I don't have more money than brains, and what little brains I do have, I tend (for some very strange reasons) to use very carefully.

But then, what do I know?

Since it is your time, money and work at risk here, you are free to make your own decisions and deal with any negative consequences. Just don't whine and ask for help when (not if) your system crashes and burns. Deal with your own, informed, screw-ups.
 
I suspect that google'e workload is a server type, characterised by many short random reads. Continuous viscious actuator movement with many queued requests would certainly expose any weaknesses in a drive. On the other hand, typical desktop single user patterns would be more sequential in nature with pauses in between. I am not too worried.
 

zenmaster

Splendid
Feb 21, 2006
3,867
0
22,790
Why do you suspect that?

The detailed study details that usage did not matter signficantly?
The study took such things into account and compared the rsesults of different usages with reliability.

I don't try to make you concerned.
I'm simply pointing out an important scientific study.

Folks can choose to read it and understand it or chose to claim its baseless and/or wrong w/o reading it. The latter I can do nothing to assist. It is the former whom I am trying to help.
 

Codesmith

Distinguished
Jul 6, 2003
1,375
0
19,280
To find the probability of any drive failing you need to find the probability of all drives not failing.

For Raid 0 with n drives with x chance of failure you get the chance of failure = 1-(1-.x)^n

For RAID 1 you find the probability of all drives failing. So its x^n.

So you get 2.65%, but thats assuming that you refuse to replace the drive after it fails!

Lets falsely assume a drive is as likely to die on the day one as at the end of two years.

The chance of the 2nd drive failing within a week of the first would be 16.3% * (16.3% /(3*52)) = .0017%.

Anyway thats the best I can do given that I took my statistics courses 10-12 years ago.

PS I personally prefer at least one largish two drive RAID 1 array somewhere on my home network for automatic backup purposes. I also verify the the drives are still 100% readable by non-raid controllers before using them for storage.
 

WizardOZ

Distinguished
Sep 23, 2006
250
0
18,780
To find the probability of any drive failing you need to find the probability of all drives not failing.

For Raid 0 with n drives with x chance of failure you get the chance of failure = 1-(1-.x)^n

For RAID 1 you find the probability of all drives failing. So its x^n.

So you get 2.65%, but thats assuming that you refuse to replace the drive after it fails!

Lets falsely assume a drive is as likely to die on the day one as at the end of two years.

The chance of the 2nd drive failing within a week of the first would be 16.3% * (16.3% /(3*52)) = .0017%.

Anyway thats the best I can do given that I took my statistics courses 10-12 years ago.

PS I personally prefer at least one largish two drive RAID 1 array somewhere on my home network for automatic backup purposes. I also verify the the drives are still 100% readable by non-raid controllers before using them for storage.

Umm

1) you neglected to include the detail that different drives have different probabilities of failure. This will change the equation somewhat, and the output will be a worse probability. ind, I am very pleased that you used the correct initial equation.

2) your assumption and result for Aid 1 is not necessarily reasonable.

3) Why are you saying that assuming a drive will fail on day one is false? MTBF provides zero indication of when a unit will actually fail. More critiacally, you fail to either acknowledge, include, ar address amny of the points about statistrice, QC or warranty issues I raised earlier. I have actually had hard drives show up DOA. Along with other components. Oin what basis do you say that this can't happen?

4) given the error in your previous point, your calculation is overly optimistic. Try something closer to 25%.

I trust these clarifications are of use.
 

Codesmith

Distinguished
Jul 6, 2003
1,375
0
19,280
To determine the likely hood to two hard drives failing in a week can only be done in a simply manner if you assume that during the 3 years the drives chances of dying on any particular day is the same.

I never implied anything about the relative likely hood of a drive failing on day one vs the end of three years.

I was merely stating that it is false to assume they are equal.

What I was trying to say is that my equation assumed that on any given day the drive was as likely to fail as any other.

This simplification makes the graph showing that the instantaneous probability of failure over time completely flat, and greatly simplifies the math.

My assumption was obviously false because the curve wouldn't be flat and given the available information should instead be estimated using a best fit curve of the available data points.

Then you have to write an equation that represents the chance of a drive failing at a particular instance, and then multiplies that by the change of the 2nd drive failing from that instant to plus 7 days.

It can be done but it is messy and would take me a about 2-5 hours because I haven't do anything like that in awhile.

I think the results are a close estimate.

---

If you want better results you can create a best fit curve, determine the highest daily failure risk and then use that for the whole year.

Much easier to calculate and it places an upper bound on the actually probability.

That is you could say with confidence that the probability is less than x%.

---
That the statistics are for the set of all drives used, and that there are potentially identifiable subsets with different failure statistics, in no way invalidates the statistics or the equations.

With all statistics there would be potential subsets that end up getting generalized. Sometimes a more detailed study will decrease the margin of error, sometimes it will increase it if the subsets end up being too small.

The entire point of statistics is to generalize and predict based on generalizations.

Get too specific and instead of having x% of 2007 model Y cards have defective anti lock breaks you get nothing but a list of people who's cars have been examined.

---
 

WizardOZ

Distinguished
Sep 23, 2006
250
0
18,780
I didn't say your stats were invalid. I said that your result was over optimistic. The issue of within brand and between brand reliability is a reasonable concern. Yes it does make the calculations messier and more difficult. That's how it goes.

There is nothing technically wrong with using simplifications - as long as you make it clear that you have done so, specify what they are and the overall impact of these simplifications. In most cases the rough results from simplified calculations are pretty close to the more fussy calculations, especially as sample size goes up.