Sign in with
Sign up | Sign in
Your question

Assessing HDD reliability

Last response: in Storage
Share
July 18, 2009 1:18:22 PM

I have been looking into HDD reliability and I am stumped. Where are the facts?

Each drive has a mean time before failure (MTBF). But is it accurate? How can I check?

I am too much of a mathematician to accept anecdotal evidence from people who have had drives fail and say "Never buy <brand x>".

www.storagereview.com has a reliability survey. But they don't have any of the drives that I care about.

If newegg was my friend, they would publish their reliability numbers (number shipped vs. numbered RMA'ed). If they do, I haven't seem them. I would also want to know if a particular batch was bad.

How are you assessing HDD reliability?

July 20, 2009 12:26:44 AM

for me its trial and error buddy, but that left me with a dead hd trying to recover the lost data.. wish i had a mitigating plan like you do!
m
0
l
July 20, 2009 2:42:37 AM

suduma,
I hate it when it comes down to rolling the dice.

I decided on a Western Digital Caviar Black WD1001FALS 1TB 7200 RPM 32MB Cache SATA II. It comes with a 5-year warranty. Given Seagate's poor handling of the 7200.11 firmware issue I have decided to stay away from Seagate until they start caring about their customers again.

My mitigation plan is:
1. Down load Data Lifeguard Tools 11.2 for DOS (CD) and run it to check on the health of the drive before I place any data on it

2. Frequent backups

3. Use SMART to monitor HDD health and status

4. Use a high airflow case to keep the HDD temperature within operating constraints
m
0
l
Related resources
a c 156 G Storage
July 20, 2009 3:32:14 AM

With any drive, you should always do frequent backups. I have seen no inclination for one drive makers drives to outlast another with the exception to Maxtor, those seemed to have a high failure rate.

Good put out some data a while back stating that drives do not actually fail more at higher temperatures unless they are very high.

Quote:
We first look at the correlation between average temperature
during the observation period and failure. Figure
4 shows the distribution of drives with average temperature
in increments of one degree and the corresponding
annualized failure rates. The figure shows that failures
do not increase when the average temperature increases.
In fact, there is a clear trend showing that lower
temperatures are associated with higher failure rates.
Only at very high temperatures is there a slight reversal
of this trend.

http://labs.google.com/papers/disk_failures.pdf
m
0
l
July 20, 2009 10:43:46 AM

Nukemaster,
Thank you. I came across that write up in my research. I too became less concerned about temperature. Later, I read this:
http://www.storagereview.com/guide/issuesCooling.html

which made me think again that I should pay attention to the temperature ranges specified by the manufacture.

This is one of the reasons that I started this thread. I am finding disk drive reliability to be clear as mud. I think that I must be missing something.

I guess that what I really want is to understand the probability of getting a lemon for the various drive choices under consideration. I have never won the lottery so probability theory has worked so far. ;) 
m
0
l
a b G Storage
July 20, 2009 1:46:59 PM

One measure of hard drive reliability is mean time before failure (mtbf). Unfortunately, that is based on some sort of statistical calculations (that I have no freakin' clue about :(  ).

Case in point: the WD 1 TB RE3 drive has a 1,200,000 hour mtbf. That's about 135 years of continuous operation. I am willing to bet that WD did not design it in the 1870's, build ten of them, run them continuously, have five fail by this year, and so declare a 1.2 million hour mtbf.

The new Seagate .12's have really nice specs - particularly the 500 GB platters on their larger drives. But like Mike, I remember the .11 firmware problems, and I have seen indications (anecdotal, admittedly) of problems with some of the .12 models.

So like Mike, I am avoiding Seagate and sticking with WD.

About temperature:
I have seen material posted that indicates everything from "We do not need to worry unduly about temperatures." to "Temperatures, especially high ones, are bad, bad, bad." including "Temperatures have an increasingly bad effect as drives age."

My more than 40 years of electronics experience indicates that temperature increases above "normal" are bad.

My media and gaming machine has three hard drives, a WD 640 GB Black and two 1 TB Greens, in the lower drive bay of an Antec 900. The highest hard drive temp I have seen so far is 31 C. They usually run around 27 - 29 C.
m
0
l
July 20, 2009 4:09:04 PM

MTBF, as used by the HDD industry, applies to "aggregate analysis of a large numbers of drives". It tells you nothing about any individual drive. One way to think of it is that if you took a large number of drives with a service life of five years and every five years swapped out the old drive for a new version of the same model, a drive with a MTBF of 500,000 hours should last 57 years before failing.

Tell that to the people who open their box and find a drive that is DOA.

Hmm, 57 / 5 = 11 plus a remainder. Does that mean that every 12th drive is a dud? Nah.

See, clear as mud.
m
0
l
January 8, 2010 5:54:54 PM

MikeJRamsey said:
MTBF, as used by the HDD industry, applies to "aggregate analysis of a large numbers of drives". It tells you nothing about any individual drive. One way to think of it is that if you took a large number of drives with a service life of five years and every five years swapped out the old drive for a new version of the same model, a drive with a MTBF of 500,000 hours should last 57 years before failing.

Tell that to the people who open their box and find a drive that is DOA.

Hmm, 57 / 5 = 11 plus a remainder. Does that mean that every 12th drive is a dud? Nah.

See, clear as mud.



Late, I know, but... MTBF is actually the inverse (sort of) of the failure rate of drives over a large population. To determine the number of drives that will fail in any given year (or the probability YOUR drive will fail in any given year) you divide the number of power-on hours by the MTBF. Traditionally, companies use 8760 (365 days times 24 hours). So...if a drive is 1.2M hours, the probability of failure in any given year is 8760 divided by 1.2M or ~0.73%.

I hope this helps...
m
0
l
April 30, 2010 2:32:25 AM

As an engineer, the training courses on reliability that I have taken say that (for electric motors) each 10 degees C above the design operating temperature, the life (MTBF) is reduced by 50%. This reduction in life is due to the increased rate of degredation of the electrical insulation inside the motor. I imagine that this 10 degree C "rule of thumb" for equipment life is also the case with electronics.

Kemo
m
0
l
!