Sign in with
Sign up | Sign in
Your question

2nd HDD failure in 3 months

Last response: in Storage
Share
January 20, 2010 1:05:02 PM

Running Windows 2003 Server with an Intel ICH7R/DH controller, four drives in RAID 5.

One of the original Samsungs that came with the system failed and I replaced it with a WD Caviar Black (1Tb) in early November. A couple of weeks ago the new WD failed and I took it back to the brick and mortar and was given a 1:1 swap for a new one. This morning I found the message that the replacement drive has failed as well. The other drives are running just fine.

Any ideas what might be killing the WDs?

More about : 2nd hdd failure months

a b G Storage
January 20, 2010 1:34:54 PM

Hi newcomer and welcome to the Tom's hardware forum.

Check the power, current and voltage of your PSU.
m
0
l
January 20, 2010 2:01:40 PM

Shall do. Coolmax PS-228 a decent tester?
m
0
l
Related resources
a b G Storage
January 20, 2010 2:16:11 PM

Yea, looks good. Or with a multimeter.
m
0
l
January 20, 2010 2:22:12 PM

It looks like the Coolmax will test the voltage on the SATA line as well as on the ATX connector - might as well test all possible points of failure.
m
0
l
January 20, 2010 9:45:27 PM

All voltages were spot-on normal. Something else must be going on.
m
0
l
January 20, 2010 10:10:40 PM

A faulty PSU would not necessarily show up all the time. It could be a spike on the line that appears every now and then that kills the drives. Could be static. Overheating, or just pure coincidense (can't rule that out). Try running that drive on a different rail if possible and change the SATA cable as well. (Also could be faulty SATA port on motherboard/RAID card).
m
0
l
January 20, 2010 10:31:10 PM

I've moved the (replacement) drive from the cage to the floor of the tower at the front of the case (just inside the vents). I also added a video cooler fan to hopefully increase airflow out the back. Here's hoping that it doesn't die again.
m
0
l
January 20, 2010 11:21:26 PM

Check with SMART if possible every now and then to see if the drive experiences any problems during its lifetime. Cross your fingers!
m
0
l
a c 415 G Storage
January 20, 2010 11:42:10 PM

Are you diagnosing these as disk failures just because the RAID controller declared them as dead? Have you actually tried using the drives outside the RAID set to see if they're really and truly kaput?

RAID controllers often place a time limit on how long they'll wait for a drive to respond and some drives may exceed the timeout if they're trying to read a block that has ECC errors. This is probably made worse by "Green" drives which can have an additional delay if they're accessed after spinning down because they haven't been active for a while. (I understand you're not using "Green" drives, just mentioning it for the benefit of lurkers).

This is why the HDD manufacturers make special "RAID" versions of their drives (ie, the WD "RE3" and "RE4" series drives) - the "RAID" versions have different firmware that ensures they will respond quickly to every request. They'll actually give up faster on a block that has ECC errors based on the strategy that in a redundant RAID set the data can be recovered from another drive anyway.

So I'd be a little cautious about assuming a drive is "really" dead just because a RAID controller kicks it out of a RAID set.
m
0
l
January 21, 2010 12:19:29 AM

Interesting question sminlal - once the raid controller said the drive was dead I tried running the WD diagnostics and couldn't get any signs of life out of it, but I didn't try putting the drive in another machine to see if anything would happen.

I think part of the problem is this intel raid controller which has always generated many timeout errors throughout the life of the machine - intel never bothered to do anything about it and it hasn't really caused any problems for me (until now maybe)

m
0
l
January 21, 2010 12:58:56 AM

Excellent point from sminlal! Although like you said you did try with WD diagnostics and it said that it is dead, hence not much hope there.

On another point, timeout errors would not kill a hard drive, although a corrupt BIOS (of the RAID controller) could actually make a hard drive work so badly that in the long run it would damage it (physically, not its electronics).
m
0
l
January 27, 2010 11:13:40 PM

Moot question at this point... I've picked up some real server hardware (PowerEdge T310) with better networking (dual 1000 ethernet) and a dedicated PERC 6/i RAID controller (as opposed to the on-board controller in the Precision which is going to be repurposed as a workstation with a non-RAID configuration. If the PSU is indeed losing capacity then the machine can't be relied upon as the main server anymore.
m
0
l
January 28, 2010 1:57:49 AM

ldsskier said:
Moot question at this point... I've picked up some real server hardware (PowerEdge T310) with better networking (dual 1000 ethernet) and a dedicated PERC 6/i RAID controller (as opposed to the on-board controller in the Precision which is going to be repurposed as a workstation with a non-RAID configuration. If the PSU is indeed losing capacity then the machine can't be relied upon as the main server anymore.


True... what brand and model is yout PSU? And what sort of hardware do you have in there?
m
0
l
January 28, 2010 12:28:46 PM

The old machine was a Dell - a Precision 380. Proprietary PSU that aren't easy to find. From day one I was getting frequent timeout errors in iastor.sys that nobody could ever resolve so I'm guessing it was simply choking on the throughput.

New machine:

X3430 Xeon Processor, 2.4 GHz 8M Cache, Turbo
4GB Memory (2x2GB), 1333MHz Dual Ranked UDIMM
PERC6i SAS RAID Controller
250GB 7.2k RPM Serial ATA (x3 in RAID 5)
onboard dual gigabit ethernet
m
0
l
January 28, 2010 10:04:50 PM

ldsskier said:
The old machine was a Dell - a Precision 380. Proprietary PSU that aren't easy to find. From day one I was getting frequent timeout errors in iastor.sys that nobody could ever resolve so I'm guessing it was simply choking on the throughput.

New machine:

X3430 Xeon Processor, 2.4 GHz 8M Cache, Turbo
4GB Memory (2x2GB), 1333MHz Dual Ranked UDIMM
PERC6i SAS RAID Controller
250GB 7.2k RPM Serial ATA (x3 in RAID 5)
onboard dual gigabit ethernet



The PSU you are describing does not sound good enough for me. I would get a good brand 500W as a minimum.
This could pretty much be your problem.
m
0
l
January 29, 2010 8:23:16 AM

I had a similar problem with repeated disk failures in my NAS. Finally Seagate admitted you have to buy 24/7 rated disks to guarantee operations in such an environment. First they tried and said this was an "unsupported configuration", but when I asked where I could have learned this they backed off and replaced my drives free of charge! Since then I have not had any problems.
m
0
l
!