2nd HDD failure in 3 months

ldsskier

Distinguished
Jan 20, 2010
13
0
18,510
Running Windows 2003 Server with an Intel ICH7R/DH controller, four drives in RAID 5.

One of the original Samsungs that came with the system failed and I replaced it with a WD Caviar Black (1Tb) in early November. A couple of weeks ago the new WD failed and I took it back to the brick and mortar and was given a 1:1 swap for a new one. This morning I found the message that the replacement drive has failed as well. The other drives are running just fine.

Any ideas what might be killing the WDs?
 

darkguset

Distinguished
Aug 17, 2006
1,140
0
19,460
A faulty PSU would not necessarily show up all the time. It could be a spike on the line that appears every now and then that kills the drives. Could be static. Overheating, or just pure coincidense (can't rule that out). Try running that drive on a different rail if possible and change the SATA cable as well. (Also could be faulty SATA port on motherboard/RAID card).
 

ldsskier

Distinguished
Jan 20, 2010
13
0
18,510
I've moved the (replacement) drive from the cage to the floor of the tower at the front of the case (just inside the vents). I also added a video cooler fan to hopefully increase airflow out the back. Here's hoping that it doesn't die again.
 
Are you diagnosing these as disk failures just because the RAID controller declared them as dead? Have you actually tried using the drives outside the RAID set to see if they're really and truly kaput?

RAID controllers often place a time limit on how long they'll wait for a drive to respond and some drives may exceed the timeout if they're trying to read a block that has ECC errors. This is probably made worse by "Green" drives which can have an additional delay if they're accessed after spinning down because they haven't been active for a while. (I understand you're not using "Green" drives, just mentioning it for the benefit of lurkers).

This is why the HDD manufacturers make special "RAID" versions of their drives (ie, the WD "RE3" and "RE4" series drives) - the "RAID" versions have different firmware that ensures they will respond quickly to every request. They'll actually give up faster on a block that has ECC errors based on the strategy that in a redundant RAID set the data can be recovered from another drive anyway.

So I'd be a little cautious about assuming a drive is "really" dead just because a RAID controller kicks it out of a RAID set.
 

ldsskier

Distinguished
Jan 20, 2010
13
0
18,510
Interesting question sminlal - once the raid controller said the drive was dead I tried running the WD diagnostics and couldn't get any signs of life out of it, but I didn't try putting the drive in another machine to see if anything would happen.

I think part of the problem is this intel raid controller which has always generated many timeout errors throughout the life of the machine - intel never bothered to do anything about it and it hasn't really caused any problems for me (until now maybe)

 

darkguset

Distinguished
Aug 17, 2006
1,140
0
19,460
Excellent point from sminlal! Although like you said you did try with WD diagnostics and it said that it is dead, hence not much hope there.

On another point, timeout errors would not kill a hard drive, although a corrupt BIOS (of the RAID controller) could actually make a hard drive work so badly that in the long run it would damage it (physically, not its electronics).
 

ldsskier

Distinguished
Jan 20, 2010
13
0
18,510
Moot question at this point... I've picked up some real server hardware (PowerEdge T310) with better networking (dual 1000 ethernet) and a dedicated PERC 6/i RAID controller (as opposed to the on-board controller in the Precision which is going to be repurposed as a workstation with a non-RAID configuration. If the PSU is indeed losing capacity then the machine can't be relied upon as the main server anymore.
 

darkguset

Distinguished
Aug 17, 2006
1,140
0
19,460


True... what brand and model is yout PSU? And what sort of hardware do you have in there?
 

ldsskier

Distinguished
Jan 20, 2010
13
0
18,510
The old machine was a Dell - a Precision 380. Proprietary PSU that aren't easy to find. From day one I was getting frequent timeout errors in iastor.sys that nobody could ever resolve so I'm guessing it was simply choking on the throughput.

New machine:

X3430 Xeon Processor, 2.4 GHz 8M Cache, Turbo
4GB Memory (2x2GB), 1333MHz Dual Ranked UDIMM
PERC6i SAS RAID Controller
250GB 7.2k RPM Serial ATA (x3 in RAID 5)
onboard dual gigabit ethernet
 

darkguset

Distinguished
Aug 17, 2006
1,140
0
19,460



The PSU you are describing does not sound good enough for me. I would get a good brand 500W as a minimum.
This could pretty much be your problem.
 

bengtner

Distinguished
Jan 28, 2010
3
0
18,510
I had a similar problem with repeated disk failures in my NAS. Finally Seagate admitted you have to buy 24/7 rated disks to guarantee operations in such an environment. First they tried and said this was an "unsupported configuration", but when I asked where I could have learned this they backed off and replaced my drives free of charge! Since then I have not had any problems.