2nd HDD failure in 3 months

Running Windows 2003 Server with an Intel ICH7R/DH controller, four drives in RAID 5.

One of the original Samsungs that came with the system failed and I replaced it with a WD Caviar Black (1Tb) in early November. A couple of weeks ago the new WD failed and I took it back to the brick and mortar and was given a 1:1 swap for a new one. This morning I found the message that the replacement drive has failed as well. The other drives are running just fine.

Any ideas what might be killing the WDs?
16 answers Last reply
More about failure months
  1. Hi newcomer and welcome to the Tom's hardware forum.

    Check the power, current and voltage of your PSU.
  2. Shall do. Coolmax PS-228 a decent tester?
  3. Yea, looks good. Or with a multimeter.
  4. It looks like the Coolmax will test the voltage on the SATA line as well as on the ATX connector - might as well test all possible points of failure.
  5. All voltages were spot-on normal. Something else must be going on.
  6. A faulty PSU would not necessarily show up all the time. It could be a spike on the line that appears every now and then that kills the drives. Could be static. Overheating, or just pure coincidense (can't rule that out). Try running that drive on a different rail if possible and change the SATA cable as well. (Also could be faulty SATA port on motherboard/RAID card).
  7. I've moved the (replacement) drive from the cage to the floor of the tower at the front of the case (just inside the vents). I also added a video cooler fan to hopefully increase airflow out the back. Here's hoping that it doesn't die again.
  8. Check with SMART if possible every now and then to see if the drive experiences any problems during its lifetime. Cross your fingers!
  9. Are you diagnosing these as disk failures just because the RAID controller declared them as dead? Have you actually tried using the drives outside the RAID set to see if they're really and truly kaput?

    RAID controllers often place a time limit on how long they'll wait for a drive to respond and some drives may exceed the timeout if they're trying to read a block that has ECC errors. This is probably made worse by "Green" drives which can have an additional delay if they're accessed after spinning down because they haven't been active for a while. (I understand you're not using "Green" drives, just mentioning it for the benefit of lurkers).

    This is why the HDD manufacturers make special "RAID" versions of their drives (ie, the WD "RE3" and "RE4" series drives) - the "RAID" versions have different firmware that ensures they will respond quickly to every request. They'll actually give up faster on a block that has ECC errors based on the strategy that in a redundant RAID set the data can be recovered from another drive anyway.

    So I'd be a little cautious about assuming a drive is "really" dead just because a RAID controller kicks it out of a RAID set.
  10. Interesting question sminlal - once the raid controller said the drive was dead I tried running the WD diagnostics and couldn't get any signs of life out of it, but I didn't try putting the drive in another machine to see if anything would happen.

    I think part of the problem is this intel raid controller which has always generated many timeout errors throughout the life of the machine - intel never bothered to do anything about it and it hasn't really caused any problems for me (until now maybe)
  11. Excellent point from sminlal! Although like you said you did try with WD diagnostics and it said that it is dead, hence not much hope there.

    On another point, timeout errors would not kill a hard drive, although a corrupt BIOS (of the RAID controller) could actually make a hard drive work so badly that in the long run it would damage it (physically, not its electronics).
  12. Moot question at this point... I've picked up some real server hardware (PowerEdge T310) with better networking (dual 1000 ethernet) and a dedicated PERC 6/i RAID controller (as opposed to the on-board controller in the Precision which is going to be repurposed as a workstation with a non-RAID configuration. If the PSU is indeed losing capacity then the machine can't be relied upon as the main server anymore.
  13. ldsskier said:
    Moot question at this point... I've picked up some real server hardware (PowerEdge T310) with better networking (dual 1000 ethernet) and a dedicated PERC 6/i RAID controller (as opposed to the on-board controller in the Precision which is going to be repurposed as a workstation with a non-RAID configuration. If the PSU is indeed losing capacity then the machine can't be relied upon as the main server anymore.


    True... what brand and model is yout PSU? And what sort of hardware do you have in there?
  14. The old machine was a Dell - a Precision 380. Proprietary PSU that aren't easy to find. From day one I was getting frequent timeout errors in iastor.sys that nobody could ever resolve so I'm guessing it was simply choking on the throughput.

    New machine:

    X3430 Xeon Processor, 2.4 GHz 8M Cache, Turbo
    4GB Memory (2x2GB), 1333MHz Dual Ranked UDIMM
    PERC6i SAS RAID Controller
    250GB 7.2k RPM Serial ATA (x3 in RAID 5)
    onboard dual gigabit ethernet
  15. ldsskier said:
    The old machine was a Dell - a Precision 380. Proprietary PSU that aren't easy to find. From day one I was getting frequent timeout errors in iastor.sys that nobody could ever resolve so I'm guessing it was simply choking on the throughput.

    New machine:

    X3430 Xeon Processor, 2.4 GHz 8M Cache, Turbo
    4GB Memory (2x2GB), 1333MHz Dual Ranked UDIMM
    PERC6i SAS RAID Controller
    250GB 7.2k RPM Serial ATA (x3 in RAID 5)
    onboard dual gigabit ethernet



    The PSU you are describing does not sound good enough for me. I would get a good brand 500W as a minimum.
    This could pretty much be your problem.
  16. I had a similar problem with repeated disk failures in my NAS. Finally Seagate admitted you have to buy 24/7 rated disks to guarantee operations in such an environment. First they tried and said this was an "unsupported configuration", but when I asked where I could have learned this they backed off and replaced my drives free of charge! Since then I have not had any problems.
Ask a new question

Read More

Hard Drives Western Digital Windows Server 2003 Storage