RAID 0 Failure magically Fixes itself (repeatedly)

Been running 2 WD 320gb disks in RAID 0 (ICH10) for a few months. Periodically the system will crash completely.

After a soft reset the system will not boot at all reporting a "Error Occured" on the first disk in the array, however, upon recycling the power switch the system will boot nps at all and the fault can be cleared using the intel manager.

My first thoughts were that the disk had an intermittent fault and was on its way out. I replaced the disk and all was well for a few weeks. The fault has now returned on the new disk. It can be cured the same way, power off/power on.

Any thoughts?

System is Win 7 x64 home edition
q6700, 4gb kingston ram
5 answers Last reply
More about raid failure magically fixes repeatedly
  1. Perhaps some pending (bad) sectors on the HDD that can not be swapped until written to.

    Please try to fetch the SMART data of the RAID member disks; Intel has some windows-based utility for that if i recall correctly.

    If one of your disk has pending sectors, you should repair that immediately, for example with a re-write command in linux. The alternative is to zero-write the entire drive and re-creating the RAID-array; meaning you need to backup all existing data on the current array.
  2. Recreating the array was proved ineffective as when I replaced the error drive & reinstalled the backup of the drives the fault has reappeared.
    Will have a look at the SMART data for pending sectors but am sceptical this is the problem.
  3. Why are you sceptical? If you have a pending sector, you WILL run into problems if windows tries to read that sector. With the low-quality fakeRAID drivers available in Windows, it will mean you get a broken array real soon; most drivers actually split the array in 1 array with the 'good' disk and a 'lost' array with the failed disk. This failure is actually minor and can be corrected manually or using utilities like SpinRite.

    Re-creating the array does nothing to resolve the problem; it only re-writes the configuration data on the last sector of each disk. But thats not the problem you probably have some surface media error on one of the member disks.

    Once you fixed the 'pending sector' it will be swapped with a reserve sector so the operating system only sees the 'good' sectors and the 'bad' sectors will be hidden from the operating system. But this 'swapping' can not happen until new data has been written to the bad sector; allowing the HDD to forget/forfeit the unknown contents of the bad sector; and simply not use that sector anymore by mapping it to a reserve sector.

    To me, this seems like the most likely scenario for your freezes/failures.
  4. I was sceptical because you said the alternative was to zero write the entire drive where as i replaced the drive, which to me was effectivly the same thing and this did not fix the problem.
    The solution you offer implies that this may be a problem I may not be able to permanently fix and is most likely to reoccur?.
    Will try and find a program to fetch the SMART data tonight.
  5. So you believe its not disk problem at all and your disks are 100% okay?

    Can't hurt to check some additional stuff, like:
    - disk read tests using ubuntu
    - memory test using memtest86
    - check SMART logs
    - measuring power levels of +5V and +12V when running and when in soft-OFF state (soft-OFF means you turn off the power button; but the mainboard still gets "standby power")
    - do you run with Intel 'write caching' enabled or disabled? i recommend disabling it for now

    You can do the first two tests easily, download any Ubuntu cd from, burn it, boot from it (livecd option), open a terminal and execute:

    sudo dd if=/dev/sda of=/dev/null bs=1M
    sudo dd if=/dev/sdb of=/dev/null bs=1M

    Do not make errors with this command! If properly run, it will only read from /dev/sda from sector 1 to the last sector. If your disks are okay, it will run for 2 hours, and then give some output of the speed. If it contains bad sectors, it will run for some time and then output "I/O error" or something like that. Note that until it finishes or finds an error, this command does not output anything. So it may appear to do nothing for some time, just be patient.

    As you got two disks, i'm assuming they are named /dev/sda and /dev/sdb. Since this is a read test, there is no risk when you choose the wrong disk - for example your SATA CD/DVD may be /dev/sda too. To verify what disks you have:

    dmesg | grep sd

    The memory test is important too, if you have not done it already i strongly advise you do; just to be certain its not an error there. If you downloaded Ubuntu, the boot menu also has an option for "Memory test" which runs memtest86+. Once selected, simply wait a few hours to let it make a full pass.
Ask a new question

Read More

NAS / RAID Western Digital Storage