General question about RAID recovery

siler

Reputable
Oct 16, 2014
3
0
4,510
I've worked with a variety of different RAID setups - including software RAID on Linux and BSD, as well as hardware RAID controllers such as 3ware.

One question I have is : how exactly does a RAID controller (either software or hardware) really know if a drive has failed?

In my experience, most of the time when a non-RAIDED single-drive fails, it doesn't simply just stop working all of a sudden. It continues kicking along - the BIOS continues to detect it, it continues to work mostly. But random I/O error start happening. You know your drive is toast when you type ls on the shell and you get an I/O error. But sometimes the problems are more subtle, such as random, weird behavior - like everything works fine except you get an "out of disk space error" even though df -h reports space is available, or I/O operations work but are extremely slow.

The point is - the symptoms of a bad drive range from subtle errors to outright failures (where the drive isn't even detected by the BIOS.) How can a RAID setup compensate for this whole range of potential behavior?
 
Solution
Motherbd/software raids typically rely on SMART & the harddrives own ability to correct errors. whereas a good hardware raid controller would also use smart as well as constantly scanning the drives sectors for error during low usage times. As for correcting errors the good cards would rather the drive only try for a limited time and then give up and allow the controller to get the data from the parity info. This is called TLER - Time Limited Error Recovery which many desktop drives do not have and therefor can cause issues when the controller is expecting/designed with it in mind and a drive hits something like an unrecoverable sector.

popatim

Titan
Moderator
Motherbd/software raids typically rely on SMART & the harddrives own ability to correct errors. whereas a good hardware raid controller would also use smart as well as constantly scanning the drives sectors for error during low usage times. As for correcting errors the good cards would rather the drive only try for a limited time and then give up and allow the controller to get the data from the parity info. This is called TLER - Time Limited Error Recovery which many desktop drives do not have and therefor can cause issues when the controller is expecting/designed with it in mind and a drive hits something like an unrecoverable sector.
 
Solution