I'm using an Areca 1210 controller with 4 500GB Western Digital RE2 drives in a RAID5. The box was running Debian with Xen with most of the storage allocated to Xen VMs via LVM. Setup had been running great for months then last night one of the drives failed followed by another drive failure later in the night. The server was up and running and so were three of the VMs. The VM that was down was the NFS server with the largest LVM volume (800GB). Xen was kicking back some errors as I was trying to restart the NFS VM, so I went to reboot. Server failed to boot (no disk error) and the controller reported a degraded raidset.
I understand that two drive failures in a RAID5 is unrecoverable, so that's not what the the question is about.
The weird thing is that SMART reports drives as healthy and so does the Western Digital's Data Lifeguard Diagnostic quick scan (running the full scan now, takes like two hours). Also, I find it strange that the server was up an running with two "drive failures". The other thing is that it seems highly unlikely that two Raid Edition drives would fail so close to each other considering they're rated at 1.2 million hours MTBF..
Even the Areca controller BIOS doesn't report the drives as "failed". Just says the volume and raidsets are degraded with no option to rebuild, I replaced one of the "failed drives" and assigned a new drive as a hot spare, hoping it'll kick in the rebuild process, but no such luck.
So right now I'm running a full Western Digital Data Lifeguard Diagnostic quick scan against one of the drives, got an hour left there. If it comes back as "healthy", what would you try next?
Well, the extended scan against both "failed" drives using the Western Digital Data Lifeguard Diagnostic utility reported NO errors. Out of the tools I tried so far (smartctl, WDC diagnostics) only the Areca controller feels the drives are bad..
Areca replied basically saying I'm ****. I know Areca makes highly regarded controllers so I'm not going to single them out for bashing, especially considering that many other highly regarded controllers have crapped themselves when a drive in a RAID5 set dies and the controller tries to do an autorebuild which stresses the other drives and boosts the chances that another might croak (which I'm guessing happened in my case). I do wanna give props to Areca for replying to my query through their web form. I exchanged a total of three emails with Kevin from Areca and even though I didn't get a solution, I want to recognize them for trying.
One option would be to go to RAID6 which can survive two drive failures. But I think I'll give this Areca controller a rest, so last night I setup a new Debian Xen box with two small disks in Linux software RAID1 for the Xen host OS and two big disks in software RAID1 + LVM for the VMs. A disk failure with this setup should leave me with a mountable/accessible/working second drive (not the garbage this RAID5 failure left me with).
RAID1 is expensive, so I'll give Solaris + ZFS a try on the storage box with NFS4 and Samba for Linux/Windoze clients. I've been hearing great things about ZFS, let's see if it lives up to the hype.
The other reason to go with Linux software RAID is that it's well understood (beauty of Open Source) and widely used. Running it on Debian Stable means a rocks solid Linux distribution supported by a very strong community. From what I see, the Solaris community is strong also.
Anyway, I'm off to trying to salvage some data off the four disks the failed RAID5 left me with. Anyone know of good tools to do this?
There are quite a few benefits to going with a software raid as opposed to a hardware raid, specifically not being tied to your controller if it should happen to be the faulty part. You are also less likely to experience data corruption issues due to a bad controller (which you won't even notice until you find your files are no longer able to run).
Linux raids (specifically using mdadm) are bulletproof, and depending on your hardware, can even outperform some hardware raids, especially if you are looking at controller cards in the sub $600 range.
In fact, Tom's had an article not too long back about VST Pro 2008, which appears to have some nice functionality for a reasonable price, as well as being able to expand in the future should you need to.
As for how to recover data from your array...I hope you made backups, otherwise you are going to be laying out a bit of money to have someone go in and rebuild parity with two members being hosed. One of the beauties of running the VMs is that you essentially have to back up only the host environment in order to recover all the VMs, and then back up any volumes that were shared as opposed to mounted in the VM. If nothing else, a couple external SATA terabyte dives would be all you need. I know, you are running a RAID array, why should you have to back up? For exactly this reason.
Yeah, I got some backups (thank god), but some of the latest stuff didn't get backed up yet. It's definitely a major pain, but not the end of the world. I had a proper nightly routine backing up to my off site servers at a colo, but I had to get my servers out of there in the beginning of June and hadn't rebuilt my process to backup to another source. Just unfortunate timing, I guess.