Recover RAID data with 2 partially failed drives

raypp2

Distinguished
Oct 6, 2010
4
0
18,510
What is the best way to pull off data from this situation:

1) RAID array alerts me that Drive 8 has failed
2) I check the drive and find no physical errors so I initiate a rebuild to the same drive
3) During the rebuild, the process stalls at 10% citing read errors for Drive 10
4) I clone both drives 10 and 8 to eliminate possible physical errors from causing further data loss. During the cloning process, drive 10 shows lots of physical errors.

Now what? I'm inclined to attempt to force Drive 8 back into the array and NOT rebuild drive 10 (cloned) until I get off as much data as possible. If I try to continue a rebuild from the Drive 10 with read errors, I assume I'd experience data corruption.

If this is the correct approach, does anyone know the procedure to do this on my Areca card?

I reached out to Areca technical support and received this back from Kevin Wang:
generally, force add a previous failed drive back to array is possible. but it works in certain condition and may result data corruption if the data in this drive is different from others. it is the reason why i asked you the system keep accessing after drive8 failed or not.
currently there have two drives available to added back to this array, the drive8 and 10. drive8 had been failed first. and the drive 10 have reading errors.
drive8 may contain corrupted data if the system keep accessing after drive8 failed.
and drive 10 may contain corrupted data because these multiple reading errors.
which drive is better depends on your condition.

to force add a drive into a raidset is create new volume without intialize it. but it may not work if the new volume configuration is mismatch with original.
the create volume without initialize is a new added feature, you have to update your controller firmware to be able to get this initialize mode while creating volume.

Another e-mail:

sorry for my confusion reply, controller will not rebuild VOLUME because no enough drives.

the RAIDSET rebuild used for make the status become free and allows you create new VOLUME to recover the original data in side.

My broken English is a little rusty these days. Can anyone help? Thanks in advance.

My Setup:
Areca Model: ARC1130ML PCI- X (12 drives)
Firmware Version=V1.41 2006-5-24
Disk Vendor=Seagate
Disk module=ST3500630AS
Disk Firmware Version=3.AAK
 

sub mesa

Distinguished
Did you check each disks SMART information?

Do any of the disks have a non-zero value for:

UDMA CRC Error Count
Current Pending Sector

?

Your data should still be recoverable; but be careful trying out things. You may want to consider Linux/BSD RAID recovery using software RAID instead.
 

raypp2

Distinguished
Oct 6, 2010
4
0
18,510
Emerald, I'm using RAID 5. There are 12 drives out of 12 being used. Each drive is 500GB. An important piece of information :)

sub mesa, I'm not near the computer right now but I can post the SMART data later tonight. I did check it previously and didn't notice any red flags even on the drive that had tons of read errors. Weird, I know.... Maybe I'm just not reading it correctly?

Great suggestion for trying the software solution. Do you recommend any particular application or live boot CD for the Areca chipset? I'm not a super advanced linux user so something well documented is preferred. The other problem I would run into is how to connect 12 SATA drives to the same motherboard. I suppose I'd need a PCI card for this?

Here is a procedure that was just sent back from Kevin at Areca:
i am sorry, we do not have document for it. generally array recovery may vary with conditions, we have to check the entire situation to find out the correct reaction.

and your procedure should be
1. take a screenshoot on the entire raidset hierarchy page and the volume information page for future reference
2. delete the listed volume
3. assign drive 8 as hotspare disk, controller will add it into the array immediately
4. put another empty drive in for rebuild the array, the array status should become normal after this drive been added
5. create new volume and the initialization mode must be no init for rescue.
6. remove the empty drive to be able to access data from rest drives.

Can anyone vet this? I'm understanding/translating this into better English as follows:

[References] from Areca Manual SATA RAID HBAs

1. Upgrade firmware to 1.48
2. Screenshot my raidset hierarchy page and the volume information page
3. Delete Volume Set [3.7.3.2 p69]
4. Put my clone of Drive 8 into the Drive 8 slot and Create a Hot Spare [3.7.2.5 p59]
5. Replace Drive 10 (read errors) with a completely NEW blank drive of the same size. (What does it mean for the array to become normal?)
6. Create Volume Set [3.7.3.1 p62] Use the default settings. When prompted with the "Initialization" screen, choose "No Init (The Rescue Volume)"
7. Remove the blank drive from the Drive 10 slot
8. Restart the system (I assume it will beep an alert because a drive is missing?) and the drives should be accessible through the OS

***** Other Steps
9. Mount the array with Linux in a read only state
10. Copy off all the data
11. Check the drives for integrity (Any suggestions on software for this? DD?)
12. Replace any bad drives and initialize the array from scratch if needed
13. Copy back all the data if needed
14. Determine why my SMART readings are now showing drive errors
15. Keep better backups :)

How does this sound to you guys?
 

raypp2

Distinguished
Oct 6, 2010
4
0
18,510
The verdict: almost all of this became mute when I completed the clone of the original failed drive (#8) and the drive with read errors (#10). DDRescue running on Knopixx was able to recover all but a few megabytes of the data. I replaced the new cloned drives in the array, let the rebuild run and the array is now working properly with all data intact :bounce:

Of course, now I'm concerned about why multiple drives would fail at the same time in the first place. Especially concerning is the lack of warning from SMART. I ran drive 10 through Seatools and it also doesn't see any SMART errors but does see plenty of sector errors. When that drive was cloning with ddrescue, I was getting about 10 megabytes/sec. The throughput on that drive should be 100 megabytes/second on a drive to drive clone. I've seen that throughput on other drives in that model. Now I'd like to test all of the other drives.

What tool(s) or method(s) would you recommend for doing a thorough test of each drive?


The drives are a combination ST3500630AS and ST3500320AS.