Sign in with
Sign up | Sign in
Your question

weird RAID 5 problem

Tags:
  • NAS / RAID
  • Green
  • Storage
Last response: in Storage
Share
August 15, 2007 6:28:59 AM

Hi everyone

I'm experiencing a weird RAID problem. The system I use is server 2003 enterprise, CPU = Athlon X2 4000+, RAM = 1GB x 2 (dual channel) and 4 WD 250 GB SATA 2 HDD , 3 is RAID 5 member and 1 is marked as spare. the main board is GA-965P-DS3P (rev. 3.3) ( Intel P965+ ICH8R chipset) the Intel Matrix storage manager option ROM version is 6.0.0.1022

The usual RAID status should be

Volume 0 RAID 5 strip = 64kb status = healthy bootable = yes

port status
0 member disk (green)
1 member disk (green)
2 member disk (green)
3 spare (yellow)

The story is, on the other day I remote control the server from home (using windows remote desktop), I accidentally disable the WAN adapter which makes my remote session end. the next day I went to the office to enable back the Ethernet adapter. surprisingly the screen display "boot disk failure please insert system disk and press enter". (this is not suppose to happen, coz I didn't shutdown the system or restart). so I press reset it boot up to the point where there's the running bar saying window server 2003 enterprise. then it's just restart it self. but this time, it went pass the running bar and up to the point where it said "preparing network connection" then it just restart again. this time it didn't make it to the windows running bar, but display "boot disk failure please insert system disk and press enter" and so I press reset again. this time the RAID status say

Volume 0 RAID 5 strip = 64kb status = fail bootable = no

port status
0 unknown error (red)
1 member disk (green)
2 unknown disk (red)
3 member disk (green)

[notice that port 3 is now member disk, before it is marked as spare. does this means that one drive has already fail?]


after this, i switch off the power and check all the cable connector (unplug and replug in the same order). then I switch on again

the status now said


Volume 0 RAID 5 strip = 64kb status = fail bootable = no

port status
0 offline member (red)
1 member disk (green)
2 unknown disk (red)
3 member disk (green)

so i press Ctrl + I (to get into the RAID manager)
it said "RAID volume fail detect, rebuild volume (y/n)" so i press "y"

now the status say

Volume 0 RAID 5 strip = 64kb status = degraded bootable = yes

port status
0 offline member (red)
1 member disk (green)
2 unknown disk (red)
3 member disk (green)

but this time it said "read disk error press Ctrl+Alt+Del to restart"
so I press it and it shows this again. I did this about 5 times, and once or twice it show as " boot disk failure please insert system disk and press enter"

I did try unplug port 0 and port 2 so that its left with 2 member disk. that didn't help either.

the weird part is disabling the ethernet adapter over remote desktop would cause all this? It's all my data is lost? or there's a way to fix this.

any help will be greatly appreciate. sorry for the long post and please excuse my english.
Thank you

More about : weird raid problem

August 15, 2007 7:55:28 AM

WOW. This doesn't sound good at all! I'm gonna give you the cut and dry answer before I explain.

Is your data recoverable: probably not :( 

Here's some info for you since you clearly had a good idea what you wanted/need when you built it. There's alot more to Raid-5 than people might think. I have a setup almost like yours, but im 8x250GB on RAID-5. I remote desktop into my windows 2003 server machine too, but mine's at home cause im a computer geek and I want to feel 'the force' with 1.75TB of drive space.

In this scenario, it would first seem your disabling a network card would cause a hard drive failure. There's NO chance that was the cause. Im sure you realize this being that you set up your server very well from the info provided.

Before I go into my theory of what I think happened from the info you provided, is there ANY way ANYONE was playing around with your computer, accidentally reset it, power spike causing a reset, etc? This just doesn't seem like it could happen without some 'human intervention'. I'll explain in a sec.

So, here's what it look like happened. There's so many possible ways this COULD have gone but I'll try to give you the most reasonable way.

1. One of your drive(probably 0 or 2) died, the server realized this and brought the spare online. It began rebuilding the array(if it finished this means your data might still be recoverable because disk 1 and 3 will have your data).

2. You were able to get partial reboots, but nothing good enough to actually log into. Finally after a few reboots it's suddenly not bootable. It would appear here that in step 1 the recovery was not complete.

3. Now the other drive(0 or 2) has decided to crap out on you. Now things are starting to get hairy.

4. You power down, check your cables. Hopefully keeping a cool head so far. You boot your system back up and nothing has changed. But, now you can 'rebuild'. You click 'yes'. It rebuilds.

5. Now it's done rebuilding and it says bootable. You think you are safe and click ok but no worky worky. You unplug the bad hard drives and it still won't work.

Now, in the beginning of your post you say:

Quote:
the next day I went to the office to enable back the Ethernet adapter. surprisingly the screen display "boot disk failure please insert system disk and press enter". (this is not suppose to happen, coz I didn't shutdown the system or restart).


When i've seen hard drives fail catastrophically, the machine blue screens and reboots if it's supposed to reboot. I disable the reboot because I want to see the blue screen so I can look it up on the internet. What are the odds that 2 drives would fail at the exact same time? Not much IMO. I'm thinking perhaps someone thought they could fix it and made things worse, either by resetting the computer while it was rebuilding automatically before you came to work the next morning or that a human somehow interacted with it. Don't go on a manhunt, because this COULD have happened on its own, but I think it's unlikely.

Now, obviously, if drive 0 fails, drive 3 begins rebuilding automatically. IF it finishes, then 1,2,3 is the new RAID-5(couldn't have had the random reboot). Then 2 fails, and 1 & 3 would still have your data and your golden. However, if it didn't finish the auto-rebuild, but instead rebooted, hard drive 3 might be labeled as being 'green' because it had rebuilt data(even if only a few sectors, once it starts some controllers consider it a valid drive from a user's view). This would be a VERY bad thing because you actually have only partial data on drive 3. This would cause all sorts of operating system havoc and wouldn't actually work if you needed data past the 'rebuild' point.

Now, the last thing(which might have ruined your chance for recovery) is that you did a manual rebuild also. Looking at your post, from the reboot before and the reboot that you actually did the manual rebuild, nothing changed, so why did it suddenly want to rebuild? I NEVER rebuild an array when things go awry until I(BIG "I") have control of the situation. You appeared to have 2 bad drives. I would have taken the server offline and actually tested the hard drives that are 'bad' and see if they are bad or what the story is with them. The raid controller has already decided they are not in the array, and you can't fix that. BUT what if they are good.

As a desperate act of saving your data when things are grim, sometimes you can recover some/most of your data with a software tool. Ive used Rstudio before for windows software RAID-5 setups where 1 drive dies and the boot drive was on the same drive as the RAID-5, just a separate partition from the RAID-5. Depending on how important your data is, rstudio might be able to help you recover your data. I would try putting disks 0,1,2 on a separate NON-RAID controller and try using r-studio to see what you can get. Worst case you see your data, but really recover trash. Best case you recover all of your data.

Now, onto a few other notes.

You were disabling the network card, which makes me wonder if this was a relatively new server setup. If it was a
relatively new setup, then it would seem that a drive(or possibly 2) were already bad, you just hadn't discovered it yet. Drives can have bad media from the manufacturer, and the data written is completely trash when you try to read it later. I had a new hard drive added to my RAID-5 a while back, and a small section of the new drive was bad. Guess where it was? Where the MFT was. A chkdsk fixed it, but not without a few hundred orphan files to sort through. This bad drive is where I learned the difference between a retail drive and a raid drive.

One problem I have seen myself is that if a hard drive has 1 bad sector, the hard drive will immediately go into a 'recovery mode' and try to read that sector(the clicky sound some people hear from the drive seeking and reseeking a sector). This is VERY bad for RAID-5 setup because the recovery mode locks out the hard drive from giving/receiving any data from the RAID controller. The RAID controller will think 'gosh.. must not be a hard drive here' then drop the drive from the array and life goes on. 1 bad sector can ruin it. I'm not sure how many companies are like this, but I know I saw it once. It sucks because you pretty much can't use the drive for RAID-5 anymore, but it's not broke enough to RMA. This is why some hard drives are sold as 'raid drives'. They lack the 'recovery mode' and instead just tell the RAID controller 'hey.. im broke what do you want me to do about it'. Then the RAID controller handles the recovery using RAID-5 parity data, and remaps the bad sector. Then you actually never see anything. Bad sector is remapped, array is still healthy and all is well.

Another thing to think about is this. How likely is a 2 hard disk failure on a random day? I'd say slim to none unless your building is on fire. Now, me personally, I NEVER put my RAID-5 on a setup where it would auto-rebuild the array on failure. If drive 0 dies, I want to know what the heck is going on before anything else happens. That way if something goes terribly wrong, I know exactly what is going on. I want to be responsible for the loss of my precious data, not a controller that decided to rebuild on its own and made things worse. I may set up a spare on the controller, but I do that only if I can disable the controller from actually rebuilding automatically.

Also, does your controller let you do array verifications? I do an array verification once a week at 12:01am on Sunday morning. Basically it compares your data and parity data to make sure they match. If they don't something is wrong. My 8x250GB does it in about 4 hours.

I also ALWAYS do a test pattern WRITE test on a new hard drive before I do anything with it. If the hard drive can't store a test pattern, then it won't store your valuable data. I use Spinrite myself. Im sure there's more out there, but I find Spinrite to be a valuable asset for hard drive recovery too. I've used it twice for friends that weren't smart enough to do backups, retrieving most of their data.

So to briefly cover my few pointers:
1. Disable automatic rebuilds if you can. You probably know better than the controller!
2. Always do a test pattern write test on a hard drive before you actually use it.
3. Set up an array verification. Weekly is probably the shortest time I'd do it. Biweekly might be better for you. I do weekly only cause I don't care much about the wear and tear on the drives, they are all Seagate drives with their 5 year warantee, and i'll have it replaced LONG before the warantee expires. If possible, do an array verification after the array is built, but before you install and copy all your data to it. I like to be sure my array is in good health.

I hope this provided enough information to help you in your RAID journey, and I hope you are able to recover your data. It really sucks when you lose it all. I had it happen to me years ago, and I'm a better person for it. Backups are a MUST. DVDs are cheap, use them! I'm sorry if this post is overly long, I just felt that I'd give you as much ammunition as I could when you go up against your server 1-on-1.
August 15, 2007 12:06:59 PM

Thank you cyberjock for your great reply.

here are the answer for your questions

This server is up and running for about 8 month

The server is connected to an APC UPS, so I don't think power surge or spike will kill it. BUT, during the last 2 months, we have a major refurbish
of the building and, the worker are idiot enough to just unplug the power cable from the UPS. which mean this server is subject to improper shutdown for about 10 times. (sorry I forget to tell you this in the first place) I was there when they did, only to discover it later then I log in to the server and the pop up came up let me enter the reason for improper shutdown. But every time improper shutdown occur the Intel Matrix Storage Manager usually initialize the array on the next boot. sometime I just click verify and repair in the manager and it never found any problem.

After this incident we've bought the server rack so we can put the server (and the UPS) in the rack and lock them up. which mean no one can press reset before I arrive at the office.

so you think the forth drive (port 3) is incomplete right? so I shouldn't bother getting a new drive and replace it?

but I still don't get it, why does all this have to happen after I disable the adapter, i know this is totally unrelated, but just can't stop wondering.
Related resources
August 17, 2007 5:40:35 AM

cyberjock said:
One problem I have seen myself is that if a hard drive has 1 bad sector, the hard drive will immediately go into a 'recovery mode' and try to read that sector(the clicky sound some people hear from the drive seeking and reseeking a sector). This is VERY bad for RAID-5 setup because the recovery mode locks out the hard drive from giving/receiving any data from the RAID controller. The RAID controller will think 'gosh.. must not be a hard drive here' then drop the drive from the array and life goes on. 1 bad sector can ruin it. I'm not sure how many companies are like this, but I know I saw it once. It sucks because you pretty much can't use the drive for RAID-5 anymore, but it's not broke enough to RMA. This is why some hard drives are sold as 'raid drives'. They lack the 'recovery mode' and instead just tell the RAID controller 'hey.. im broke what do you want me to do about it'. Then the RAID controller handles the recovery using RAID-5 parity data, and remaps the bad sector.

As far as i know, most RAID5 implementations would kick out a disk that has a bad sector, regardless of the TLER (Time-Limited Error Recovery) feature you are talking about. TLER means you don't have a 60-second I/O stall which might be very undesirable for production servers. I highly doubt they are of great benefit for home users, they will not loose revenue if the server does not respond for a short while. Instead, after a timeout the disk will be kicked out and the array will be run in degraded mode. I can confirm at least Areca behaves in the same way.
August 17, 2007 4:40:19 PM

You seem to have had two drive failures plus int read errors. When array only has 2 drives it is running as 'stripe' (no parity bit), so if one or more drives has a read error there is no way of checking the data against parity bit, so you are in same position as single drive with read errors, hence failure to boot. One cause could be failing PSU. Fluctuations on power rail to drive could cause drive failure ans read errors.

Mike.
August 17, 2007 5:49:31 PM

Spinrite!! Hooray for Gibson Research :) 

I agree that the nic disable didn't cause anything except a sudden disconnect. Cant really add any more to the raid discussion itself, mostly because I agree that you've got 2 bad disks now. I've run across several raid systems where the spare was bad, and nobody knew it until the raid itself went crash and it was needed. Raid rebuild starts.....and dies a terrible death when the 'spare' doesn't work.

EDIT: NO AUTOMATIC REBUILDS!! If it's that important, mirror the array or cluster the machines.
August 21, 2007 7:53:05 AM

Thank you everyone for your reply. I haven't check the drive for error yet but I'm wondering, when the RAID controller says that the drive is bad (fail) does this means that the drive is physically broken (bad sector, MBR, head crash, etc) or it can be just the data inside is corrupted.
August 21, 2007 9:46:08 AM

The controller only knows that the drive gave a read or write error, there is no diagnosis carried out. You need to tes the drive before setting a rebuild in motion.

Mike.
August 21, 2007 11:05:08 AM

The controller saying it is failed means hardware failure. The controller has no way of knowing if the data on the drive is actually data or complete garbage. More than likely the 2 drives that your controller say are 'bad' are either bad, or disconnected from the array. Some controllers will list the 'bad' drives if they are removed so that the administrator knows what drives should have been there.

This could be a good example of hard drives failiing without doing regular surface scans until it was too late. When you rebuild the array, you are basically doing a scan of all of the disks to rebuild the array. 1 bad sector anywhere can ruin the rebuild.

As a thought I came up with a few mins ago, there's a program called R-studio that I have used to rebuild a software RAID-5 array with 1 drive bad and the OS was toast. It worked VERY well. You might want to try running that program with your original 3 disks to see how lucky/unlucky you are with the recovery. I think this is your last savior before you decide to wipe them and figure out what drives are good/bad and make a new array. Granted I'd expect some files to be corrupt because your original disks went out of sync for some period of time, but the number of files should be small unless you were doing a defrag or something like that. I'd try R-studio in your situation expecting to recover at least 80% of your data. You just need to find more storage space to recover to.

http://www.r-studio.com
August 23, 2007 1:21:23 PM

Thank you Mike99 and cyberjock for replying

To cyberjock, you said "You might want to try running that program with your original 3 disks"

the 3 original disks, you mean the disk on port 0,1 and 2 right? even though disk 0 and disk 2 are dead.

Also, lets say if I can recover most of the data and then I test the HDDs for error. Is there any possibility that the HDDs contain no error at all?

By the way, does anyone have any explanation about fail HDD and disabling NIC. because at the moment I have nothing to report to my boss and he will probably think that I screw it up by disabling NIC.
Thanks again for any help.
September 2, 2007 3:03:27 PM

Yesterday I try using the original disk and I've manage to recover files with r-studio. Surprisingly, I'm able to recover 100% of all the files I need [about 80GB] NOTE:I didn't recover them all, only the files that I need. I didn't test the HDD for errors yet, but I'll do that soon. Luckily my boss believes me that I didn't screw it up. :-). I'll post the result when I've done it.
September 3, 2007 12:58:07 PM

Excellent thread. May I continue it?
I have (had) a RAID5 3 HDD set. In an hour it went from ALL green to 1 member remaining only. RAID - DEAD.
On examination of disks I can see that 2 have some partition structure but one has none. I presume this was a failed auto-rebuid. Now I know better having read the thread but still the problem remains for me.

R-Studio 3 did not help. It found files and folders but data recovered is not working.

I guess that the "emty" disk is no help for me here. How about the one that RAID reports as failed but still has structures on it? If it is a single bad sector that brought it to demise, I don't mind, as long as I can retrieve the rest. But how? R-Studio reports that there is data on both disks that have partitions and finds absolute zilch on the one that hasn't got that.

My current last step solution is to pay £2000+ VAT for an attempted recovery with a company. Can anyone suggest any more things I can do? I'm ready to buy extra HDDs and do a RAW copy of the original set if this allows me to do some expermenting.

Thanks to anyone for any help!
September 3, 2007 4:53:59 PM

In addition to the above info... Array is a 64KB stripe. On closer examination the RAID menu on the PCI card reports that there is only 1 HDD that is part of the RAID5 array. The other 2 are reported as single and one has status of BOOT whilst another has status of NONE. I don't remember if the array was configured as Bootable but the disk which is reported as bootable was member #3. It is also the one that has some partition structure on in. The empty one is the #2.

What really pisses me off is that these 2 disks are less than a year old and the one that still works is well over 2 years!!!

I guess buying for raid needs to be done across different makes to ensure differential failure...
September 4, 2007 4:28:21 AM

Ok, so I've got the result, HDDs 1,2,3 contain no error (I use spinrite 6 with operation level 4). The interesting bit is HDD 0, the spinrite couldn't find the drive but the BIOS detect that the drive is there (it stuck at "Discovering System's Mass Storage Devices...) BUT when I put the HDD in the other PC (as slave) and boot into Windows, Windows detect the HDD and found a partition which I've manage to format in windows. But when I try again, boot up with partition magic and try to format it again, it said "partition table error #105 found".I tried to format it, but once it's done there is a pop up saying "error #4 bad argument / parameter" and it went back to partition table error again so i couldn't format it with partition magic. I did try install windows to it but it couldn't find any HDD installed.(all this is done in another PC not with the one I've having problem). So I couldn't do a test on HDD 0. Does this mean that HDD 0 is broken? if yes then I'll go and RMA it.
In addition to the above info, while HDD 1,2,3 are tested with spinrite, spinrite couldn't retrieve the SMART status from any drive. (it is enable at BIOS).

Now I have new problem, even though HDD 1,2,3 contain no error, I still couldn't format it with partition magic. when I boot with PM, it said "error #48 sector not found"
I'm gonna try to put HDD 1,2,3 back in the server and see if i can use the Intel Matrix Storage Manger to format it or setup the new array.

Also, after testing HDD 3 (the spare one) with spinrite and looking at the raw data snapshot. I can confirm that cyberjock prediction is correct that HDD3 only contains data in the first few sectors, and the rest is free space.
September 4, 2007 4:34:45 AM

To magelan_2000

with the R-studio, you have to choose the correct RAID block order to be able to recover data. for example my setting is "Left Synchronous (Continuous)" and RAID block size is 64Kb.
The default is "Left Synchronous (standard)"

try changing the setting, and see if u get any luck
September 8, 2007 6:20:19 AM

@ Pheonixchin

I wouldn't trust drive 0 if Spinrite is having problems with it. It could be flaky enough that spinrite is having a problem, but the OS isn't thorough enough to see the problem. Drive 0 would make a good backup/dumping ground for something, but definitely don't use it as your main array.

@ magelan_2000

The model and brand of hard drives doesn't matter for the array. I 'like' to stick to 1 brand and model if possible, but the performance hit if you buy a seagate and WD in the same array is minimal. It's normal to have 1 or 2 with a partition structure, but in reality only the first drive in the array will actually have the partition table. The rest will be parts of the stripe. You can't examine the hard drives to find out the partition tables, boot partition and stuff because it's an array. Member 3 shouldn't have been bootable, only member 0(or 1, whichever is the first one).
!