Dying raid array

I've got an on board Nvidia raid controller with my boot drives in Raid 0 and my storage drives in Raid 1. I suddenly started having problems with at least one of my hard drives, I'm hoping someone might be able to help me shed some light on what is going on. I can't paste specific error messages at the moment but will be able to later.

So far the symptoms are thus:

Occasionally the OS (ubuntu 9.04) decides that the mirror is read only. When I tried to reboot I was getting I/O errors and it was listing a sector. What is strange is the I/O errors were listed for /dev/sda which is one of the striped drives. I threw smartmontools on there and checked the logs for the drives. The two drives in the Raid 0 have a few read errors but nothing obviously wrong, (that I saw). The mirrored drives both have an extremely high read error count, and if I recall correctly they were identical. What I'm wondering is if anyone has any idea if this means there's something wrong with the actual array, is there something wrong with the raid controller, or am I unfortunate enough to have two drives in a mirror die at once? The drives in the mirror are identical 1TB Samsung drives, purchased together, which is why I'm afraid I may be losing both. Any ideas/suggestions/help would be appreciated.

Also the raid controller lists both arrays as "Healthy".

Thanks in advance!
13 answers Last reply
More about dying raid array
  1. Before you do anything else, have you backed up your data? How old are the drives? From what you've said, it seems like your array is failing on the hardware-level.
  2. I haven't had a chance to back up the data, needless to say the hard drives are currently unplugged. I'm waiting on getting a harddrive big enough to fit all my crap on. The drives are about 6-9 months old, although they run almost continuously. The boot drives (about the same age) have about 5000 hours on them.
  3. A little early for drive death, but certainly not unheard of. I haven't really ever used a samsung drive. After you successfully back up your data and pull those drives, I would test both independently because I find it unlikely both are dying at once. Since it's a mirror, it's too hard to tell without separate testing. Simply because of the young age of the drives, I would guess it was the controller over the drive.
  4. The annoying thing is that the controller is about the same age as the drives.

    Since this was the first mirror I'd ever set up I'm wondering about how it functions precisely... For example, can I take one of the drives and stick it in another computer and read the data off it normally? Or would the fact that is was part of a mirror cause another computer to have difficulty reading it?
  5. Yeah, if drives are in a mirror they MUST be used together. The info is scrambled, so you couldn't pull one of the two drives and use it elsewhere. You would have to format it before using it. So it is imperative you backup the data prior to pulling the drives. The drives have to be used with eachother; whether or not the two drives can be used with a different controller I am not sure. I don't have enough experience with raid, ie, I've moved my drives. Maybe someone else on the forum can answer if you can use the 2 drives w/ a diff controller.
  6. Did you update any driver lately. I've had similar problems with my Samsung F1 750Gb and 1TB drives. They are incompatible with some controller drivers. It took some time for me to figure this out, because, sometimes it would work and sometimes it would say it was a read-only drive. Because I used 4 drives in RAID-5 and RAID-10, the last thing I would think of was dying disks. I've asked Samsung for more recent firmware, but they do not answer to my questions regarding firmware. Drives with later firmware doesn't seem to have the same problem. I use more than 15 Samsung drives so was able to test it.

    I suggest you try an newer/older storage driver. An update of the BIOS can also be of help if the driver update/downgrade does not work.
  7. OK... I discovered in working on the computer that I actually have Seagate drives, I must have given the Samsung to someone else. I currently run linux, more specifically ubuntu, as my primary OS and utilize the package manager to install updates, so I have no idea what packages it installed/updated that might be associated with the raid controller. I may look into a BIOS update just to cover everything.

    My boot drives check out ok using smartctl (all error values are within expected ranges) however my OS doesn't seem to be running correctly. I also discovered while working on my computer that the fan that cools the HDD's crapped out lending to the idea that the drives might be dying. The log below shows that it may have overheated at some point. Also the log shows the read error rate going up but the number of recovered errors remains equal. Now that I've backed up the data I'm running Extended Offline tests using smart in the hopes that it might shed some light on the problem.

    Is it likely that this is software related...? It seems to me that it is more likely hardware.

    Here's the pertinent information from the smart logs (hopefully it render correctly when posted):

    1 Raw_Read_Error_Rate 0x000f 109 099 006 Pre-fail Always - 21056358
    7 Seek_Error_Rate 0x000f 071 060 030 Pre-fail Always - 14125335
    190 Airflow_Temperature_Cel 0x0022 071 039 045 Old_age Always In_the_past 29 (0 45 36 24)
    195 Hardware_ECC_Recovered 0x001a 039 030 000 Old_age Always - 21056358
  8. first of all, don't save anything on the disks when you connect them to the station. it may erase files etc. you can connect www.raid-recovery-online.com for a free diagnostic and they will tell you what the problem is without any obligation.
  9. Mostly commentary, I've run seagates for many, many years and never had one fail. I mostly use their ES series which has well known reliability, but still....which makes me think it's the controller even more than originally.
  10. Because it is a RAID-set it isn't always easy to see he problem is the controller' or the disks' fault. I still think it's an incompatible driver with the controller. That would explain you have the problem occasionally. If a drive fails you would be able to read it in the drives SMART data, because you can't it points to the controller. I have never heard of failing drives that would occasionally prevent from writing to it by mentioning it is read-only. Normally it would try to write and after a certain amount of attempts would return an error.

    Because I have had the same problems, however with several Samsung drives, and the solution was to use a more recent controller driver I am guessing it would solve your problems too.

    How long have you been having these kind of problems? Since when is Ubuntu on your system? Has it begun after installing Ubuntu? Have you ever installed an other OS on the same machine without problems? Have you tried to unplug one drive, because a RAID-1 set must allow the removal of 1 drive?

    Watch out with the latest option. After plugging the second drive back in, the controller will examine the RAID-set by checking the data block-by-block. This will take a long time.
  11. OK...

    The computer decided to start having a myriad of problems and the most recent symptoms answered most of the questions here... the mobo is toast. I plugged the computer in at one point and attempted to start it up and was rewarded with the lights/fans turning on briefly then everything turning off again. No, its not the PSU... the computer is now running perfectly with a different mobo/cpu/ram set in it.

    I also got a 1.5TB drive and backed everything up. Before the mobo died, I plugged this brand new drive into one of the ports the old mirror had been plugged into and its read error count skyrocketed. Both of the 1TB drives passed their smart self checks (extended offline tests) with flying colors and once I plugged everything into a different motherboard the error counts stopped climbing.

    So the real problem appears to have been a combination of at least one bad SATA port and a motherboard that was on its last leg.
  12. So, the onboard raid controller took a dump, that is what we are saying, right.
  13. Yup... that was my hope and my suspicion but with the symptoms it was hard to be sure.

    Again thanks to everyone for the input. [:wolfen18:9]
Ask a new question

Read More

NAS / RAID Controller Storage