Windows 2008 Server internal RAID6 Volume going offline during data restore

nynj

Honorable
Jan 12, 2013
4
0
10,510
Greetings,
Forgive me for this long post as it is necessary to explain the details of the problem we're having.
Basically when we try to restore data from the backup across the Lan to our attached RAID6 Volume the Volume disappears from within Windows and the data backup fails.
The only way to get the Raid Volume to appear in Windows is to hard reboot the Server.

Hardware/Software This is a new build
Windows 2008 Server R2 x64 Enterprise (All Updates & Patches Installed)
Chenbro Rm31616 Chassis with 16 3.5" SATA backplane connected via Mini SAS 36-Pin to Mini SAS Cable (Host)
Areca ARC-1264IL Raid Card (latest firmware, etc)
16 Seagate 3.5" SATA ST3000NC002
ASUS Z9PE-D16 SSI EEB Server Motherboard Dual LGA 2011 DDR3 1600
64GB RAM
Dual SSD RAID1 Boot Drives Using intel RST from motherboard
2 x Intel Xeon E5-2630 v2 Ivy Bridge-EP 2.6GHz LGA 2011 80W Six-Core Server Processor BX80635E52630V2

Backup Server: Separate Chassis
Archiware PresStore 4.4.10
ADIC Scalar 50 LTO4 Library

So this is a new build, it's our backup domain controller and we're restoring data from a backup using Archiware's PresStore v4.4.10 to the local RAID6.
When we start the restore it will report that the volume is no longer available and the Areca Console reports that several drives have TimeOuts.
I stop the restore and then the RAID Volume is no longer accessible when I try to browse that volume.
The only way to bring it backup is to restart the System, and it hang at shutting down, so I have to hard reboot it.
Upon reboot the RAID volume appears, and I run a volume check on the RAID and a check disk on the boot drives and they come back normal.

Troubleshooting: After each part was replaced we try to restore data.
Replaced Areca ARC-1264IL Raid Card with identical new one, issue remains.
Raid Card to backplane cables, issue remains.
Replaced Chenbro Backplane, issue remains.
Rebuilt the RAID6 from scratch, issue remains.
Replaced the power Supplies (grapsing now), issue remains.
Tried using only 4 Ram modules, issue remains.
Took 2 brand new Hitachi HUA723030ALA640 drives and setup a RAID 0 using the cables that came with the RAID Card
BYPASSING the Backplane connecting from the RAID card directly to the new Drives, issue remains.

I've contacted Archiware and was informed that there software is not capable of taking a RAID offline.
Contacted Seagate and was informed that the firmware is the latest version and they have no incompatibilities with our setup.
Contacted Areca and they have been working with me but have no solution, I can copy from other network drives to the RAID without issues, but not the same volume of data.
I've changed TimeOut Settings, etc on Card as instructed by Areca, same issue.
Contacted Chenbro and they have no solution, all parts are compatible.

I've replaced everything associated with the RAID, including setting up a new one off the backplane
What remains constant is I cannot restore data from my PresStore backups to the RAID Volume.

The Windows logs reports only the following: Event 15, Disk The Device, \Device\Harddisk2\DR2, is not ready for access yet.

<Event xmlns="http://schemas.microsoft.com/win/2004/08/events/event">
- <System>
<Provider Name="Disk" />
<EventID Qualifiers="49156">15</EventID>
<Level>2</Level>
<Task>0</Task>
<Keywords>0x80000000000000</Keywords>
<TimeCreated SystemTime="2013-10-15T19:41:41.318518600Z" />
<EventRecordID>32761357</EventRecordID>
<Channel>System</Channel>
<Computer>ServerName</Computer>
<Security />
</System>
- <EventData>
<Data>\Device\Harddisk2\DR2</Data>
<Binary>0400800001000000000000000F0004C0040100009D0000C000000000000000000000000000000000E11E0A0000000000FFFFFFFF010000005800000A00000000F820101282032040001000005A00000000000000000000001861B73D80FAFFFF00000000000000001063B73D80FAFFFF000000000000000068B8B200000000008A000000000000B2B868000000080000000000000000000000000000000000000000000000000000</Binary>
</EventData>
</Event>

Event10, WMI, Error 0x80041003. Events cannot be delivered through this filter until the problem is corrected.

Any insight is greatly appreciated.
Archiware says it's an Areca RAID issue, but if so then why can I copy data from a mounted share without issue?
Areca says if you can copy files thru windows, then the RAID is working properly.

My next step: I have ordered an LSI MegaRAID SAS LSI00210 (9280-16i4e) SATA/SAS 6Gb/s PCIe 2.0 w/ 512MB onboard memory controller card
LJS

**UPDATE 10-17-13**
I took 2 Drives, plugged them directly into the Areca Raid Card using the cables supplied by Areca. Bypassing the backplane and Cables in the original setup.
Setup a RAID0 and tried to restore the data, same issue.

I then applied this MS fix as I was getting that WMI error listed above: http://support.microsoft.com/default.aspx?scid=kb;en-US;2545227

I was then successful in restoring over 500GB of data at approx 180GB per hr.
I then put everything back the way it was, and tried to restore again. The restore failed after about 300+GB, but the RAID volume did not go offline.
So I must have more than 1 issue going on here, the MS fix mentioned above and now it seems it's either the backplane or the cables (both of which have been replaced)

--> Hey popatim, what's your background in storage?

**UPDATE 10-22-13**
So at the recommendation of the Backplane Mfr, I deleted the RAID6 Volume and did a full Format (not quick via windows)
This took several days and once it was completed I tried to restore data and within an hour 3 drives failed!
This rendered the Raid Volume useless.

I then replaced the cables I was using with Areca branded cables to match the RAID Card.
I removed all the drives and added 4 new drives, one on each channel of the card and initialized as RAID6.
I then tried another restore, and I continued to get TimeOuts.

So the only thing that has not been replaced are the 16 drives, so I ordered new Seagate drivesI went with Seagate Constellation ES.3 ST3000NM0033 3 TB 3.5" Internal Hard Drive. I took out the 4 drives I just tested and replaced them with these Seagate drives, using the same slots on the RAID backplane. Initialized the 4 drives as RAID6, and then attempted to restore my data.
After 3 hours it successfully restored over 450GB of data, without any errors.

Now I'm adding 16 of the Seagate Constellation ES.3 ST3000NM0033 3 TB 3.5" Internal Hard Drives and will setup RAID6 and test when initialization is completed.

Outlook is promising.
How could 3 out of 16 drives have failed?
Bad drives... Mess You Up!
LJS
 
Solution
You're pretty much going to fail every time with drives that size and this spec: Nonrecoverable read errors 1 per 10*14 bits read, max (that should read 10 to the 14th power)

I don't mean to be crude but get better drives. These value drives aren't good for large arrays, they pretty much guarantee failure just like you're seeing. Who's building this, why didn't they know this already?

edit - there's also the possibility one or more of your drives is having an issue such as bad clusters. Check the smart on the drives.

FireWire2

Distinguished
Just for testing
Don't use the the back up software to restore
Just simply copy the backup files to your RAID volume
If you can, than there are issue with your back up SW for restore via network.
If you can not, then there is an issue with your network, try to access it via NSF instead
 

popatim

Titan
Moderator
You're pretty much going to fail every time with drives that size and this spec: Nonrecoverable read errors 1 per 10*14 bits read, max (that should read 10 to the 14th power)

I don't mean to be crude but get better drives. These value drives aren't good for large arrays, they pretty much guarantee failure just like you're seeing. Who's building this, why didn't they know this already?

edit - there's also the possibility one or more of your drives is having an issue such as bad clusters. Check the smart on the drives.
 
Solution