Server RAID errors repeating every several days - paging operation, disk or RAID controller?

supercharn

Distinguished
Feb 4, 2011
14
0
18,510
Need some help here please.

We have a server running Windows Server 2008 R2 on a custom box with a Supermicro X9SCL/X9SCM motherboard running RAID-1 and RAID-5 arrays on the Intel RAID controller. Every few days the server will become completely unresponsive - no keyboard/mouse control, black screen, no network access. The only solution is a hard reset. When the server comes back up, the event log shows this event:

An error was detected on device \Device\Harddisk0\DR0 during a paging operation.

At first I thought this meant one of the drives was going bad on the RAID-1 volume. I changed out both drives over the course of a few weeks, but the server kept crashing every few days with this same error. I then removed the paging file from the RAID-1 volume and moved it to the RAID-5 volume. Now the server is still crashing exactly as before, but the error message has changed a little:

An error was detected on device \Device\Harddisk1\DR1 during a paging operation.

We have updated all of the motherboard drivers, changed out multiple drives, moved around the paging file, but the problem persists. Since the server is a custom build, we don't have Dell OpenManage or HP tools to work on the RAID. We are using the Intel Rapid Storage Technology GUI to do the management..

I am at a loss as to what to do here. Any assistance would be greatly appreciated.

Thank you!


 
Solution

supercharn

Distinguished
Feb 4, 2011
14
0
18,510


Yes we are running RST on the server to manage the RAID, but it doesn't have a huge amount of features/options. How would I use this to run a consistency check or disable cache mode? .
 

leo2kp

Distinguished
In RST it's called a Verify. You should be able to find it under Manage. I'm not at home so I can't look, but the RST Help might also be useful to find the Verify feature. Also in Manage you should be able to change the Cache mode.
 

supercharn

Distinguished
Feb 4, 2011
14
0
18,510


Okay great, I do see Verify in RST, and can run that now. Do I need to make new backups, or get everyone out of the system when i run this?

Also, "Write-back cache" is set to Disabled already for both arrays, and it doesn't seem like I can change that or enable it through manage in RST.
 

leo2kp

Distinguished




You can use the array while being verified but performance may suffer a bit.

 

supercharn

Distinguished
Feb 4, 2011
14
0
18,510


Okay I ran verify on both arrays, and on the RAID-1 volume, there were no errors found or fixed. On the RAID-5 volume, it found and fixed 26312 errors. Do you happen to know if this is normal, and if not, what the next steps would be? And by the way, thank you for your help!
 

leo2kp

Distinguished


I would also run a chkdsk /f against both volumes but it will need to dismount the volume to perform the check, so do this during a reboot or after hours. Do not run the /r switch - this is unnecessary on a RAID and possibly harmful. Perform another backup prior to the chkdsk.

The errors in consistency could be cause by a bad controller, but I'm also curious what kind of disks are involved. Are they on a backplane or do they use SATA cables?
 

supercharn

Distinguished
Feb 4, 2011
14
0
18,510


Thanks for the suggestions. Okay last night I ran chkdsk /r on both volumes, then I ran verify again on both volumes through RST. Now the verify is showing no errors on both volumes. At this point we will just wait and watch and see if it happens again. If it does happen again just like before then you are probably correct and it is the controller. If it is okay after this, then it was the verify/fix/chkdsk stuff that worked.
On your question about the drives, they are connected to the motherboard via SATA cables. Is there something in this I need to look at?
 

leo2kp

Distinguished


Do you happen to know what model of disks you're using? If you are using the wrong type of disks, then you'll have a higher chance of RAID issues.
 
Solution