RAID1 Mirror Corruption on 2008 R2 Server with Intel RSTe Controller with Intel SSD Drives

Stephen Done

Reputable
Aug 22, 2014
2
0
4,510
We are experiencing random, occasional but catastrophic array corruption on two servers that are on test before being moved to a hosting centre.

SERVER CONFIGURATION
SuperMicro SuperServer 6017R-TDLRF 1U Server
Incorporating SuperMicro X9DRD-LF Motherboard with Intel C602 Chipset with latest BIOS.
64GB ECC RAM
1x Xeon E5-2630v2 CPU
2x Intel DC S3700 800GB SSD Drives in RAID 1 (Mirror) on RSTe Hardware RAID.
Windows Server Enterprise 2008 R2, fully updated.

CORRUPTION PROBLEM
Under heavy load, after a random period of time, often when doing a Windows backup, the array corrupts and the following event log messages are generated. There are varying quantities of each message...

Event ID: 55
Description:
The file system structure on the disk is corrupt and unusable. Please run the chkdsk utility on the volume VMs.

Event ID: 12289
Description:
Volume Shadow Copy Service error: Unexpected error CreateFileW(\\?\GLOBALROOT\Device\HarddiskVolumeShadowCopy25\,0x80000000,0x00000003,...). hr = 0x800703ed, The volume does not contain a recognized file system.
Please make sure that all required file system drivers are loaded and that the volume is not corrupted.

Event ID: 136
The default transaction resource manager on volume E: encountered an error while starting and its metadata was reset. The data contains the error code.

A chkdsk on a corrupted volume shows hundreds of lines of errors. I can post these two, but I do not think the exact errors are relevant, as they vary each time. They include:
...
The object id index entry in file 0x19 points to file 0x174c
but the file has no object id in it.
...
The multi-sector header signature for VCN 0x0 of index $I30
in file 0x3e is incorrect.
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
Error detected in index $I30 for file 62.
The index bitmap $I30 in file 0x3e is incorrect.
...

TESTS PERFORMED
# We have two complete servers with identical hardware. We can repeat the fault on either server. So we know there is not a fault with a specific hardware item.
# We have tested with 800GB Intel SSDs with HP firmware. We have also tested with 200Gb Intel SSDs with Intel firmware. Configurations with both drive versions exhibit the fault.
# We have tested with Windows based software RAID and the fault does not occur. Unfortunately this halves the array read performance, as we have confirmed with drive benchmarking software. Having spent £2000 per server on drives, halving the disk performance is not something we want to do. Since the software RAID works, this suggest that the drives and connectivity are not at fault, as they are used in both hardware and software RAID. Switching to software RAID uses a standard Microsoft AHCI driver instead of the Intel RSTe driver.
# Configurations with the following Intel RSTe driver versions exhibit corruption: Version 3.8.0.1113, version 4.0.0.1045 and version 4.1.0.1047.
# Configurations with Intel 'C600+/C200+ series chipset SATA RAID' RSTe driver version 3.6.0.1093 does not exhibit corruption.
# Power consumption is around 130 Watts and is well within the limits of each server's dual 500 Watt power supplies.

OBSERVATIONS
# Once corrupted, running an array verify from the Windows Intel RAID utility often results in a blue screen.
# Once an array has corrupted, if we break the array and inspect each disk of the mirror, we find that one drive is intact and the other drive is corrupt. But this is not a fault with a drive or a cable because we have run tests on two different servers, six drives and four SATA cables.

CONCLUSION
Based on 6 weeks of exhaustive testing, we have concluded that there is a fault in the Intel RSTe driver.
We are trying to find a way to get this bug fixed.
If others have had the same issue, this puts more weight to the case.

Has anybody else experienced this behaviour? If so, have you been able to fix the problem by downgrading to RSTe driver version 3.6.0.1093?

Does anyone have any good suggestions or good contacts at Intel, so we can get this information to the right people so it gets fixed?

We don't really want the server to go live with a very old driver version, as the servers once live, will effectively remain stuck at that driver version, as it will be too risky to update them.

Any help or suggestions are appreciated.

Best regards

Stephen Done



 
This may seem unrelated, but what is the command rate on your RAM? If it is 1T, set it to 2T. I've seen corruption problems, not quite like this, but similar, that went away when I made that change.

Otherwise, the level of detail you have provided would probably be very useful to someone in Intel's tech support; and we'd love to hear what they say.
 

Stephen Done

Reputable
Aug 22, 2014
2
0
4,510


Sorry, I forgot to mention that the servers were also burn-in tested for a week and have had a full set of tests on all hardware, including memory. The problem varies with driver version. It also does not occur using software RAID. So I think it we have ruled out a RAM problem.

Yes, am keeping my fingers crossed that we get a positive reply from Intel.

Best regards

Steve