I've got a server box running Win Server x64 2003 STD for about 1 year now based on an Asus PVL-D (the 4 port 133Mhz PCI-X version) and an 8-port Promise EX8300 semi-pro Raid I/O card installed. I originally started with 3 500GB Seagate ES.2 series drives (ST3500320NS) but have over time migrated the raid5 array to now include 6 ES.2 500GB drives.
The first 2 migration operations went without problems, but the last one has turned my raid5 array into a true nightmare.
For some unknown reason, my raid5 has now become unstable resulting in data-corruption to new written data. I can reproduce the problem by copying a huge file from raid5->raid5 and running "comp" afterwards.. The two files are in ~20% of the tests NOT identical.
I coded a quick app in C which wrote a 1Gig block of known patterns to the beginning, middle and end of the logical volume.. this data is then read and compared.. again, roughly 20% of the times, errors where present in the data
This is very bad. My array is now in read-only mode as I have lost faith in the array.
Funny enough (if this can be "fun" at all) the first two ports of the controller is hooked up to two 120GB old-disks in a raid1 config. The old-disks got an ATA-to-Sata converter box on them which provide a sata1 link to the controller. On this array I cannot reproduce any errors.
Based on this, a couple of thoughts can be concluded:
1) Problem could be size related.. Raid1 is 120GB, Raid5 is 2.5TiB (2.3TB).. Before the last migration, the array was 1.8TB.
2) Problem could be speed related? (Sata1 versus Sata2) (this is general.. speed could be anywhere in the flow; Memory issues, DMA burst issues, PCI bus issues, etc..)
3) Problem could be port related? (very unlikely, but hey.. maybe)
4) Problem could be HD related? (such as cache, NCQ, weird features)
5) Power problems (new HD installed)
5) Nah.. Got a 600W PSU which on ingress draws a stable 1.5Amps at read-write operations on the array. Plenty to spare.
4) This could be part of it.. although the array ran fine with 5 disk with write-cache/NCQ on.. so I don't see why 6 should fail..
3) This is nonsense.. anyhow, I checked by moving disks around.. it was nonsense..
2) This is very likely.. Increased speed puts strain on all components in the flow, and even though flow-control should (and must) prevent any data-corruption, DMA problems and similar have been reported over and over again in other configurations..
1) This could also be it, although I don't quite see how.. GPT support -very- large partitions, and a x64 box shouldn’t have any problems with 2TB.. but it's worth the thought anyway..
Now, how to debug?
1) Has one/more of the disks gone bad?
Now, The controller should detect this automatically and degrade the array. None the less, I ran a Media Patrol function (a controller feature operated via gui), which sektor-by-sektor scans the physical media for errors. Non where found on either of the six drives.
I have of cause disabled everything cache related but can still reproduce the errors
Memory testing/burn-in test of the CPU yielded no problems.. ECC memory had zero corrections after a half day run.
4) uhmm.. what now? The BIOS on the PVL-D prevents me from tuning the FSB (or scaling factor), so I don't really now how to slow down the system.. I guess server boards come pretty locked with respect to tuning options?
The ES.2 drive does provide a jumper for setting it to sata1 mode.. I'll try this next.. otherwise I'm kinda out of ideas.. anyone?
In the meantime, I've ordered a S3000AH (LC) board from Intel and an Adaptec 30165 controller.. Adaptec is an old rat in this business, and I fully trust them.. On the other hand, I have had no major problems with Promise controllers before, so I thought I’d at least try to narrow down the problem before dismissing them for good..
I had problems with a different Promise RAID controller. The RAID-5 setup would run fine, then data corruption would hit about every 3 months. It became more frequent, so I did many tests. The root problem was the RAM for the cache on the controller itself. It had gone bad.
When you disabled the cache, was it the individual hard drive caches or the controller cache (or both) ?
I disabled everything cache related to ensure direct-to-disk writes, but was consistently able to reproduce corruption errors.
After debating with Promise for a couple of rounds I finally gave up and replaced the controller with an Adaptech and the motherboard with an Intel brand. I still believe Promise have a fair product in the mid-segment market, but in my case, with the PVL-D, some form of hardware incompatibility must've been present..