HW or SW - System Stable for 4 wks Spontaneously Corrupts

beluczywo

Distinguished
Sep 30, 2008
1
0
18,510
MOBO - GA-EP35-DS3R
Drives - 4 Western Digital 500, 2 Samsung 750, 2 Samsung 500
Memory - Corsair Ballistix (2GB)
Video - BFG nVidia 8500
CPU - Core2 Quad Q6600 Retail
PSU - CoolerMaster 750W Modular
Other - 2 Generic 4 bay hotswap drive cages

This system is set up for WinXP SP3 with all current patches. Tom's hack on LDM for RAID. The two 750's each have a 250G partition and mirror. The remaining 500 on both of them and the remaining 6 drives give 8 at 500 and are software RAID5 for a 3.5T array.

The system will run stable for 4 weeks then spontaneously lock. Upon reboot, Windows will start to sweep the blue bar but never boot. Reboot and try Safe Mode - won't work. Reboot and try "Last Known Good Configuration" -- boot but not happy, rebuilding mirror, RAID5 offline. 1 drive typically listed as foreign (never the same one or same channel). Bring array online, initialize broken drive, start to resync Raid. After about 5%, spontaneous reboot and that's all she wrote -- completely trashed -- errors about HAL, missing system32 directory, etc.

No messages in logs when back in for short time. All 8 drives test correct with Mfg drive program (I did the quick test not the full surface scan, but have moved around enough to believe it's not specific to one). Has done this 3 TIMES with minor changes between to try to figure it out (putting different drives in different orders, turning off drive level write caching, not doing all patches, minimizing HW drivers)

Very basic software thus far -- avast antivirus, zonealarm, gigabyte audio, nvidia control panel, dvd43 running in the background, windows autoupdate turned OFF completely after manual patching.

Any thoughts -- please? Here is my brainstorms:
Bad single disk -- don't know how it could test okay and run 4 wks and trash mirror as well.
Microsoft patch that destabilized the RAID hack
Problem with using the 6-SATA controller and the 2-SATA controller in this fashion.
MOBO problem with controller -- GigaByte was NO HELP of course.
Bad connectors on hot swap bays -- find this hard to believe -- it is a glorified wire. Litereally a connector with a housing and I would think you'd see delayed write failures if there were issues.
Bad power supply or glitch -- it's always spinning everything even after hosed, PSU is protected by a UPS, doesn't necessarily fail around storms

Options:
Try Vista Ultimate next (hate vista, but clean integrations of RAID LDM)
Give up on 3.5TB contiguous space and put 2 boot drives on purple connectors, 6 drives on orange and have a 750x2 mirror and a 500x6 (2.5TB) raid space -- getting the raid off the same drives and windows will at least keep me from losing data and having to do a complete restore.
Use HW raid instead of SW -- I hate this idea because it locks you to hardware and pretty much means I need to keep a spare MOBO for when something dies.
Take it out back and put a shotgun shell into it.
 

bilbat

Splendid
Excellent post - obviously you know exactly what you're doing, and are extremely technically competent. That said

so far, I vote for the shotgun shell

I have learned something invaluable here, though. I read a review or post quite a ways back that apparently seemed convincing (my bad - certainly have learned not to swallow everything I read on the web!) and said that the ICH9R was limited to RAIDing only 4 drives: pick from two pair in 0 or 1, 3 or more in 5, or 4 in 10, but THAT's IT! After reading your post, I went to Intel & DL'd the actual chip specs - voila: all 6 ports! Thank you for the revelation. I am planning a new development system, and need 4 drives in two RAID0 pairs (Velociraptors for systems & swaps) and a RAID1 pair of 1TB's for data - thought I'd need, as a minimum, a MOBO w/extra JMicron controller to do this as I, too, don't really want to commit to h'ware RAID controller.

I'll mull this over for a while, re-read it a few times, and see if anything jumps out at me.

I find that my subconcious tends to do my best troubleshooting. I do industrial systems, and used to work at a place where the kid who did the wiring smoked a lot of dope, and made a lot of mistakes. Before I'd ship a control panel, I'd disconnect the internal transformer, hook up a power cord with some clip-jumpers at the end, and test the obvious. Hooked up one particular box, plugged it in, and POOF, nice flash of light, room lights out, cord plug charred, & breaker blown. Reset breaker, take off cord, look for short in box - nothing - DEAD open between 120 hot line & both common and ground. HUH! Figure I'll be days to find an invisible short, and leave to work on other stuff. Wander in every hour or two to look at it and ponder...

Return late that afternoon, go to my travel bag full of field startup gear, take out an outlet tester, SAS, line and common switched - same kid had wired the outlet, & didn't know the difference between the white & black wires! Hot (wrong) side of outlet connected to ground in receptacle, common connected to ground in cabinet - invisible short - only exists when you plug it in... Don't know, at all, where the inspiration came from, but have learned to rely on it (A LOT!)...

 

bilbat

Splendid
I took a look at the board itself, and noticed that it doesn't have the 'crazy cool' heat-pipe thingy connecting the regulators, northbridge, and southbridge. I know that if you want to push the RAM on these Gb boards, you really wanna put a fan on the northbridge, as it's really doing the majority of the work. Do you think that the southbridge, because it's 'really humping' ALL of its channels simultaneously, might be running too hot? Never have heard of it, but, like I said, never have heard of a six drive s'ware RAID... I do know that, while investigating several h'ware RAID controllers, boards running REALLY HOT seems to be a nearly universal complaint.
 

TRENDING THREADS