System randomly freezing

Greetings Everyone,

My main desktop seems to be in some trouble at the moment. For the last few weeks the video has been randomly shutting off (the driver halts, then tries to recover and fails, resulting in a 0x116 bluescreen), or completely freezing... as in no mouse movement, keyboard doesn't respond, and the system must be hard reset by way of the reset button. The most aggravating part of this issue, is that it only happens a couple of times a day, every 7 or 8 days. Once it happens twice, I don't see the problem again until the following week.

Specs:

Intel Core i7 920
Gigabyte GA-EX58-UD5
12GB DDR3-1600 @ 9-9-9-24
eVGA Geforce GTX 295
3 x 1.5TB Seagate 7200.11 (one is the boot drive, one is a backup volume with program installers, and the last is a buffer drive for uncompressed bluray disk rips)
3 x 1TB Seagate 7200.11 (in RAID 0)
Corsair CMPSU-1000HX 1KW

I have an additional GTX 295 card here that I have swapped in on several occasions, in case the card itself was the point of failure. Alas, it does not appear to be the issue, as the system had the aforementioned issues with the second 295 card swapped in. I have upgraded and downgraded the video drivers, but am currently running on the 260.xx series beta driver from Nvidia, and had the WHQL certified 258.96 drivers installed up until a few days ago with various other versions installed throughout this ordeal.

I have tested the PSU with a power supply tester. All voltages were spot-on, except the 5VSB which was off by .1 (still well within the 10% tolerance). I have also run MemTest86+ on a few occasions, including one session for 12 hours. No faults were detected during any of the test runs. All 6 RAM sticks have been reseated, and the slots cleaned with short bursts of compressed air while the sticks were removed. The same was done to the first x16 PCI-E slot on my motherboard when swapping the video card. The case in general - including the power supply - was cleaned with compressed air the first time this happened (about a month ago). Temperatures were also fine, as this has never occurred while in the middle of a game (only when randomly clicking desktop items, or using Alt+Tab to get back to the desktop from a full screen game), or while running anything else CPU or GPU intensive. Just for reference, temps are below:

CPU Idle:
Diode - 35
Core 1 - 42
Core 2 - 43
Core 3 - 42
Core 4 - 44

CPU Load:
Diode - 47
Core 1 - 58
Core 2 - 59
Core 3 - 61
Core 4 - 60

GPU Idle (70% fan speed)
GPU 1 - 49
GPU 2 - 48

GPU Load (70% fan speed)
GPU 1 - 78
GPU 2 - 80

As a shot in the dark, I am currently running Spinrite on all of my hard drives (except the RAID which it won't see, and I don't really want to revert the controller mode to deactivate the array to Spinrite a bunch of 1TB drives...)

None of the parts in this system have been overclocked. Everything is at stock clocks and stock voltages.

If anyone can think of something else to try that I have missed, please let me know... I would really appreciate another brain working on this one, as i'm starting to get stumped.

Regards,
TP
 

Wamphryi

Distinguished
Well I would look to the Hard Drives next. One of the nastiest problems I ever faced was when I had a System HDD develop an intermittent BIOS chip fault. The system would just freeze or fail in install. Yet the HDD itself gave no indication there was any problem.
 
Make sure you have the latest drivers for your motherboard and bios upgrades. Have you tried a complete reinstall? Try a new HD with a fresh install to see if the problem presists.

Using another working PSU and the problem persists or moving your PSU to another system and the problem goes with it are the only ways for ruling out the PSU. All we really know at this time is its not the GPU, most likely not the memory, and possible not your PSU. The PSU power output could change once its been used for a while.

You could have a motherboard slowly dieing. Inspect your motherboard for leaking capacitors, your chipset temps if possible, and around the chipset for discoloring. You may have to resort to the finger test after your system has been running a while. The chipset should be warm but if its burning hot could well be the problem. An overheating chipset could cause HD problems.

Lastly you may want to move your computer to a different part of the house. The plugin your using can cause swings in power. Your home wiring could be the cause or even your power company could be having brown outs. The only way of overcoming these problems is a quality UPS. Voltage to your PSU could be the cause. The prices on these to cover a 1KW PSU or very high. I am recommending these 2 because your system specs shouldn't require anywhere near the full 1KW. The APC brand is the better of the 2 by far.
http://www.newegg.com/Product/Product.aspx?Item=N82E16842102070
http://www.newegg.com/Product/Product.aspx?Item=N82E16842101067

From these 2 the prices start going up fast with a full 1KW costing $498.99.
http://www.newegg.com/Product/Product.aspx?Item=N82E16842102104
 
I'll try to answer everyone's questions with a single post here...

So after leaving Spinrite on Level 2 overnight, no problems were detected on any of the 3 1.5TB drives.

The BIOS is up to date (F12 installed). There is an F13j beta BIOS available, but I'm not quite that desperate yet.

The 1KW power supply was installed because at one time, I had both of my GTX 295's in SLI. Since the performance increase was minimal with a lot of extra power draw, I removed the second 295 to use in situations like now, where I thought (at least at one point) I had a failing video card and could swap in the other one.

I left the system frozen for roughly 2-3 minutes the first few times it happened. Sometimes the system would reboot itself via the 0x116 bluescreen I mentioned above, other times I would use the hard reset method.

I doubt the wiring is an issue, because this system is regularly at 2 houses (mine, and my mom's place). It has shown the same symptoms at both places on several occasions. Thus, I fail to see how a UPS would help at this point, though I am thinking about getting one anyway. I do at least have a surge protector in place at both houses, and the power reliability is pretty good at both houses as well.

Unfortunately I do not have another adequate power supply for use in this system to test the "failing under load" theory. Getting one would require another $120-140 that I don't really want to spend right now (ignore the fact that I bought 2 GTX 295 cards when reading that last sentence) if the cause lies somewhere else.

There was at one point a potential heat issue with the chipset, but I have placed a fan inside the system blowing on the heatsink to cool it down. I just double checked for blown capacitors on the board, and did not find any.

I'll try to figure out a way to swap in another usable power supply into the box to test that theory, but I may have to go and buy one after all.

I also realized my specs above are incomplete. I have a few additional expansion cards installed which may be causing the problem. They are:

Hauppauge Win-TV HVR-1800 TV tuner
Creative X-Fi Titanium Fatal1ty Champion Series sound card
Asus U3S6 SATA3/USB3.0 add-in card (has been removed in the last few hours)

Edit: The Asus card was not the issue. I've moved the system back to my place and it froze again. I also attempted to raise the RAM voltage slightly (1.50 to 1.54), but that appears to have had no effect either.
 
Now the system ran all night while being tested. Any chance a sleep state could be causing your problems? Try changing windows settings to never use energy saving and turn off all sleep mode settings. Rule this out as a possible cause.

Ok we ruled out house wiring. Try running your system with each of the add in cards removed. You just want to rule them all out. Try a fresh install with only 1 HD. Test to rule out a previous bad install or possible new software or driver problem. I understand cost is an issue so do all the cheap tests first in a way that surely removes them from being the cause.

There was at one point a potential heat issue with the chipset, but I have placed a fan inside the system blowing on the heatsink to cool it down. I just double checked for blown capacitors on the board, and did not find any.
Was the heat issue bad enough to damage the chipset? This has me worried about your motherboard at this point. A damaged chipset may only produce problems every once in a while.
 
Sleep and Hibernation have been completely turned off on this system (first thing I did when I installed Windows 7). I removed the add-in cards and the problem still occurred.

Cost however, is no longer an issue. I've bought a Gigabyte EX58A-UD5 and Core i7 950 CPU to replace my current EX58-UD5 and i7 920 (a family member needed a system with additional ram support, so I saw this as a good chance to move up a bit. The fact that I was able to price match the parts for a bit cheaper than I would have gotten them for otherwise helped as well).

So this effectively rules out everything but my Windows 7 install. I'll keep an eye on it over the coming weeks and let everyone know what happens.

Edit: The system has just frozen again in the same manner as before. I'm going to clone my boot drive as is tomorrow afternoon, then do a fresh install and play with it for a week or two to see if a fresh install finally fixes this.