Hi everyone hope someone can shed some light on this problem. Before christmas I treated myself to a decent PC, specs are:
Gigabyte EP45-EXTREME (I know it's for x-fire primarily but I got a great price)
Intel Quad Core Q9550 CPU
8GB Corsair 1066 7-5-5-5 RAM (4x2GB sticks)
XFX Geforce 260GTX
Corsair 950W PSU
10000rpm Velociraptor 150GB HDD
Now the PC has been perfect up until a week ago where it started BSODing with errors relating to IRQL_NOT_LESS_OR_EQUAL (RAM I believe) and nvlddmkm.dll (GPU). I ran Memtest and found errors in the last pair so I checked the timings in the BIOS and set them manually. This worked for 3 hours or so until another blue screen with nvlddmkm.dll. I decided to flash the BIOS to the latest, which was a beta version on Gigabyte's website. This set the timings and voltages properly and seemed to work for the rest of the night.
The next day I get home from work and 5 mins into windows booting I get another BSOD with ntfs.sys. Reboot and then another with cdrom.sys. Run memtest again and more errors appear. The PC had been off all day so it was starting up from cold. I thought maybe it was the beta BIOS and flashed down to the most recent non-beta... 8 hours later I'm still playing with NO problems whatsoever. This brings us to now where I am yet to go home tonight and see what happens. I ran CPU-Z to check temp issues and all looks fine, CPU runs at 58 under load, GPU 55 under load so I can't see any temp issues.
Has anyone had anything like this? To recap I have so far tried:
1. Manually setting voltages and timings for RAM.
2. Flash BIOS to latest non-beta version.
3. Updated NVIDIA drivers to latest after a purge with driversweeper to remove everything.
4. Checked registry for issues, none whatsoever.
5. Checked HDD for errors - none that I can see.
6. Ran memtest and found RAM errors - none after latest BIOS update.
Basically I don't trust the PC and I'm open to the possibility something has gone and will need replacing. I develop software for a living so I'm technically competent but I really can't find what is causing this problem. My obvious next step if I get another BSOD is to try 1 stick of RAM at a time and trial and error it that way. Failing that I'm thinking a motherboard issue.
My first thought was that your RAM is questionable. Than after your tests/trials I was thinking that your Mobo was part of the problem. I'd first try the 1 RAM stick at a time and see if you can isolate the problem.
OK got home last night and tried it with 1x2GB stick at a time, no luck. Still various blue screens with each one of the sticks - although 3 passes with memtest show absolutely no RAM errors whatsoever.
No problem with POST or OS boot up as might be expected with faulty CPU / Motherboard so I'm completely at a loss what it might be. I'm even starting to wonder if the OS itself is at fault... maybe a bad driver even?
I'll get to the bottom of it eventually but this is annoying as hell considering I need to prove a part is faulty if I'm to return it for a replacement. Any of you know some good tools I can use that will kind of debug all the major components?
Have you tried checking all connections (pull and re-connect all connections)??
You could also try different PCI-e power connectors to see if that changes anything, maybe you have a faulty connection in one of those. Also maybe your Front panel connections are iffy. You could also pull all parts out of the case and lay it out on a non-conductive (cardboard works well) surface and try power/running the system out of the case. Maybe there was some metallic part/shaving interferring with parts on your mobo.
Ahhh VERY good point - I didn't even consider the PSU! Something I did consider however is that because the crashes seem to occur more frequently when starting up from cold i.e. PC has been off all day, maybe the actual buses in the motherboard have been broken and they only connect after a certain amount of heat has been applied?
Are motherboards really THAT fragile or is this an unlikely situation?
Not likely. It could be a part heating up real quick at the start, but I don't think that the buses are the issue. Try removing it out of the case, to minimize any shorts that might be occurring on the case.
OK got home last night and ripped the PC apart. Literally everything out and back in again slowly... blue screen with no error this time only a stop code of 0x00000000001E. Took out 2 / 4 sticks of RAM and it seemed to run OK for the remainder of the night so I prepped the 2 sticks I took out for sending back for replacement.
Get into work this morning and decide to check the RAM on the PCs here... so far using Vista memory tester (extended mode) on a whole new system and 50+% through no errors whatsoever. Time will tell but I'm starting to wonder if the DIMM slots on the board are faulty - will probably try the 2 (apparently) working sticks in the next DIMM slots tonight.
This is the headache you adopt when you buy 2 separate kits of ram. There's a reason that everyone advises against 4x2 (especially buying separate kits). Do yourself a favor and RMA both kits and pick up a single 4x2 kit, or even better a 2x4 kit. Either that or just return the kit and eat the restocking fee and be happy with 4gb.
You're going to be in the same exact boat if you just RMA a single kit. Even though you get 2 kits that are the same exact part number, you almost always end up with 2 kits that come from different binnings. Welcome to no-post and BSOD hell.
That's the thing, each 1 of the 4 RAM sticks are the EXACT same everything. They're the corsair dominator, each with a 5-5-5-15 timing and 2.1V requirement. I'm really starting to suspect the actual DIMM slots so will swap them tonight and see what happens.
I'm also just as a precaution do a full OS reinstall in case the reg has become corrupt or some driver .dll file is causing instability. I'll keep you all updated.
Just an update, after 6 hours on the 4th of 5 passes we see memory errors. Sending back for replacement... now I'll see tonight if we get anymore blue screens with the 2 (apparent) working RAM slots in while I wait for replacements.