Epic System Troubleshooting

cmc242

Distinguished
Oct 20, 2007
5
0
18,510
Greetings Forum Denizens,

Like many others, I have come to Tom's Hardware to plead for help with a PC problem that's far beyond my own abilities. For two months I have worked at this PC build, which would be my first gaming machine since high school, and only now am I ready to admit that I can't fix it on my own. Not easily do I admit this, for I am a graduate of an electrical and computer engineering program at a good university. I won't bore you with tales of my past hardware exploits, I just want you to know that whoever solves this problem earns enough Tech-XP to gain two levels and gets to loot a high-level nerd corpse.

~~The Problem~~

The system has, on several occasions, been stable enough to install an OS (WinXP SP2 and Ubuntu 7.10) and get several hours of WoW or Oblivion in. After running for a few hours, I would power down the machine. On the next morning, the machine sucks and dies. The machine might suck and die on several boot attempts and then run flawlessly for as many hours as I care to leave it on, but over a few sessions the odds of "suck and die" seem to go up a lot, and it always ends in BIOS Corruption.

"Suck and die" is a somewhat general term that refers to
(1) no POST at all, possibly with no Video signal
(2) POST, but lock up before loading OS, possibly as I'm trying to configure the BIOS
(3) freeze while loading OS, possibly with WinXP Stop Error (usually IRQ related, sometimes pointing to a .dll) or a Linux kernel panic notice
(4) OS loads and I log in, but freeze or crash before all services and drivers are loaded

I've tried to boot this thing hundreds of time (by all the captains of the Enterprise, I wish I had kept a log of all the boot attempts) and the freeze really does happen at all four of those points.

Clearing CMOS will sometimes (but not always!) get me a successful boot. Flashing the BIOS has resulted in *one of the next four* boot attempts being successful on the two occasions that I flashed the BIOS.

~~The System~~

--Base--
Intel Q6600 with stock cooler & Arctic Silver (after the price drop, of course)
2x Corsair XMS2 1GB DIMMs (both recently passed two full memtests from the Ubuntu CD)
EVGA 8800GTS, 640MB, factory overclock
Antec P180 chassis
Gigabyte P35-DS4 Rev2, later Asus P5K Deluxe
Antec NeoPower650, later OCZ Gamestream 700 (yeah, I guess these are oversized for non-SLI)

--Peripheral--
2x Samsung 500GB SATA drives
Floppy Drive, Lite-On combo optical drive, USB Keys & Mouse

--BIOS--
Both boards had many attempts with the original BIOS and the most recent BIOS from the company website.
I didn't tinker much with BIOS settings, but I have tried setting Asus' AI-overclockign to 'standard' instead of 'auto' and enabling or disabling USB support.


~~The Trials~~

So, what have I been doing these last two months, none of which did even the slightest good? Glad you asked...

Actually, I'll skip most of it. Let me just say that I've tried very minimal configurations with both of those PSUs and both of the Motherboards and seen literally the same symptoms (the ones described above) in each case. Of course I'm posting on the Motherboards forum because this always turns into BIOS corruption, which is clearly happening on the motherboard. Having experienced the same pattern of boot failures with two different boards I'm starting to think that BIOS corruption is being caused by something off the main board.

~~Bench Test~~

When the heroes of Greece needed to get to the bottom of things, they carried gifts to the oracle at Delphi. When PC builders need answers, we do a bench test. In my case I took the Asus P5K out of the chassis and set it on a box with both DIMMs,the vid card, and the CPU with stock cooler still on the board. No keyboard, mouse or SATA device was attached. The chassis front panel connector was removed. I used a multimeter probe to bridge the Power and Reset pins for my testing. Before bench testing the PC was in a very bad state, not even showing the Asus logo on several boot attempts, and disconnecting power, which supposedly clears all CPU parameters, wasn't helping. CMOS was cleared by removing the battery and using the clear CMOS jumper as described in the manual prior to bench testing.

First result was the "Insert bootable media in appropriate drive" message. Encouraged, I attached the USB keys and SATA CD drive and tried to boot from the Ubuntu live CD. I deleted the 'quiet' and 'splash' parameters so that I could watch it spew out a stack trace when it sucked and died trying to configure Ubuntu on my hardware. Error messages were IRQ-related. After that I couldn't get it to POST over five attempts, so I cleared CMOS and removed the CD drive.

Out of the following 17 attempts to boot the board in this configuration only 7 resulted in the desirable "Insert bootable media and press enter" state. Here is a detailed log:

[Edit]No USB devices were attached at this time[/Edit]

(1) (pass) Insert bootable media and press enter... (reset)
(2) (pass) Insert bootable media and press enter... (reset)
(3) (pass) Insert bootable media and press enter... (cycle power)
(4) (pass) Insert bootable media and press enter... (cycle power)
(5) (fail) Freeze on "Initializing USB controllers" (reset)
(5.2) no video signal (reset)
(5.3) no POST, flashing cursor (reset with jittery fingers)
(5.4) (pass) Insert bootable media and press enter... (reset)
(6) (pass) Insert bootable media and press enter... (disconnect power 10sec)
(7) (fail) no video signal (reset)
(7.2) Freeze on "Initializing USB controllers" (reset)
(7.3) Freeze on "Initializing USB controllers" (reset)
(7.4) no video signal (reset)
(7.5) (pass) Insert bootable media and press enter... (reset)
(8) Freeze on "Initializing USB controllers" (cycle power)
(8.2) Freeze on "Initializing USB controllers" (cycle power)
(8.3) Freeze on "Initializing USB controllers" (disconnect power)

At this point I gave up and put Freeze on "Initializing USB controllers" into Google. Not finding any useful suggestions, I decided to post here...

Note that I had *never* seen it freeze on "Initializing USB controllers" before the bench test, so fixing that problem may or may not fix the PC.
 

jZeroSeven

Distinguished
Dec 16, 2007
1
0
18,510
Hmm, not sure how your issue is caused, mostly because it is intermittent(the hardest bugs to fix) - but since you have replaced most of the obvious components means its - the other components - or your installing them in the most incomprehensible way.

I would try some other ram; just because it has passed Mem tests doesn't mean it works (now). Then the video card.

Let us know how exchanging these components works.

Thanks you,
j07
 

roadrunner197069

Splendid
Sep 3, 2007
4,416
0
22,780
Try one stick of ram if you fail try the other. If you get post look in bios if ram is set at factory specs. Try a different USB keyboard and or mouse. Make sure nothing else USB is plugged in. Did you plug the wrong thing on a USB header on the MOBO. Is the heatsink attached right? Check USB settings in Bios. Maybe USB is set to run high speed and mouse/ keyboard isnt compatible. I've had a mouse screw a computer up before, so its good to rule it out with a different one.
 
Entertaining narrative. It's weird. We've all been there with seemingly impossible problems.

It may yet be simple, though intermittent, and is only appearing weird. Let's assume it's a simple hardware problem and break it down from the beginning:

1. Go to barebones: PSU (don't forget the 4 pin 12v cpu aux.) mobo, cpu, video card, ONE (only 1) stick RAM (trying different sticks and different slots systematically, of course if this RAM switching solves it there you are) keyboard, all built out of the case for sure.

2. Have errors and failed POSTS been occuring on above after newly flashed BIOS?

3. If yes forget the other stuff and assume one of the above parts is bad.

4. One by one switch out each part with a known good part. Until the errors stop. This may be tricky since it is intermittent, I know. But you may get lucky if you keep trying this.

You may have already done this but possibly in a somewhat haphazard manner or else under stress. If you think this may be possible then please try again and do it in a very calm and systematic way. Report back.

EDIT: be sure RAM voltage is set to mfg. specs. If by chance you have 2.0 or higher RAM your mobo may be defaulting to 1.8. This has been a common problem here lately. Causes weird problems and failed boots etc.

also, could you please define what you mean by BIOS corruption and how this shows itself and also tell us if this corruption has happened on two mobos. If you explained this already I am sorry but it was a long story.
 

tlmck

Distinguished
Have you checked the AC coming from the wall? From the surge protector or whatever protection device you are using? In my experience, glitchy problems like this are almost always electrical. Since you have replaced the MB and PSU, you would have to go back towards the wall i would think.

Other than that, it would have to be one of the components you have not changed out as mentioned above.
 

Zorg

Splendid
May 31, 2004
6,732
0
25,790
I agree, make sure you set the voltage for the RAM to the specification, I'm assuming that you did this initially. Also, XMS2 doesn't tell us what model of ram it is. I don't know the RAM test that comes with Ubuntu, so download UBCD and run memtest86+, unless it is the same. You can burn the ISO with ISO Recorder v 2 if you don't have a burning program handy. You said no USB installed, I assume that rules out the mouse and Keyboard. I'm guessing your 110V power is good, maybe slap a UPS on there to rule it out. If you have a UPS on it, then remove it. Since you have a meter, then I assume that you have measured the voltage on 12V, 5V and 3.3V on boot to look for droop. With two PSUs this is not likely, unless you have bad 110. If you get it to boot you can run Prime95 25.5 and watch to see if any of the cores fail and are dropped.

I would like to know as well.
 

bberson

Distinguished
Oct 25, 2006
363
0
18,780
Since you mention IRQ -related items in the stack dump, and since the only peripheral card you mention is video, I would be inclined to swap out that video card. Drop in any cheap card just to test and see how the box behaves, before you go sending your back to the vendor.
 

cmc242

Distinguished
Oct 20, 2007
5
0
18,510
Thanks much for the advice-

When I say "BIOS Corruption" I'm talking about a message on the POST screen that says "your BIOS is corrupt, insert [floppy USB CD] with new BIOS file". I forget exactly what it said with the Gigabyte board, but the Asus board has such a message.

j07 and others - I will try going down to one stick of RAM, even though it passes memtests. I was reluctant to do this because the Asus board has extraordinarily tight DIMM slots that make me very nervous when inserting RAM.

roadrunner - The bench test with no USB devices attached yielded repeated USB-related errors, so I think it may not be the USB keyboard or mouse. I actually have tried two different keyboards and no mouse, but this doesn't seem to help things. Forcing USB to FullSpeed instead of HighSpeed in the BIOS is probably a good idea though, I will give it a try.

notherdude & zorg - Glad you were entertained :)
Here are the full RAM specs: http://www.newegg.com/Product/Product.aspx?Item=N82E16820145566
I'll try forcing voltage to 1.9 is the BIOS

zorg & tlmck - I share your suspicion of the wiring in my apartment. One of the first problems I had with this build is that I could only power it up when plugged directly into a wall outlet. The CyberPower surge protector I tried to use somehow prevented me from starting the system. There are also a handful of outlets in this room where nothing runs at all, not even a desk lamp.

Looks like I have my work cut out for me. Only questions are what order to do these tests in and whether to buy booze first... Stay tuned for more results!

 

trevorblain

Distinguished
Sep 26, 2006
94
0
18,630
Interesting bit about the household wiring. The fact that the surge protector didn't like the socket it was plugged into should tell you something. Plugging the computer directly into a wall socket is never a good idea, the current is not steady enough, and could possibly be the culprit of the scrambled bios.

If possible, take your rig to a friend's house, somewhere with more reliable wiring where you can use your SP, and see if you can't get it back on its feet.

otherwise I was going to suggest checking your PSU. Could be the rails are faulty and spiking your board.
 

cmc242

Distinguished
Oct 20, 2007
5
0
18,510
I have made some progress...

Over the last three hours I have logged 61 boot attempts while varying the wall socket, surge protector, RAM arrangement, and presence of keyboard. The results are:

~~Wall sockets & surge protectors~~
First off, I lied about running this machine with no surge protector. Turns out that I actually was using a small surge protector the whole time: a 700J traveler's unit that came with a Dell laptop in 2003. When I tried without any surge protector it wouldn't even give me a video signal. All of the RAM trials were made using a different power outlet (which probably makes no difference) and a bigger surge protector, the CyberPower unit that I bought from Newegg.

~~RAM Arrangement~~

I have two DIMMS and two yellow slots.

Yellow1 <= DIMM1; Yellow2 <= DIMM2 : The original configuration. Probably had it set up this way on the Gigabyte board too and (randomly) put the same DIMM in the same slot on the Asus board. This configuration has all the problems mentioned above.

Yellow1 <= DIMM1; Yellow2 <= Nothing : Made 10 boot attempts this way and had the same ~20% success rate as before.

Note that I cleared CMOS at this point since I had to take the vid card out anyway...

Yellow1 <= DIMM2; Yellow2 <= Nothing : Made 10 boot attempts and it did not suck and/or die even once. Eventually I had CPU Temperature Over warning from the POST screen, but I assume this is a result of running on a cardboard box and possibly doing a bad job of applying Arctic Silver to the Intel stock cooler.

Yellow1 <= DIMM2; Yellow2 <= DIMM1 : So I tried putting the probably-bad DIMM back in the machine, but not in the first socket. Strangely enough, this had the same high success rate as the previous trial. The only problem was frequent CPU Over Temperature errors after about ten consecutive boot attempts. Note that these errors did not result in a frozen machine; if I pressed F1 it would continue to operate normally (for a machine with no bootable media attached).

So either DIMM1 is the source of my troubles or the problem is elsewhere and all the good boot attempts are explained by clearing CMOS. Only time will tell...
 

Zorg

Splendid
May 31, 2004
6,732
0
25,790
OH MY GOD MAN! It's absolutely the power. There is no question in my mind and you need to look no further. It is classic bad power. The reason that I didn't think it was the culprit is because you had indicated that you had schooling, so I assumed that was the first thing that you checked. Get a 600VA or larger name brand (APC etc.) UPS. That might not even be able to overcome the bad power. Whether any components are damaged by the screwy power is debatable. As said get your landlord in there to clean up the power or move.
 

bberson

Distinguished
Oct 25, 2006
363
0
18,780

In this day and age it's pretty incredible that anyone - particularly someone who can afford a computer - could have such poor power in their residence. I've seen remote shacks in third-world countries with more reliable power than what the OP described. Wow...
 

Zorg

Splendid
May 31, 2004
6,732
0
25,790
I was in a house that had old wiring, and I was too lazy to clean it up. I had an APC UPS on it and it would beep at least once a day, but I never had any problems with the PC. Kind of funny though, I did run a 20A dedicated circuit for a copier that I bought, and I left the PC on the crap circuit. I didn't want the motor noise from the copier to screw with the PC. :lol:

@cmc242: Try to get a UPS with over and under voltage regulation. They cost more and you probably won't be able to get them from the BEST BUY but they are worth it.
 

cmc242

Distinguished
Oct 20, 2007
5
0
18,510
Posting now from an apparently stable Windows XP install on the PC that we've been discussing.

Zorg, I was suspiscious of the power in this apartment for a while, but that wasn't the problem. I'm running from one of the good wall outlets now and using a surge protector. Never had any other electrical device behave erratically.

Seems that a bad stick of RAM was the culprit. Corsair XMS2 1GB DDR2-800 is rated 4 stars on Newegg, but the bad ratings come from people who ordered this Fall and got lots and lots of DOA RAM. Some accuse Corsair of shipping returned RAM in the hopes that a high enough percentage would stick to the buyers that they could make some kind of profit. How in the hell a DIMM passes memtest twice ( almost two hours of intensive testing ) but still flakes out 80% of the time when I'm trying to boot is fun to theorize about...

After that it was a matter of re-attaching my CPU cooler. As an inexperienced system builder I screwed that up about five different ways. Eventually I got tired of trying different methods for spreading Arctic Silver Ceramic on the Intel stock cooler and brought out my Zalman 9500. That beast took more than an hour to install, made my fingers bleed in a dozen different places, and had to be assaulted with a needlenose pliers to get it to fit the case and Mobo, but...

IT FINALLY WORKS and now I can stop pacing around my apartment muttering "what the **** is happening to my computer" between sips of scotch. That really wasn't a healthy way to spend two months of weekend evenings...

Thanks for all of the help and advice. For those who didn't offer advice, thanks for having the patience to read my ramblings.
 

roadrunner197069

Splendid
Sep 3, 2007
4,416
0
22,780



Amazing if you would of listned to all your advice as you were getting it. Cool. I'm happy you finally got it going.
 

Zorg

Splendid
May 31, 2004
6,732
0
25,790
I'm happy to see you found the culprit. I had discounted that to some degree as well, due to the fact that you had done the memory test already, I wanted to confirm it so that's why I suggested you use memtest86+, it is a more rigorous test than some. It is certainly possible that your memory was damaged by bad power. I have never heard of bad RAM ever causing the BIOS that was on the EEPROM to become corrupted requiring a flash as you said. I still believe that your problems were created by bad power. That is why I always use a UPS. Even clean power can get dirty from time to time.

I just had my UPS beep three times while typing this. It's really windy in the NE USA right now and the power is showing it.

Edit: Don't confuse a surge protector with a UPS or a power conditioner. There is nothing in a surge protector other than a metal oxide varistor, which shunts the voltage spike to ground. They are better than nothing but not by much. He!!, sometimes they don't even catch the spike, like when there is a nearby lightening strike. That's why I unplug all sensitive devices when there is a thunderstorm.
 

Zorg

Splendid
May 31, 2004
6,732
0
25,790
It depends on the brand and specs, and the load of the PC and peripherals e.g., router etc. If you have an LCD and you want to also protect that, then you need to include the additional power draw in your calculations as well. IMO a good UPS is extremely important.
 

croc

Distinguished
BANNED
Sep 14, 2005
3,038
1
20,810


It should also be mentioned that a good UPS totally isolates your PC from the wall power, provides high voltage protection as well as fill-in power for under voltage situations. I'd rather lose my UPS during a lightening storm than my PC. But it needs to be a good one to really get all of the above benefits.
 

Zorg

Splendid
May 31, 2004
6,732
0
25,790
Yup, that's why I posted that he should try to get a UPS with over and under voltage regulation. The cheap ones can be sadly lacking, but still significantly better than a surge protector.

The best UPSs run off of the batteries all the time by running the inverter full time. They are truly isolated, but unfortunately they are also obscenely expensive.
 

bberson

Distinguished
Oct 25, 2006
363
0
18,780

On the used market they're not expensive at all, particularly if you're patient enough to find one you can pick up locally since the shipping can be as costly as the UPS itself. Sometimes businesses just throw these things out when the batteries go, meaning you can have one for the price of keeping a keen eye on bulk refuse pickup day and purchasing a set of batteries.

Many years ago I picked up a 2200KVA APC Smart-UPS via eBay for less than a hundred dollars and a half a gallon of gasoline, and its batteries were still in unexpectedly great shape too. It needs a new set of batteries every three to four years (I just put in a new set last year) but that's cheap insurance and this monster can easily power my entire home office with two servers, a couple of workstations, peripherals, networking equipment, even a laser printer which is really a no-no because of the fuser surges. When the batteries do go, smart shoppers buy Panasonic high-capacity batteries from the likes of Digi-Key (no affiliation, yada yada) instead of APC's pricey replacement packs, saving even more money and gaining at least 15% more run time if/when the power does go away.

-B
 

Zorg

Splendid
May 31, 2004
6,732
0
25,790
kudos on the good score.

All of my UPSs are free throw-aways due to dead batteries, unfortunately they are switching ones. I just run down to Battery Warehouse and pick up what's handy.