Sign in with
Sign up | Sign in
Your question

"occasional" computer failure

Last response: in Systems
Share
October 18, 2009 12:47:48 PM

Hello.

I've had several problems over the past couple of weeks with my new PC.

notes: I am running Linux. The issue must be hardware-related seeing as I've tested many different setups, yet all had these issues.

specs:

XFX 512 GTS 250
Intel Q9550 775 2833 BOX1333 12M
Corsair CMPSU-520HX 520W ATX2
Creative X-Fi Titanium PCIe
Asus P5Q-Pro P45
Corsair 4GB 1066-555 Dual Channel
2x Western Digital 320 GB SAT2 WD3200AAKS
Lite-On DH-16D2S DVD drive

For the first few weeks everything was fine but then I started having freezes during boot sometimes.
That issue was solved by adjusting the clock and the voltage of my RAM (though it should've worked with my original settings; this is the thread: http://www.tomshardware.co.uk/forum/page-271449_13_0.html#t1995603).
Following that I formatted and only a day later my PC froze while watching a video (I'd watched videos before that, even HD, without problems). I hit reset but it wouldn't boot saying one of my partitions was damaged.

I formatted again and installed Windows XP to run Prime95 and FurMark to see whether my hardware would fail.
I ran it for hours and tried different settings as well. It ran smoothly and I encountered no problems.

Again, I formatted, installed Linux again and for more than a week it ran as it's supposed to, then I got a kernel panic on boot, once. I hit reset and it booted normally. This did not happen again after that.

Then, yesterday, my PC completely froze while browsing the web. I hit reset but it wouldn't boot.
POST looked normal but once it started booting it would only display the first few lines of the normal booting procedure and then the screen would go black and it would reboot.
I let it continue to try to boot and after about 10 attempts it did boot, displaying some errors during boot (and the subsequent reboots). Once it failed to load some device (I couldn't see which one) and another time it failed to load my network.

Today, it booted normally and so far I haven't encountered any more errors.

Something must be half-broken since it can run without errors for more than a week, yet at some point it always fails seemingly at random.

Any ideas as to how I could figure out what's causing this?
a b B Homebuilt system
October 18, 2009 2:00:22 PM

Pull all unnecessary parts. Use the onboard sound and pull the X-Fi. If you don't need both HDDs, pull the plugs on one. Might have to move data all to one drive. If doing this, might want to do a backup of data instead of possibly putting everything on one bad drive. If its a ram issue, we might be able to find it by pulling out 1 stick and just using one. If the issues continue, trade sticks and see. I'd start with the ram myself.

If its a PSU issue, it will be difficult to find.
October 18, 2009 2:47:46 PM

Thanks for your reply!

The problem with stripping the PC of all essential parts is that it won't yield any results.
I have rebooted a dozen times now but the errors have vanished. Everything is working normally and I can't reproduce the errors, either. So stripping my PC doesn't do me any good either which is my main problem. I just can't find out what's causing it because the errors appear seemingly at random and then vanish again, usually for more than a week, sometimes even longer.
Related resources
a b B Homebuilt system
October 18, 2009 3:20:20 PM

What's the chance its linked to an environmental issue. Near a window, moisture source like the bathroom, global warming?

I guess the best way to start would be stress the ram. Try memtest overnight.
http://www.memtest.org/
October 18, 2009 3:54:57 PM

It is semi-close to a window and the radiator, yet my old PC was in the same spot and never showed any errors.
Unlikely, I think.

The temperatures are all normal. I've checked them multiple times with multiple utilities on multiple OSs.

I've run memtest86+ for a combined total of about 15 hours, no errors. The longest period at a time was 3 hours only, though. Would running it over night make a difference? If so, I'll try that.

Like I said, I ran Prime95 for hours as well and nothing happened which strikes me as very odd, just like the errors just seemingly vanishing for a period of time.
a b B Homebuilt system
October 18, 2009 4:13:51 PM

Sounds like the hardware is fine, may be a quirky PSU. Do you have a spare laying around?
October 18, 2009 7:08:15 PM

I do have another one in another PC but I am unsure whether that will be powerful enough (I'll have to check).

Though, my PC just froze again and this time it killed everything.
Here are some pictures of what happened:

http://img32.imageshack.us/img32/9501/dsc00060aj.jpg
I couldn't type anything here. After a while it displayed the kernel panic screen.

http://img148.imageshack.us/img148/227/dsc00064s.jpg
Same as above.

http://img40.imageshack.us/img40/3720/dsc00065bq.jpg
I found an option to check the lan cable in BIOS after last night's freeze and enabled it because I figured it wouldn't hurt. Then this showed up when I tried to boot, every time starting from reboot #3 after today's freeze.

http://img40.imageshack.us/img40/4828/dsc00066yz.jpg
Extended POST. No idea whether something is off, seems fine to me.

http://img40.imageshack.us/img40/7666/dsc00067xa.jpg
Errors during boot. Fsck would keep running, keyboard wouldn't work and at some point the kernel panic screen would be displayed.

http://img237.imageshack.us/img237/1307/dsc00072h.jpg
Another boot, keyboard input worked, stuck on starting network. Kernel Panic screen after a while.
No idea what happened to fsck there.

http://img42.imageshack.us/img42/7372/dsc00073p.jpg
Fsck passed as opposed to before. Stuck on starting network again.
Because of the lan cable error during POST I pulled the cable which made no difference, though.

http://img40.imageshack.us/img40/4318/dsc00075um.jpg
Here it finished booting and put me into a virtual terminal where I could log in.
It had done that the boot prior to this one too, yet it told me my username didn't exist (/home/eyescream was gone) so I rebooted and ended up in the virtual terminal again where I changed to my normal user and entered startx. The picture is the result.

http://img508.imageshack.us/img508/3790/dsc00076wf.jpg
After a few reboots I was able to log in _and_ have files in /var (sometimes it would be empty). Nano would work but using the arrow keys/page up/down wouldn't work. Used cat to try to find some helpful lines in some of the logs.

http://img267.imageshack.us/img267/1007/dsc00077i.jpg
Same as the one above.

http://img19.imageshack.us/img19/4884/dsc00079qu.jpg
This was one of the last boots. All partition passed fsck again but the keyboard didn't work.

It seems to have killed my filesystems, or not. Sometimes they pass the check, sometimes they don't. Either way, I can only get into the terminal and not even there everything is working. Basically the last freeze destroyed my whole system. I don't think the pictures give any clue as to what's causing it but I figured I'll try my luck anyway.
October 18, 2009 7:41:56 PM

I had a hunch and turned off my PC for a while. 30 minutes later I turned it back on and as I'd thought no errors. It booted, showed the login. I logged in and my desktop appeared as usual and all my files are still there.
Now I am even more clueless as to what could possibly cause this.
a b B Homebuilt system
October 19, 2009 12:45:01 AM

This is one of those cases where it can be just about anything. I'll offer up some ideas on how to test each part to see if we can isolate the issue. With those issues, I'm now leaning towards a trick HDD.

HDD - Put the drive into the other system. Should be easy enough since its linux and not windows. See if the problems follow the drive.

Ram - try just 1 stick at a time

PSU - If the other PSU has 300w, should be fine for 2d testing. The GTS 250 shouldn't come close to drawing significant power until its stressed.

CPU - If its prime stable, very unlikely the issue

Mobo - Very real possibility here, but had to determine until the other parts are ruled out.

I know its time intensive to trouble shoot a problem like this. Also, if you're thinking it might be other heat issues, pull the side off the case and see if it continues.
October 19, 2009 5:00:22 PM

Thanks for all the suggestions!

I've tested the RAM now. I did the following:
I removed the second stick and booted with the RAM settings that always made my system freeze several times (I also tried the other settings). It worked and didn't freeze.
Next, I put the same stick in the other slot and did the same as above. No freezes.
I tried the exact same with the second stick alone and didn't encounter any freezes either.
Then I put the first stick in the slot of the second stick and vice versa. No freezes.
Wondering why that would be I put them back into their old slots and booted . No freezes.
This does not make sense to me at all. Yesterday (or the day before) after the freeze I changed my RAM settings to see whether it would still freeze and it did. Now it doesn't anymore with none of the settings I tried, even the setting that would ALWAYS (not only on cold-boots) freeze my system.
I didn't try my other two slots (the unused ones) because there's no real point in that now.

Taking that into account (and the now also vanished freezes) I don't think it can be a hard drive issue. I've checked both my hard drives twice for bad sectors but they're fine. SMART checks out as well. My RAM seems to be fine as reported by memtest also.
As you said, CPU is very unlikely, as are heat issues. I've checked them multiple times and they were way up during the stress tests but that didn't crash or freeze the system either.

I still have to check on the PSU but considering the RAM issues I don't think that's the case.
It looks like it's a mainboard issue and a very weird one at that. I'd hoped the RAM tests would yield at least some results as opposed to apparently nothing.
!