System lock-ups, most likely a broken GPU, but not sure.

emilemil1

Honorable
Sep 5, 2013
4
0
10,510
Summary: I've experienced increasingly frequent lock-ups in the last few days, and after some internet searches and tests I'm fairly certain that it's a problem with my GPU, but I'd like some opinions before ordering a new one, in case it might be something else. I'm not hoping to find a solution, just confirmation on which part that needs replacing.

The full run-through of about 3-days of testing are below, in one glorious wall of text.


---

First of all, specs (bought and installed in January 2011):

MOBO - MSI P67A-GD55 REV B3
CPU - Core i5 2500K /w standard cooler
GPU - ATI Gigabyte HD6950 2GB
RAM - Corsair 4GB (2x2) 1600/CL9/1,65v/XMS3
PSU - Corsair TX 750w
OS - Windows 7 Ultimate 64 bit (Primary)/Ubuntu 10.04 (Wubi) (Secondary)
+ a generic 1TB HDD and some dust

I have never overclocked or touched anything that would affect the hardware, aside from the common 6950 -> 6970 shader unlock through BIOS flash, which has worked flawlessly for over two and a half years, at the cost of around 1-2 degrees of maximum load temperature


---

Ok, so I'll go through what I've experienced and done so far.

First lock-up:

The problem first started two days ago when the system began locking up while playing Saints Row IV (after like 5-6 hours), without any previous hint of errors. The monitor went black, but stayed connected to the system (according to the status light), or in other words it didn't go into stand-by. After another second or two, the audio started looping, at which point the system completely stopped responding. My motherboard has light indicators that light up/flash when components are working, and when this lock-up happened, every single light was constantly lit up, as well as the light on my case (HAF 912) that displays drive usage

I tried to make Windows shut-down with the power button, but it obviously didn't work. Next I tried the button on my case that makes the system reboot, but doesn't turn off the power completely. This had the odd effect of making the system reboot successfully with Windows login sounds and everything, except the monitor stayed dark (but still connected/not in stand-by, just like during the lock-up). Finally I made a hard-reset which did make the monitor function normally again.


Second lock-up:

I've had strange bugs and crashes happen before and then never happen again, so I assumed this was no different. I logged back on and started up League of Legends. A few minutes later I was in a game, and then the same lock-up happened again, this time maybe 15-20 minutes after logging in.

Since the problem repeated itself it can't be an oddity, and since I haven't installed any new hardware or drivers in the last month I assumed (hoped) that it was some kind of data corruption. So I tried booting into Ubuntu instead, which didn't work at all. The system locked itself before I even got to the login screen.

Next I tried safe-mode, which I could use without issues for well over 3 hours. Ubuntu's equivalent safe-mode worked fine as well, which meant that the problem must be in either hardware or drivers. I wanted to confirm that the problem existed one last time, so I rebooted and logged into Windows again, but didn't do anything besides staring at the desktop. Nothing happened for 30 minutes of idling, so I started a Skype call and watched some YouTube, which I managed to do for about 8 minutes before the lock-up occurred again. I tested one more time with just Skype, thinking that it might be the video, but it locked-up eventually anyway.


Re-install:

Rather than eternal troubleshooting, I opted to do a clean install of Windows. After doing so I set up a profile and logged in, and found that everything worked fine, though the resolution was low/stretched since I was running standard VGA drivers. There were no issues whatsoever when browsing, watching video or making Skype calls for extended periods of time.

So next I let Windows install every available recommended update (including SP1), rebooted, and everything still worked fine. Next I installed the latest non-beta video driver (Catalyst 13.4), and everything still worked... Until I turned on Aero, which spawned the lock-up yet again. After that I made some more testing and found that the same lock-up happens eventually regardless of which version of Catalyst I use, and it doesn't matter if I opt to not install CCC either. I've tried 13.4 (latest non-beta), 13.8 (latest beta), and a few 10.x-12.x drivers, but it made no significant difference. It's also not related to Aero, as it would eventually happen anyway while doing regular tasks like browsing, looking through directories, etc. Activating Aero just spawns it instantly. Also, the lock-up seems to only occur when actually doing things, as it never happens while idling on the desktop.

Still I can do all this in safe-mode, or with the drivers uninstalled (running on standard VGA drivers instead) without issues. All of this is what makes me strongly believe that the 6950 is the culprit, but I'm still a bit hesitant. It's difficult to find cases online of the exact same issue, as many things can cause lock-ups and looped audio. I've found cases where the issue has been solved by a new video card, but also cases where replacing it did nothing, and instead it was the MOBO or RAM that was the issue (or it went unsolved).


More tests:

I tried some other things after that.

GPU-Z showed that somehow my video card's BIOS had been reverted from the flash I had done over two years ago. It was the 6950 -> 6970 "unlocking" flash that bumped the 6950 shader count from 1408 to 1536, but GPU-Z displayed that it was now reverted to 1408. Thinking that this might be the issue, I used the BIOS switch on the card to turn on the back-up BIOS, but all this did was to make the lock-up occur right after the Windows welcome screen, but without looping the audio (the login sound played fully even after the screen went black). Anyway, I switched it back again after that.

Then I tried the Windows memory analysis utility from boot, which didn't show any issues, which means that I can rule out RAM issues? (I have no idea how reliable that test is)

I also checked that all fans were working properly, and that there was no overheating going on, which there wasn't.

I tried to get some crash dumps for WhoCrashed to analyze, but apparently this lock-up doesn't create crash dumps.

The Event Viewer shows no errors, but sometimes it does show information entries right before the lock-ups. The entries vary, but for the last two lock-ups (spawned by trying to load webpages) it has been 6-9 of these right before it happened:

Source - amdkmdag
Category - DVD_OV
ID - 62464
Description - UVD Information

Not really informative, and I've read that these can spawn in thousands for no reason, without causing any issues whatsoever. It also seems that these only spawn when using older video drivers, which I am using at the moment (I think 10.1).

Lock-ups from trying to activate Aero, or trying to get past the welcome screen with the back-up BIOS, or simply looking through directories, doesn't seem to cause any remarkable entries, only various services starting/stopping such as Windows Update.


---

And that's about it.

Before I go to order new parts, I figured I'd ask here for some advice. Maybe there is something I could try that I haven't thought of, aside from temporarily replacing GPU/RAM/etc to see if the problem goes away, which I will of course also do once I find a kind friend who will let me borrow his parts/let me put my parts in his computer. I'll make sure to update here when I do.

It's almost certainly malfunctioning hardware, right? As a Windows clean install would have solved the issue otherwise, would it not? I'm expecting to buy a new part (or many, if it's the motherboard), but some help in nailing down which part that is would be nice. My guess is GPU, but as stated above, I've read about people with nearly identical issues that have replaced their GPU and still had the issue, and I don't feel like doing that expensive mistake myself. Though hopefully testing with someone else's card will rule out that possibility.

Any input is welcome, and thanks beforehand!
/Emil
 

emilemil1

Honorable
Sep 5, 2013
4
0
10,510


Yeah, I was planning on doing that when I find someone with a rig I can borrow. The people I've asked so far have been hesitant; they think my faulty component may damage their components, or corrupt their OS because of the lock-up.

I'll convince someone eventually, I hope...
 

emilemil1

Honorable
Sep 5, 2013
4
0
10,510
Ok, so tomorrow I'll be able to test the GPU in someone else's computer. In the meantime I've installed the 13.4 driver and done some more investigation, just because I've got nothing else to do.

It will not lock-up while on the desktop, or while browsing folders/navigating the control panel, at least with this driver. I did this for around 40 minutes, so I think it's safe.

I will instantly lock-up when trying to load a webpage (or an empty new tab) on any browser, as well as when trying to activate Aero. I suppose all of these use hardware acceleration, which is why they don't work. This would make sense, as all of my previous lock-ups have been while using the web, or in a game (even in the Skype-only case, the browser was running). I can even start the browser minimized, which will delay the lock-up until I maximize it again.

CCC also seems to be malfunctioning. I've re-installed it several times, both with and without first using the AMD cleanup utility, and from both safe-mode and non-safe-mode. In any case the shortcuts in the All Programs menu all have blank icons, and CCC won't start automatically on login. If I try to use the shortcut, the processes all start (or start and immediately close, depending on how I did the re-install), but the service never actually kicks in. There is no icon in the tray. and no window ever appears. The processes also seem to go inactive after a while, judging from the memory and CPU usage. In some install cases, trying to start "MOM.exe" separately gives an error saying something about side-by-side configuration being wrong.

Everything points towards GPU malfunction, but I guess it could still be the MOBO. I've pretty much ruled out PSU since it's not related to system load, and also RAM because I've ran memtest (2 passes) with 0 errors, as well as tried to run with each stick separate from the other, without any difference.

That's that. I'll be running my GPU in my friend's PC tomorrow, which hopefully will give a result. If it doesn't, I guess I'll try to find another GPU to put in my PC, which I assume is the next step in determining if the GPU is the culprit.

Sorry for the walls'o'text, I suck at being concise.
/Emil