WHEA_UNCORRECTABLE_ERROR with Unknown Cause, Possibly Graphics?

ryry7

Prominent
May 23, 2017
3
0
510
Recently started getting semi-frequent BSoD errors on latest version of Windows 10 with stop code "WHEA_UNCORRECTABLE_ERROR". Seems to occur most often when playing video games.

My PC has i7 6700K (stock clock), Asrock Z170M Pro4S mobo with default settings, 16GB Corsair DDR4 memory, NVidia GTX 970, Corsair 650W PSU.

I do not overclock my CPU, and I tried resetting my motherboard settings to default as well. I ran chkdsk to fix disk errors. Also ran the Windows Memory Diagnostic which found no errors. Also ran memtest86 overnight which also found no errors. I also did a full removal of my graphics drivers with Wagnard's removal tool; upon then trying to reinstall the drivers the computer BSoD'd again.

A lot of people also seem to suggest it to be a CPU issue, however I ran the Intel Burn Test on Standard, High and Very High and its temperature never got above 81 C during the Very High tests and nor did it ever crash. The CPU temps sit around 20-30 at idle and 40-50 under load, such as playing a game (and around 70-80 during the Intel Burn Test).

I also ran the Intel Processor Diagnostic Tool and it also came back all clear.

Interestingly, this morning when I tried to boot (after the memtest86 run overnight) the PC couldn't even get past POST, no beep, just a black screen and whirring fans. I removed the graphics card then and the computer was able to boot.

I have since cleaned out the PCI-e slot and graphics card itself and reseated it and the computer now boots (after I got it to boot I also reset the CMOS battery for good measure) however leaving it in the Overwatch menus for half an hour resulted in a crash. Of probable importance is that the framerate in main menus went from 60-70fps to about 20fps around a minute before the crash, which strikes me as strange. This could be important?

I've also attached a minidump from a crash yesterday if any of you wish to look at it, looking at it myself its just a standard WHEA error, bugcheck code 0x0, machine-check exception; maybe one of you will see something there that I don't though: https://cl.ly/27151X2m013P/052317-3750-01.dmp

My suspicions lie with the graphics card for the obvious reasons stated above, but it could also be coincidental timing. I am currently in the process of getting GPU-Z to log the sensors of the graphics card while I wait for another crash, generally its temps sit around 60 C under 90% load while playing. Can anyone offer me some help or advice, I'm really stumped by this one.

Update: Although CPU temps are generally okay, I just ran Prime95 for a few minutes and temps quickly hit 90-93 C on the CPU cores. Could this be an overheating problem even though temps are normally okay?
 

gardenman

Splendid
Moderator
Hi, 90+ does seem a bit hot for this CPU according to the following:
http://www.tomshardware.com/answers/id-2810472/cpu-temp-range-6700k.html
http://www.tomshardware.com/answers/id-2965263/6700k-normal-load-temps.html

That doesn't necessarily mean that is what is causing the crash.

I downloaded the dump file and used the Win 10 Kit debugger and pulled this info from it: https://pastebin.com/UnHMi2BJ I left a couple of comments in the text.

I did see a mention of a NVidia file in the dump.

I can't help you much more than this. Maybe someone else will come along and help.
 

ryry7

Prominent
May 23, 2017
3
0
510


Thanks for the help though!
So I took the whole thing apart, bought some Arctic Silver 5, cleaned off the old thermal compound that came with my Cryorig cooler and made a new application. Temps on the CPU are significantly lower when using Prime95 now. 10-15 or more degrees C lower, and we're still in the break in period too, so i expect a 2 or so degree drop after some time.

So the blue screens seem to have stopped, even though temps generally never got high enough to cause issues. However, something really worrying has come to my attention. I mentioned extreme frame rate drops before some crashes occurred; well its happening very frequently but without crashes now; games will drop to 1-10 fps suddenly, audio will be buzzing and Windows reports headphones disconnecting and reconnecting. The whole computer becomes extremely slow until all graphics intensive applications are quit.

Is it possible this is perhaps a PSU issue? Not supplying enough power to the graphics card and causing all sorts of system trouble? Or even a graphics card trying to to pull too much power? Any ideas how I can diagnose such a thing? I'll try and get a video of the issue and post it here.

Update: So looking at GPU-Z while its happening and while its *not* happening, the big difference is the graphics TDP, when the frame rate gets real low and the computer starts to struggle the TDP is 10-20% lower than it normally is/was while playing games at full speed. Any ideas what an abnormally low TDP means?

It seems to happen after a cold boot after being shut down for some time. The problem seems to rectify itself after about 15 minutes. Any ideas?
 

gardenman

Splendid
Moderator
I'm glad you got the overheating and BSOD fixed.

I probably can't help considering I've never even heard of TDP until this post and had to look it up. I did find a few pages which may or may not point you in the right direction.

Someone was having a similar problem. It seems others fixed their problem by installing a new GPU driver, or re-installing it their current one. https://forums.geforce.com/default/topic/962449/geforce-1000-series/gtx-1070-periodic-fps-drops-low-gpu-usage/4/

Another page: http://www.tomshardware.com/answers/id-1730974/masive-fps-drops-core-shader-clock-drop.html
Quote: "the cards overheating and its throttling itself to prevent burning out, or your going over the power limit and again its throttling itself to keep inside a TDP limit".

Also interesting: http://www.tomshardware.com/answers/id-2658177/gtx-970-lag-spikes-fps-drop.html

Last one: http://www.tomshardware.com/answers/id-2602498/gpu-usage-drops-gtx-970.html

I'm not sure if one of those 4 links will help or not. May be a driver problem, may be a PSU problem, may be a GPU problem.

I had to RMA my EVGA Nvidia GTX 770 card a few months back because one of the fans came off. It was still under warranty, but just barely. I think it had a 3 year warranty. I'd say your 970 is likely still under warranty (if that turns out to be the problem). I'd contact the maker of the GPU and see if they can help.