whea_uncorrectable_error / clock_watchdog_timeout after OC

Candan

Honorable
Jul 27, 2014
237
1
10,715
Hi all.

Specs.
Intel i7 6700k with CM Hyper 212 EVO Air cooler.
MSI Z170A Gaming M5 A mobo
G.Skill Trident Z DDR4 3200 2X8GB
Samsung SSD Evo 250g with Win 10
EVGA 4GB GTX 980

I overclocked today the CPU for the first time. Nothing out of hand. Pushed to 4.5ghz on 1.325v.

Ran ROG Realbench. Left it running for a couple of hours and it seemed ok. When I returned to the machine I had a black screen and pc was frozen. Upon reboot, shortly afterwards I got a BSOD WHEA_UNCORRECTABLE_ERROR. another reboot shows the same message or CLOCK_WATCHDOG_TIMEOUT.

Never exceeded 65C. Sometimes after a crash and reboot it will crash and hang during bios splash, sometimes windows would boot but it hangs after just a few minutes or seconds. Not enough time to run a sfc /scannow or grab a memdump.

Any suggestions?

Also, and really strange, the pc is connected to the Internet with a hardwired ethernet cord. As a result of this issue today if the ethernet is connected to the pc, even in bios, it cuts the Internet to the entire house on either WiFi or ethernet connections. When I shut the pc down and reboot the router everything is ok with the network again, until the pc is reconnected. How could this be?

Thanks for any assistance that can be offered!
 
Solution
overall, something is wrong, it looks like the CPU but it is strange that the temp in BIOS is fine.
the start pending of the driver could also happen if the wrong driver for the chipset is installed on the machine.
it would try to start but might never complete.


------
the start pending means the driver was starting and attempting to access the hardware but has not completed yet.. Generally this would be very fast for something inside of the CPU. in the milliseconds time frame not 20 seconds.

I am looking at the CPU specs for this cpu, there are a lot of thermal monitoring that is going on by the CPU.
memory thermal management, CPU, gpu. It is a pretty fancy processor.
temperature of the memory can be passed to the CPU via on...
when you get your buggheck 0x124 BSOD WHEA_UNCORRECTABLE_ERROR
check parameter 1. If it is 0 then the CPU called the bugcheck
if it is 4 then the bugcheck came from the PCI/e bus

if you are getting a watch dog timer going off, I would expect it would be a timing issue between the pci/e bus and the gpu.
if you have a overclocked GPU, you might need to slightly overclock the PCI/e bus (103 Mhz) to get the electronics timing signals to work out.

put your minidump on a server and post a link. I can look to see what subsystem called the bugcheck.

 

Candan

Honorable
Jul 27, 2014
237
1
10,715


Forgive my ignorance . I'll post the dump later when I'm home.

I haven't touched the memory (other than default xmp) or the GPU clocks. So I don't see why that would call the error.

Incidentally, I reduced voltage to 1.25 and ran a single core only and had no issues. As soon as I applied all cores again I got the bsod immediately. I tried again increasing vcore in small increments up to about 1.35v and same issue.

I should have mentioned I also have stopped any OC and running at stock speeds again

 
if you have a nvidia card with shadowplay running you will want to make sure your network driver is updated.
-Some old drivers have bugs that cause the GPU driver to lag behind.
- you might also look for updated audio driver for your motherboard, outdated ones can cause a conflict with the GPU high def sound driver. (also causes the GPU driver to lag)

both of the above could generate a watchdog time out. (as well as any CPU or GPU overclock)
also, the PCI/e "bus" services other hardware like the USB ports, Old usb 3 or usb 2 chipset drivers or old drivers for devices on the ports can cause it to be slow and mess up your PCI/e bus timing.

generally though, if you are getting a bugcheck 0x124 you want to look at the system up timer and run the !errrec command in the windows debugger to see why the bugcheck was called.





 

Candan

Honorable
Jul 27, 2014
237
1
10,715


Hi johnbl. Here's my last minidump. Please have a look and see what you think and I will continue with some of your other suggestions in the meantime. Thanks!

https://onedrive.live.com/redir?resid=C234FEABB175904!16059&authkey=!AIiQ4Rldgm37OTs&ithint=file%2cdmp

 
I am getting a message that says that the onedrive server is down.
"Sorry, something went wrong
Our server is having a problem. We're working to fix it as soon as we can, so try again in a few minutes."

will try back in a little bit



 

Candan

Honorable
Jul 27, 2014
237
1
10,715


Try dropbox... thx! https://www.dropbox.com/s/h6qud2c4e4r4drw/061316-5718-01.dmp?dl=0
 
reset the BIOS, update all the motherboard drivers, and see below for anything that might be useful.
then reboot and see if you can still get a bugcheck.

Note: the USB chipset drivers could also cause this. You have a updated bios, updated microsoft binaries but the motherboard drivers for the USB were old (2015) They should be updated when the BIOS is updated.
IE update the CPU chipset drivers as well as the asmedia usb 3 drivers.
---------
for this type of bugcheck you have to provide a kernel memory dump the debug info is stripped out of a minidump.
here are instructions on how to change the memory dump type (select kernel rather than full memory dump)


-------------------
here is the info on the bugcheck:
memory dump shows
BugCheck 101 CLOCK_WATCHDOG_TIMEOUT

the system thinks processor 8 is hung.

I would guess plug and play is installing some driver and the driver install is failing for some reason.
I have seen this happen on network drivers, where one core is installing the driver, the other core is trying to use the driver, but the install on the first core fails and retries over and over. after some timeout the second core thinks the first core is not responding and calls the bugcheck.
Your system was up 24 seconds, (which is fast for a timeout period)


I would reset the BIOS to defaults to remove any overclock and clear the hardware database of settings.
you will want to remove your overclock software
msi afterburner NTIOLib_X64.sys

find out why this driver is loaded?
\SystemRoot\system32\DRIVERS\ASUSSC150.sys Sun Aug 16 18:51:16 2015
it looks like an asus motherboard driver but you have a MSI motherboard.
must be the driver for asus STRIX SOUND CARD ?
check with asus for a update: (this could be the driver that fails to install, after you have this in your machine be sure to reset the BIOS to defaults or toggle any setting off and back on to get the BIOS to rescan the hardware and rebuild the database of hardware settings that is passes to windows) add in sound card, GPU sound driver and motherboard sound driver can be a lot of conflicts that have to be worked out. some of these cards also have firmware that has to be updated as well as the driver for the card.

you wil want to update the windows 10 drivers for your motherboard (all of them)
you bios looks current.
https://www.msi.com/Motherboard/support/Z170A-GAMING-M5.html#down-driver&Win10 64

machine:
BIOS Version 1.90
BIOS Starting Address Segment f000
BIOS Release Date 05/11/2016
Manufacturer MSI
Product Name MS-7977
Version 1.0
Manufacturer MSI
Product Z170A GAMING M5 (MS-7977)
Version 1.0
Processor Version Intel(R) Core(TM) i7-6700K CPU @ 4.00GHz
Processor Voltage 8ch - 1.2V
External Clock 100MHz
Max Speed 8300MHz
Current Speed 4000MHz

memory:
Bank Locator BANK 1
Memory Type 1ah - Specification Reserved
Type Detail 0080h - Synchronous
Speed 3200MHz
Manufacturer 0420
Serial Number
Asset Tag Number
Part Number F4-3200C16-8GTZB
G.SKILL Trident Z DDR4 3200 C16 2x8GB (but I only see one bank installed not two, just fyi)


 

Candan

Honorable
Jul 27, 2014
237
1
10,715


- Reset Bios (JBAT jumper)
- Updated all drivers from MSI site.
- There are no updated Asmedia USB drivers. Had Windows look too and it said they were the newest.
- Uninstalled MSI Afterburner.
- Asus (ASUSSC150) Drivers are the Drivers for my Asus Strix SoundCard. Installed current version anyway in case of corruptions.
- Yes. Only one bank of Mem was installed as I swap matched the two sticks each to see if this was causing the issue. I'll put both sticks back in when I'm all fixed!

I'll see how all this does and report back. Thanks again johnbl
 

Candan

Honorable
Jul 27, 2014
237
1
10,715


OK. No change. Here's my dump attached. Pls have a look.

http://

Things I'm seeing...

Using CPU single core, I can use the network driver to connect no problem.

As soon as I boot up with all cores, I lock up with the same issue. WHEA_UNCORRECTABLE_ERROR.

At this time it completely kills internet access to the entire house (both wired and wifi connections) as well as disconnecting all connections. I have to power down this PC and reboot the router to allow the others to reconnect. I feel it must be related to your point about the cores trying to wait and then timing out. Especially as it locks up seconds after reboot.

Should I try disabling the network adapter and see if that fixes it? If it does, I guess I then just use a barebone network adapter?

It still doesn't seem logical to me that I'd have this issue just after I OC'd and never had it before...

Thanks again.
 
(kind of looks like a overclock voltage/overheating problem system was up 1 min 3 seconds)

second one was a bugcheck ox124 called by the cpu because
Error : Internal timer (Proc 3 Bank 4)

overclocking software is still running:
NTIOLib_X64.sys

you need to remove this because it can tweak the voltages to the CPU incorrectly and can cause this particular problem.
Also, it is not good to debug software when the machine is overclocked. The setting read from bios will not be correct to what ever the driver ended up setting them to. You can add the overclock software back after you are done testing.

also, you should see if you can make a kernel memory dump rather than a minidump




 

Candan

Honorable
Jul 27, 2014
237
1
10,715
Wow. I did Uninstall Afterburner. I'll go back in and remove NTIOLib_X64.sys in a day or two. I'm leaving for business for a couple of days and I'll look then.

Needless to say, the pc crashed again earlier, without the network driver installed and turned off on the bios. So at least I know now it's not that.

Since the voltage is set to auto and I've not OCd, why is it calling for a voltage surge? Can't be the PS. It's fine in single core.

But I do seem to be in a deadlock somewhere.

How do i pull a kernel dump?
 
something in the BIOS is trying to overclock.
the overclocking software will not help, you should check the date some of the old versions don't know about the new low voltage processors. Even when they do not overclock they apply too much voltage to the CPU. The voltage and clocks are used to synchronize data transfer between the layers of cache memory inside the CPU. when it is incorrect you get data corruption in the cache and the memory controller detects it and the CPU does a panic shutdown.

also, when you have a new PC, all the BIOS vendors have to pick up microcode patches for their motherboard, Windows trys to patch the code with a system32\mcupdate_GenuineIntel.dll Thu Oct 29 19:42:26 2015 this is to fix various bugs in the CPU but it will not fix the voltage table settings in the BIOS the motherboard vendor would have to do that for each motherboard, Also they tune it over time, to get better settings over the life of the product. Initially, they set aggressive timings so they can get good benchmarks, then step it down later during production so the motherboard is more stable so they will not get so many returns.


how to change memory dump type: (use kernel rather than full)
https://www.sophos.com/en-us/support/knowledgebase/111474.aspx

it is strange, I would think that after you selected single core the driver would not install and would just go into a loop trying over and over again. Failing each time. with two cores the driver installs on the first core and fails, while the second core attempts to run code that requires the driver. it is the second core that thinks the first core stopped working and the second core calls the bugcheck. The first core thinks everything is ok and just retries over and over it running code as fast as it can.
(anyway this is just a guess based on other cases I have looked at, it could just be a simple timing problem in the electronics that would be resolved by removing the 2 over clocks, (one in bios the other in the overclock driver)



 

Candan

Honorable
Jul 27, 2014
237
1
10,715
Interesting. So from what I've read from the link on how to dump. I set it up to wait for the next crash? It's no good me dumping now, I need to wait for the next crash, correct? So if I set it up with just the PC running on a single core, then reboot with all cores and wait for the next crash and send the dump then?
 
yep, post the next memory dump. Note that kernel memory dumps are stored in a different location and are much bigger than a minidump file. they are in c:\windows\memory.dmp file.

sometimes with a bugcheck 0x124 the cpu will not have enough time to write a kernel memory dump but I think it might work in your case.



 

Candan

Honorable
Jul 27, 2014
237
1
10,715
Ok. That NTIOLib_X64.sys file appeared in 3 places on my C drive.

All in MSI folders.
In bios setup files (now deleted)
In MSI live update folder. (Now uninstalled and deleted)
In MSI bios flash utility (Now uninstalled and deleted)
All these folders had the 32bit version of the same file and all were deleted too.

But how would you like my memory dump? Since it's 16gb? At least if I remove a stick I can get it down to 8gb. Bit still a massive file!
 
you can do a kernel dump, it would be smaller. or you can zip the full memory dump it should be mostly empty space.
you can download and run rammap.exe and clear the working sets. it will clear the list of programs that windows preloads into memory before you actually want to use them. then when you do a full memory dump there will be a lot less items in memory and the memory dump will be smaller.




 

Candan

Honorable
Jul 27, 2014
237
1
10,715


Here's the compressed dump. https://www.dropbox.com/s/cip2uzmymcnsj1f/MEMORY.rar?dl=0

Things I did.... Removed video card and uninstalled all signs of Nvidia drivers, just to see if it would help. I thought it was working as it stayed alive for longer than it had before. Alas, it failed after just a bit longer.

Pls let me know what you think!

Thanks!
 
what does your BIOS think your CPU temps are?
---------------
looks like a problem was coming from
"PCI\VEN_8086&DEV_A131&SUBSYS_79771462&REV_31\3&11583659&0&A2"
which is listed as:
Intel(R) 100 Series/C230 Series Chipset Family Thermal subsystem

it the CPU being cooled, fans running? does the BIOS have the correct bios setting for thermal shutdown?

do you have the proper intel chipset drivers installed? (if you can even boot)
-----------
you have some unknown hardware on your PCI bus that does not have a driver and is attempting to start when the system crashes
any idea what it is? a bad card in a slot, a good card in the wrong slot?


1: kd> !pcitree
Bus 0x0 (FDO Ext ffffe0016770e370)
(d=0, f=0) 8086191f devext 0xffffe001675af360 devstack 0xffffe001675af210 0600 Bridge/HOST to PCI
(d=2, f=0) 80861912 devext 0xffffe001676d21b0 devstack 0xffffe001676d2060 0300 Display Controller/VGA
(d=8, f=0) 80861911 devext 0xffffe001676d41b0 devstack 0xffffe001676d4060 0880 Base System Device/'Other' base system device
(d=14, f=0) 8086a12f devext 0xffffe001675f81b0 devstack 0xffffe001675f8060 0c03 Serial Bus Controller/USB
(d=14, f=2) 8086a131 devext 0xffffe001675f21b0 devstack 0xffffe001675f2060 1180 Unknown Base Class/Unknown Sub Class
(d=15, f=0) 8086a160 devext 0xffffe001675f11b0 devstack 0xffffe001675f1060 1180 Unknown Base Class/Unknown Sub Class
(d=15, f=1) 8086a161 devext 0xffffe001675f01b0 devstack 0xffffe001675f0060 1180 Unknown Base Class/Unknown Sub Class
(d=16, f=0) 8086a13a devext 0xffffe001675ef1b0 devstack 0xffffe001675ef060 0780 Simple Serial Communications Controller/'Other'
(d=17, f=0) 8086a102 devext 0xffffe001675ed1b0 devstack 0xffffe001675ed060 0106 Mass Storage Controller/Unknown Sub Class
(d=1c, f=0) 8086a110 devext 0xffffe001675f91b0 devstack 0xffffe001675f9060 0604 Bridge/PCI to PCI
Bus 0x1 (FDO Ext ffffe001675de760)
(d=0, f=0) 1b211242 devext 0xffffe001675d47e0 devstack 0xffffe001675d4690 0c03 Serial Bus Controller/USB
(d=1c, f=7) 8086a117 devext 0xffffe001675ee1b0 devstack 0xffffe001675ee060 0604 Bridge/PCI to PCI
Bus 0x2 (FDO Ext ffffe001675dbc60)
(d=0, f=0) 1b211142 devext 0xffffe001675d2840 devstack 0xffffe001675d26f0 0c03 Serial Bus Controller/USB
(d=1f, f=0) 8086a145 devext 0xffffe001675e81b0 devstack 0xffffe001675e8060 0601 Bridge/PCI to ISA
(d=1f, f=2) 8086a121 devext 0xffffe001675e71b0 devstack 0xffffe001675e7060 0580 Memory Controller/'Other'
(d=1f, f=3) 8086a170 devext 0xffffe001675e41b0 devstack 0xffffe001675e4060 0403 Multimedia Device/Unknown Sub Class
(d=1f, f=4) 8086a123 devext 0xffffe001675e31b0 devstack 0xffffe001675e3060 0c05 Serial Bus Controller/Unknown Sub Class
Total PCI Root busses processed = 1

----------------
I would check all of the motherboard power connections and all of the power connections to the CPU and make sure you are getting proper power.

looking at all of the CPUs and what they are running
cpu 4 was in the process of crashing
cpu 2 has some problems
cpu 1 managed to bugcheck the system
cpu 0 was messed up and was trying to call a bugcheck
----------------
if you can boot, I would disable all power management functions in BIOS and in windows control panel (set it to high performance.
I would suspect your sound card does not conform to the power management specs. the specs get updated and there might be a bug or mismatch between versions.

I am assuming some bug in your motherboard ACPI BIOS or you have a device that is not conform ant to the specs.


I will look around in the memory dump at the logs to see what i find I don't expect much since the system was up for so short of a time.


system was up for 23 seconds,
then made a call to ACPI (Advanced Configuration and Power Interface)
and the system was then told to shutdown by the CPU.
the reason the cpu gave was:
Error Type : Micro-Architectural Error
Error : Internal timer (Proc 1 Bank 4)
processor 1 cache bank 4


here is the call stack (read from the bottom up for the sequence of calls)
Child-SP RetAddr Call Site
ffffd000`203ec6f8 fffff800`d284ff1f nt!KeBugCheckEx
ffffd000`203ec700 fffff800`d2a9d7d4 hal!HalBugCheckSystem+0xcf
ffffd000`203ec740 fffff800`d285040c nt!WheaReportHwError+0x258
ffffd000`203ec7a0 fffff800`d2850764 hal!HalpMcaReportError+0x50
ffffd000`203ec8f0 fffff800`d285064e hal!HalpMceHandlerCore+0xe8
ffffd000`203ec940 fffff800`d285088e hal!HalpMceHandler+0xda
ffffd000`203ec980 fffff800`d2850a10 hal!HalpMceHandlerWithRendezvous+0xce
ffffd000`203ec9b0 fffff800`d29d927b hal!HalHandleMcheck+0x40
ffffd000`203ec9e0 fffff800`d29d9031 nt!KxMcheckAbort+0x7b
ffffd000`203ecb20 fffff800`d28e2fb0 nt!KiMcheckAbort+0x171
ffffd000`209b1360 fffff800`d28908e5 nt!MiFlushTbList+0x360
ffffd000`209b1560 fffff800`d28906eb nt!MiZeroAndFlushPtes+0x12d
ffffd000`209b1790 fffff801`9c8b38df nt!MmUnmapIoSpace+0x7b
ffffd000`209b17d0 fffff801`9c8b421a ACPI!FreeNameSpaceObjects+0x1cf
ffffd000`209b1820 fffff801`9c8a2ad0 ACPI!ParseCall+0x91a
ffffd000`209b19c0 fffff801`9c8a35f3 ACPI!RunContext+0x1e0
ffffd000`209b1a30 fffff801`9c8c01ff ACPI!InsertReadyQueue+0x3a3
ffffd000`209b1a80 fffff801`9c8bfaa3 ACPI!RestartCtxtPassive+0x2f
ffffd000`209b1ab0 fffff800`d297ab65 ACPI!ACPIWorkerThread+0xe3
ffffd000`209b1b10 fffff800`d29d4926 nt!PspSystemThreadStartup+0x41
ffffd000`209b1b60 00000000`00000000 nt!KiStartSystemThread+0x16


 

Candan

Honorable
Jul 27, 2014
237
1
10,715
Thanks. Is it not fair to say that since everything works perfectly fine when operating on a single core that it can only be that the CPU is defective when this only happens when all cores are opened?

I never had a problem with any of this hardware until I tried to OC to 4.5ghz on 1.325v. where it crashed for the first time while stress testing the cpu
 
I do think there is something wrong with the CPU. if you just got it and there are no burn marks you can try and return it.
(guess it could also be a bad motherboard )
-----------
the return code for the thermal sensor was "start pending" it could have failed.
see if it is working in BIOS
-------------

check the bios to see what the sensor says the temp is.
if the CPU overheated then it should have started to throttle down to prevent burning your CPU pads.
unless you disable that in bios.

you might pull your cpu and inspect the CPU for physical damage (burn spot on a pad)



 

Candan

Honorable
Jul 27, 2014
237
1
10,715
Bios says 25c for cpu

What does start pending mean?

I suppose if I try to RMA the CPU temp they'll tell me it's been OC'd and won't touch me.

Is there any way to check that it's the mobo or cpu without having to buy what doesn't need replaced? Especially if there's no burn marks on the cpu pads?
 
overall, something is wrong, it looks like the CPU but it is strange that the temp in BIOS is fine.
the start pending of the driver could also happen if the wrong driver for the chipset is installed on the machine.
it would try to start but might never complete.


------
the start pending means the driver was starting and attempting to access the hardware but has not completed yet.. Generally this would be very fast for something inside of the CPU. in the milliseconds time frame not 20 seconds.

I am looking at the CPU specs for this cpu, there are a lot of thermal monitoring that is going on by the CPU.
memory thermal management, CPU, gpu. It is a pretty fancy processor.
temperature of the memory can be passed to the CPU via on board sensor.


you might try a slower clock speed for your memory. just to see if you can get the system to boot.



 
Solution

Candan

Honorable
Jul 27, 2014
237
1
10,715
Thanks for all your help johnbl.

I spoke with Intel today. Even though I overclocked the CPU, they are allowing me an RMA! Just goes to show that honesty is the best policy! They said, since I was honest about overclocking and didn't really push it too hard they'd swap it out for me! So happy I don't have a paperweight that I'd need to splash out for another expensive chip!

Thanks again!