BSoDs GPUs heating and problems. Also how can one with 2 GPU tell which GPU is misbehaving?

pirate.halpert

Prominent
Oct 13, 2017
4
0
510
So I have been bsoding with win10 regularly and as much fun as it may sound I have been secretly monitoring my system for clues to make it stop.

Ive begun to suspect one of my 2 GPUs is corrupting my system somehow. The bsods were usually IRQL error messages, but less often also cache malfunction errors and even more rarely other random errors I think I've cleared after following advice from u guys. Used to be lots of usb and com error in the mix but thats mostly stopped after changing some com server permissions.
Bsod happens reliably every 3-6 hrs still, though back2back bsod is not all the infrequent either, and is also not coo at all.

Memtests and sfc scans were all clear. But recently I started doing some processing using Nvidia opencl api and my system started to freeze black for a long few seconds or lag freeze frame like then pop out of it by minimizing all open windows and vise versa.. the GPUs would process in opencl but it was like they really didn't want to and we're throwing tantrums.

Most troubling was the bsods became even friendlier and would visit every hour or so. I don't know why it's gotta frown like some Microsoft e-thug. But my system has lotsa fans and the room I keep it in is usually chilly as well.

But what i would most like to know is why this hadn't become evident from graphics processing in premiere or during other moments of GPU time.
I also noticed these things were less severe when my case was left open leading me to guess that overheating is relevant somehow. Is it safe to o blame a faulty GPU for this or is could it be more involved?

How can I determine which GPU to rma without pulling one out? They are not easily removed in my setup and without the neighbor GPUs body heat im not sure either one will bug solo.

Also, the bsods became a lot more severe after putting a win10 creators update on a while ago and hasn't gotten better at all since doing a roll back.. do I need to reset windows before looking into hardware?
VMware workstation also makes more bsods.

I know I have a lot of questions, and I thank you for reading this far. I'm usually pretty good at troubleshooting but I haven't been able to crack this for months now.

Hardware: 2x msi gtx 1080, 128gb ddr4-3000 Corsair dominator sdram, Intel i7 6950x, asus x99-e ws, Samsung 960 nvme m.2 card, a WiFi and bridged Ethernet pcie cards from Intel, & between 4-6 spinny hdd drives depending on the day

edit: ajit pai made this take too long to upload, but here is a memdump file
https://www.dropbox.com/s/ey4p5qbstd40vei/MEMORY.DMP?dl=0
 
Solution
bugcheck was caused by
intel Turbo Boost Max Technology 3.0 Driver
\SystemRoot\system32\DRIVERS\IntelNit.sys Thu Oct 27 07:43:44 2016
new version is here:https://downloadcenter.intel.com/download/26103/Intel-Turbo-Boost-Max-Technology-3-0
I think something tried to call a unknown function. I would guess you have a old version or one that does not match your BIOS version (looks like a bug unrelated to the GPUs)

I was unable to read the bios version from the memory dump, this can mean the bios is old or of a non standard format. Check for updates to the bios and motherboard drivers from the motherboard vendor.

note: here are your GPU drivers, one was updated from the windows driverstore on dec 5
(most likely windows update, the other...

pirate.halpert

Prominent
Oct 13, 2017
4
0
510


i like this idea cause i got a super box fan in my closet. sorry walmart
if overheating is the case, i guess my best option would be figuring out a way to mount the gpu on riser cables next to a case fan? box fan is loud and windy to have as long term solution..
 

Mark RM

Admirable
Yeah I'm not saying it's a good solution forever, it's a very good short term check because it will drop the temps a LOT.

Back when it when it was a thing to do , I built a gaming rig with 3 390x 8 gigabyte cards in trifire for 4K (it's still a beast of a machine,) for a client. Instead of Jamming three cards in tight together or liquid cooling them, we bought a full sized server tower that allowed me to separate the final of the three cards entirely off the motherboard using a standard 16 to 16 pin PCIe cable about eight inches long (there has never been signal degradation because of this that affects performance or power delivery in any way). So it still mounts in the slots on the rear of the case, but it's not actually in the motherboard.

Five high static pressure fans shoving quiet but effective air through it and it stayed calm and quiet all these years.

But if it comes right down to diagnosing what card it is, it's 99% always the top card, closest to the CPU that is overheating.
 
bugcheck was caused by
intel Turbo Boost Max Technology 3.0 Driver
\SystemRoot\system32\DRIVERS\IntelNit.sys Thu Oct 27 07:43:44 2016
new version is here:https://downloadcenter.intel.com/download/26103/Intel-Turbo-Boost-Max-Technology-3-0
I think something tried to call a unknown function. I would guess you have a old version or one that does not match your BIOS version (looks like a bug unrelated to the GPUs)

I was unable to read the bios version from the memory dump, this can mean the bios is old or of a non standard format. Check for updates to the bios and motherboard drivers from the motherboard vendor.

note: here are your GPU drivers, one was updated from the windows driverstore on dec 5
(most likely windows update, the other two files most likely from NVidia, I would make sure you get a clean install of the gpu drivers from NVidia)
\SystemRoot\system32\drivers\nvhda64v.sys Thu Sep 14 02:55:42 2017 (59BA521E)
\SystemRoot\System32\DriverStore\FileRepository\nv_dispi.inf_amd64_c68c1eb90f6d242e\nvlddmkm.sys Tue Dec 5 11:28:19 2017 (5A26F353)
\SystemRoot\system32\drivers\nvvad64v.sys Tue Sep 19 01:38:04 2017 (59C0D76C)

0: kd> !sysinfo cpuspeed
CPUID: "Intel(R) Core(TM) i7-6950X CPU @ 3.00GHz"
MaxSpeed: 3000
CurrentSpeed: 2998 <---- kind of a unexpected speed
0: kd> !sysinfo cpumicrocode
Initial Microcode Version: 0b00001c:00000000
Cached Microcode Version: 0b00001c:00000000
Processor Family: 06
Processor Model: 4f
Processor Stepping: 01


 
Solution