BSOD, fixed by disabling some CPU cores. Does this mean that my CPU is the culprit for errors?

Ballistic456

Reputable
Sep 5, 2014
16
0
4,510
Specs:
i5-3750k Quad Core @ 3.4GHz (Stock, not overclocked) CPU
8GB RAM
GTX560ti GPU
750W PSU
Windows 7 HP 64-bit

---

I have been wrestling on and off with my first self-built rig for the past year and a bit. I have solved major spouts of bluescreens, only to have them return in force in a matter of months. Having exhausted all my past fixes which have gotten me out of their hard, cold grasp, I have tried diagnosing my components my specifically.
(Many reinstalls of windows 7 indicate that my issues most likely lie in BIOS or in Hardware)

Memtest86 revealed no errors after 4 passes

CHKDSK/r ran clean

Recently learned that the most common BSOD I get (STOP 0x101) could be caused by a bad core in the CPU. Cutting to the chase, just today I have gone into BIOS and disabled 3 cores. Windows boots up just fine, got 20mins+ uptime with no BSOD (much longer than usual with the bluescreens). With 2 cores disabled, everything's fine and dandy. But with 1 core disabled (and with all 4 cores active for that matter), I get a BSOD or a freeze on POST screen.

This being said, can I rule more or less for certain that my CPU is bad and requires replacing in order to get a stable 4 core system, hopefully free of BSOD?

Thank you in advance!

PS- First post here, been using Tom's Hardware a lot as a guest user. You guys seem like a great bunch to get help from :3 I was brought into the PC building by my uncle and I must say it's one of the most rewarding hobbies I could ever have. Sorry if I sound a bit nooby.
 
Solution
I remember looking at this memory dump but lost track of the thread.
if i recall correctly you had a core that was trying to install a USB driver and getting repeated failures.
and another core that was trying to use the ethernet driver that the first core was trying to install but failed.
the driver only had 30 clock cycles before it timed out and bugchecked. I think it tried 88 time before the bugcheck was called.

It is hard to find a good driver for that device now that the company does not exists. They designed and sold the chip and many manufactures sell the same device with a different plastic case on it.
but do update your chipset driver also because on will effect the other driver.

you also want to find a copy of...
I believe your conclusion is probably correct. When you did your MemTest runs, did you check each stick separately? Also, did you verify that there are no broken/bent pins in your CPU dock, and that when mounting your cooler you tightened it evenly and in a star or criss-cross pattern?
 

Ballistic456

Reputable
Sep 5, 2014
16
0
4,510


A memtest I ran a while back involved me swapping out the RAM chips, yes. (My rig uses 2x4GB... should have mentioned that.) Those ran for at least 10 passes each. I even tried each stick in different slots. No errors were found at all. My most recent one was with both sticks in; 4 passes, no errors.

I have not tried removing the CPU at all during the time of build so I'm not 100% certain on that. However, I remember checking for bent pins, etc before mounting the CPU in the first place. And yes, cooler was attached as dictated by the instructions with the criss-cross pattern you mentioned.

Is it worth pulling out my CPU and checking for sure? I get paranoid very quickly when I deal with the thermal compound in there.
 
Check your CPU temperature. More cores generally means more heat. It could be that the cooler is not cooling evenly. Could be wrong thermal paste application for example.

If you have control over specific cores, try and enable only the two cores that you suspect are faulty. If you get the errors, you have bad cores. If you don't, it's probably a heat issue.
 
"Is it worth pulling out my CPU and checking for sure? I get paranoid very quickly when I deal with the thermal compound in there."

I've pulled coolers off so many times to test different pastes and application techniques that I don't even think about it much when I'm doing it, but I'm still completely paranoid about removing and inserting chips. Having said that, at the point that you've exhausted every other diagnostic avenue and you're considering RMAing the CPU, you'll need to pull it anyway. Checking the pins costs you very little in the way of extra time or effort. It would suck to go through the trouble of exchanging your chip just to have the same problem with the new chip.

I'm assuming that you've also tried re-flashing or updating your BIOS. If so, you're basically down to a bad chip (which is rare), a bad paste job (which is common), or bent/broken pins (which is more common than a bad chip, but less so than a bad paste job). All three can be checked by: 1) pulling the cooler and checking that there was sufficient coverage and that you didn't use too much paste; and 2) checking your CPU dock for bent or broken pins. If both of those check out, then continue with the RMA process. If either of them is suspect or definitely bad, you can apply remedial action, re-install and re-test.
 

Ballistic456

Reputable
Sep 5, 2014
16
0
4,510


Definitely not heat. Checked core temps when PC was running on all cores. All were consistently low (approx. 38C). Although I did notice a larger difference between core temps when running only two cores (almost 5C, whereas noramlly only a difference of 1 or 2C when running 4 cores). And no, unfortunately I am not able to specifically select which cores to activate. BIOS only gives me the option to choose the number of cores.

I will try looking into heat again though to help try and rule things out. Thanks for the input!
 

Ballistic456

Reputable
Sep 5, 2014
16
0
4,510


RAM in Kingston brand. My BIOS is set to govern voltages automatically but the stickers on the chips say 1.5V. As for model... I'm not totally sure. I have the chip in my hand right now which has a lot of code or index numbers of sorts. One does stick out though, maybe you can make something of it.

It reads: KVR1333D3N9/4G
 

Ballistic456

Reputable
Sep 5, 2014
16
0
4,510


I will do as you say here as soon as I get the free time I need. I should make clear that this rig has actually been running fine for almost 12 months since the last onslaught of BSODs. During this time, I ran programs such as CPUID HWMonitor to check my core temps, these were always around 35-41C depending on work load and were always within at the very most 3 degrees C of each other.

I was just thinking about this as surely a bent pin or heat problem would have meant my PC would have been BSODing since the very beginning without periods of good function? Or can these sorts of things cause intermittent errors such as the ones I have been facing?

Thanks for your continued help!
 

Your problem
http://www.intel.com/support/motherboards/desktop/sb/CS-034263.htm?wapkw=half+height+memory

By turning off cores you are limiting memory bandwidth to that required by fewer cores

buy some full height RAM sticks
 
If your rig was running fine for that long, I doubt that it's a problem with the pins - I'm just saying that if you're going to pull the chip anyway, you might as well check the paste job and pins while you're about it. You'll also want to closely inspect the mobo for any misshapen, discolored or leaking capacitors, or anything else that looks out of the ordinary like any scorch marks, melt spots, separated wires or connectors etc.
 

Ballistic456

Reputable
Sep 5, 2014
16
0
4,510


I'm using an ASUS motherboard P8Z77-V LX. Does this still apply?
 

Ballistic456

Reputable
Sep 5, 2014
16
0
4,510


Will do, thank you very much, volcanoscout.
 


Intel chipset will behave the same way in an intel board or an ASUS one
 

Ballistic456

Reputable
Sep 5, 2014
16
0
4,510


Ah right, I see. So the issue is between the chipset and and the RAM, not the MoBo and the RAM. Sorry, the linked page looked like it was just referring to Intel MoBos.

What is the difference between full height and half-height sticks (or is it as simple as it sounds)? How can I be sure I am getting full height sticks? Sorry for rambling with these questions but this half/full height RAM thing is totally new to me.

Thanks for your continued help and input!
 
I don't think the LP DRAM are the issue. Although mobo QVLs are not comprehensive, if a stick is on the list, that usually pretty reliable - the QVL for this mobo, dated this July, lists numerous low profile DIMM sets at all speeds. Also, when I searched online for issues with this board and LP memory, I was only able to find one reference and it was specifically directed towards a specific model of Crucial Ballistix.

Having said all that, if you have access to some non-low profile sticks, it might be worth trying them. Can't hurt anyway.
 
edit #2: if you have any memory .dmp files even a mini dump you can put it on a cloud server and I can take a quick look. You should also change your dumps to a kernel dump in case you can hit this problem again and cause windows to save the proper debug information in the dump file.

edit:
note: to figure out bugcheck 0x101 cause with a windows debugger you need to make a kernel dump rather than a mini dump.
-----------------
no, don't conclude that it is a CPU hardware problem you can have software that hangs a core and will produce this bugcheck.

normally, you would run the various hardware tests, if they pass and you still get the problem. You update the BIOS, CPU chipset drivers remove as many third party programs as you can and update the other 3rd party driver.

if you continue to get problems you enable windows driver verifier.exe and have it test the various 3rd party device drivers and to force the system to bugcheck and name the driver. Then look at the memory .dmp file and it will name the offending driver that you have to update or remove.



 

Ballistic456

Reputable
Sep 5, 2014
16
0
4,510
I don't have any non-LP sticks on me, but I suppose this would be a good starting point since RAM is pretty cheap anyways. I'll try grabbing a kernal dump (set my computer to do that a while back). Is it a simple case of finding it in my hardrive and uploading it?

I did get a few BSODs showing specific drivers. I removed these, removing those specific BSODs, but they were replaced with 0x101s and 0x1As (or is it 1A?) etc. Most driver software is paid for but I'm assuming the windows one is free.

Can the CPU driver be updated in device manager?
 

Ballistic456

Reputable
Sep 5, 2014
16
0
4,510
Checked CPU dock today for bent pins and I even redid the thermal compound! No bent pins were found and I remounted the CPU with no issues, checking to make sure the cooler was attached correctly. Also had a look at the MoBo for blemishes and burn marks, none of that either. Going to invest in some full height RAM shortly.

I'll change the RAM first before I start enabling my cores with my current RAM just to save myself the stress of seeing BSODs. I'll report back to tell you how it goes. If the RAM does not solve this issue, I'll post a kernal dump on a cloud for you to look at. If things still continue I'll look into getting a new CPU.

I'm running ok on two cores now, but there's a very distinct drop in game performance. Again, thanks everyone.
 

Ballistic456

Reputable
Sep 5, 2014
16
0
4,510


Here's the kernal dump:
https://www.dropbox.com/s/z0f7pu7ctbpd91c/MEMORY.DMP?dl=0
 
This bugcheck was not caused by the CPU, it was not caused by RAM.
you had two cores running USB code, one attempting to get the USB bus, the other doing network calls via the USB and it exposed a driver bug. If you limit the cores then the calls would become serialized and you would not hit the bug in the USB network driver.

BugCheck 101, {31, 0, fffff88002f65180, 2}
CLOCK_WATCHDOG_TIMEOUT


from the memory .dmp it looks like a USB driver issue most likely caused by your USB network device.

I would start by updating this file
\SystemRoot\system32\DRIVERS\netr28ux.sys Wed Apr 27 23:18:18 2011

Ralink RT2870 series USB802.11n Wireless Adapter Driver (you may have a different manufacturer)
the company was sold to media tec http://www.mediatek.com/ so you will have to dig around to find working drivers.

The drivers also depend on updated chipset drivers so you should also install https://downloadcenter.intel.com/Detail_Desc.aspx?DwnldID=20775




machine info:
Manufacturer ASUSTeK COMPUTER INC.
Product P8Z77-V LX
BIOS Version 2303
BIOS Release Date 12/05/2013
Socket Designation LGA1155
Processor Version Intel(R) Core(TM) i5-3570K CPU @ 3.40GHz
Processor Voltage 8ah - 1.0V
External Clock 100MHz
Max Speed 3800MHz
Current Speed 3408MHz



 

Ballistic456

Reputable
Sep 5, 2014
16
0
4,510


Thanks for the input! I'm looking into your solution now but I need some help.

"I would start by updating this file
\SystemRoot\system32\DRIVERS\netr28ux.sys Wed Apr 27 23:18:18 2011 "

^ I'm weary about finding random sites to search for driver updates. Is there a safe and easy way to do this?

I installed the chipset update with no issues.
 
I remember looking at this memory dump but lost track of the thread.
if i recall correctly you had a core that was trying to install a USB driver and getting repeated failures.
and another core that was trying to use the ethernet driver that the first core was trying to install but failed.
the driver only had 30 clock cycles before it timed out and bugchecked. I think it tried 88 time before the bugcheck was called.

It is hard to find a good driver for that device now that the company does not exists. They designed and sold the chip and many manufactures sell the same device with a different plastic case on it.
but do update your chipset driver also because on will effect the other driver.

you also want to find a copy of usbview.exe from the microsoft DDK. it is in the standalone debugger package on their website.

also google how to make device manager show disconnected/hidden devices. unhide them and delete the various not used USB devices that have been removed.

- got to run



 
Solution