SOLVED: Advanced CPU Overclocking Problem (GPU crashes when system idles)

frenchy70

Distinguished
Feb 22, 2010
56
0
18,640
Hi everyone

I put "advanced" in the thread title to indicate that this isn't a "how do I overclock" thread - I am 95% the way there and understand my BIOS settings inside out. I have read enough over the past few years to OC my CPU and GPU, to update BIOS, etc. I was also hoping to draw in a few experts to help me crack this irritating problem I've been having. My rig is old-ish but as I haven't had it running for more than a few months over the past 3 years I'm really not interested in upgrading at the moment.

At this stage of the game, I'd even be happy to donate $50 to a charity, chosen by the person that can help me solve this. I have spent over 100 hours reading up on this and still haven't found any explanation or solution for my problem. So, its kind of become an obsession to solve it.

Good luck and thanks in advance for any help.

So, the short version of the question is:

why would my GPU start acting weird ONLY AT IDLE when I overclock my CPU ? But I never have problems (with the GPU or anything) when at stock ?

I need some help understanding where I should be looking to get my overclock stable at idle because I don't really understand what is happening with my GPU.

Further info:

Kit: i5-750, GA-P55M-UD2 mobo, Crucial 9-9-9-24 RAM, Sapphire HD5850, OCZ ModXStream 500w PSU, W7 64-bit

I can get a "static" overclock to 4 Ghz stable at load AND at idle but as soon as I enable Turbo and power saving, the overclock makes the GPU "crap out" ONLY AT IDLE. Thing is, I really want the modest 4 cores overclock to 3.5 Ghz and Turbo to 4 Ghz, with power savings. I've seen it done (Toms even did an article on it) but I can't get it to work.

More detailed explanation of the problem:

When idle, surfing or at the desktop, after 2 minutes to 2 hours , I eventually get one of the following:

BSOD with error 0x116 or 0x119 - crash logs indicate its a problem with atikmpag.sys or atikmdag.sys

OR

the GPU gives the "driver has stopped responding and has restarted" pop-up - sometimes it recovers fine, sometimes it crashes and hangs the system

The GPU never gives me any problems otherwise - its only this and its only with the overclock with Turbo enabled.

With the CPU overclocked, the CPU and GPU do fine when I am stress testing them BOTH at the same time under full load.

Heres some things I have tried:

Dialled back BCLK to reasonably low values like 150 with Turbo just to get it working before increasing (bear in mind, I can sit happily at 4 Ghz without turbo, stable at idle and load)

Run with Ram timings and speed at stock and looser, to rule that out (passed hours of MemTest with RAM overclocked and at stock/loose settings)

Reinstalled ATI drivers to latest (using DriverSweeper method)

Checked that latest versions of atikmpag.sys and atikmdag.sys got installed

Naturally, I've been giving it more (or sometimes less) juice by adjusting VTT, vcore, PCH, PLL, Dram volts, etc

Run with and without LLC

Updated mobo BIOS and drivers

Disabled GPUs HD audio drivers (in case clash with Realtek board audio)

Set 2D GPU idle clocks and volts higher

Upgraded fan on heatsink and reseated heatsink with new thermal paste - didn't need doing from a temps point of view but it had been 3 years since last reseating (CPU temps are great: 25-30°c idle, 60°c at load; GPU 36°c idle, 75°c at load)

Tried all Windows Power profiles (balance, custom, etc)

Uninstalled Catalyst Control Center

Increased PCI express frequency to 101 - read somewhere that it might "lock" the value and help with stabilising overclock. 101 didn't work, so now trying it at 103. /clutching at straws

Reduced PLL - read that reducing it well below the "normal" 1.8 can help. Tried 1.6 and 1.7

Turned off hardware acceleration in Chrome browser (because my GPU was constantly jumping between 2D and 3D clocks all the time when scrolling a web page) - didn't help with the crashes

Things I haven't done yet:

Reinstalled Windows - thinking about it even though I feel its a pain and won't help

Changed the PSU - would mean having to buy one as I don't have a friend nearby who I could borrow one from. Was wondering if that was the problem but don't understand how it could be if I can run all day at load on CPU and GPU
 
The turbo mode introduced with the first generation Nehalem processors is slightly different than the implementation used with SandyBridge and subsequent processors.

When the processor switches from a low power state to a high power state it has to do three things. First, it has to halt execution. Second, it has to adjust its voltage level and clock multiplier. Third, it has to restart execution.

This sounds simple in practice but the technical implementation is actually quite difficult.

If my memory serves me correctly, on the Nehalem processors the turbo engine changes the voltage and clock multiplier at the same time when changing power levels. On subsequent processors, they are decoupled. This allows newer processors to change the voltage level without pausing execution, and then pause execution only to alter the clock multiplier. This greatly reduces the processor's stall time while switching power states, improves the transient response of the switch, and improves the dynamic range of the jump.

My suspicion is that your computer is failing at switching from a low power state to a high power state. I'm not sure what the i5-750's idle clock is, but I assume that it's probably in the range of 1600Mhz. It is tested and rated to jump from 1600Mhz to 3200Mhz on one core, I suspect that you're asking it to jump from 1600Mhz to 4000Mhz on all cores all at once. This huge jump can prove problematic for the PLL (it's what multiplies the 100Mhz base clock into the 4000Mhz CPU clock) as well as your motherboard's voltage regulator.

It's entirely possible that your processor is simply unable to make that jump. That's why it's stable at 4Ghz when power saving and turbo mode are turned off, but not when they're turned on.

There are three things that you may be able to do.

First, bump the PLL voltage up a little bit. Try 1.9 volts. PLLs are complex analog circuits which are constantly playing catchup, increasing the voltage may allow it to track better.

Second, get a real power supply. Seriously, OCZ products are absolute garbage and the ModXStreme lineup is as "meh" as you can get.

Third, you're using a budget motherboard. You get what you pay for. A 50% overclock is probably a bit too ambitious on that. There's a reason why premium overclocking boards like the Asus Rampage serious demand a hefty price premium, they put a lot of money into components that are designed specifically for squeezing out extra performance.
 

frenchy70

Distinguished
Feb 22, 2010
56
0
18,640
@Pinhedd Thanks very much for your reply. I had an "intuition" that it could be something with the low to high power states and what you are saying totally makes sense. However, I didn't think this could be the problem because I can stop and start Intel Burn test or Prime95 over and over, while watching CPU-Z or Real Temp to see 1, 2, 3 or all 4 cores doing what they should do - going from nothing to full load. It acts just like I'd expect, when I'm stress testing. I'm just confused about why this switching from a low to high state should be a problem when the computer is just idle or I'm surfing in Chrome browser.

Just to respond to a couple of your points:

Yeah - the i5 750 is idling at a 9x multiplier, so in my case, I have my BCLK at 160 instead of stock 133, so CPU-Z shows it idling at 1440.

" I suspect that you're asking it to jump from 1600Mhz to 4000Mhz on all cores all at once "

Actually, I'm asking it to do 3360 (21 x 160) on 4 cores, and 3840 on 1 core (24 x 160). but yeah, its going from 1440 - 3840 on 1 core. As I said, its doing that fine, at the desktop while stress testing with IBT or Prime, or in games, etc.

But, I'm not so sure that the CPU is actually going from idle 1440 to 3840 when I'm browsing or the computer is left untouched. I know the frequency is bouncing around a bit, especially when I'm using the computer, but are you saying that when it is idle or I'm just browsing, that the CPU is going to 3840 ?

I agree about the PSU - I don't know why I bought that and if I thought it would help, I'd probably replace that with a quality PSU. The thing is, could it really be my crappy PSU when it has been providing enough juice when the CPU and GPU are OC'ed and working fine at load ?

Or does the PSU need to be able to handle the switching between low and high power states in an IDLE scenario ?? Something still doesn't make sense to me about the PSU being the problem. However, I know that the PSU doesn't really have enough amps on the 12v rails for what AMD specifies, but again, it has been doing fine delivering at LOAD, why would that be a problem at IDLE ?

Ditto re the mobo - it definitely wasn't the most expensive but it did get some great reviews at the time for being an overclocking champ. But I can accept that maybe it hasn't been constructed well enough to handle the switching between low and high power states. At load, it (seems) to do fine.

Thanks for explaining PLL in relation to this - thats something I can focus on. I'll try and bump that up to 1.9 and see what happens.

The last thing I'd say is that with regards to the CPUs ability, I accept that things might have improved with subsequent generations of CPU, but as I said, I have seen plenty of write-ups at Toms and around the web to show that a dynamic overclock is no problem at all. So at the moment, I'd have to be looking elsewhere for the problem based on that.
 

frenchy70

Distinguished
Feb 22, 2010
56
0
18,640
UPDATE: Is it possible that the PCIe frequency be affected by overclocking the CPU ?? I'd been wondering about this because its my GPU that is crashing while CPU overclocked, and of course, the GPU is plugged in to a PCIe slot (thats the extent of my knowledge about the PCIe slot)

Coincidentally, I ran across some interesting threads where people were talking about putting in a value of 101 for PCIe frequency when overclocking because if you left it at 100 on some boards (Gigabyte ??), then it wouldn't stay at 100, and would mess with the stability of your overclock.

So I tried 101 and it didn't BSOD for quite a few hours but it did give me the driver has stopped responding message after the screen blacks out. Encouraging because the GPU recovered whereas previously, the screen always goes laggy after these events (like I can't change tabs on the browser, the screen takes ages to refresh, etc).

So, this morning, just before posting this thread, I bumped PCIe frequency to 103, and so far, the OS has been up and running for longer than it ever has while CPU is overclocked.

I'm not quite ready to mark this solved yet as I know I need to run some more tests, and even just leave it idle overnight. But so far, I'm feeling quietly confident as Bick would say. My charitable donation to the Human Fund is already in the envelope.

Thoughts ??
 

MEMOFLEX

Distinguished


I don't know how much more I can add to this thread but when overclocking my i7 930 I had issues which seem similar to yours in that my GPU would behave strangly and after hours of tinkering (and considerable hair loss) I traced the issue to my PCI-E frequency being too high. I have an Asus board so this could be completely different issue but I eventually went back to 100 for PCI-E and everything has been rock solid since. In regards to the switching in power phasing I cannot comment as I turned off all power saving features as I was going for a relatively aggressive overclock.
 

frenchy70

Distinguished
Feb 22, 2010
56
0
18,640
@MEMOFLEX - thanks for your reply. OK, thats interesting. So maybe I'm barking up the right tree at least.

I just had a "display driver stopped responding" event and it recovered again without crashing or freezing. SO, I have dropped the PCIe frequency back to 101 (which for the few hours I've tried it, I haven't had any BSODs, which were happening frequently).

I'm now going to try increasing my 2D clocks to see if that will overcome the driver stopped responding problem. Increasing the 2D clocks didn't stop the BSODs so I put the clocks/voltage back to default.

If that doesn't work, I'll try PCIe frequency at 105 :pt1cable:

Edit: also saw someone mention to uninstall manufacturers drivers for monitor and use Windows VGA ones. Done as I'm willing to try anything new.

 

MEMOFLEX

Distinguished
No worries buddy, just thought it would be worth mentioning.

Have been re reading through the comments and it is a head scratcher but I keep coming back to the PSU. Now this is based solely on a hunch but you mentioned that the 12v rail is of a lower power than what was required but also that the issue only occurred when you OC your CPU.

Now my thoughts are that if the PSU is being maxed out by your OC components and the GFX card meaning that in some way the card is being starved of power either through the 12v rail directly or by the power it draws from the board. I know that this does not really make sense as there should be a lower power draw at idle than when at load but some PSU's are incredibly tempremental. Unfortunately PSU's only get worse over time especially the cheap ones.

Could you give the details off the PSU sticker and I will look into it and see if I can find anything.

BTW : Have seen people using PCI-E frequencies as high as 110 with stable systems so you still have a little room on that front but I still think the PSU may be the culprit.
 

frenchy70

Distinguished
Feb 22, 2010
56
0
18,640
@MEMOFLEX: Yeah - the PSU is this one, reviewed favourably by quite a few publications: http://www.hardwaresecrets.com/article/OCZ-ModXStream-Pro-500-W-Power-Supply-Review/973/1

It has 18A on the 2 12v rails. I think ATI recommend slightly more, but plenty of people told me it would be fine when I bought it. Can't find the reference now but I think I read that AMD specify 40A on the 12v rails and mine should deliver 36A.

What I don't understand is why the PSU delivers plenty of juice when the CPU is at 4 Ghz and I've even overclocked the GPU by 20% at the same time as the CPU and run them both at full load to stress test them at the same time. I've taken the overclock off the GPU now while trying to fix the CPU overclock problem. The system doesn't behave any differently with or without the GPU overclock.

Do the rails ampage affect system at idle power requirements. Personally, I've never seen a problem with the PSU but I am more than willing to accept it could be the problem.

Thanks for your help.

 

frenchy70

Distinguished
Feb 22, 2010
56
0
18,640


Yeah - I have read that. But I've also read that plenty of people have upped it a little with no problem. As I said, I'm back at 101 - no BSOD so far. Just the "display driver stopped responding" (this happens only when I overclock, so I'm not entirely sure if its a separate issue or not).

 

MEMOFLEX

Distinguished
Right. Have looked at the PSU reviews etc and you are right it has good reviews from numerous sites so will put that to one side for a moment.

Have been re reading your original post to try and fully get my head around this. From what I gather you have no issues at stock clocks or at static 4ghz, but only when you re enable power saving and turbo. If that is the case then is it possible to enable just turbo for example but not power saving or visa versa and then run to see if you get the crashes. If you do or don't make a note and then switch over, test again and note what happens to see if it is one setting or the other? Not very technical I know but it is the only method I find that can eliminate certain issues. That is if what I have said is poissible.
 

frenchy70

Distinguished
Feb 22, 2010
56
0
18,640
@MEMOFLEX - well, I did try turning off C3/C6 c-states but then I can't enable Turbo. I also tried turning off EIST and leaving C1E on, and then vice versa, but that didn't help. However, if I turn both of them off, I'm pretty sure I won't get any power saving and will just be running at max.

I have to say though, testing is looking encouraging with PCIe freq at 101. I left the PC on over night and no BSOD, just one "display driver stopped responding", which again recovered fine.

My latest guess is that I have 2 problems when overclocking: the BSOD and the driver time-out error. Hopefully the BSOD is fixed by the PCIe freq bump (to lock the value down).

Maybe I can go back and re-try your idea to disable EIST or C1E to see if that helps with the GPU driver time-out error.

At the moment, I've still yet to try bumping idle 2D clocks/volts on
the GPU but before I do that, I'm just testing a bump to the PLL suggested by Pinhedd (as that seemed to be related to the PCIe too, and I'm wondering if that will help with the GPU time-out error).

Thanks again and I'll update after the next few hours of testing.
 

frenchy70

Distinguished
Feb 22, 2010
56
0
18,640
SOLVED: I don't know how to mark the thread officially solved so I changed the thread title and am posting this update.

It seems that changing the PCI Express Frequency from 100 or Auto, to 101, has solved the BSOD problem. My PC has been on for 36 hours straight, idling when doing nothing, light load when surfing, and all cores ramping up and down under the various stress tests.

Weird thing is, and its a bit embarrassing to admit it, but I already had the answer hidden an a response to an old thread I made in 2010. Bilbat replied with some suggested settings for an overclock on my Gigabyte board, and he just happened to offer PCIe freq: 101. It wasn't discussed at all, it was just there so I didn't really pick up on the relevance of it.

Before I found that reply from Bilbat, I'd been searching for a solution everywhere and the PCIe to 101 thing is either to do with a glitch on my board (and that is the workaround) or related to some vague talk in an Anandtech article where they discuss the PCIe controller being on the processor of the i5 750 and therefore, maybe it needs to increase BCLK in multiples of 133. Anyway, very weird, but now I'm happy.

Thanks for the help I received and I hope that this might be of help to someone else (although I doubt it as everyone seems to either have i5 3570k or i5 2500k :D)
 


Glad to hear it. If you want to mark the thread as solved, just select one of the responses as the best solution.

I will caution you again though to get a better power supply than that crummy OCZ ModXStreme.