Computer reboots when gaming without BSOD

Terrikus

Prominent
Apr 30, 2017
3
0
510
Computer specs:
I7-5930k @ 4.2hz
GTX 1080 ti
16GB Corsair Vengeance 2666mhz (4x4GB)
250GB Samsung EVO SSD
Asus X99 deluxe Mobo
Corsair HX1000i 1000 watt PSU

I've been having an issue recently where I'd be playing a game like Overwatch or PUBG and I'd get a half second freeze and the computer would restart. Didn't happen all that often, but enough to make me want to take a look at things.

First thing I did was check event viewer saw the standard "Event 41" which tells me nothing other than the power went out. Then I checked bluescreenviewer, and it didn't have any logs (and there was no BSOD to begin with).

Next thing I did was run Memtest86. it was an older version an ran for 6 passes and had two errors:

T8ptvRr.jpg



Well believing that any errors in Memtest86 are bad, I thought I had my answer. I also used compressed air to spray out the case and reapplied thermal paste to the CPU.

To be sure, I ran a stress test in AIDA64 with CPU/FPU/Cache/System Memory. It failed after a couple minutes. Then I ran a stress with CPU only. It ran fine with no errors for two hours. Then I ran it with system memory only and it failed after two minutes.

So I opened up the case, removed two RAM sticks from one side, and ran the AIDA64 system memory test again. I got no failure after 10 minutes. I put those sticks back and took out the other two and ran the test again. No failure after 10 minutes. So I put all 4 sticks back in and ran the test again thinking I'd get a failure after 2 minutes like I did before. It ran 10 minutes with no issue.

These are the temps and voltages during the AIDA64 stress test (which I reran before this post):

5s7WJyK.png


Finally, I decided to download the latest ver of memtest86 (7.5, I believe) and set it to run for 10 passes. About 6 hours and 120 tests later, I got 0 errors.

So now I'm a little lost. Could the RAM still be faulty? Am I looking at a PSU issue? Something else? What else should I be looking at or trying?
 
Solution
Two things can instantly shut down without notice...either the power supply shut down due to detecting some error condition, or you had a triple hardware exception (https://en.wikipedia.org/wiki/Triple_fault).

In the case of heat this could explain either.

For a CPU, when it detects some errors it will run an error handler. Logging can occur. When there is an error within an error handler logging might still occur. When there is an error within this error, then the CPU is afraid to continue because everything is so badly scrambled it could destroy things and it is safer to just power off. Some issues with RAM could cause this even if the CPU is good...it just isn't common for RAM to cause a triple exception without...
Two things can instantly shut down without notice...either the power supply shut down due to detecting some error condition, or you had a triple hardware exception (https://en.wikipedia.org/wiki/Triple_fault).

In the case of heat this could explain either.

For a CPU, when it detects some errors it will run an error handler. Logging can occur. When there is an error within an error handler logging might still occur. When there is an error within this error, then the CPU is afraid to continue because everything is so badly scrambled it could destroy things and it is safer to just power off. Some issues with RAM could cause this even if the CPU is good...it just isn't common for RAM to cause a triple exception without something else showing up prior to reaching such a sad state. I tend to believe the power supply is a more likely issue, but I've seen both.

For RAM which is dual channel you may find each individual RAM stick works great by itself. If in single channel operation, then two sticks may still work fine together. When using multiple RAM sticks together you will often find they are configured to interleave...something like RAID0 where performance is boosted by reading or writing half of the content to each RAM at the same time. When the RAM goes multichannel and starts to interleave for performance you will find a single timing is used for all RAM sticks...the sticks must be matched and working together at that timing. If they differ in any way, but are otherwise working, you will still get failures. If your BIOS can turn off the interleave you might try that as a test with memtest86. Even RAM from the same manufacturer of the same model and same rating might differ slightly if not sold together as matching.

Someone also mentioned turning up the voltage on the RAM by a tiny amount...this can add stability (along with heat) at any timing. I would first try turning off interleave, and if this helps, you could try turning up the voltage as suggested and re-enable interleave.

Loosely related to this, and not necessarily likely, is that brownouts where voltage on the power line decreases by an amount too small to actually cause shutdown some voltages will drop slightly. One result is corrupted data in places like RAM. A brownout is somewhat analogous to purposely lowering RAM voltage. An UPS with brownout protection is very nice when testing for issues in a long test such as memtest86.

The CPU compound replacement was a very good idea. However, there may be other places where heat is still an issue, e.g., the power supply having dust in it. Or at high load levels the power supply just can't handle that much power continuously.

You might have a combination of problems as well.
 
Solution