Frequent crashes with red health led flashing while startup and gaming

buandboo

Prominent
Aug 10, 2017
5
0
520
Specs:
Motherboard - DL160 G6
RAM - 7x2GB UDIMM DDR3 ECC
CPUs - 2xE5620
Power supply - 460W
GPU - GTX 650 Ti BOOST

Since about a week ago, I have been experiencing frequent crashes and errors while using my server. Almost every time I turn it on it reboots or hangs on Windows splash screen and after 1 or 2 attempts it finally boots properly. Moreover during playing games such as GTA5 or recently also Minecraft either the graphics driver or the game itself crashes unexpectedly in random moments.

The whole server often hangs at such times too and switches on the red health led. In the System Event Log there is either no error or if there is then it is always a IOH_NMI_DETECT Bus Uncorrectable Error.

I have been suspecting the GPU or the cables connecting it to the motherboard to be the reason, because they are not set up in the intended way. The GPU did not fit into the PCIe cage therefore I had to use a 20 cm long PCIe extender and place the GPU outside the case.

What is strange crashes do not occur during GPU stress tests.

I have tried reinstalling and updating the graphic driver, installing a fresh system on a different hard drive, removing one of the CPUs to rule out potentially too high power consumption, placing the GPU in another server with similar specs and same GPU setup and nothing seems to get rid of the problem.

Does anyone have any ideas as to what might be the cause of these problems?
 
Solution

buandboo

Prominent
Aug 10, 2017
5
0
520


As far as I know issues concerning RAM have their own set of error names so RAM should not be the problem. However, I will still try replacing it with different RAM, first thing tomorrow, just to rule out every possibility.

Last time I checked the system worked fine without the GPU as well as with a different one in it's place.
 

buandboo

Prominent
Aug 10, 2017
5
0
520


I guess you are right. But I would like to know if the card is dying or if it is just an incompatibility problem. Unfortunately I do not have the possibility to test the card in a different PC, not a server but a desktop, which would probably help determine which of the two is the case.

Could you tell which one is more likely to be the issue?

I still find it hard to understand how a dying card would easily handle all more demanding stress tests, but crash during less demanding games.
 


If the card used to work, what would it be incompatible with all of a sudden? Did you change something?
If it worked in the system before and nothing was changed, and other video cards work, I don't see how it can be anything other than a failing video card. Without testing it in another system there is no way to confirm things.
 

buandboo

Prominent
Aug 10, 2017
5
0
520

Actually, I did change quite a lot.
In a short period of time I added a second CPU and 4 additional RAMs let alone the GPU.
And I must confess that I did not thoroughly test if things work fine after every change I made.



Other video cards used to work, but not any more.

Since the 650 Ti Boost GPU started failing, I switched to using an older one - 9600GT, which I had used to confirm that other cards do not cause crashes.
Shortly after I started the system, the server crashed and the health led flashed red, right after I had launched a game and opened an internet browser.
After restarting, the red health led turned on again while on Windows splash screen, just like the other card, and the server rebooted itself.

This did not happen ever before and now the symptoms are exactly the same as with the 650 Ti Boost.
It surely cannot be that both of them started failing.
 


If you did other changes and now no add-in video cards work, those other changes are then likely the cause, maybe a bad motherboard slot, maybe power supply issue, maybe one of those new RAM cards or even the CPU. A new CPU would up the power load in the system by a decent amount. If the video cards work in other systems, it's not the cards.

It's not easy to trouble-shoot things with incomplete information such as making changes to the system before the issue started. The video cards need to be tested in known good systems that can run them, past that, there are quite a few things to look at with the system itself.
 

buandboo

Prominent
Aug 10, 2017
5
0
520


I am sorry for the late reply, but I decided to do some thorough testing before writing again and I did not have the time to do so. I have removed the new components from the server and then I have been by successively adding them back and testing each of them. I have also been using different cables to plug in the GPU - a PCIe 1x instead of a 16x - throughout all the testing. When I switched back to using a PCIe 16x and then started some testing, the crashes returned.

Here are some pictures which might help visualize how the two differ from each other:
PCIe 16x:
16x%20to%2016x%20Flexible%20Risercard%20(2).jpg


PCIe 1x:
3-x-pci-e-express-1x-to-16x-powered-usb-3-0-extender-riser-card-adapter-sata-67dc4013660eb95a387e2e3fde4e1617.jpg


The previous GPU adapter (16x) only supplied the 12V rail to the GPU, which had to rely on 5V and 3.3V from the motherboard. The new adapter (1x) supplies all rails - 12V as well as 5V and 3.3V - directly to the GPU. Since it is unlikely that the crashes depend on the type of PCIe used (16x vs 1x), the different source of the power rails must be the case.

So it turns out it was a power problem after all. The motherboard was probably not supplying the GPU with enough power or the power was not stable enough. Fortunately the new adapter solved this issue.

Sorry for not giving complete information about all the changes I had made. It would have been much easier to solve the problem together if I did so in the first place. Nevertheless, thank you for your support.
 
Solution