VRAM problems? Any way to fix them?

Status
Not open for further replies.

Istvankszk

Commendable
Jun 2, 2016
12
0
1,510
Recently one of my friends gave me his old r9 295x2 because it was giving him BSODs. I cleaned the card properly and re did the thermal paste/pads.

After having the card for about ~3 days it crashed for the first time while loading a game
video of crash: [video="www.youtube.com/watch?v=3V6iithwzsI"][/video]
After that, it kept crashing on random occurrences - under load, light load (youtube), no load.

Last night I found a software called memtestCL and ran it to see if maybe it was a problem with the memory of the card and sure enough, it got a few errors in almost every test
unknown.png

(the "random blocks" test always shows loads of errors on any card, so I think it's a bug)
(left card is the primary, right side is the secondary card)
(also, the test on the primary card took way longer)


I think it only crashes when windows tries to access the bad areas. Usually all 3 of my screens freeze, even the ones plugged into my onboard graphics but sound continues to play so and as far as I can tell everything else keeps running as well. The fan on the card goes into a "default" state (same as in BIOS or without the drivers installed) and then switches between that and normal a few times (probably tries to re-initialise the drivers)

What I tried so far:

  • ■ loosened the screws on the backplate so it doesn't put pressure on the ram modules - no effect
    ■ tightened the screws - no effect
    ■ dumped rubbing alcohol on the whole card and let it sit/dry for a few hours - no effect
    ■ set the windows timeout detection and recovery (TDR) to max (8) - no effect (takes longer to reboot automatically)
    ■ underclocked the ram - no effect
    ■ overclocked the ram - no effect (no idea what I was expecting)
    ■ reinstalled drivers/windows/tried different slot/etc etc (basically the usual troubleshooting procedures)

I got a few other ideas that could work, but I don't know how I would go about doing either.

  • ■ Make windows ignore the error and just reload the driver (everything else continues running, so maybe just somehow force-reload the driver with a hotkey?)
    ■ switch the roles of the 2 GPUs on the 295x2 (the second one seems to be working perfectly)
    ■ add a second card (I got a 1030 and an rx 460) and connect the monitors to that, and somehow pass the rendered frames to this one?
    ■ somehow fix the ram itself (doesn't seem to be a connection thing, so baking is useless (probably))
    ■ put in my second card (a sapphire r9 290) as primary and just hope that with 3 way crossfire the problem gets reduced to just artifacts every now and then

The card itself runs perfectly, and there are no visual clues of the ram failing, so I have no idea what to do with it. Can't RMA because I got it from a friend and he got it like ~4 years ago. I still have my original 290 so I can switch back if I need to

This is my current PC configuration, but I think this is an issue with the card itself so it's not that relevant
Asus H97 pro gamer mobo with an i5 4690 (non K)
16 gigabytes of 1600 mhz RAM, a 240 GB samsung SSD and 2 HDDs for data
an EVGA 1000GQ PSU (eco mode is off atm)
MSI r9 295x2 (and currently an rx 460 so I can write this without my PC rebooting)
some random case and a bunch of cooling fans
 

Istvankszk

Commendable
Jun 2, 2016
12
0
1,510
I know it's not good forum etiquette to answer your own post but here are my steps on how I managed to (finally) fix my card.


step 1: check for actual hardware defects
I managed to find a small SMD capacitor close to the PCI-E connector that was broken off. I soldered it back. Not sure if this had any effect


step 2: set the card to PCI-E X16@2.0
This is the most important part. PCI-E 3.0 when running at max bandwith (ex: loading textures into memory, running without an fps limit) seems to crash(?) the PLX chip for some reason and hence disconnect it while running, giving you a BSOD. You shouldn't loose that much performance in games (1440p) and rendering with the card running at 2.0.


Bonus things
Do not use MSI afterburner with the latest version of windows 10 and the AMD 2019 driver. It can cause crashes.

Get the SDK package for the PLX PEX 8747 chip that connects the 2 GPUs with the PC. You'll want to start the software called PLX GenMon and enable logging. Some of the other programs offer other useful debug functions.
unknown.png


Add a capacitor (15V220uF works fine) to the fan on the water cooler. This makes the pumps run much smoother and can make the card stop thermal throttling.

You can also use clock blocker and hawaii bios editor if you want to stop throttling.


Well, that's what I did. For a test I've been running blender for 2 days now without a crash. Previously this would crash the card after 1-2 rendered frames (VRAM intensive renders)
 
Status
Not open for further replies.