Nvidia RTX 5090 reset bug prompts $1,000 reward for a fix — cards become completely unresponsive and require a reboot after virtualization reset bug, also impacts RTX PRO 6000

GeForce RTX 5090 32GB Professional
(Image credit: AFOX)

Nvidia’s new RTX 5090 and RTX PRO 6000 GPUs are reportedly plagued by a reproducible virtualization reset bug that can leave the cards completely unresponsive until the host system is physically rebooted.

CloudRift, a GPU cloud provider, published a detailed breakdown of the issue after encountering it on multiple Blackwell-equipped systems in production. The company has even issued a $1,000 public bug bounty for anyone able to identify a fix or root cause.

Reset bug bricks Blackwell

According to CloudRift’s logs, the bug occurs after a GPU has been passed through to a VM using KVM and VFIO. On guest shutdown or GPU reassignment, the host issues a PCIe function-level reset (FLR), which is a standard part of cleaning up a passthrough device. But instead of returning to a known-good state, the GPU fails to respond: “not ready 65535ms after FLR; giving up,” the kernel reports.

At this point, the card also becomes unreadable to lspci, which throws “unknown header type 7f,” errors. CloudRift notes that the only way to restore normal operation is to power-cycle the entire machine. Tiny Corp, the AI start-up behind tinygrad, brought attention to the issue by reposting CloudRift’s findings on X.com with a blunt question: “Do 5090s and RTX PRO 6000s have a hardware defect? We’ve looked into this and can’t find a fix.”

Other users confirm similar failures

Threads across the Proxmox forums and Level1Techs community suggest that home users and other early adopters of the RTX 5090 are also encountering similar behavior.

In one case, a user reported a complete host hang after a Windows guest was shut down, with the GPU failing to reinitialize even after an OS-level reboot. In another case, a user said, “I found my host became unresponsive. Further debugging shows that the host CPU got soft lock [sic] after a FLO timeout, which is after a shutdown of LinuxVM. No issue for my previous 4080.”

Several users confirm that toggling PCIe ASPM or ACS settings does not mitigate the failure. No issues have been reported with older cards such as the RTX 4090, suggesting that the bug may be limited to Nvidia’s Blackwell family.

FLR is a critical feature in GPU passthrough configurations, allowing a device to be safely reset and reassigned between guests. If FLR is unreliable, then multi-tenant AI workloads and home lab setups using virtualization become risky, particularly when a single card failure takes down the entire host.

Nvidia has not yet officially acknowledged the issue, and there is no known mitigation at the time of writing.

Follow Tom's Hardware on Google News, or add us as a preferred source, to get our up-to-date news, analysis, and reviews in your feeds. Make sure to click the Follow button!

Luke James
Contributor

Luke James is a freelance writer and journalist.  Although his background is in legal, he has a personal interest in all things tech, especially hardware and microelectronics, and anything regulatory. 

  • chaz_music
    This could easily be related to the GPU power quality issue found before with the power cable problem. I posted about the cable burning problem here in this TH story:

    https://forums.tomshardware.com/threads/asrocks-40-16-pin-power-cable-has-overheating-protection-to-prevent-meltdowns-—-a-90-degree-design-ensures-worry-free-installation.3885286/
    Also, there exist a huge ground loop in the ATX PSU structure and the GPU/PCIe power solution. If the GPU cable returns start going high resistance, the GPU power-return-current finds a path through the PCIe connector and returns unintentionally back to the PSU through the ATX power connector on the motherboard. When this happens, the high frequency noise in the GPU power imparts common mode noise on the PCIe signalling and can cause many logic issues - on both the GPU and motherboard side. So this could also be the motherboard PCIe bus is getting latched somewhere and not the GPU logic.

    At these current levels (GPU supply currents upwards to 50A+), I could see where a high PSU di/dt event could also latch the logic by causing a negative voltage. This can trigger the protection diodes in the I/O ICs with the voltage briefly going below ground. That can trigger the substrate diodes in various ICs and wreak havoc on the logic.

    If one of these is the actual issue, I could see how only a full power down reset (cold reset) would be required to get the GPU back to a known state. Power quality and voltage dips can cause a brownout and get the board and/or bus logic stuck in an improper state requiring the voltage be removed to get a clean reset. I have also seen this in many embedded designs, it can also indicate an issue with the reset system (brownout/blackout/system watchdog).
    Reply