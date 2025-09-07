Nvidia’s new RTX 5090 and RTX PRO 6000 GPUs are reportedly plagued by a reproducible virtualization reset bug that can leave the cards completely unresponsive until the host system is physically rebooted.

CloudRift, a GPU cloud provider, published a detailed breakdown of the issue after encountering it on multiple Blackwell-equipped systems in production. The company has even issued a $1,000 public bug bounty for anyone able to identify a fix or root cause.

Reset bug bricks Blackwell

According to CloudRift’s logs, the bug occurs after a GPU has been passed through to a VM using KVM and VFIO. On guest shutdown or GPU reassignment, the host issues a PCIe function-level reset (FLR), which is a standard part of cleaning up a passthrough device. But instead of returning to a known-good state, the GPU fails to respond: “not ready 65535ms after FLR; giving up,” the kernel reports.

At this point, the card also becomes unreadable to lspci, which throws “unknown header type 7f,” errors. CloudRift notes that the only way to restore normal operation is to power-cycle the entire machine. Tiny Corp, the AI start-up behind tinygrad, brought attention to the issue by reposting CloudRift’s findings on X.com with a blunt question: “Do 5090s and RTX PRO 6000s have a hardware defect? We’ve looked into this and can’t find a fix.”

Other users confirm similar failures

Threads across the Proxmox forums and Level1Techs community suggest that home users and other early adopters of the RTX 5090 are also encountering similar behavior.

In one case, a user reported a complete host hang after a Windows guest was shut down, with the GPU failing to reinitialize even after an OS-level reboot. In another case, a user said , “I found my host became unresponsive. Further debugging shows that the host CPU got soft lock [sic] after a FLO timeout, which is after a shutdown of LinuxVM. No issue for my previous 4080.”

Several users confirm that toggling PCIe ASPM or ACS settings does not mitigate the failure. No issues have been reported with older cards such as the RTX 4090, suggesting that the bug may be limited to Nvidia’s Blackwell family.

FLR is a critical feature in GPU passthrough configurations, allowing a device to be safely reset and reassigned between guests. If FLR is unreliable, then multi-tenant AI workloads and home lab setups using virtualization become risky, particularly when a single card failure takes down the entire host.

Nvidia has not yet officially acknowledged the issue, and there is no known mitigation at the time of writing.

