Massive VRAM pools on AMD Instinct accelerators drown Linux's hibernation process — 1.5 TB of memory per server creates headaches

AMD AI servers
(Image credit: Moreh)

In today’s Linux patch series, AMD engineer Sameul Zhang highlighted an unusual issue where Linux servers are failing to hibernate due to excessive VRAM and a high number of AMD Instinct accelerators per system. For context, Instinct accelerators are powerful AMD GPUs designed specifically for data centers handling AI, high-performance computing, scientific workloads, and other demanding tasks.

Part of what makes these GPUs so powerful is that they come with massive amounts of VRAM, like 192GB in some, which might sound huge to gamers but is fairly standard for modern data center chips. In fact, this AMD AI Linux-powered server is equipped with a total of eight Instinct cards that bring the total VRAM to around 1.5TB. However, while more VRAM is generally a good thing, in cases like this, it can lead to unexpected issues.

But while VRAM capacity does play a part, the root cause of the hibernation failure isn’t the number of Instinct cards, but rather how Linux handles GPU memory during the hibernation process. When the system initiates hibernation, all GPU memory is first offloaded to system RAM, typically through the Graphics Translation Table (GTT) or shared memory (shmem). From there, the kernel creates a hibernation image by copying all system memory content, which also includes the evicted VRAM, into a second memory region before writing it to disk.

Sounds confusing? Well, in simple terms, if your server has 1.5TB of total VRAM, this duplication can push the memory usage up to 3TB, which easily exceeds the capacity of servers equipped with only 2TB of system memory. The spill-out ultimately causes the hibernation process to fail.

Fortunately, Zhang has been working to address this hibernation issue and suggests two main changes. The first is aimed at reducing the amount of system memory needed during hibernation, which would allow the process to succeed. However, doing so introduces a new issue, as the "thawing" stage (when the system resumes from hibernation) could take nearly an hour due to the large amount of memory. To fix this, a third patch was added to skip restoring these buffer objects during the thaw stage, significantly reducing the resume time.

Now, most high-end AI servers run continuously, so it's fair to ask why anyone would hibernate them. One common reason is to reduce power consumption during downtimes and help stabilize the electrical grid. Since large-scale data centers consume massive amounts of power, this can help lower the risk of blackouts, like the one we recently saw in Spain.

TOPICS
Hassam Nasir
Contributing Writer

Hassam Nasir is a die-hard hardware enthusiast with years of experience as a tech editor and writer, focusing on detailed CPU comparisons and general hardware news. When he’s not working, you’ll find him bending tubes for his ever-evolving custom water-loop gaming rig or benchmarking the latest CPUs and GPUs just for fun.

  • bit_user
    Linux isn't the only thing with hibernation problems!

    My Dell Precision laptop (for my job), running Windows 11, doesn't always come out of hiberation successfully. The laptop is still under warranty and fully supported under Windows 11, so it's new enough that this should work!
    Reply
  • ezst036
    bit_user said:
    Linux isn't the only thing with hibernation problems!

    My Dell Precision laptop (for my job), running Windows 11, doesn't always come out of hiberation successfully. The laptop is still under warranty and fully supported under Windows 11, so it's new enough that this should work!
    Hibernation has always been an issue on the two major PC platforms.

    In contrast to PCs, this is a benefit Apple provides for itself by keeping everything so controlled; as an example. Less choice and less user allowances does have its benefits for the provider from a support and testing perspective. That in turn allows Apple to "offer" to its customers the concept that "everything just works".
    Reply
  • DS426
    bit_user said:
    Linux isn't the only thing with hibernation problems!

    My Dell Precision laptop (for my job), running Windows 11, doesn't always come out of hiberation successfully. The laptop is still under warranty and fully supported under Windows 11, so it's new enough that this should work!
    Yeah, Windows has pretty much always had issues resuming from hibernation and/or other sleep states over various versions of Windows and all of the monthly build levels. The same being true pretty much regardless of make and model, though like anything, some can certainly be way more troublesome than others. It's a crapshoot, really!

    We actually disable hibernation with:

    powercfg /h off

    Using Command Prompt. This is pushed out as a GPO script. Laptops do have sleep and deep sleep (S5) enabled, even though that can cause problems as well. So any of our laptop users just shutdown (and it's a true shutdown, not hibernation as Windows starting preferring several years ago) when the laptop isn't going to be used for some time. Anyways, not writing the hibernation file to disk also saves some SSD disk writes. ;)
    Reply
  • Notton
    ezst036 said:
    In contrast to PCs, this is a benefit Apple provides for itself by keeping everything so controlled; as an example. Less choice and less user allowances does have its benefits for the provider from a support and testing perspective. That in turn allows Apple to "offer" to its customers the concept that "everything just works".
    Contrary to popular belief, macs, even the newer M-series, are not immune to sleep breaking in some way.
    The likely cause is a 3rd party driver, rather than the OS itself, but YMMV.
    Reply
  • ejolson
    Years ago in the high-altitude deserts of the Sierra Nevada I ran a small computing cluster of about 50 nodes that would pause computations during the summer days and resume them at night when the outside temperatures were much cooler. Otherwise, the air-conditioning tended to fail.

    Although AI inference is usually performed on demand, I can well imagine a training cluster which benefits enough from lower electrical and cooling costs at night to the point that hibernation during the day is cost effective.
    Reply
  • Exploding PSU
    bit_user said:
    Linux isn't the only thing with hibernation problems!

    My Dell Precision laptop (for my job), running Windows 11, doesn't always come out of hiberation successfully. The laptop is still under warranty and fully supported under Windows 11, so it's new enough that this should work!

    I have a Thinkpad X1 (also for work-related stuff and also runs Windows 11 that came with it) that never gets hibernation right. Sometimes it works fine, sometimes it refuses to wake up and forces me to turn it back on from cold boot, sometimes it doesn't hibernate at all and remains powered on while stored away. Sometimes it wakes up only for it to freeze on the lock screen, forcing me to forcefully reboot it. Sometimes everything works fine but the fingerprint scanner refuses to work (or some random things not working right like the touchpad / touchscreen not recognising gestures)...

    Pretty much everything is still in default settings as I don't want to screw up anything on a laptop that I use for my job. I simply have accepted it as a quirk of the laptop, and to remind myself to not use the hibernate (or sleep) feature.

    So I think it's not the fault of that particular model of Dell. Maybe it's more of a Windows thing.
    Reply
  • bit_user
    ejolson said:
    Although AI inference is usually performed on demand, I can well imagine a training cluster which benefits enough from lower electrical and cooling costs at night to the point that hibernation during the day is cost effective.
    The hardware and opportunity costs are so high that I'm sure they'll just plow ahead, 24/7. At the very most, they might slightly reduce clock speeds during the heat of the day and perhaps at peak demand times, for the grid.
    Reply
  • Firestone
    I'm glad to see Linux continue to grow to support these common hardware configurations.
    Reply
  • mitch074
    bit_user said:
    The hardware and opportunity costs are so high that I'm sure they'll just plow ahead, 24/7. At the very most, they might slightly reduce clock speeds during the heat of the day and perhaps at peak demand times, for the grid.
    True - but then it may be a grid capacity problem, because A/C may be overworked during that time too, in which case you'd need to reduce clock speeds by quite a lot for it to be effective. Since those parts already run at lower clock speeds because they rely more on pure parallelism than max clock speed, it is not unlikely that the only way to really reduce power draw is to actually shut down some racks.
    Reply