Massive VRAM pools on AMD Instinct accelerators drown Linux's hibernation process — 1.5 TB of memory per server creates headaches

AMD AI servers
(Image credit: Moreh)

In today’s Linux patch series, AMD engineer Sameul Zhang highlighted an unusual issue where Linux servers are failing to hibernate due to excessive VRAM and a high number of AMD Instinct accelerators per system. For context, Instinct accelerators are powerful AMD GPUs designed specifically for data centers handling AI, high-performance computing, scientific workloads, and other demanding tasks.

Part of what makes these GPUs so powerful is that they come with massive amounts of VRAM, like 192GB in some, which might sound huge to gamers but is fairly standard for modern data center chips. In fact, this AMD AI Linux-powered server is equipped with a total of eight Instinct cards that bring the total VRAM to around 1.5TB. However, while more VRAM is generally a good thing, in cases like this, it can lead to unexpected issues.

TOPICS
Hassam Nasir
Contributing Writer

Hassam Nasir is a die-hard hardware enthusiast with years of experience as a tech editor and writer, focusing on detailed CPU comparisons and general hardware news. When he’s not working, you’ll find him bending tubes for his ever-evolving custom water-loop gaming rig or benchmarking the latest CPUs and GPUs just for fun.

  • bit_user
    Linux isn't the only thing with hibernation problems!

    My Dell Precision laptop (for my job), running Windows 11, doesn't always come out of hiberation successfully. The laptop is still under warranty and fully supported under Windows 11, so it's new enough that this should work!
    Reply
  • ezst036
    bit_user said:
    Linux isn't the only thing with hibernation problems!

    My Dell Precision laptop (for my job), running Windows 11, doesn't always come out of hiberation successfully. The laptop is still under warranty and fully supported under Windows 11, so it's new enough that this should work!
    Hibernation has always been an issue on the two major PC platforms.

    In contrast to PCs, this is a benefit Apple provides for itself by keeping everything so controlled; as an example. Less choice and less user allowances does have its benefits for the provider from a support and testing perspective. That in turn allows Apple to "offer" to its customers the concept that "everything just works".
    Reply
  • DS426
    bit_user said:
    Linux isn't the only thing with hibernation problems!

    My Dell Precision laptop (for my job), running Windows 11, doesn't always come out of hiberation successfully. The laptop is still under warranty and fully supported under Windows 11, so it's new enough that this should work!
    Yeah, Windows has pretty much always had issues resuming from hibernation and/or other sleep states over various versions of Windows and all of the monthly build levels. The same being true pretty much regardless of make and model, though like anything, some can certainly be way more troublesome than others. It's a crapshoot, really!

    We actually disable hibernation with:

    powercfg /h off

    Using Command Prompt. This is pushed out as a GPO script. Laptops do have sleep and deep sleep (S5) enabled, even though that can cause problems as well. So any of our laptop users just shutdown (and it's a true shutdown, not hibernation as Windows starting preferring several years ago) when the laptop isn't going to be used for some time. Anyways, not writing the hibernation file to disk also saves some SSD disk writes. ;)
    Reply
  • Notton
    ezst036 said:
    In contrast to PCs, this is a benefit Apple provides for itself by keeping everything so controlled; as an example. Less choice and less user allowances does have its benefits for the provider from a support and testing perspective. That in turn allows Apple to "offer" to its customers the concept that "everything just works".
    Contrary to popular belief, macs, even the newer M-series, are not immune to sleep breaking in some way.
    The likely cause is a 3rd party driver, rather than the OS itself, but YMMV.
    Reply
  • ejolson
    Years ago in the high-altitude deserts of the Sierra Nevada I ran a small computing cluster of about 50 nodes that would pause computations during the summer days and resume them at night when the outside temperatures were much cooler. Otherwise, the air-conditioning tended to fail.

    Although AI inference is usually performed on demand, I can well imagine a training cluster which benefits enough from lower electrical and cooling costs at night to the point that hibernation during the day is cost effective.
    Reply
  • Exploding PSU
    bit_user said:
    Linux isn't the only thing with hibernation problems!

    My Dell Precision laptop (for my job), running Windows 11, doesn't always come out of hiberation successfully. The laptop is still under warranty and fully supported under Windows 11, so it's new enough that this should work!

    I have a Thinkpad X1 (also for work-related stuff and also runs Windows 11 that came with it) that never gets hibernation right. Sometimes it works fine, sometimes it refuses to wake up and forces me to turn it back on from cold boot, sometimes it doesn't hibernate at all and remains powered on while stored away. Sometimes it wakes up only for it to freeze on the lock screen, forcing me to forcefully reboot it. Sometimes everything works fine but the fingerprint scanner refuses to work (or some random things not working right like the touchpad / touchscreen not recognising gestures)...

    Pretty much everything is still in default settings as I don't want to screw up anything on a laptop that I use for my job. I simply have accepted it as a quirk of the laptop, and to remind myself to not use the hibernate (or sleep) feature.

    So I think it's not the fault of that particular model of Dell. Maybe it's more of a Windows thing.
    Reply
  • bit_user
    ejolson said:
    Although AI inference is usually performed on demand, I can well imagine a training cluster which benefits enough from lower electrical and cooling costs at night to the point that hibernation during the day is cost effective.
    The hardware and opportunity costs are so high that I'm sure they'll just plow ahead, 24/7. At the very most, they might slightly reduce clock speeds during the heat of the day and perhaps at peak demand times, for the grid.
    Reply
  • Firestone
    I'm glad to see Linux continue to grow to support these common hardware configurations.
    Reply
  • mitch074
    bit_user said:
    The hardware and opportunity costs are so high that I'm sure they'll just plow ahead, 24/7. At the very most, they might slightly reduce clock speeds during the heat of the day and perhaps at peak demand times, for the grid.
    True - but then it may be a grid capacity problem, because A/C may be overworked during that time too, in which case you'd need to reduce clock speeds by quite a lot for it to be effective. Since those parts already run at lower clock speeds because they rely more on pure parallelism than max clock speed, it is not unlikely that the only way to really reduce power draw is to actually shut down some racks.
    Reply
  • Stomx
    Linux developers claim hibernation is still an unstable feature. Specifically on Linux Desktop. The Desktop itself is still unstable. The default installations for Ubuntu/Linux Mint do not even offer hibernation.
    Linux may fail not only to hibernate but also to Standby (called Suspend in Linux). Even with or without any VRAM.

    Like Linux was not user friendly 30 years ago, 20 years ago, 10 years ago, still today it is continuing its trend. Too small users pool and too many distros do bad job to Linux: no time and money to polish them. Plus looks like Torvalds does not care about Linux Desktop enjoying always to see his beloved Terminal and the orgasmic list of his creations every user must enjoy permanently to see too:
    bin
    boot
    cdrom
    dev
    etc
    lib
    lib32
    lib64
    libx32
    mnt
    opt
    proc
    root
    run
    sbin
    srv
    sys
    tmp
    usr
    var
    (sorry for torture exposing what no one wants to see).
    When Linux devs will learn something from Android?
    Reply