U.S. Government Offers Nvidia A100 Nodes at 'Half Price'

(Image credit: NERSC)

The U.S. National Energy Research Scientific Computing Center is offering to 'rent' Nvidia A100-based compute GPU nodes of the Perlmutter supercomputer with a 50% discount till the end of September, as noticed by Glenn K. Lockwood, an HPC storage specialist from Microsoft. The offer comes as demand for compute horsepower for AI training is scarce industry-wide. However, the proposal is available for NERSC users only and applies to the number of hours credited for the machine.

"Using your time now benefits the entire NERSC community and spreads demand more evenly throughout the year, so to encourage usage now, we are discounting all jobs run on the Perlmutter GPU nodes by 50% starting tomorrow and through the end of September," wrote Rebecca Hartman-Baker, User Engagement Group Leader at NERSC, in an email to NERSC users. "Any job (or portion of a job) that runs between midnight tonight and the very start of October 1 at midnight (Pacific time) will be charged only half the usual charges, e.g., a 3-hour job on 7 nodes, which would normally incur a charge of 21 GPU node-hours, would be charged 10.5 GPU node-hours."

Amid the generative AI craze, there are dozens of companies willing to rent Nvidia compute GPU-based nodes to train their large language models. Still, commercial data centers are running at their maximum capacity, and Nvidia's compute GPUs are sold out for quarters to come, according to media reports. The offering from NERSC is undoubtedly generous, and the scientific center could make some easy money if it were offering its capacity commercially.

However, the problem is that they only offer it to existing NERSC users who use the Perlmutter supercomputer for scientific research. Since these users were on summer break, they were probably not running their workloads on the supercomputer and are not going to till the end of the year; at least some of the GPU nodes were idling for some time, which begs the question of why the organization does not backfill its idle capacity with commercial workloads.

TOPICS

Anton Shilov is a contributing writer at Tom’s Hardware. Over the past couple of decades, he has covered everything from CPUs and GPUs to supercomputers and from modern process technologies and latest fab tools to high-tech industry trends.

4 Comments Comment from the forums

bit_user

"Using your time now benefits the entire NERSC community and spreads demand more evenly throughout the year, so to encourage usage now, we are ..."
Sounds like it's heavily directed at grad students. I'll bet there's a huge queue for getting some runtime, toward the end of each semester.

The discount must be due to having some idle time on their A100 nodes, right now. If you have a limited budget of compute hours, you're going to save them until you need them, which is probably going to be skewed towards the latter half of the semester.

... why the organization does not backfill its idle capacity with commercial workloads.
If this HPC GPU supply crunch were an ongoing phenomenon, then you might see something like that. However, the agency is tasked with providing resources to scientific researchers, not with serving the commercial sector or trying to turn a profit.

In offering a commercial service, there could come additional risks, such as more exposure to hackers, which is an added problem when you consider those jobs are running along side some classified government research jobs. You'd also have to do more clearance work to ensure that your commercial customers aren't from sanctioned entities, etc.

All in all, it sounds like more of a headache than it might be worth, especially if your existing users have the allocation of hours to use. Encouraging them to use their time now not only reduces node idle time, but perhaps more importantly reduces contention during those critical high-demand periods.
Reply
bit_user

BTW, we're seeing a growing trend of institutions shifting away from building their own supercomputers and towards simply renting time on cloud instances. Overall, this is a better solution, especially regarding idle resources. However, it would require the institutions to do some up-front negotiation and contract a certain number of hours, to avoid getting hit too hard, during crunch time. And it still might not be suitable for certain government research, such as NERSC supports.

https://www.nextplatform.com/2022/11/08/hpc-follows-the-enterprise-into-the-cloud/
Reply
3Cats3Boxes

Indeed the government doesn't rent server space to commercial users, as there's quite a concern about potential competition between state-sponsored facilities and private entities of any kind, whether commercial, academic, or nonprofit. But as Congress and the administration continue to flesh out strategies to assist domestic startups that might otherwise be lost in the "valley of death," the chasm between R&D funding and commercial viability, through initiatives at HHS, DOE, NSF, etc., it would be a good idea to mandate a structure to monitor and make available at minimal cost the surplus compute resources available across our many publicly funded facilities (mostly, but not exclusively, at FFRDCs).
Reply
bit_user

🐈⬛🐈⬛🐈⬛ said:
Indeed the government doesn't rent server space to commercial users, as there's quite a concern about potential competition between state-sponsored facilities and private entities of any kind, whether commercial, academic, or nonprofit. But as Congress and the administration continue to flesh out strategies to assist domestic startups that might otherwise be lost in the "valley of death," the chasm between R&D funding and commercial viability, through initiatives at HHS, DOE, NSF, etc., it would be a good idea to mandate a structure to monitor and make available at minimal cost the surplus compute resources available across our many publicly funded facilities (mostly, but not exclusively, at FFRDCs).
Good points.

Quite a username, too! Welcome!
: )
Reply