Nvidia Unveils DGX GH200 Supercomputer, Grace Hopper Superchips in Production

Nvidia Grace Hopper
(Image credit: Tom's Hardware)

Nvidia CEO Jensen Huang announced here at Computex 2023 in Taipei, Taiwan that the company's Grace Hopper superchips are now in full production, and the Grace platform has now earned six supercomputer wins. These chips are a fundamental building block of one of Huang's other big Computex 2023 announcements: The company's new DGX GH200 AI supercomputing platform, built for massive generative AI workloads, is now available with 256 Grace Hopper Superchips paired together to form a supercomputing powerhouse with 144TB of shared memory for the most demanding generative AI training tasks. Nvidia already has customers like Google, Meta, and Microsoft ready to receive the leading-edge systems.

Nvidia also announced its new MGX reference architectures that will help OEMs build new AI supercomputers faster with up to 100+ systems available. Finally, the company also announced its new Spectrum-X Ethernet networking platform that is designed and optimized specifically for AI server and supercomputing clusters. Let's dive in. 

Nvidia Grace Hopper Superchips Now in Production

Nvidia

(Image credit: Nvidia)

We've covered the Grace and Grace Hopper Superchips in depth in the past. These chips are central to Nidia's new systems that it announced today. The Grace chip is Nvidia's own Arm CPU-only processor, and the Grace Hopper Superchip combines the Grace 72-core CPU, a Hopper GPU, 96GB of HBM3, and 512 GB of LPDDR5X on the same package, all weighing in at 200 billion transistors. This combination provides astounding data bandwidth between the CPU and GPU, with up to 1 TB/s of throughput between the CPU and GPU offering a tremendous advantage for certain memory-bound workloads.

With the Grace Hopper Superchips now in full production, we can expect systems to come from a bevy of Nidia's systems partners, like Asus, Gigabyte, ASRock Rack, and Pegatron. More importantly, Nvidia is rolling out its own systems based on the new chips and is issuing reference design architectures for OxMs and hyperscalers, which we’ll cover below.

Nvidia DGX GH200 Supercomputer

Nvidia's DGX systems are its go-to system and reference architecture for the most demanding AI and HPC workloads, but the current DGX A100 systems are limited to eight A100 GPUs working in tandem as one cohesive unit. Given the explosion of generative AI, Nvidia's customers are eager for much larger systems with much more performance, and the DGX H200 is designed to offer the ultimate in throughput for massive scalability in the largest workloads, like generative AI training, large language models, recommender systems and data analytics, by sidestepping the limitations of standard cluster connectivity options, like InfiniBand and Ethernet, with Nvidia's custom NVLink Switch silicon.

Details are still slight on the finer aspects of the new DGX GH200 AI supercomputer, but we do know that Nvidia uses a new NVLink Switch System with 36 NVLink switches to tie together 256 GH200 Grace Hopper chips and 144 TB of shared memory into one cohesive unit that looks and acts like one massive GPU. The new NVLink Switch System is based on its NVLink Switch silicon that is now in its third generation.

The DGX GH200 comes with 256 total Grace Hopper CPU+GPUs, easily outstripping Nvidia's previous largest NVLink-connected DGX arrangement with eight GPUs, and the 144TB of shared memory is 500X more than the DGX A100 systems that offer a 'mere' 320GB of shared memory between eight A100 GPUs. Additionally, expanding the DGX A100 system to clusters with more than eight GPUs requires employing InfiniBand as the interconnect between systems, which incurs performance penalties. In contrast, the DGX GH200 marks the first time Nvidia has built an entire supercomputer cluster around the NVLink Switch topology, which Nvidia says provides up to 10X the GPU-to-GPU and 7X the CPU-to-GPU bandwidth of its previous-gen system. It's also designed to provide 5X the interconnect power efficiency (likely measured as PJ/bit) than competing interconnects, and up to 128 TB/s of bisectional bandwidth.

The system has 150 miles of optical fiber and weighs 40,000 lbs, but presents itself as one single GPU. Nvidia says the 256 Grace Hopper Superchips propel the DGX GH200 to one exaflop of 'AI performance,' meaning that value is measured with smaller data types that are more relevant to AI workloads than the FP64 measurements used in HPC and supercomputing. This performance comes courtesy of 900 GB/s of GPU-to-GPU bandwidth, which is quite impressive scalability given that Grace Hopper tops out at 1 TB/s of throughput with the Grace CPU when connected directly together on the same board with the NVLink-C2C chip interconnect.

Nvidia provided projected benchmarks of the DGX GH200 with the NVLink Switch System going head-to-head with a DGX H100 cluster tied together with InfiniBand. Nvidia used varying numbers of GPUs for the above workload calculations, ranging from 32 to 256, but each system employed the same number of GPUs for each test. As you can see, the explosive gains in interconnect performance are expected to unlock anywhere from 2.2X to 6.3X more performance.

Nvidia will provide the DGX GH200 reference blueprints to its leading customers, Google, Meta, and Microsoft, before the end of 2023, and will also provide the system as a reference architecture design for cloud service providers and hyperscalers.

Nvidia is eating its own dogfood, too; the company will deploy a new Nvidia Helios supercomputer comprised of four DGX GH200 systems that it will use for its own research and development work. The four systems, which total 1,024 Grace Hopper Superchips, will be tied together with Nvidia's Quantum-2 InfiniBand 400 Gb/s networking.

Nvidia MGX Systems Reference Architectures

While DGX steps in for the highest-end systems, Nvidia’s HGX systems step in for hyperscalers. However, the new MGX systems step in as the middle point between these two systems, and DGX and HGX will continue to co-exist with the new MGX systems.

Nvidia’s OxM partners face new challenges with AI-centric server designs, thus slowing design and deployment. Nvidia’s new MGX reference architectures are designed to speed that process with 100+ reference designs. The MGX systems comprise modular designs that span the gamut of Nvidia’s portfolio of CPUs and GPUs, DPUs, and networking systems, but also include designs based on the common x86 and Arm-based processors found in today’s servers. Nvidia also provides options for both air- and liquid-cooled designs, thus providing OxMs with different design points for a wide range of applications.

Naturally, Nvidia points out that the lead systems from QCT and Supermicro will be powered by its Grace and Grace Hopper Superchips, but we expect that x86 flavors will probably have a wider array of available systems over time. Asus, Gigabyte, ASRock Rack and Pegatron will all use MGX reference architectures for systems that will come to market later this year into early next year.

The MGX reference designs could be the sleeper announcement of Nvidia’s Computex press blast – these will be the systems that mainstream data centers and enterprises will eventually deploy to infuse AI-centric architectures into their deployments, and will ship in far greater numbers than the somewhat exotic and more costly DGX systems – these are the volume movers. Nvidia is still finalizing the spec, which will be public, and will release a whitepaper soon.

Nvidia Spectrum-X Networking Platform

Nvidia’s purchase of Mellanox has turned out to be a pivotal move for the company, as it can now optimize and tune networking componentry and software for its AI-centric needs. The new Spectrum-X networking platform is perhaps the perfect example of those capabilities, as Nvidia touts it as the ‘world’s first high-performance Ethernet for AI’ networking platform.

One of the key points here is that Nvidia is pivoting to Ethernet as an interconnect option for high-performance AI platforms, as opposed to the InfiniBand connections often found in high-performance systems. The Spectrum-X design employs Nvidia’s 51 Tb/s Spectrum-4 400 GbE Ethernet switches and the Nvidia Bluefield-3 DPUs paired with software and SDKs that allow developers to tune systems for the unique needs of AI workloads. In contrast to other Ethernet based systems, Nvidia says Spectrum-X is lossless, thus providing superior QoS and latency. It also has new adaptive routing tech, which is particularly helpful in multi-tenancy environments.

The Spectrum-X networking platform is a foundational aspect of Nvidia’s portfolio, as it brings high-performance AI cluster capabilities to Ethernet-based networking, offering new options for wider deployments of AI into hyperscale infrastructure. The Spectrum-X platform is also fully interoperable with existing Ethernet-based stacks and offers impressive scalability with up to 256 200 Gb/s ports on a single switch, or 16,000 ports in a two-tier leaf-spine topology.

The Nvidia Spectrum-X platform and its associated components, including 400G LinkX optics, are available now.

Nvidia Grace and Grace Hopper Superchip Supercomputing Wins

Nvidia's first Arm CPUs (Grace) have already been in production and made an impact with three recent supercomputer wins, including the newly announced Taiwania 4 which will be built by computing vendor ASUS for the Taiwan National Center for High-Performance Computing. This system will feature 44 Grace CPU nodes, and Nvidia claims it will rank among the most energy-efficient supercomputers in Asia when deployed. The supercomputer will be used to model climate change issues.

Nvidia also shared details of its new Taipei 1 supercomputer that will be based in Taiwan. This system will have 64 DGX H100 AI supercomputers and 64 Nvidia OVX systems tied together with the company's networking kit. This system will be used to further unspecified local R&D workloads when it is finished later this year.

Paul Alcorn
Managing Editor: News and Emerging Tech

Paul Alcorn is the Managing Editor: News and Emerging Tech for Tom's Hardware US. He also writes news and reviews on CPUs, storage, and enterprise hardware.

  • bit_user
    Thanks for the writeup!

    up to 1 TB/s of throughput between the CPU and GPU offering a tremendous advantage for certain memory-bound workloads.
    The way they have a dual-CPU configuration option can lead to some confusion about whether the stats are referring to a single CPU or dual-CPU config. As you can see from their developer blog post, the 1 TB/s is the memory bandwidth across 2 Grace CPUs. The Chip-to-Chip link is only 900 GB/s.

    When a GPU is paired with a CPU, the GPU would be bottlenecked by the CPU's 500 GB/s memory interface. Then again, I'd bet Nvidia is arriving at 900 GB/s by summing throughput in each direction, in which case it could only read or write CPU memory at a peak of @ 450 GB/s.
    Source: https://developer.nvidia.com/blog/nvidia-grace-cpu-superchip-architecture-in-depth/
    The system has 150 miles of optical fiber and weighs 40,000 lbs, but presents itself as one single GPU. Nvidia says the 256 Grace Hopper Superchips ...
    Really? That's 156 pounds per "superchip". That sounds like pretty bad scaling, actually.


    @PaulAlcorn , in the 7-slide album under "Nvidia MGX Systems Reference Architectures", the last 2 slides appear to be duplicates of slides 2 & 3. Not that I really care that much about the details of MGX, but I did notice.
    ; )
    The MGX reference designs could be the sleeper announcement of Nvidia’s Computex press blast
    Yeah, but let's call it out for what it is: an end-run around OCP (Open Compute Project). The industry has a nice, open multi-vendor ecosystem that already supports modular accelerators.

    Nvidia is pivoting to Ethernet as the interconnect for high-performance AI platforms
    "Never bet against Ethernet."


    @PaulAlcorn, in the album under "Nvidia Spectrum-X Networking Platform", the second image seems to be missing?

    Nvidia's first Arm CPUs (Grace) ...
    Server CPUs, that is. They've been making ARM-based SoCs for embedded applications since about 1.5 decades ago.
    Reply
  • zecoeco
    Nvidia throwing the new buzzword "AI" in every new product they announce.
    They just won't stop overhyping AI..
    Reply
  • ngilbert
    zecoeco said:
    Nvidia throwing the new buzzword "AI" in every new product they announce.
    They just won't stop overhyping AI..
    Given that it is the buzzword of the month across the industry, and they have a massive lead in that type of workload, they would be foolish not to push it as far as they can.
    Reply
  • elforeign
    Awesome write up, can someone illustrate just what the platform enables? I understand whatever is possible with today in terms of language models, etc but when they say 2.2-6.3x more performance, where does that take us in the next year?

    How soon can I plug into the matrix?
    Reply
  • bit_user
    elforeign said:
    Awesome write up, can someone illustrate just what the platform enables? I understand whatever is possible with today in terms of language models, etc but when they say 2.2-6.3x more performance, where does that take us in the next year?
    The details are (mostly) in the slides. And the devil is in the details. It shouldn't surprise you to hear that Nvidia loves to overhype their products, so they're going to highlight the best case scenario, much more than the typical speed up. Don't get these two confused!
    They claimed a 2.3x speedup in (training?) a 65B parameter Large Language model (e.g. ChatGPT), a 5x speedup in a 500 GB Deep Learning Recommendation Model, and a 5.5x speedup in a 400 GB Vector DB workload (think lookups in a large-population facial recognition database, among other examples that come to mind).

    There's a common theme, here: they're all big. That's because they didn't announce a new GPU! They announced a new system architecture, based around their "superchip" modules and next-gen interconnect technology. So, it's not the computation that got faster, but rather the data movement, locality, and latencies that improved.

    elforeign said:
    How soon can I plug into the matrix?
    Depends on how you define it. If you just wanted VR without a headset, then the bottleneck would seem to be the brain-machine interface. Elon Musk's Neuralink just got approval to start human trials. I haven't followed their stuff, but I'm guessing it'd be several generations before they're ready to tap into the optic nerve. And it's anybody's guess if/when we can do that without having to sever the connection to your organic retinas.
    Reply
  • elforeign
    bit_user said:
    The details are (mostly) in the slides. And the devil is in the details. It shouldn't surprise you to hear that Nvidia loves to overhype their products, so they're going to highlight the best case scenario, much more than the typical speed up. Don't get these two confused!
    They claimed a 2.3x speedup in (training?) a 65B parameter Large Language model (e.g. ChatGPT), a 5x speedup in a 500 GB Deep Learning Recommendation Model, and a 5.5x speedup in a 400 GB Vector DB workload (think lookups in a large-population facial recognition database, among other examples that come to mind).

    There's a common theme, here: they're all big. That's because they didn't announce a new GPU! They announced a new system architecture, based around their "superchip" modules and next-gen interconnect technology. So, it's not the computation that got faster, but rather the data movement, locality, and latencies that improved.


    Depends on how you define it. If you just wanted VR without a headset, then the bottleneck would seem to be the brain-machine interface. Elon Musk's Neuralink just got approval to start human trials. I haven't followed their stuff, but I'm guessing it'd be several generations before they're ready to tap into the optic nerve. And it's anybody's guess if/when we can do that without having to sever the connection to your organic retinas.
    Thanks, for some reason I thought the grace hopper chip was a new chip but if I'm understanding correctly, it's just the software stack and the interconnects that got upgraded to enable more to be daisy chained and maintain coherency.
    Reply
  • bit_user
    elforeign said:
    Thanks, for some reason I thought the grace hopper chip was a new chip
    Grace is a 72-core ARM CPU and each one has up to 512 GB of memory. There are 3 options for populating their modules (daughter cards):
    Grace + Grace (2x CPU)
    Grace + Hopper (CPU + GPU)
    Hopper + Hopper (2x GPU)
    The big news is that Grace is finally shipping, after probably close to a year of delays.

    elforeign said:
    if I'm understanding correctly, it's just the software stack and the interconnects that got upgraded to enable more to be daisy chained and maintain coherency.
    Yeah, I don't follow NVLink quite closely enough to say in detail just how much the latest round of upgrades were worth, but the game-changer is that you now have these CPU nodes in the NVLink mesh, each of which has 512 GB of memory (compared with the GPU nodes, which have only up to 96 GB of memory). So, that's why they're highlighting all of these big memory workloads that were previously bottlenecked on accessing large amounts of data in pools of "host" memory.

    Beyond that, they're highlighting that you can bypass Infinband for scaling to multi-chassis. Using NVLink scales much better, at the rack-level. At some point, you still need to use Infiniband.
    Reply
  • msroadkill612
    bit_user said:
    The details are (mostly) in the slides. And the devil is in the details. It shouldn't surprise you to hear that Nvidia loves to overhype their products, so they're going to highlight the best case scenario, much more than the typical speed up. Don't get these two confused!
    They claimed a 2.3x speedup in (training?) a 65B parameter Large Language model (e.g. ChatGPT), a 5x speedup in a 500 GB Deep Learning Recommendation Model, and a 5.5x speedup in a 400 GB Vector DB workload (think lookups in a large-population facial recognition database, among other examples that come to mind).

    There's a common theme, here: they're all big. That's because they didn't announce a new GPU! They announced a new system architecture, based around their "superchip" modules and next-gen interconnect technology. So, it's not the computation that got faster, but rather the data movement, locality, and latencies that improved.


    Depends on how you define it. If you just wanted VR without a headset, then the bottleneck would seem to be the brain-machine interface. Elon Musk's Neuralink just got approval to start human trials. I haven't followed their stuff, but I'm guessing it'd be several generations before they're ready to tap into the optic nerve. And it's anybody's guess if/when we can do that without having to sever the connection to your organic retinas.
    Reply
  • msroadkill612
    Indeed. To this inexpert, there seems a lot of hype about grace/hopper & not much fundamentally new about processors.

    The stuff from AMD's stable tho, is revolutionary.

    4 gpus on a single mi300 substrate (32 such gpuS per blade). NV's lan bandwidth boasts seem irrelevant to AI - its all about latency & efficiency via proximity of processing resources (simple laws of physics).

    The most gloomy prospect for NV IMO, is that AMD is on the launch pad of gpu chiplets - not only chiplets which together comprise a gpu (e.g. 9700 xtx) (as with a single core complex (ccx) Zen, but multiple gpuS on a single substrate (as w/ a multi ccx zen)

    Nvidia is constrained by the scale limitations of monolithic processors & thats far worse a problem for GPU, where scale clearly matters far more. Intel's problems competing in DC against epyc will pale by comparison in the gigantic AI rigs.

    To boast of 150 miles of lan fiber seems odd... not to mention the huge cost of all those $10k NICs

    Shared gpu/cpu memory on mi300, 192 core epycs, 3d cache epyc - a huge advance on grace hopper discrete cpu/gpu cache,

    if grace hopper is just now releasing as u say, more revolutionary mi300 seems not so far behind, with its release to el capitan.
    Reply
  • bit_user
    msroadkill612 said:
    Indeed. To this inexpert, there seems a lot of hype about grace/hopper & not much fundamentally new about processors.
    Here's what I see that's new:
    First ARMv9-A server CPU (also the first to use ARM Neoverse V2 cores - Amazon Graviton 3 uses V1)
    First server CPU to use co-packaged LPDDR memory, with the only other example I know being Apple.
    Second and only current CPU to feature NVLink (the first was IBM's POWER, but that CPU is at least 2 generations old)
    Made on TSMC N4 - gives a slight advantage over EPYC at N5.
    msroadkill612 said:
    NV's lan bandwidth boasts seem irrelevant to AI
    Of course it's relevant! Big models involve thousands of training GPUs, and that means you need to scale well beyond the number that can be packed in a single chassis or rack!

    I'm incredibly interested in seeing some real-world benchmarks, purely for the sake of looking at how the latest & greatest ARM server cores compare to Intel and AMD. As for its role in building scalable AI systems, that's not terribly interesting to me, personally.

    Compared with the MI300, I can't really say too much, technically. I'd just point out that the success of that product depends on a whole lot of factors, with technical parity/superiority being only one. However, I do hope it enables AMD to finally gain some traction in the AI market. I'm not going to place any bets, either way.
    Reply