Nvidia Grace Superchip loses to Intel Sapphire Rapids in HPC performance benchmarks, but promises greater efficiency

Grace CPU Superchip
Grace CPU Superchip (Image credit: Nvidia)

The Barcelona Supercomputing Center and the State University of New York have published benchmarks showing the prowess of Nvidia's brand-new Grace Superchip, which couldn't quite match two of Intel's 48-core Sapphire Rapids CPUs. Despite not having earth-shattering performance, Grace nevertheless promises to be a competitive datacenter and HPC processor thanks to its efficiency.

Grace is Nvidia's first-ever homemade server CPU, built on the Arm architecture. A single Grace CPU comes with 72 cores and 480GB of LPDDR5X memory. Though it's not possible to buy a single Grace CPU on its own, it features alongside Hopper GPUs in Grace-Hopper processors and Nvidia offers the Grace Superchip with two Grace CPUs combined on a single board for a total of 144 cores and 960GB of LPDDR5X.

The benchmarks shown at the HPC Asia conference last week are perhaps the most detailed we've seen thus far, with the Barcelona and New York researchers each presenting their findings at the conference. Each group tested differently, with the Barcelona benchmarks focusing on Grace's performance relative to Skylake-X's, and the New York tests comparing Grace to a variety of other AMD and Intel CPUs.

The Barcelona researchers tested Grace-Hopper (without the GPU part — effectively a single Grace CPU) and the Grace Superchip against a pair of 24-core Xeon Platinum 8160s. Given that Skylake-X turns seven years old in 2024, it wasn't surprising that the Grace Superchip in its worst showing was still 67% faster than the 48-core Skylake-X server; the Superchip's best result saw a lead of 4.49x. The choice of CPU comparison is strange but not arbitrary, as the Barcelona Supercomputing Center is replacing its Intel-powered MareNostrum 4 with Nvidia's Grace.

The New York benchmarks are more interesting given that they include comparisons to Intel Sapphire Rapids and Ice Lake, AMD's Milan, and rival Arm-based CPUs in the form of Amazon's Graviton 3 and Fujitsu's A64FX. The Grace Superchip easily beat the Graviton 3, the A64FX, an 80-core Ice Lake setup, and even a 128-core configuration of Milan in all benchmarks. However, the Sapphire Rapids server with two 48-core Xeon Max 9468s stopped Grace's winning streak.

Swipe to scroll horizontally
Grace Overall Performance (GFLOPs)
Row 0 - Cell 0 GraceSapphire Rapids HBMSapphire Rapids DDR5
Matrix Multiplication4,4615,3924,787
LINPACK3,1202,8622,211
FFT134.2143.1129
HPCG106.5197.583.6
OpenFOAM (lower is better)5.466.876.89
Gromacs MEM171206.1203.64
Gromacs RIB12.713.5213.88
Gromacs PEP0.9771.21.18

Against Sapphire Rapids in HBM mode, Grace only won in three of the eight tests — though it was able to outperform in five tests when in DDR5 mode. It's a surprisingly mixed bag for Nvidia considering that Grace has 50% more cores and uses TSMC's more advanced 4nm node instead of Intel's aging Intel 7 (formerly 10nm) process. It's not entirely out of left field, though: Sapphire Rapids also beat AMD's Epyc Genoa chips for a spot in a MI300X-powered Azure instance, indicating that, despite Sapphire Rapid's shortcomings, it still has plenty of potency for HPC.

On the other hand, Nvidia might have a crushing victory in efficiency. The Grace Superchip is rated for 500 watts, while the Xeon Max 9468 is rated for 350 watts, which means two would have a TDP of 700 watts. The paper doesn't detail power consumption on either chip, but if we assume each chip was running at its TDP, then the comparison becomes very favorable for Nvidia.

Swipe to scroll horizontally
Grace Hypothetical Efficiency
Row 0 - Cell 0 GraceSapphire Rapids HBMSapphire Rapids DDR5
Matrix Multiplication130.4%112.6%100%
LINPACK197.6%129.4%100%
FFT145.6%110.9%100%
HPCG178.3%236.2%100%
Gromacs MEM116.2%101.2%100%
Gromacs RIB128.1%97.4%100%
Gromacs PEP115.9%101.7%100%

Bearing in mind that this is a comparison of TDP and not actual power consumption, the data here looks very positive for Nvidia. It would seem that the Grace Superchip is only less efficient in a single benchmark compared to the Sapphire Rapids chip in HBM mode. That certainly changes Grace's outlook, especially considering that efficiency is a big deal in large deployments of server CPUs — since cooling and power usage costs can become very expensive.

Though not an absolute performance champion, Grace is shaping up to be one of the most efficient datacenter CPUs today, though bear in mind that neither Epyc CPUs based on Zen 4 nor Intel Xeons based on Emerald Rapids were included in these benchmarks. Nvidia claims Grace will beat AMD's Genoa in efficiency, but we're going to have to wait and see if Nvidia proves to be right about that.

Matthew Connatser

Matthew Connatser is a freelancing writer for Tom's Hardware US. He writes articles about CPUs, GPUs, SSDs, and computers in general.

  • cyrusfox
    Nvidia, already well in front on AI with CUDA and best LLM performance has at the same time created a competitive custom datacenter solution... Maybe their marketshare valuation is justified.

    Will be interesting to see AMD genoa comparison as well as if it can continue to compete. Custom ARM cores are making inroads and these benches prove for certain configs they are competent replacements.
    Reply
  • bit_user
    Against Sapphire Rapids in HBM mode, Grace only won in three of the eight tests — though it was able to outperform in five tests when in DDR5 mode. It's a surprisingly mixed bag for Nvidia considering that Grace has 50% more cores and uses TSMC's more advanced 4nm node instead of Intel's aging Intel 7 (formerly 10nm) process.
    Some key details this statement overlooks.
    Clock speed: the Xeon Max 9468 runs at 3.5 Ghz vs. 3.2 GHz for the Grace CPU tested - a 9% advantage for Intel. Allegedly, Grace is designed to clock higher, but the one tested was running at reduced clocks for some reason.
    Sapphire Rapids' Golden Cove cores feature much wider AVX-512. I'm not sure of the number of ports, but possibly a total of 1536 bits or wider. Grace uses ARM Neoverse V2 cores, which have 4x 128-bit SVE 2 support = 512 bits of vector throughput, per cycle.
    Xeon Max features 1 TB/s of HBM bandwidth, while Grace only manages about half as much bandwidth from its LPDDR5X. Furthermore, the NextPlatform article indicates that the researchers' system had Grace's memory running at reduced clocks, but doesn't specify by how much.
    Basically, the main thing Grace has going for it is its 50% higher core count. Taken together, I find the result probably putting Grace in a more positive light than expected. I mean, if you really run the numbers, the Grace setup just isn't designed for throughput like a Xeon Max is. That's because Nvidia never intended Grace to do the heavy lifting. They expect you to use their H100 (and now H200) to shoulder the main compute burden.

    BTW, NextPlatform incorrectly describes Grace's 512 GB/s memory bandwidth as being aggregate for the superchip. It's actually 512 GB/s per CPU.
    Source: https://developer.nvidia.com/blog/nvidia-grace-cpu-superchip-architecture-in-depth/
    Reply
  • craigss
    Well now we see a test that shows the chosen equipment winning that justifies the spend and choice which seems to run contrary to all the expectations, lots of stuff done in the background to ensure the intel product is on top, when frankly it should not be, anybody else smell a blue fish ?
    Reply
  • bit_user
    craigss said:
    anybody else smell a blue fish ?
    Usually, you'd want to dig into the details of the setups and how the testing was performed, in order to spot anything which could've biased the results one way or another. Hopefully, all of those details are in the papers, themselves.

    What strikes me as weird, about the whole proposition, is the use of Xeon Max in HBM mode. This limits you to just 64 GiB of memory, which seems woefully insufficient. I'm glad they tested this configuration, because I'm certainly curious about the potential of HBM, but seems unsuitable for actual usage.

    So, if they forego HBM mode, that cuts down Xeon Max's advantage from winning 5/8 benchmarks to 3/8. There's a 3rd option, which is to use HBM as a cache. I wonder if the article just omitted those results or if the researchers hadn't tested them. Anyway, if the HBM cache mode adds little benefit over straight DDR5 mode, then I'd say they made a mistake in selecting Xeon Max and should've just gone with a standard Xeon model that supports higher clock speeds.

    Fun stuff, though.
    : )
    Reply
  • bit_user
    Just gonna leave this here:
    https://www.phoronix.com/review/nvidia-gh200-gptshop-benchmark
    Sadly, no power consumption figures. However, the Geomean shows a single, 72-core Grace achieving 2175, while a single, 96-core AMD EPYC 9654 achieves 2499. So, that's 14.9% faster with 33.3% more cores (and 166.7% more threads).

    What's even more impressive is how well it does against the 128-core/256-thread Zen 4C-based Bergamo (EPYC 9754), which is only 13.1% faster!

    Not bad, Grace!
    Reply