Nvidia unveiled its new 144-core Grace CPU Superchip, its first CPU-only Arm chip designed for the data center, back at GTC. Nvidia shared a benchmark against AMD's EPYC to claim a 1.5X lead, but that wasn't very useful because it was against a previous-gen model. However, we found a benchmark of Grace versus Intel's Ice Lake buried in a GTC presentation from Nvidia's vice president of its Accelerated Computing business unit, Ian Buck. This benchmark claims Grace is 2X faster and 2.3X more energy-efficient than Intel's current-gen Ice Lake in a Weather Research and Forecasting (WRF) model commonly used in HPC.
Nvidia's first benchmark claimed that Grace is 1.5X faster in the SPECrate_2017 benchmark than two previous-gen 64-core EPYC Rome 7742 processors and that it will deliver twice the power efficiency of today's server chips when it arrives in early 2023. However, those benchmarks compare to previous-gen chips — the Rome chips will be four years old when Grace arrives next year, and AMD already has its faster EPYC Milan shipping. Given the comparison to Rome, we can expect Nvidia's Grace to be on-par with the newer Milan in both performance and performance-per-watt. However, even that comparison doesn't really matter; AMD's EPYC Genoa will be available in 2023, and it will be faster still.
That makes Nvidia's comparison against Intel's current-gen Ice Lake a bit more interesting. So even though Intel will have its Sapphire Rapids available by 2023, at least we're getting a generation closer in the comparison below. (Beware, this is a vendor-provided benchmark result and is based on a simulation of the Grace CPU, so take Nvidia's claims with a grain of salt.)
As a reminder, Nvidia's Grace CPU Superchip is an Arm v9 Neoverse (N2 Perseus) processor with 144 cores spread out over two dies fused together with Nvidia's newly branded NVLink-C2C interconnect tech that delivers 900 GB/s of throughput and memory coherency. In addition, the chip uses 1TB of LPDDR5x ECC memory that delivers up to 1TB/s of memory bandwidth, twice that of other data center processors that will support DDR5 memory.
And make no mistake, that enhanced memory throughput plays right to the strengths of the Grace CPU Superchip in the Weather Research and Forecasting (WRF) model above. Nvidia says that its simulations of the 144-core Grace chip show that it will be 2X faster and provide 2.3X the power efficiency of two 36-core 72-thread Intel 'Ice Lake' Xeon Platinum 8360Y processors in the WRF simulation. That means we're seeing 144 Arm threads (each on a physical core), facing off with 144 x86 threads (two threads per physical core).
The various permutations of WRF are real-world workloads commonly used for benchmarking, and many of the modules have been ported over for GPU acceleration with CUDA. We followed up with Nvidia about this specific benchmark, and the company says this module hasn't yet been ported over to GPUs, so it is CPU-centric. Additionally, it is very sensitive to memory bandwidth, giving Grace a leg up in both performance and efficiency. Nvidia's estimates are "based on standard NCAR WRF, version 184.108.40.206 ported to Arm, for the IB4 model (a 4km regional forecast of the Iberian peninsula)."
Grace's tremendous memory throughput will pay dividends in performance and also in energy efficiency because the increased throughput reduces the number of inactive cycles by keeping the greedy cores fed with data. The chips also use lower-power LPDDR5X compared to Ice Lake's DDR4.
However, Grace likely won't have as much of an advantage against Intel's upcoming Sapphire Rapids — these chips support DDR5 memory and also have variants with HBM memory that could help counter Grace's strengths in some memory-bandwidth-starved applications. AMD also has its Milan-X with 3D-stacked L3 cache (3D V-Cache) that benefits some workloads, and we expect the company will make similar SKUs for the EPYC Genoa family.
It's telling that Nvidia used benchmarks showing a 1.5X gain over AMD's prior-gen EPYC Rome for its headline benchmark comparisons at GTC and in its press releases instead of using its larger 2X gain over Intel's current-gen Ice Lake. Instead, it buried the Intel comparison in a GTC presentation. Given that AMD is the leader in the data center, perhaps Nvidia felt that even managing to beat up on its previous-gen chips was more impressive than taking down Intel's current-gen finest.
In either case, that doesn't mean Nvidia doesn't have a use for Intel's silicon. For example, Nvidia's Jensen Huang told us during a recent roundtable that "[...]If not for Intel's CPUs in our Omniverse computers that are coming up, we wouldn't be able to do the digital twin simulations that rely so deeply on the single-threaded performance that they're really good at."
In fact, those very Nvidia OVX servers use two of Intel's 32-core Ice Lake 8362 processors apiece, and they're obviously selected because they are more agile in single-threaded work than AMD's EPYC— at least for this specific use case. Interestingly, Nvidia has yet to share any projections of Grace's prowess in single-threaded work, instead preferring to show off its sheer threaded heft for now.
There will certainly be interesting times ahead as a new and very serious contender enters the data center CPU race, this time with a specialized Arm design that's tightly integrated with what is fast becoming the most important number cruncher of all in the data center: the GPU.
Overall, Nvidia claims the Grace CPU Superchip will be the fastest processor on the market when it ships in early 2023 for a wide range of applications, like hyperscale computing, data analytics, and scientific computing. Regardless of how well Nvidia's Grace CPU Superchip performs relative to the other data center chips in 2023, there will certainly be plenty of choice in the years ahead, specifically for the myriad of HPC workloads shown below that already run on Arm. Given the recent explosion of new Arm-based chips in the data center, we expect this list to grow quickly.