Nvidia announced today that it had submitted its first benchmark results for its Grace Hopper CPU+GPU Superchip and its L4 GPU accelerators to the newest version of MLPerf, an industry-standard AI benchmark that's designed to offer a level playing field for measuring AI performance in different workloads. Today's round of benchmark results marks two notable new firsts for the MLPerf benchmark: The addition of a new Large Language model (LLM) GPT-J inference benchmark and a revamped recommendation model. Nvidia claims the Grace Hopper Superchip delivers up to 17% more inference performance than one of its market-leading H100 GPUs in the GPT-J benchmark and that its L4 GPUs deliver up to 6X the performance of Intel's Xeon CPUs.
The industry is moving at breakneck speed as it rapidly evolves to newer AI models and more powerful implementations. In the same vein, the MLPerf benchmark, managed by the MLCommons body, is constantly evolving to better reflect the changing nature of the AI landscape with its new v3.1 revision.
GPT-J 6B, a summarization model for text used in real-world workloads since 2021, is now used within the MLPerf suite as a benchmark for measuring inference performance. The GPT-J 6-billion-parameter LLM is rather lightweight compared to some of the more advanced AI models, like the 175-billion-parameter GPT-3, but it fits the role of an inference benchmark nicely. This model summarizes blocks of text and operates in both online mode, which is latency sensitive, and offline mode, which is throughput-intensive. The MLPerf suite also now employs a larger DLRM-DCNv2 recommendation model with twice the parameter count, a larger multi-hot dataset, and a cross-layer algorithm that better represents real-world environments.
With that background, we can see some of Nvidia's performance claims here. Be aware that Nvidia itself submits these benchmarks to MLCommons, so they likely represent highly tuned best-case scenarios. Nvidia also likes to point out that it is the only company to submit benchmarks for every AI model used in the MLPerf suite, which is an objectively true statement. Some companies are wholly absent, like AMD, or only submit a few chosen benchmarks, like Intel with Habana and Google with its TPU. Reasons for the lack of submissions vary by company, but seeing more competitors step into the MLPerf ring would be nice.
Nvidia submitted its first GH200 Grace Hopper Superchip MLPerf result, highlighting that the CPU+GPU combo delivers 17% more performance than a single H100 GPU. On the surface, that's surprising, given that the GH200 uses the same silicon as the H100 CPU, but we'll explain why below. Naturally, Nvidia's systems armed with eight H100s outperformed the Grace Hopper Superchip, taking the lead in every inference test.
As a reminder, the Grace Hopper Superchip combines a Hopper GPU and the Grace CPU on the same board, providing a C2C link (deep dive here) with 900GB/s of throughput between the two units, thus providing 7x the bandwidth of a typical PCIe connection for CPU-to-GPU data transfers, boosting the GH200's accessible memory bandwidth, augmented by a coherent memory pool that includes 96GB of HBM3 memory and 4TB/s of GPU memory bandwidth. In contrast, the comparison H100 tested in the HGX has only 80GB of HBM3 (next-gen Grace Hopper models will have 144GB of 1.7X faster HBM3e in Q2 2024).
Nvidia also touts a dynamic power-shifting tech called Automatic Power Steering that dynamically balances the power budget between the CPU and GPU, steering the spillover budget to whichever unit is under the most load. This tech is used in many competing modern CPU+GPU combos, so it isn't new, but it does allow the GPU onboard the Grace Hopper Superchip to enjoy a higher power delivery budget than it does in the HGX due to power being shifted from the Grace CPU — that isn't possible in a standard server. The full CPU+GPU system ran at a 1000W TDP.
Most inference continues to execute on CPUs, which could change in the future as larger models become more common; Displacing CPUs for these workloads with small low-power GPUs like the L4 is paramount for Nvidia because it would fuel high-volume sales. This round of MLPerf submissions also includes the first results for Nvidia's L4 GPUs, with the inference-optimized card delivering 6X the performance of a single Xeon 9480 in the GPT-J inference benchmark, despite pulling a mere 72W in a slim form factor card that doesn't require an auxiliary power connection.
Nvidia also claims up to 120X the performance in a video+AI decode-inference-encode workload against CPUs by measuring the performance of eight L4 GPUs versus two previous-gen Xeon 8380s CPUs, which is a bit lopsided. This is probably intended to be a direct comparison of the sheer amount of computing power that can be crammed into a single chassis. Still, it's noteworthy that quad-socket servers are available despite not being the best fit for the job, and newer Xeon chips would probably perform a bit better in this test. The test configuration is in the small print at the bottom of the slide, so be sure to pay attention to those details.
Finally, Nvidia also submitted benchmarks of its Jetson Orin robotics chips, showing an 84% gain in inference throughput that's largely driven by improvements to software.
it's important to remember that, in the real world, each AI model runs as part of a longer series of models that execute in an AI pipeline to accomplish a specific job or task. Nvidia's illustration above encompasses this nicely, with eight different AI models being performed on one query before completion — and it isn't unheard of for these types of AI pipelines to extend up to 15 networks to satisfy a single query. This is important context, as the throughput-oriented benchmarks above tend to focus on running a single AI model at high utilization, as opposed to the real-world pipeline that requires quite a bit more versatility with multiple AI models running serially to complete a given task.
Nvidia also announced last week that its TensorRT-LLM software for generative AI workloads delivers optimized performance in inference workloads, overall delivering more than double the performance when used on its H100 GPUs, at no added cost. Nvidia provided details about this software recently, which you can read about here, and points out that it didn't have this inference-boosting software ready for this round of results; MLCommons requires a 30-day lead time for MLPerf submissions, and TensorRT-LLM wasn't available at the time. That means that Nvidia's inaugural round of MLPerf benchmarks should see a tremendous improvement with the next round of submissions.