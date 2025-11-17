As Nvidia ships millions of Grace CPUs and Blackwell AI GPUs to data centers worldwide, the company is hard at work bringing up its next-generation AI and HPC platform, Vera Rubin, which is expected to set a new standard for performance and efficiency. Nvidia's Vera Rubin comprises not one or two, but nine separate processors, each tailored for a particular workload, creating one of the most complex data center platforms ever.

While Nvidia will be disclosing more details about its Vera Rubin over the coming year before it officially launches in late 2025, let's recap what we already know about the platform, as the company has revealed a fair few details.

At a glance

On the hardware side, Nvidia's Vera Rubin platform is its next-generation rack-scale AI compute architecture built around a tightly integrated set of components. These include the following: an 88-core Vera CPU, Rubin GPU with 288 GB HBM4 memory, Rubin CPX GPU with 128 GB of GDDR7, NVLink 6.0 switch ASIC for scale-up rack-scale connectivity, BlueField-4 DPU with integrated SSD to store key-value cache, Spectrum-6 Photonics Ethernet and Quantum-CX9 1.6 Tb/s Photonics InfiniBand NICs, as well as Spectrum-X Photonics Ethernet and Quantum-CX9 Photonics InfiniBand switching silicon for scale-out connectivity.

(Image credit: Nvidia/YouTube)

A full NVL144 rack integrates 144 Rubin GPUs (in 72 packages) with 20,736 TB of HBM4 memory and 36 Vera CPUs to deliver up to 3.6 NVFP4 ExaFLOPS for inference and up to 1.2 FP8 ExaFLOPS for training performance. In contrast, NVL144 CPX achieves almost 8 NVFP4 ExaFLOPS for inference using Rubin CPX accelerators, providing even more massive compute density.

On the software side, the Rubin generation is optimized for FP4/FP6 precision, million-token context inference, and multi-modal generative workloads. The CPX systems will come with Nvidia's Dynamo inference orchestrator built atop CUDA 13, which is designed to intelligently manage and split inference workloads across different types of GPUs in a disaggregated system.

Additionally, Nvidia's Smart Router and GPU Planner will dynamically balance prefill and decode workloads across Mixture-of-Experts (MoE) replicas to improve utilization and response time. Also, Nvidia's Interconnect Extension Layer (NIXL) enables zero-copy data transfers between GPUs and NICs through InfiniBand GPUDirect Async (IBGDA) to reduce latency and CPU overhead. Meanwhile, NVMe key-value cache offload is said to achieve 50% – 60% hit rates, enabling multi-turn conversational context to persist efficiently. Finally, the new NCCL 2.24 library is expected to reduce small-message latency by 4x, enabling the scaling of trillion-parameter agentic AI models with much faster inter-GPU communication.

Truth to be told, these features are not specific to the Vera Rubin platform, but Rubin-class systems benefit the most from them, as the platform was designed explicitly to exploit them at scale. But what is so special about the Vera Rubin platform? Let's dig a little bit deeper.

The Vera CPU

Nvidia's Vera Rubin NVL144 and Rubin Ultra 576 platforms use Nvidia's custom Vera processors specifically designed for data center-grade AI infrastructure and promising a two times performance increase compared to the predecessor, Grace.

(Image credit: Nvidia/YouTube)

The CPU packs 88 proprietary Armv9-class cores (a departure from Grace, which uses Arm Neoverse V2 cores) with 2-way simultaneous multithreading, enabling up to 176 threads to run simultaneously. These new Arm v9.2 cores, internally called Olympus, rely on a wide out-of-order pipeline and feature a wide set of optional extensions (SVE2, crypto, FP8/BF16, tagging, RNG, LS64, etc.). Nvidia's documents indicate that SMT affects per-thread performance — e.g., most pipelines effectively halve per-thread throughput with two threads active, except for a few per-thread-dedicated ones, so developers should decide whether to use SMT for a given workload or keep one thread per core.

Nvidia continues to use its Scalable Coherency Fabric (SCF) within the CPU to tie cores and memory controllers together, but this time, the CPU's memory bandwidth reaches 1.2 TB/s, 20% higher than Grace. As for system memory, Vera continues to use LPDDR5X, but now uses SOCAMM2 modules for extra density.

Vera uses NVLink-C2C as the coherent CPU to GPU link, the same technology as Grace-Blackwell, but with higher bandwidth. Grace offers 900 GB/s bidirectional bandwidth, but with the Vera Rubin platform, the bandwidth will double to around 1.8 TB/s per CPU.

The recently released images of the Vera processor show that the CPU does not appear to feature a monolithic design but a multi-chiplet design, as it has visible internal seams. One image shows that the Vera CPU has a distinct I/O chiplet located next to it. Also, the image shows green features emanating from the I/O pads of the CPU die; their purpose is unknown. Perhaps some of Vera's I/O capabilities are enabled by external chiplets beneath the CPU, but this is merely speculation.

Publicly, there are still big gaps in information about Nvidia's Vera CPU. There are no official clock speeds, per-core cache sizes, exact L2/L3 topology, or TDP. We also have limited information on NUMA/socket configurations outside the NVL144/NVL576 rack context.

The Rubin GPU

The Rubin GPU is, without any doubt, the heart (or hearts, as there are two of them per board) of Nvidia's Vera Rubin platform. The first Rubin GPU — let us call it R200 — features two near-reticle-sized compute tiles manufactured on a 3nm-class TSMC process technology, a pair of dedicated I/O dies, and 288 GB of 6.4 GT/s HBM4 memory arranged in eight stacks, offering roughly 13 TB/s of aggregate bandwidth. Note that starting from R200, Nvidia will count GPU dies, not GPU packages, as 'GPUs', thus although the NVL144 platform carries 72 GPU packages, Nvidia now sees them as 144 GPUs.

(Image credit: Nvidia)

Rubin GPUs are designed to push low-precision AI throughput for inference and agentic AI even further, but to boost training performance significantly too, compared to Blackwell Ultra, as Nvidia promises 50 FP4 PetaFLOPS and ~16 FP8 PetaFLOPS of performance per R200 GPU, which is 3.3 and 1.6 times higher than Blackwell Ultra, respectively. Nvidia has not yet outlined performance for higher-precision formats, but substantial generational gains are naturally anticipated.

Performance improvements will come with a clear tradeoff: power draw. Current guidance points to roughly 1.8 kW per GPU, which raises both infrastructure and cooling demands for large clusters. Yet, a 0.4 kW per GPU increase seems insignificant when 1.6X – 3.3X performance gains are present. Nvidia's Vera Rubi