Nvidia launches Vera Rubin NVL72 AI supercomputer at CES — promises up to 5x greater inference performance and 10x lower cost per token than Blackwell, coming 2H 2026
Where we’re going, you’re gonna need way more tokens
AI is everywhere at CES 2026, and Nvidia GPUs are at the center of the expanding AI universe. Today, during his CES keynote, CEO Jensen Huang shared his plans for how the company will remain at the forefront of the AI revolution as the technology reaches far beyond chatbots into robotics, autonomous vehicles, and the broader physical world.
First up, Huang officially launched Vera Rubin, Nvidia's next-gen AI data center rack-scale architecture. Rubin is the result of what the company calls "extreme co-design" across six types of chips: the Vera CPU, the Rubin GPU, the NVLink 6 switch, the ConnectX-9 SuperNIC, the BlueField-4 data processing unit, and the Spectrum-6 Ethernet switch. Those building blocks all come together to create the Vera Rubin NVL72 rack.
Demand for AI compute is insatiable, and each Rubin GPU promises much more of it for this generation: 50 PFLOPS of inference performance with the NVFP4 data type, 5x that of Blackwell GB200, and 35 PFLOPS of NVFP4 training performance, 3.5x that of Blackwell. To feed those compute resources, each Rubin GPU package has eight stacks of HBM4 memory delivering 288GB of capacity and 22 TB/s of bandwidth.
Per-GPU compute is just one building block in the AI data center. As leading large language models have shifted from dense architectures that activate every parameter to produce a given output token to mixture-of-experts (MoE) architectures that only activate a portion of the available parameters per token, it has become possible to scale up those models relatively efficiently. However, communication among those experts within models requires vast amounts of inter-node bandwidth.
Vera Rubin introduces NVLink 6 for scale-up networking, which boosts per-GPU fabric bandwidth to 3.6 TB/s (bi-directional). Each NVLink 6 switch boasts 28 TB/s of bandwidth, and each Vera Rubin NVL72 rack has nine of these switches for 260 TB/s of total scale-up bandwidth.
The Nvidia Vera CPU implements 88 custom Olympus Arm cores with what Nvidia calls "spatial multi-threading," for up to 176 threads in flight. The NVLink C2C interconnect used to coherently connect the Vera CPU to the Rubin GPUs has doubled in bandwidth, to 1.8 TB/s. Each Vera CPU can address up to 1.5 TB of SOCAMM LPDDR5X memory with up to 1.2 TB/s of memory bandwidth.
To scale out Vera Rubin NVL72 racks into DGX SuperPods of eight racks each, Nvidia is introducing a pair of Spectrum-X Ethernet switches with co-packaged optics, all built up from its Spectrum-6 chip. Each Spectrum-6 chip offers 102.4 Tb/s of bandwidth, and Nvidia is offering it in two switches.
Get Tom's Hardware's best news and in-depth reviews, straight to your inbox.
The SN688 boasts 409.6 Tb/s of bandwidth for 512 ports of 800G Ethernet or 2048 ports of 200G. The SN6810 offers 102.4 Tb/s of bandwidth that can be channeled into 128 ports of 800G or 512 ports of 200G Ethernet. Both of these switches are liquid-cooled, and Nvidia claims they're more power-efficient, more reliable, and offer better uptime, presumably against hardware that lacks silicon photonics.
As context windows grow to millions of tokens, Nvidia says that operations on the key-value cache that holds the history of interactions with an AI model become the bottleneck for inference performance. To break through that bottleneck, Nvidia is using its next-gen BlueField 4 DPUs to create what it calls a new tier of memory: the Inference Context Memory Storage Platform.
The company says this tier of storage is meant to enable efficient sharing and reuse of key-value cache data across AI infrastructure, resulting in better responsiveness and throughput and predictable, power-efficient scaling of agentic AI architectures.
For the first time, Vera Rubin also expands Nvidia's trusted execution environment to the entire rack by securing the chip, fabric, and network level, which Nvidia says is key to ensuring secrecy and security for AI frontier labs' precious state-of-the-art models.
All told, each Vera Rubin NVL72 rack offers 3.6 exaFLOPS of NVFP4 inference performance, 2.5 exaFLOPS of NVFP4 training performance, 54 TB of LPDDR5X memory connected to the Vera CPUs, and 20.7 TB of HBM4 offering 1.6 PB/s of bandwidth.
To keep those racks productive, Nvidia highlighted several reliability, availability, and serviceability (RAS) improvements at the rack level, such as a cable-free modular tray design that enables much quicker swapping of components versus prior NVL72 racks, improved NVLink resiliency that allows for zero-downtime maintenance, and a second-generation RAS engine that allows for zero-downtime health checks.
All of this raw compute and bandwidth is impressive on its face, but the total cost of ownership picture is likely most important to Nvidia's partners as they ponder massive investments in future capacity. With Vera Rubin, Nvidia says it takes only 1/4 the number of GPUs to train MoE models versus Blackwell, and that Rubin can cut the cost per token for MoE inference down by as much as 10x across a broad range of models. If we invert those figures, it suggests that Rubin can also increase training throughput and deliver vastly more tokens in the same rack space.
Nvidia says it's gotten all six of the chips it needs to build Vera Rubin NVL72 systems back from the fabs and that it's pleased with the performance of the workloads it's running on them. The company expects that it will ramp into volume production of Vera Rubin NVL72 systems in the second half of 2026, which remains consistent with its past projections regarding Rubin availability.
Follow Tom's Hardware on Google News, or add us as a preferred source, to get our latest news, analysis, & reviews in your feeds.

As the Senior Analyst, Graphics at Tom's Hardware, Jeff Kampman covers everything to do with GPUs, gaming performance, and more. From integrated graphics processors to discrete graphics cards to the hyperscale installations powering our AI future, if it's got a GPU in it, Jeff is on it.
-
Stomx What about fp32 and fp64? Will these be included in every chip or there will be special versions of this GPU for HPC and supercomputers?Reply -
timsSOFTWARE This is also the problem with these companies stockpiling hardware. It's a depreciating asset, and becomes effectively worthless after 5 years or so, because of the energy cost and space required vs. replacement hardware. Nvidia is saying this new gen will be 10x more energy efficient than the one before it. If that happens again in a few more years - what will Blackwell hardware be worth, when the new stuff can fit in 5% of the space and use 1% of the power?Reply
And the majority of the AI industry seems to be betting on having a use for continued scale - and relatively quick results - but I agree with various industry voices that are saying scale is pretty much played out already, and additional improvements are probably going to take some time - the limitation now is more about ideas than hardware. -
bit_user Reply
You have to take their numbers with a grain of salt. They're pros at cherry-picking and using every available trick to make their numbers as big as possible.timsSOFTWARE said:Nvidia is saying this new gen will be 10x more energy efficient than the one before it.
They were focused on inferencing. The improvements for training will be much less. Hence, some existing hardware that's used for inferencing could be re-targeted towards training and still remain viable for perhaps a couple more generations.timsSOFTWARE said:If that happens again in a few more years - what will Blackwell hardware be worth, when the new stuff can fit in 5% of the space and use 1% of the power?
That said, I agree that this hardware has a limited window for seeing a return-on-investment and there's barely enough power for datacenters as-is. It seems like they'll be forced to decommission some not-so-old hardware, in order to make room & power/cooling budgets for new stuff.
Some macroeconomic butterfly is going to flap its wings, somewhere, and the cascading effect through the economy will squeeze these guys by shutting off their funding to do more GPU buys and datacenter buildouts. Then, things are going to get interesting - especially if some of the biggest AI companies are unable even to service their debts. -
alan.campbell99 Demand for AI compute is insatiable, based on what, how much debt OpenAI and Anthropic are taking on? I've seen reporting that out of 400+ million paid 365 users only 8 million are paying for Copilot.Reply