Nvidia may postpone volume ramp-up of Blackwell machines: TrendForce [Updated]

Dell servers based on Nvidia GB200
(Image credit: CoreWeave)

Edit: 12/20/2024 1:00pm PT: Corrected erroneous 'overheating' reference and added expanded reference to Nvidia's latest known statements.

Amended article:  

Nvidia may have to postpone the volume ramp of next-generation AI servers based on the B200 and GB200 platforms due to high power consumption and the necessity to optimize interconnections, according to a TrendForce report. The market research firm believes that mass production and peak shipments of Blackwell machines will occur sometime in mid-2025, which would mean a nearly half-year delay. Nvidia has yet to confirm or deny the claims, but as of the company's claims at its last earnings call (which predates the TrendForce report), Blackwell is already in the hands of some customers and is in production.

As expected, Nvidia and its partners can ship only limited quantities of Blackwell-based servers in 2024, as the company will have to use its low-yielding B200 for them. However, Dell is already shipping Blackwell server racks. However, although refined versions of Nvidia's B200 processors entered mass production in October and, therefore, will get to the company's hands in January, TrendForce doesn't expect the ramp of Blackwell-based servers to skyrocket immediately. According to TrendForce, due to higher power consumption and requirements for higher-speed interconnects, mass production and peak shipments of B200 and GB200 will occur between the second and third quarters of 2025. 

Just several months ago, it was reported that an Nvidia NVL72 rack based on the GB200 platform with 72 B200 GPUs would consume 120 kW of power, which already is significantly higher than current AI server racks (typical high-density rack power is up to 20kW, while an H100-based rack reportedly consumes around 40kW). TrendForce now claims that Nvidia had updated the specification of the device, and now it consumes 140 kW, which is more than typical data centers can provide to a single rack. 

The problem is that Nvidia's Blackwell GPUs were reportedly prone to overheating in servers equipped with 72 processors even when the racks consumed up to 120 kW per rack. This issue has reportedly forced Nvidia to repeatedly revise its server rack designs, as overheating not only reduces GPU performance but also risks hardware damage. A 140 kW per rack power consumption means further alterations to server designs, which could result in setbacks. 

Increased power consumption could mean additional cooling requirements. Liquid cooling is essential for Blackwell servers, but modern sidecar coolant distribution units (CDUs) can only handle 60 kW—80 kW of thermal power. To that end, cooling system providers are optimizing cold plate designs and aiming to double or triple the capacity of CDUs. TrendForce expects the performance of liquid-to-liquid in-row CDUs to exceed 1.3 mW, with further advancements possible, so excessive heat dissipation will eventually cease to be a major problem. 

However, according to the report, power consumption and heat management are not the only issues that Nvidia and its partners have to solve. TrendForce claims that Nvidia has to optimize its interconnections, but it didn't elaborate on which interconnections must be optimized. 

It remains to be seen how the claimed teething problems with Nvidia's B200 and GB200 servers affect the launch timeframe and availability of B200A based on simplified Blackwell processors and the B300 and GB300 machines featuring refreshed Blackwell GPUs. While B200A will likely feature a considerably lower power consumption compared to B200/GB200, the refreshed B300-series Blackwell GPUs promise to come with more memory and feature higher compute performance, which usually comes at higher power, so these products will likely consume even more than 140 kW per rack, necessitating even more sophisticated components and cooling. 

Anton Shilov
Contributing Writer

Anton Shilov is a contributing writer at Tom’s Hardware. Over the past couple of decades, he has covered everything from CPUs and GPUs to supercomputers and from modern process technologies and latest fab tools to high-tech industry trends.

  • DS426
    That would be a bit ironic if Blackwell gaming GPUs end up ramping quicker than their AI datacenter-destined siblings. Obviously not something that NVIDIA desires, but they're at least responsible enough to avoid a catastrophe of an AI XPU situation just to chase cash and stock prices.
    Reply
  • psykhon-
    Another article assuming face value of rumors from a site with very low reputation. "Consequently, TrendForce forecasts" right there, just a supposition.

    Oh boy how I miss the days when TH was a reliable source of tech news...
    Reply
  • P.Amini
    Just run them with lower clocks and lower voltages, problem solved.
    Reply
  • Mama Changa
    P.Amini said:
    Just run them with lower clocks and lower voltages, problem solved.
    They already need fp4 benchmarks to give the illusiona Blackwell is massively better than Lovelace. Like-for-like, it's 25% stronger at best.
    Reply