NVIDIA: GPU Supply Issues Involve Packaging, Not Chip Wafers

H100 GPU
(Image credit: Nvidia)

Nvidia's Vice-president and General Manager of HPC-geared DGX systems has come forward to set the record state on where exactly the company's GPU volume issues lie. According to Boyle, the problem doesn't come from Nvidia miscalculating demand or wafer yield issues at its manufacturing partner, TSMC. 

Instead, the bottleneck in manufacturing enough GPUs that can cater to both consumer and professional workloads (looking at you, AI boom) lies with the chip packaging steps that come after. Nvidia's H-class GPUs make use of TSMC's 2.5D Chip-on-Wafer-on-Substrate (CoWoS) packaging technology, a multi-step, high-precision engineering step whose complexity slows down the number of GPUs that can be assembled in a given timeframe. This can disproportionately impact supply; the delta between the number of GPUs required and those available even led Elon Musk to say they were proving "harder to acquire than drugs." We couldn't check that here at Tom's Hardware, but we'll trust Mr. Musk to know that one after Twitter/X procured as many as 10,000 of Nvidia's compute-focused GPUs.

So when people use the word GPU shortage, they're really talking about a shortage of, or a backlog of, some component on the board, not the GPU itself. It's just limited worldwide manufacturing of these things... but we forecast what people want and what the world can build.

Charlie Boyle, VP and GM of Nvidia's DGX

Multiple steps are required, from chip design through manufacturing, before a chip becomes a usable GPU. For one, issues during the chip design stage could create a manufacturing bottleneck due to design oversights that bring a design's yield down (yield being the percentage of usable chips out of an entirely-etched wafer). A lack of rare earth metals or other materials, such as the recently-restricted Gallium, would impact other steps in the long logistics chain; so would materials contamination, energy blackouts, and many other factors, as we've already seen happen throughout the years.

But this CoWoS bottleneck issue may be more severe than expected. TSMC itself has said that it expects it to take 1.5 years (and the completion of additional fabs and expansion of already-existing facilities) to bring the packaging process backlog back in line. This likely means that Nvidia will have to decide on what packaging capacity to allocate to which products - there's not enough time and capacity to package them all.

The supply issues may come from TSMC's packaging, but in the end, Nvidia dominates the AI space through its (according to Pat Gelsinger) "incredible execution." TSMC, for its part, is one of the few players with a functional, high-performance packaging technology that's an absolute requirement for performance scaling. There's definitely a need for more competition in the AI space (and in a good but insufficient sign, AMD gaming GPUs such as the RX 7900 XTX have also been seen heading towards AI datacenters)

But competition's also needed on the manufacturing side of the equation. There's hope that Intel's Foundry Services (IFS) will bring another player into the high-performance GPU game; at the same time, eyes are also on Samsung to at least close its manufacturing tech gap relative to TSMC such that its chips are attractive enough for another manufacturer to be on the table.

Francisco Pires
Freelance News Writer

Francisco Pires is a freelance news writer for Tom's Hardware with a soft side for quantum computing.