Nvidia reportedly mulls socketed design for Blackwell B300 AI GPUs — next-gen Blackwell GPUs may be removable by the user
GPUs that could be removed from the motherboard, similarly to CPUs.
Nvidia is considering adopting a socket design for at least some of its upcoming Blackwell B300 GPUs for AI and HPC applications, according to a report from TrendForce that cites Economic Daily News and MoneyDJ. The company is said to adopt the new socketed design for something codenamed GB300, and for now, the information looks unconvincing, to put it mildly. Yet, given the fact that there is supply chain chatter, it is at least worth considering.
MoneyDJ reports that considering the failure rates of AI GPUs under high loads, the replacement costs of motherboards, and cooling challenges, Nvidia and other AI GPU designers might consider using socket designs for their next generation of GPUs instead of soldering GPUs to motherboards.
EDN cites Chen Shuowen, an analyst with CLSA, as saying that based on supply chain checks, Nvidia has been designing GPU sockets for its products, possibly starting with the GB200 Ultra. Chen reportedly mentioned a 4-way Nvidia GPU design with one Nvidia CPU. Neither of the reports mentions anything called GB300, so TrendForce has added this part, possibly based on some additional chatter.
Several things about the reports should be noted. Socketed designs would instead add to power and cooling challenges rather than help solve them, so the first report is inaccurate. The most power-hungry GPUs usually use BGA packaging.
A 4-way Blackwell GPU with one CPU motherboard does not look extraordinary, considering that with DGX servers, we see an 8-way GPU baseboard and a 2-way CPU motherboard, yet such a design looks incredible.
Nvidia's data center nomenclature divides the company's GPU (A100, H100, B100/B200) and Grace CPU + GPU platforms (GH100, GB200). For now, GB200 platforms use BGA packaging for both CPU and GPU; we are not sure something has to change with the B200 Ultra refresh, especially with the possible GB200 Ultra refresh sometime in the second half of the year.
We all love standard CPU sockets for their easy repairs and upgradeability. But in servers, they take up more space and have more power and thermal constraints than BGA packages or SXM/OAM modules. While the modules provide reparability, the process might vary depending on the specific motherboard design, and removing an OAM/SXM module requires careful handling, so they are not as good as sockets.
Stay On the Cutting Edge: Get the Tom's Hardware Newsletter
Get Tom's Hardware's best news and in-depth reviews, straight to your inbox.
There is another point to make. Add-in cards, SXM, and OAM modules are hard and expensive to make, and for now, most Nvidia SXM modules are made by Foxconn. Migrating from a card or a module to a socket cuts costs but limits performance.
Blackwell hardware possibilities
Before moving on to the alleged Blackwell-based data center product (GB300, GB200 Ultra, whatever) featuring a socketed GPU, let us recall which Blackwell-based data center GPUs Nvidia has already introduced.
By now, Nvidia has formally introduced its B200 GPU (1,000W+) that will be used on GB200 boards (codenamed Bianca with one Grace CPU and two Blackwell GPUs as well as Ariel with one Ariel CPU and one Blackwell GPU) and will come in a BGA form-factor. In addition, Nvidia also has Umbriel GPU boards supporting eight B200 (1000W) and B100 (700W) SXM module form factors. In addition, there are codenamed Miranda (adds performance (think higher TDP), PCIe 6.0, and 800G networking) and codenamed Oberon GB200 platforms, according to SemiAnalysis.
While there are Nvidia H100 and even H200 add-in cards (based on the Hopper architecture) with lowered performance to fit into typical power and thermal budgets provided by classic servers, Nvidia has never announced any add-in cards featuring Blackwell-based GPUs.
Yet, based on unofficial information, we know that Nvidia is prepping its codenamed B200A product based on the monolithic B102 processor with four HBM3E memory stacks connected using TSMC's CoWoS-S packaging technology. This is in contrast to dual-die B100/B200 designs that are packaged together using TSMC's CoWoS-L and then connected to eight stacks of HBM3E memory.
Given that with the alleged B200A, we are dealing with a single-die product not designed to be a performance champion, this one could adopt multiple form factors. This includes an SXM modular design (especially in its China-specific B20 form) and an add-in-card form factor. Could it be a socket? Perhaps. We are going to see about that. Intel has done its socketed Xeon CPU Max 9480 'Sapphire Rapids' with HBM onboard, and it was not a success beyond selected supercomputing auditory. Does Nvidia want to build something similar? We will see about that.
Anton Shilov is a contributing writer at Tom’s Hardware. Over the past couple of decades, he has covered everything from CPUs and GPUs to supercomputers and from modern process technologies and latest fab tools to high-tech industry trends.
-
BoobleKooble hardware companies, of course as everyone knows, want you locked into their "jail".Reply
they want "hardware as a service" and they want application development to be as "locked" to their hardware as with classic proprietary viscously closed source video game consoles
that is the "wet dream" of technology executives
going in the opposite direction will never happen.
socketing accelerators (any/all bus types), if it happens, will be functionally the equivalent of SNES video game cartridge sockets - you still have to buy the licensed hardware
in this sense, socketing is merely a component of "hardware as a service" business model