China's hybrid-bonded AI accelerators could rival Nvidia's Blackwell GPUs — top semiconductor expert hints at 'fully controllable domestic solution'

(Image credit: AMD)

Wei Shaojun, vice chairman of the China Semiconductor Industry Association and professor at Tsinghua University, said at an industry event that AI accelerators consisting of 14nm logic chiplets and 18nm-based DRAMs developed in China could rival Nvidia's Blackwell processors that are made using a custom 4nm-class process technology at TSMC, reports DigiTimes.

Speaking at the ICC Global CEO Summit, Wei Shaojun indicated that the key to performance efficiency breakthrough would be advanced 3D stacking used to build Chinese accelerators.

Wei Shaojun — who previously said that goals set by China in the 'Made in China 2025' program were unachievable and who later called on the country to cease using foreign AI accelerators like Nvidia H20 and use domestic solutions instead — described a hypothetical 'fully controllable domestic solution' that would combine 14nm logic with 18nm DRAM using 3D hybrid bonding. There is no evidence that such a solution exists or could be built using technologies that are available in China, so the speech is strictly hypothetical.

According to Wei, this hypothetical configuration is intended to approach the performance of Nvidia's '4nm GPUs' despite using outdated technologies. He believes that such a solution could offer performance of 120 TFLOPS, without revealing specific precision. Furthermore, he claims that it would consume only about 60W of power, thus offering higher performance efficiency (2 TFLOPS per Watt) compared to Intel's Xeon CPUs, according to Wei. To put the number into context: Nvidia's B200 processor delivers 10,000 NVFP4 TFLOPS at 1200W, thus delivering 8.33 NVFP4 TFLOPS per Watt. B300 delivers 10.7 NVFP4 TFLOPS per Watt, which is five times higher than what the non-existent AI accelerator could offer.

The key technologies that are meant to significantly improve the performance efficiency of a hypothetical AI accelerator developed in China are 3D hybrid bonding (copper-to-copper and oxide bonding), which replaces solder bumps with direct copper interconnects at sub-10 µm pitches, as well as near-memory computing. Hybrid bonding with sub-10 µm pitches can enable tens to hundreds of thousands of vertical connections per mm^2, alongside micrometer-scale signal paths for high-bandwidth low-latency interconnects.

One of the best examples of 3D hybrid bonding design is AMD's 3D V-Cache, which delivers 2.5 TB/s of bandwidth at 0.05 pJ/bit I/O energy, so Wei is likely looking toward a similar figure for his hypothetical design. 2.5 TB/s per device is considerably higher than what HBM3E can deliver, so it could be a breakthrough for AI accelerators that rely on the near-memory computing concept. Wei also said that the concept could theoretically scale toward ZetaFLOPS-level performance, though he did not outline when and how such levels could be reached.

Wei identified Nvidia's CUDA platform as a key risk not only for the hypothetical alternative he described, but also for non-Nvidia hardware platforms, as once software, models, and hardware converge on a single proprietary platform, alternative processors become difficult to deploy. Keeping in mind that he envisioned near-memory computing as a way to significantly increase the competitiveness of AI hardware being developed in China, any alternative platform that does not rely on this concept (including Chinese AI accelerators like Huawei's Ascend series or Biren's GPUs) may be considered a problem.

Follow Tom's Hardware on Google News, or add us as a preferred source, to get our latest news, analysis, & reviews in your feeds.

TOPICS

Anton Shilov is a contributing writer at Tom’s Hardware. Over the past couple of decades, he has covered everything from CPUs and GPUs to supercomputers and from modern process technologies and latest fab tools to high-tech industry trends.