Chinese Biren's New GPUs Have 77 Billion Transistors, 2 PFLOPS of AI Performance

Biren Technology
(Image credit: Biren Technology)

Biren Technology has formally introduced its first GPUs designed primarily for artificial intelligence (AI) and high-performance computing (HPC). According to the company, the top-of-the-range BR100 GPU can challenge Nvidia's A100 and even H100 chips in certain workloads, yet its complexity is comparable with that of Nvidia's H100 compute GPU

Biren's initial family of compute GPUs includes two chips. The BR100 promises up to 256 FP32 TFLOPS or 2 INT8 PetaFLOPS performance, whereas the BR104 is rated for up to 128 FP32 TFLOPS or 1 INT8 PetaFLOPS performance.  

The top-of-the-range BR100 comes with 64GB of HBM2E memory with a 4096-bit interface (1.64 TB/s), while the midrange BR104 with 32GB of HBM2E memory with a 2048-bit interface (819 GB/s). 

Swipe to scroll horizontally
Row 0 - Cell 0 Biren BR104Biren BR100Nvidia A100Nvidia H100
Form-FactorFHFL CardOAM ModuleSXM4SXM5
Transistor Count?77 billion54.2 billion80 billion
NodeN7N7N74N
Power300W550W400W700W
FP32 TFLOPS12825619.560
TF32+ TFLOPS256512??
TF32 TFLOPS??156/312*500/1000*
FP16 TFLOPS??78120
FP16 TFLOPS Tensor??312/624*1000/2000*
BF16 TFLOPS512102439120
BF16 TFLOPS Tensor??312/624*1000/2000*
INT810242048??
INT8 TFLOPS Tensor??624/1248*2000/4000*

* With sparsity

Both chips support the INT8, FP16, BF16, FP32, and TF32+ data formats, so we're not talking about supercomputing formats (e.g., FP64) even though Biren says that its TF32+ format provides higher data precision than traditional TF32. Meanwhile, the BR100 and BR104 offer rather formidable peak performance numbers. In fact, if the company had incorporated GPU-specific functionality (texture units, render back ends, etc.) into its compute GPUs and had designed proper drivers, these chips would have been rather incredible GPUs (at least BR104, which is presumably a single-chip configuration).

In addition to the compute capabilities, Biren's GPUs can also support H.264 video encoding and decoding. 

(Image credit: Biren Technology)

Biren's BR100 will be available in an OAM form-factor and consume up to 550W of power. The chip supports the company's proprietary 8-way BLink technology that allows the installation of up to eight BR100 GPUs per system. In contrast, the 300W BR104 will ship in a FHFL dual-wide PCIe card form-factor and support up to 3-way multi-GPU configuration. Both chips use a PCIe 5.0 x16 interface with the CXL protocol for accelerators on top, reports EETrend (via VideoCardz). 

(Image credit: Biren Technology)

Biren says that both of its chips are made using TSMC's 7nm-class fabrication process (without elaborating whether it uses N7, N7+, or N7P). The larger BR100 packs 77 billion transistors, outweighing the 54.2 billion with the Nvidia A100 that's also made using one of TSMC's N7 nodes. The company also says that to overcome limitations imposed by TSMC's reticle size, it had to use chiplet design and the foundry's CoWoS 2.5D technology, which is completely logical as Nvidia's A100 was approaching the size of a reticle and the BR100 is supposed to be even larger given its higher transistor count. 

Given the specs, we can speculate that BR100 basically uses two BR104s, though the developer has not formally confirmed that. 

To commercialize its BR100 OAM accelerator, Biren worked with Inspur on an 8-way AI server that will be sampling starting Q4 2022. Baidu and China Mobile will be among the first customers to use Biren's compute GPUs. 

(Image credit: Biren Technology)
Anton Shilov
Contributing Writer

Anton Shilov is a contributing writer at Tom’s Hardware. Over the past couple of decades, he has covered everything from CPUs and GPUs to supercomputers and from modern process technologies and latest fab tools to high-tech industry trends.

  • bit_user
    Well, here we are. I think Nvidia doesn't really need to worry quite yet. The new hardware, and its supporting software stack, are probably still very rough around the edges, and will continue to be so for some time. But Nvidia is finally on-notice.

    Nvidia, AMD, and others have long been doing some design work of their chips in China (since the mid-2000's, I've heard), so it can hardly come as a surprise that China is finally becoming a formidable competitor.
    Reply
  • Dadata
    Has any of this been independently verified? I find it hard to believe that a chip with 3 billion less transistors is able to do 4x more fp32 tflops and consume less power than nvidias chip
    Reply
  • bit_user
    Dadata said:
    Has any of this been independently verified?
    It's hard to tell if this is just a product announcement, or if it's actually now shipping. In any case, I doubt they're available outside of China.

    Dadata said:
    I find it hard to believe that a chip with 3 billion less transistors is able to do 4x more fp32 tflops and consume less power than nvidias chip
    Perhaps that number is really more comparable to Nvidia's TF32 metric, at least in so far as it represents more of a corner case than what generic GPU shaders would really be able to achieve.

    Also, note that their big accelerator uses two chips, and they seem to presume linear scaling. On some workloads, that won't happen.
    Reply
  • Mpablo87
    Good Article! !
    Reply