Huawei's brute force AI tactic seems to be working — CloudMatrix 384 claimed to outperform Nvidia processors running DeepSeek R1

Huawei Ascend AI chip
(Image credit: Huawei)

Huawei's CloudMatrix AI cluster takes a comparatively simple approach in its attempt to beat Nvidia, and the company's researchers and an outside firm claim it has worked, at least in one instance. A recent technical paper portends that a cluster of Ascend 910C chips has surpassed the performance of Nvidia's H800 chip in running DeepSeek's R1 LLM.

Huawei published a technical paper in collaboration with Chinese AI startup SiliconFlow, which finds that Huawei's CloudMatrix 384 cluster can outperform Nvidia in running DeepSeek models. The cluster's hardware and software stack was found to outpace systems using Nvidia's H800 chip, a variant of the H100 pared down for export to China, as well as the H100 itself when running DeepSeek's 671-billion-parameter R1 model.

The CloudMatrix 384 is a brute-force solution for the company, which is barred from access to the leading edge of chip production. The CloudMatrix is a rack-scale system that combines 384 dual-chiplet HiSilicon Ascend 910C NPUs with 192 CPUs across 16 server racks, using optical connections for all intra- and inter-server communications to enable blisteringly quick interconnects.

The research paper contends Huawei's goal with the CM384 was to "reshape the foundation of AI infrastructure," with another Huawei scientist sharing that the paper itself was published "to build confidence within the domestic technology ecosystem in using Chinese-developed NPUs to outperform Nvidia’s GPUs."

On paper, the CloudMatrix 384 cluster can put out more raw power than Nvidia's GB200 NVL72 system, delivering 300 PFLOPs of BF16 compute versus the NVL72's 180 BF15 PFLOPS. The Huawei cluster also has software to compete with Nvidia's for LLMs; the CloudMatrix-Infer LLM solution was claimed to be able to pre-fill prompts with 4.45 tokens generated per second per TFLOPs, and produce responses at a rate of 1.29 tokens per second per TFLOPS, efficiency that the paper claims outpaces Nvidia's SGLang framework.

Of course, the CloudMatrix 384 is not better than Nvidia's solutions across the board, and its major downside is its power consumption and efficiency. The CloudMatrix utilizes four times the power of Nvidia's GB200 NVL72, consuming a total of 559 kW compared to the NVL72's 145 kW. Cramming more chips into one unit surpasses Nvidia in compute power, but at the cost of efficiency, which is roughly 2.3x less.

However, Chinese customers interested in the CloudMatrix are banned from accessing Nvidia-powered AI clusters, so these comparisons matter slightly less to them. Not to mention, energy is abundant in mainland China, with electricity prices in the region falling nearly 40% over the last three years.

As Nvidia boss Jensen Huang shared at France's VivaTech earlier this month, Nvidia is solidly ahead of Huawei's performance chip-for-chip. "Our technology is a generation ahead of theirs," as Huang claims, and Huawei is quick to agree internally. But, as Huang was quick to add, "AI is a parallel problem, so if each one of the computers is not capable … just add more computers."

The CloudMatrix, 16-racks-big and energy hungry though it is, still presents a compelling choice for Chinese customers looking for peak LLM performance, thanks to its wicked-fast interconnects and its solid software stack. For those looking for a deeper dive into the CloudMatrix 384, our article from its release gets much further into the weeds of what helps the "AI supernode" outpace Nvidia's offerings.

Follow Tom's Hardware on Google News to get our up-to-date news, analysis, and reviews in your feeds. Make sure to click the Follow button.

TOPICS
Sunny Grimm
Contributing Writer

Sunny Grimm is a contributing writer for Tom's Hardware. He has been building and breaking computers since 2017, serving as the resident youngster at Tom's. From APUs to RGB, Sunny has a handle on all the latest tech news.

  • setx
    The result was pretty obvious.

    Why so biased title? Why mention "using four times the energy" without "for 1.7x speed"? It would look very different.

    Also, if you think that proper interconnect and software stack is "brute force" – go ahead and try replicating it.
    Reply
  • jp7189
    Those are some cherry picked numbers. NVL72 can do 360 fp16 pflops with sparsity, and scales pretty well with a doubling of performance for fp8 and another doubling with fp4. Point being, that a different benchmark may tell a vastly different story.

    Getting 16 racks to work together is quite a feet and the engineering in the interconnects sounds like they are carrying the show. How far can it scale? Will it double the performance with a 32 rack deployment? Nvidia is not a slouch in interconnects either. I honestly don't know the scaling of a 16 rack system, but NVL72 can be deployed in a single rack. What is the performance of 16 of those?

    Lastly, didn't I read on Tom's somewhere that the 910C was manufacturered by TSMC and was done so against their knowledge? How many of these chips can be sourced going forward? Catching up by deploying 4x as many chips only works if you can get 4x as many chips.
    Reply