Huawei's brute force AI tactic seems to be working — CloudMatrix 384 claimed to outperform Nvidia processors running DeepSeek R1

Huawei Ascend AI chip
(Image credit: Huawei)

Huawei's CloudMatrix AI cluster takes a comparatively simple approach in its attempt to beat Nvidia, and the company's researchers and an outside firm claim it has worked, at least in one instance. A recent technical paper portends that a cluster of Ascend 910C chips has surpassed the performance of Nvidia's H800 chip in running DeepSeek's R1 LLM.

Huawei published a technical paper in collaboration with Chinese AI startup SiliconFlow, which finds that Huawei's CloudMatrix 384 cluster can outperform Nvidia in running DeepSeek models. The cluster's hardware and software stack was found to outpace systems using Nvidia's H800 chip, a variant of the H100 pared down for export to China, as well as the H100 itself when running DeepSeek's 671-billion-parameter R1 model.

The research paper contends Huawei's goal with the CM384 was to "reshape the foundation of AI infrastructure," with another Huawei scientist sharing that the paper itself was published "to build confidence within the domestic technology ecosystem in using Chinese-developed NPUs to outperform Nvidia’s GPUs."

As Nvidia boss Jensen Huang shared at France's VivaTech earlier this month, Nvidia is solidly ahead of Huawei's performance chip-for-chip. "Our technology is a generation ahead of theirs," as Huang claims, and Huawei is quick to agree internally. But, as Huang was quick to add, "AI is a parallel problem, so if each one of the computers is not capable … just add more computers."

TOPICS
Sunny Grimm
Contributing Writer

Sunny Grimm is a contributing writer for Tom's Hardware. He has been building and breaking computers since 2017, serving as the resident youngster at Tom's. From APUs to RGB, Sunny has a handle on all the latest tech news.

  • setx
    The result was pretty obvious.

    Why so biased title? Why mention "using four times the energy" without "for 1.7x speed"? It would look very different.

    Also, if you think that proper interconnect and software stack is "brute force" – go ahead and try replicating it.
    Reply
  • jp7189
    Those are some cherry picked numbers. NVL72 can do 360 fp16 pflops with sparsity, and scales pretty well with a doubling of performance for fp8 and another doubling with fp4. Point being, that a different benchmark may tell a vastly different story.

    Getting 16 racks to work together is quite a feet and the engineering in the interconnects sounds like they are carrying the show. How far can it scale? Will it double the performance with a 32 rack deployment? Nvidia is not a slouch in interconnects either. I honestly don't know the scaling of a 16 rack system, but NVL72 can be deployed in a single rack. What is the performance of 16 of those?

    Lastly, didn't I read on Tom's somewhere that the 910C was manufacturered by TSMC and was done so against their knowledge? How many of these chips can be sourced going forward? Catching up by deploying 4x as many chips only works if you can get 4x as many chips.
    Reply