Huawei Ascend NPU roadmap examined — company targets 4 ZettaFLOPS FP4 performance by 2028, amid manufacturing constraints

(Image credit: Huawei)

In addition to announcing its first AI cluster with 1 FP4 ZettaFLOPS performance, Huawei also revealed a detailed roadmap of its upcoming Ascend neural processing units (NPUs) that accelerate AI workloads at the Huawei Connect 2025 event.

The company does not have access to TSMC’s leading-edge process technologies or high-end HBM4 and GDDR7 memory from global leaders. So to boost the performance of its Ascend processors, it will need to rely on a new architecture and new types of memory, kicking off with the Ascend 950-series and onwards. Huawei expects its new NPUs to enable multi-ZettaFLOPS performance toward the end of the decade.

When it comes to features, Huawei’s Ascend 910-series AI accelerators have barely changed in years: The latest dual-chiplet Ascend 910C offers higher performance and optimized manufacturability compared to the original Ascend 910 from 2019. The unit uses a SIMD architecture and supports conventional formats, such as FP32, HF32, FP16, BF16, and INT8, which are good enough for AI training, but are ‘heavy’ for AI inference by modern standards.

So, the company is cooking up a lineup of NPUs — the Ascend 950PR and 950DT, Ascend 960, and Ascend 970 — which use an all-new instruction set architecture and support modern data formats required for next-generation AI workloads. The new AI accelerators will also use Huawei's proprietary HBM-like memory technologies: the cheaper HIBL 1.0 and higher-performance HiZQ 2.0.

Swipe to scroll horizontally

Huawei Ascend roadmap
NPU	Targeted Release	Architecture	FP8 Performance	FP4 Perf	Memory	Memory Bandwidth	Interconnect Bandwidth	Supported Formats
Ascend 910C	2025 Q1	SIMD	–	–	128 GB	3.2 TB/s	784 GB/s	FP32, HF32, FP16, BF16, INT8
Ascend 950PR	2026 Q1	SIMD + SIMT	1 PFLOPS	2 PFLOPS	128 GB	1.6 TB/s	2.0 TB/s	FP32, HF32, FP16, BF16, FP8, MXFP8, HiF8, MXFP4
Ascend 950DT	2026 Q4	SIMD + SIMT	1 PFLOPS	2 PFLOPS	144 GB	4.0 TB/s	2.0 TB/s	FP32, HF32, FP16, BF16, FP8, MXFP8, HiF8, MXFP4
Ascend 960	2027 Q4	SIMD + SIMT	2 PFLOPS	4 PFLOPS	288 GB	9.6 TB/s	2.2 TB/s	FP32, HF32, FP16, BF16, FP8, MXFP8, HiF8, MXFP4, HiF4
Ascend 970	2028 Q4	SIMD + SIMT	4 PFLOPS	8 PFLOPS	288 GB	14.4 TB/s	4.0 TB/s	FP32, HF32, FP16, BF16, FP8, MXFP8, HiF8, MXFP4, HiF4

The Ascend 950

The next major step in the Huawei Ascend roadmap is the Ascend 950 series, comprising two variants: the Ascend 950PR, optimized for prefill and recommendation stages, and the Ascend 950DT, optimized for decoding and training.

Huawei Ascend AI chip — (Image credit: Huawei)

Both Ascend 950-series products use the same silicon, based on the company's new SIMD+SIMT architecture that weds vector-based processing and thread-level parallelism to maximize performance. They feature a GPU-like memory subsystem with reduced DRAM access granularity from 512 to 128 bytes. This reduces wasted bandwidth and improves memory efficiency. All Ascend 950-series processors will add support for FP8, MXFP8, HiF8, and MXFP4 data formats (on top of what is already offered by the Ascend 910C) to offer the right balance of performance and precision.

Both processors also have the same 1 FP8 PFLOP and 2 FP4 PFLOPS of compute performance and interconnect bandwidth (2 TB/s). The only differences between the Ascend 950PR and the Ascend 950DT are their memory subsystems and launch timeframes.

The 950PR features 128 GB of Huawei's proprietary HiBL 1.0 with a bandwidth of 1.6 TB/s, which is a low-cost HBM-like solution optimized for compute-intensive, memory-light tasks like recommendations and prefill. The processor will be available in Q1 2026.

The Ascend 950DT for training and decoding workloads will come with 144 TB of HiZQ 2.0 memory, which offers a claimed bandwidth of 4.0 TB/s. The unit is expected to arrive in Q4 2026.

All AI processors from Huawei, starting from Ascend 950, will rely on the company's UnifiedBus (UB) interconnect protocol. UnifiedBus claims to offer 2.1 microsecond latency, 100 times improved optical reliability, and TB/s-scale bandwidth — critical for binding thousands of NPUs into a single logical system. Huawei supports UB deployment over standard Ethernet via UBoE (UnifiedBus over Ethernet), which reduces hardware costs and improves MTBF (Mean Time Between Failures), relative to RoCE (Remote Direct Memory Access over Converged Ethernet)-based solutions.

Huawei will use its Ascend 950-series AI accelerators for its Atlas 950 SuperPoD and Atlas 950 SuperCluster systems for large-scale AI infrastructure, as well as for other AI workloads (including AI workstations for developers).

The Ascend 960

The Ascend 960 NPU will succeed the Ascend 950 NPU sometime in Q4 2027, according to Huawei's roadmap presented at the symposium. The new unit will add support for HiF4, a Huawei-developed 4-bit format for AI inference, and is said to double performance, memory capacity, and memory bandwidth compared to its predecessor.

The AI accelerator is projected to deliver 2 FP8 PFLOPS and 4 FP4 PFLOPS of performance, featuring 288 GB of memory with 9.6 TB/s bandwidth.

While doubling may imply the usage of two Ascend 950 chiplets, this is not the case. The new Ascend 960 processor supports a new data format, and its interconnect bandwidth is limited to 2 TB/s, far from the expectations of the predecessor.

While it is reasonable to assume that Huawei will continue to use its HiZQ memory with the Ascend 960, the company did not explicitly confirm this during the presentation.

A million of the Ascend 960 processors will form the compute backbone of the Atlas 960 SuperCluster, which is projected to offer 2 FP4 ZettaFLOPS of performance for AI inference.

The Ascend 970

Huawei plans to release the Ascend 970, which will again double the performance of its predecessor, targeting 4 FP8 PFLOPS and 8 FP4 PFLOPS. The NPU will still come with 288 GB of memory, albeit with a 14.4 TB/s bandwidth. While detailed specifications of the Ascend 970 are still in development, the chip is designed to support models scaling to 10 trillion parameters and beyond, and is expected to land as early as late 2028.

Huawei may announce another set of Atlas SuperPods and Atlas SuperClusters, built around its Ascend 970 NPUs. However, the company did not disclose such devices, perhaps because its next-next-generation platform's scale-up world size and scale-out world size are a work in progress, possibly because they can only rely on Huawei's proprietary UBoE (UnifiedBus over Ethernet) protocol, rather than on industry-standard RoCE, which is still an option for the Atlas 960 SuperCluster.

Massive scale of AI systems

Before we proceed with analyzing Huawei's upcoming zettascale platform discussion, let us once again remind you that this multinational giant is blacklisted by the U.S. and does not have access to the advanced manufacturing capacities of Intel Foundry, Samsung Foundry, or TSMC.

While the company could come into possession of such chips through other murky means, Huawei no longer deems this a proper long-term strategy. While this was not disclosed specifically, the announcement of clusters featuring 0.5 – 1 million AI accelerators points to a major change of strategy, from scaling chips in accordance with the cadence of Moore's Law, to scaling up/scaling out of systems. As a result, we are going to see completely different performance scaling challenges, both in terms of hardware and software, from Huawei and Nvidia in the coming years.

Swipe to scroll horizontally

Huawei SuperPoD and SuperClusters
System	NPUs / Chips	Performance	Cabinets / Components	Release Timeframe
Atlas 950 SuperPoD	8,192 Ascend 950DT	8 EFLOPS FP8, 16 EFLOPS FP4	160 (128 compute + 32 comm)	Q4 2026
Atlas 950 SuperCluster	~524,288 Ascend 950DT (64 SuperPoDs)	524 EFLOPS FP8, 1 ZettaFLOPS FP4	>10,000 cabinets	Q4 2026
Atlas 960 SuperPoD	15,488 Ascend 960	30 EFLOPS FP8, 60 EFLOPS FP4	220 (176 compute + 44 comm)	Q4 2027
Atlas 960 SuperCluster	>1,000,000 Ascend 960	2 ZettaFLOPS FP8, 4 ZettaFLOPS FP4	Multiple SuperPoDs	Q4 2027

Huawei's scale-up world size refers to the number of AI chips that can be integrated into a single compute domain. The Atlas 950 SuperPod packs up to 8,192 Ascend 950DT NPUs. Meanwhile, the Atlas 960 SuperPod will scale to 15,488 NPUs, all connected via Huawei's proprietary UnifiedBus (UB) interconnect with 2.1 µs latency and up to 2 TB/s chip-to-chip bandwidth or RoCE, using industry-standard components, but with lower performance.

These SuperPods are meant to function as one logical system, optimized for large-model training and inference, with synchronized compute, unified memory access, and token throughput scaling beyond 80 million tokens per second. At this point, we can only wonder whether this will indeed work as planned.

To contrast, Nvidia currently limits its scale-up world size to 72 GPU packages per NVL72 GB200/GB300 racks, all connected with NVLink 5.0 and NVSwitch within a single rack. For future systems like NVL144 or NVL576 (Blackwell and Blackwell Ultra), Nvidia also maintains a modular pod-based structure, with no extension of NVLink domains beyond one rack. Interestingly, the number of logical GPU packages remains unchanged at 72.

In terms of scale-out world size, Huawei connects dozens of SuperPods into a SuperCluster using UnifiedBus over Ethernet (UBoE) or RoCE, enabling deployments like the Atlas 950 SuperCluster with 524,288 NPUs, and the upcoming Atlas 960 SuperCluster with over 1 million NPUs. These clusters aim to operate cohesively, with improved fault tolerance, low inter-pod latency, and the ability to train multi-trillion-parameter models when interconnected using UBoE, according to Huawei.

Nvidia's design offers flexibility, modularity, and ease of integration, but lacks Huawei’s end-to-end coherence and latency control at extreme scale, potentially limiting performance scaling for systems with hundreds of thousands or millions of GPUs. Then again, Nvidia and its clients may not need clusters with over a million compute GPUs (we are talking about GPU packages rather than GPU chiplets) for AI, given the fact that Nvidia’s GPUs that Huawei’s NPUs will compete against in 2027 – 2028 will be inherently more powerful.

Software implications

Without any doubt, orchestrating hundreds of thousands of AI accelerators is an incredible engineering achievement. But scaling out hundreds of thousands of NPUs is not only complicated from a hardware development point of view, but it also complicates software development.

Nvidia's clusters are generally easier to program for because they require fewer accelerators to reach a target performance level, thanks to the high compute density of each GPU, like an NVL72 pod, which integrates 72 Blackwell GPUs connected via NVLink 5.0 and NVSwitch.

These pods operate as single, tightly coupled domains with shared memory coherence, reducing the need for complex distributed parallelism. Many large-scale AI workloads, including multi-trillion-parameter model training, can run effectively on just a few NVL72 pods, enabling developers to work within stable, local system boundaries.

Nvidia's modular scale-out model — NVL72/NVL144 (Rubin)/NVL576 (Rubin Ultra) → into a cluster, makes distribution more manageable.

Software stacks like NCCL, Megatron-LM, TensorRT-LLM, and DeepSpeed can assume consistent interconnect topologies and latency domains, with limited cross-pod communication. Taking into account Nvidia's vertically integrated and mature CUDA ecosystem, developers benefit from unified tooling, extensive documentation, and robust abstractions, making it possible to scale AI workloads with minimal custom engineering.

Huawei, by contrast, aims for scale through very large monolithic systems, such as the Atlas 950 SuperPod (8,192 NPUs) and Atlas 960 SuperPod (15,488 NPUs), which function as single logical compute domains. These SuperPods use Huawei’s UnifiedBus (UB) interconnect with 2.1 µs latency and up to 2 TB/s of tight chip-to-chip bandwidth to several thousand NPUs.

Token throughput is projected to exceed 80 million tokens/s (for Atlas 960 SuperPods), and memory access is synchronized across the entire system. This architecture supports tightly coupled training and inference at a massive scale, but also introduces far greater complexity in synchronization, memory partitioning, and job orchestration within each node.

In the scale-out model, Huawei connects multiple SuperPods via UBoE (UnifiedBus over Ethernet) or RoCE to build SuperClusters with 524,288 NPUs (Atlas 950) or over 1 million NPUs (Atlas 960). This large-scale interconnection requires developers to write software that performs well across tens or hundreds of thousands of accelerators, even for workloads that Nvidia can handle within a few pods.

While Huawei's vertical integration and proprietary toolchain (e.g., MindSpore) offer optimization opportunities, the lack of software maturity (according to Chinese companies, which still prefer to use Nvidia hardware despite issues with availability) and the massive scale involved make distributed scheduling, failure handling, and workload decomposition significantly harder, especially for tight synchrony requirements in multi-trillion-parameter models.

The future of Huawei's AI scale-up

At Huawei Connect 2025, the company revealed its shift to system-level scaling via massive AI clusters in a bid to stay competitive in the rapidly developing AI industry. Huawei is unable to access advanced foundry nodes or HBM4 memory. But, it introduced a quite impressive Ascend NPU roadmap that includes the 950PR, 950DT, 960, and 970, all based on a new SIMD+SIMT architecture, featuring support for modern low-precision formats (FP8, MXFP4, HiF8, HiF4), and using proprietary memory like HiBL 1.0 and HiZQ 2.0.

Starting with the Ascend 950 series (1 PFLOPS FP8), Huawei’s SuperPods will scale up to 15,488 NPUs per SuperPod system and over half a million NPUs per SuperCluster. Such massive clusters enable Huawei to achieve multi-ZettaFLOPS performance levels comparable to those of clusters used by market leaders like Google, Meta, OpenAI, and xAI.

However, Huawei's large, monolithic clusters present major software scaling challenges, unlike Nvidia’s modular NVL72/NVL144/NVL576 systems, which are easier to program due to consistent pod sizes, mature tooling, and fewer nodes needed to reach the same performance targets.

TOPICS

Anton Shilov is a contributing writer at Tom’s Hardware. Over the past couple of decades, he has covered everything from CPUs and GPUs to supercomputers and from modern process technologies and latest fab tools to high-tech industry trends.