China's secretive Sunway Pro CPU quadruples performance over its predecessor, allowing the supercomputer to hit exaflop speeds
China continues to advance supercomputing technologies despite U.S. sanctions.
Earlier this year, the National Supercomputing Center in Wuxi (an entity blacklisted in the U.S.) launched its new supercomputer based on the enhanced China-designed Sunway SW26010 Pro processors with 384 cores. Sunway's SW26010 Pro CPU not only packs more cores than its non-Pro SW26010 predecessor, but it more than quadrupled FP64 compute throughput due to microarchitectural and system architecture improvements, according to Chips and Cheese. However, while the manycore CPU is good on paper, it has several performance bottlenecks.
The first details of the manycore Sunway SW26010 Pro CPU and supercomputers that use it emerged back in 2021. Now, the company has showcased actual processors and disclosed more details about their architecture and design, which represent a significant leap in performance, recently at SC23. The new CPU is expected to enable China to build high-performance supercomputers based entirely on domestically developed processors. Each Sunway SW26010 Pro has a maximum FP64 throughput of 13.8 TFLOPS, which is massive. For comparison, AMD's 96-core EPYC 9654 has a peak FP64 performance of around 5.4 TFLOPS.
CPU | Compute Cores | FP64 | FP32 |
SW26010-Pro | 384 | 13.8 TFLOPS | 27.6 TFLOPS |
SW26010 | 256 | 2.9 TFLOPS | 5.8 TFLOPS |
A64FX | 48 | 3 TFLOPS | 6 TFLOPS |
MI250X (Single GCD) | 110 | 23.9 TFLOPS | 23.9 TFLOPS | 47.8 TFLOPS (packed) |
The SW26010 Pro is an evolution of the original SW26010, so it maintains the foundational architecture of its predecessor but introduces several key enhancements. The new SW26010 Pro processor is based on an all-new proprietary 64-bit RISC architecture and packs six core groups (CG) and a protocol processing unit (PPU). Each CG integrates 64 2-wide compute processing elements (CPEs) featuring a 512-bit vector engine as well as 256 KB of fast local store (scratchpad cache) for data and 16 KB for instructions; one management processing element (MPE), which is a superscalar out-of-order core with a vector engine, 32 KB/32 KB L1 instruction/data cache, 256 KB L2 cache; and a 128-bit DDR4-3200 memory interface.
MPEs and CPEs use a directory-based protocol to enable coherent data sharing to reduce data movement between cores and support fine-grained interactions between different cores, which is particularly important for applications with irregular data sharing access. With six CPEs, each SW26010 processor has 384 CPEs and six MPEs, thus 390 cores in total and a PPU.
Not only does the SW26010 Pro run faster than the predecessor (CPE runs at 2.25 GHz, MPE runs at 2.10 GHz instead of 1.45 GHz for CPE and MPE on the predecessor), but the new 64-bit RISC microarchitecture on the SW26010 Pro CPU has been completely revamped to quadruple the processor's FP64 data processing throughput. To provide more memory bandwidth to new cores, designers shifted the CPU from DDR3 to DDR4 memory controllers, which significantly increased memory bandwidth and capacity. Each CG is now equipped with 16 GB of DDR4 memory, doubling the 8 GB of DDR3 memory found in each cluster of the SW26010. This enhancement increases the total memory supported by one CPU from 32 GB in the SW26010 to 96 GB in the SW26010-Pro.
Despite these advancements, both the SW26010 and SW26010-Pro share a common limitation in their cache and memory subsystem. The SW26010-Pro attempts to address its cache issue by increasing the scratchpad capacity to 256 KB, up from the 64 KB in the SW26010. But a 256KB scratchpad cache per CPE amid the lack of proper L2 is not enough, so both processors still have a major performance bottleneck. Meanwhile, a dual-channel DDR4-3200 (51.2 GB/s) memory subsystem is barely enough for 64 cores, each featuring a 512-bit vector FPU and capable of up to 16 FP64 FLOPS/cycle.
In conclusion, the SW26010 Pro represents a significant step forward from the SW26010, particularly in terms of memory capacity, compute density, and overall performance. These enhancements demonstrate China's growing prowess in supercomputing. However, the new processor has two main drawbacks: a weak caching subsystem (which can be mitigated with software optimizations, but these optimizations are costly from time and money perspectives) and insufficient memory bandwidth. As a result, it remains to be seen whether it could be used to build systems to solve complex real-world problems that truly offer ExaFLOPS performance levels.
Stay On the Cutting Edge: Get the Tom's Hardware Newsletter
Get Tom's Hardware's best news and in-depth reviews, straight to your inbox.
Anton Shilov is a contributing writer at Tom’s Hardware. Over the past couple of decades, he has covered everything from CPUs and GPUs to supercomputers and from modern process technologies and latest fab tools to high-tech industry trends.
-
bit_user Each Sunway SW26010 Pro has a maximum FP64 throughput of 13.8 TFLOPS, which is massive. For comparison, AMD's 96-core EPYC 9654 has a peak FP64 performance of around 5.4 TFLOPS.
It's a false comparison, though. The SW26010 Pro is a hybrid CPU/GPU. As a hybrid, it packs more raw compute than a CPU, but can't handle general-purpose computation as well. Nor does it have as much raw compute as GPUs, like AMD's MI250X (which packs 28 to 48 fp64 TFLOPS). As such, the best point of comparison is probably with something like Fujitsu's A64FX
https://www.anandtech.com/show/13258/hot-chips-2018-fujitsu-afx64-arm-core-live-blog
Another way to think of it is sort of like 6 Cell processors on a chip. Like the Cell, the bulk of its compute lies in the 2-way in-order cores that operate via scratchpad memory. Programming these is probably a lot more like programming a GPU than a CPU. In fact, I wouldn't be surprised if they used OpenCL to utilize them in the exact same way.
a dual-channel DDR4-3200 (51.2 GB/s) memory subsystem is barely enough for 64 cores, each featuring a 512-bit vector FPU and capable of up to 16 FP64 FLOPS/cycle.
Big mistake not to use GDDR memory for this.