Architecture of Chinese Exascale Supercomputer Proposed: 80,000 Hybrid CPUs with 512-Bit Vector Units

TaihuLight
(Image credit: South China Morning Post)

In a bid to support development of its scientific, economic, and allegedly military-bound projects, China has been building leading-edge supercomputers for about two decades. Initially, China used hardware developed in the U.S., but as tensions between the country and its main economic rival intensified, China had to build its own high-performance computing (HPC) hardware. As the era of exascale supercomputer looms, Chinese scientists propose various architectures for such systems.  

One of the exascale supercomputer proposals includes scaling of the Sunway HPC architecture as well as the Shenwei (SW) many-core hybrid CPU architecture, reports NextPlatform citing a document from the National Research Center of Parallel Computer Engineering and Technology (NRCPC).

As a part of its preparations for exascale era, the NRCPC has conducted a study about general supercomputer trends in the recent years. 

(Image credit: the National Research Center of Parallel Computer Engineering and Technology)

The organization found that because of the slowdown of both Moore’s law and Dennard Scaling law made it incredibly difficult to increase performance of supercomputers without increasing their power consumption and  therefore increasing complexity the whole system architecture exponentially.  

Based on these findings, performance of leading-edge supercomputers in 2008 ~ 2019 increased mainly due to increase of the number of compute cores by 44 times and merely because of increase of compute capability per core, which increased by three times. To that end, the NRCPC believes it makes sense to scale its existing Sunway supercomputer architecture and Shenwei CPU design rather than to invent something brand new. In particular, supercomputers featuring tens of millions of cores are considering.

Exploring the Shenwei SW26010 Architecture

The latest Sunway TaihuLight supercomputer launched in 2016 uses 40,960 homegrown manycore Sunway SW26010 processors featuring a hybrid architecture. The system offers Linpack performance (Rmax) of 93,014.6 TFLOPS as well as (Rpeak) performance of 125,436 TFLOPS. The current exascale proposal includes scaling of the SW26010 CPU as well as the TaihuLight system, so it makes sense to learn some more details about the CPU architecture.  

(Image credit: China Academy of Sciences/Research Gate)

The SW26010 processor is based on an in-house developed 64-bit RISC architecture and features four clusters, or core groups (CG) and a protocol processing unit (PPU). Each cluster has one management processing element (MPE), which is a superscalar out-of-order core with a 256-bit vector engine, 32 KB/32 KB L1 instruction/data cache, 256 KB L2 cache. It also integrates 64 compute processing elements (CPEs) featuring the same 256-bit vector engine as well as 64 KB of fast local store for data and 16 KB for instructions. CPEs are organized as an 8x8 array and are interconnected using a mesh network. It is important to note that MPEs and CPEs support coherence sharing with a directory-based protocol, which reduces data movement between cores and supports fine-grained interactions between different cores, which is particularly vital for applications with irregular data sharing access. 

Each CG has its own DDR3 memory controller with its own address space that supports 8 GB of memory using nine memory modules for a proprietary ECC implementation. CGs are interconnected using a ringbus-like network-on-chip (NoC) link and the processor itself connects to the rest of the system using the System Interconnect (SI) bus. The SW26010 CPU used in the Sunway TaihuLight supercomputer operated at 1.45GHz. The NRCPC does not disclose which process technology it used to make the SW26010, but since the TaihuLight first appeared in the Top 500 list in mid-2016, it is logical to assume that its CPU was made using TSMC's 28 nm fabrication process. 

(Image credit: the National Research Center of Parallel Computer Engineering and Technology)

Such a processor features performance of around 3.168 TFLOPS (Rpeak) as well as memory bandwidth of approximately 136 GB/s, assuming that the Sunway Taihulight is fully loaded and is 100% efficient. 

(Image credit: the National Research Center of Parallel Computer Engineering and Technology)

The SW26010 is essentially a hybrid processor with 260 cores that share the same microarchitecture, but feature different capabilities. Since the SW26010 is a single chip that can exploit thread-level parallelism with its 256 CPE cores, it is believed to be more efficient than CPUs equipped with compute accelerators (such as GPUs or FPGAs) since it does not have to make loads of memory transactions between serial (MPE) and parallel (CPE) cores. Meanwhile, modern x86-based supercomputers use CPUs with more than four 'big' cores, which adds quite some flexibility. 

NRCPC's Approach to Exascale: Scale Everything

 From NRCPC's point of view, it is possible to scale both the Sunway system as well as the Shenwei CPU architecture to build a supercomputer featuring performance of around 1 ExaFLOPS. 

the National Research Center of Parallel Computer Engineering and Technology

(Image credit: the National Research Center of Parallel Computer Engineering and Technology)

To build such a system, the NRCPC proposes to enhance the SW26010 CPU and increase the number of processors. The new Shenwei CPU for exascale machines will have eight CG clusters instead of four. The CG architecture will remain the same: one MPE and 64 CPEs. Meanwhile, CPEs will support 512-bit vector instructions (presumably the MPE will too, but the document does not state that explicitly). Based on NRCPC's estimations, such a processor will provide over 12 FP64 TFLOPS. The exascale supercomputer will also more than double the number of CPUs per system to over 80,000.  

The NRCPC says that an exascale Sunway supercomputer based on the next-generation Shenwei CPU architecture will offer around 1 FP64 ExaFLOPS, 2 FP32 ExaFLOPS as well as 4 FP16 ExaFLOPS peak performance. According to estimates by the organization, real-world performance of the exascale Sunway system will be around 700 PFLOPS (i.e., its efficiency will be at ~70%), so it will be 7.5 times faster than the TaihuLight. In addition, the supercomputer will offer about 7 times higher memory bandwidth and about 2 time higher network bandwidth. 

the National Research Center of Parallel Computer Engineering and Technology

(Image credit: the National Research Center of Parallel Computer Engineering and Technology)

 The Sunway TaihuLight supercomputer consumes 15,371 kilowatts of power. By contrast, the Fugaku supercomputer, the world's most powerful machine, consumes 29,899 kW, about two times more. The Frontier, which is expected to be the first system to offer ~1.5 ExaFLOPS performance sometimes later this year, is projected to consume ~30,000 kW. While NRCPC's study gives some idea about performance expected from the Chinese exascale supercomputer, one of the things that the document lacks is expected power consumption of the system. 

The paper acknowledges that enhancing the CPU architecture will lead to major internal redesigns on interconnections and caches, which means increase of power consumption. Furthermore, the whole supercomputer will have to be redesigned to take advantage of extra per-CPU performance as well as the number of CPUs. The NRCPC says that it will address challenges of other supercomputer subsystems in upcoming documents.

New Process Technologies Needed

Building a hybrid CPU with 520 cores (8 MPEs, 512 CPEs) is possible from engineering standpoint. Meanwhile, doubling the number of cores and increasing their complexity with 512-bit vector units that require two times faster internal interconnects will inevitably lead to a significant increase of transistor count. 

(Image credit: SMIC)

Doubling transistor count is not an undoable challenge. At the end of the day, companies like AMD, Intel, and Nvidia know how to build large CPUs and GPUs for datacenters and supercomputers. But all of these companies have access to leading-edge process technologies and semiconductor production facilities. By contrast, since China wants to build all of its technology prowess independently, it is not clear whether it is inclined to contract TSMC or Samsung Foundry to make its hybrid supercomputer CPUs knowing that the U.S. might add the NRCPC into the Entity List and forbid chipmakers to supply silicon to this company. 

Without knowing exactly which process technology is used to make the SW26010 and which node the NRCPC plans to use to make its 520-core chip, we can only make guesses and speculations about the organization's exascale plans. 

At present, China-based Semiconductor Manufacturing International Corp. has two FinFET manufacturing technologies: its 14 nm node as well as its N+1 node for inexpensive chips. Assuming that the SW26010 is made using TSMC's 28 nm process technology, using SMIC's 14 nm process for a considerably more complex CPU makes a lot of sense. It of course remains to be seen whether SMIC can indeed mass produce fairly complex chips using its 14 nm node (which so far has only been used for mobile SoCs and other relatively small components) and hit the right yields at the right frequency. Keeping in mind that SMIC is in the U.S. Department of Commerce's Entity List and it is increasingly hard for the company to obtain necessary chemicals and spare parts, the foundry is refocusing to mature process technologies, so it is unclear whether it is even inclined to produce any new 14 nm designs even for 'VIP' customers like the NRCPC.  

That said it is possible that the NRCPC might have to take a risk and use TSMC's services for its next-generation supercomputer. As an added bonus, usage of TSMC's 7 nm node will enable the National Research Center of Parallel Computer Engineering and Technology not only to increase transistor count of its CPU, but also to increase frequency while keeping power consumption in check.

Summary

Cray

(Image credit: HPE)

One of the first Chinese supercomputers will leverage an existing Sunway supercomputer and Shenwei hybrid CPU architectures developed by the National Research Center of Parallel Computer Engineering and Technology. To achieve a 1 FP64 ExaFLOPS Rpeak performance in Linpack benchmark, the NRCPC will increase the number of execution units within its processor, add support for 512-bit vector instructions, and will double the number of CPUs per system.  

The CPU that will power NRCPC's proposed exascale system will feature 520 cores (8 high-performance cores and 512 simplified cores), and an all-new memory subsystem. What is unclear is whether the new Shenwei CPU will be made in China and which fabrication process will be used to produce it. On the one hand, China-based SMIC has successfully used its 14 nm node to make SoCs for Qualcomm and some other partners, but it is unclear whether the technology is good enough for highly complex supercomputer processors and whether SMIC can actually use it given the fact that it is in the U.S. DoC's Entity List. On the other hand, while TSMC can offer the NRCPC one of its competitive N7 or N6 nodes, it is unclear whether the Chinese supercomputer specialist is inclined to use services of a Taiwanese company.  

While Chinese engineers can develop a leading-edge supercomputer, including its CPU, DRAM, NAND, and other components, competitiveness of the proposed NRCPC exascale system will depend on semiconductor process technologies available to CPU designers.

Anton Shilov
Contributing Writer

Anton Shilov is a contributing writer at Tom’s Hardware. Over the past couple of decades, he has covered everything from CPUs and GPUs to supercomputers and from modern process technologies and latest fab tools to high-tech industry trends.