Chinese supercomputers have recently attracted much attention from the hardware and high-performance computing (HPC) communities following sanctions imposed by the U.S. government. Back in October, at least two Chinese supercomputers broke the so-called exascale barrier. And during the SuperComputing 21 (SC21) conference, reports alleged that another Chinese exascale supercomputer is under development. However, there seems to be a significant catch with these machines.
Three Exascale Systems
David K. Kahaner, an HPC expert and founder of the Asian Technology Information Program (ATIP), presented on modern supercomputers in China at SC21. Thankfully, parts of that presentation were published by Koji Uchikawa in a Twitter post (via ComputerBase). He revealed that Tianxia has multiple 100 – 500 PFLOPS systems online based on homegrown technologies or commercially available AMD, Intel, and Nvidia hardware. He also reiterated that two exascale-class systems exist in China and that another system in development has been delayed.
As previously reported, the highest-performing Chinese supercomputer is the Tianhe-3 system located in the National Supercomputer Center in Guangzhou, China, according to ATIP. The machine uses Armv8-based Phytium 2000+ (FTP) processors for traditional HPC workloads with full FP64 precision. It relies on Matrix 2000+ (MTP) DSP accelerators for emerging workloads like AI that do not require FP64 precision at all times. ATIP says that the system is rated at around 1300 PFLOPS (1.3 EFLOPS).
China's second highest-performing supercomputer is the Sunway Oceanlite, located in the National Research Center of Parallel Computer Engineering and Technology (NRCPC). It uses proprietary hybrid 390-core Sunway processors that derive from the Sunway SW26010 CPUs. ATIP estimates that the sustainable performance of the machine is around 1050 PFLOPS (1.05 EFLOPS).
The National Supercomputing Center in Shenzhen also proposed an EFLOPS-class system several years ago. That supercomputer was set to be designed by Sugon and was due to be delivered in 2022. However, Sugon's Hygon processor division no longer has access to AMD's technologies (including Zen CPU microarchitecture for its Dhyana processors and AMD compute GPUs for accelerators) due to restrictions from the U.S. government. So it is unclear how the company plans to deliver the system. Experts from ATIP believe that the NSCC and Sugon will need to find a new exascale-capable hardware platform to deploy the supercomputer. Meanwhile, the key message here is that China clearly wants another high-performance supercomputer.
It's All About Precision
It is necessary to point out that supercomputing specialists, such as Top500.org, measure the compute performance of supercomputers in the number of double-precision (64-bit) floating-point operations per second (FLOPS), or in FP64 FLOPS, using the LINPACK benchmark. While processors can execute FLOPS with lower precision faster, the common standard for HPC performance is FP64 FLOPS achieved in LINPACK.
When we reported about the two Chinese exascale systems last month, we said that both were tested using the LINPACK benchmark (which means that the results were by definition in FP64 FLOPS), just as NextPlatform described their performance. Neither supercomputing sites submitted performance numbers to Top500.org, but some observers believe that they wanted to protect their suppliers from sanctions by the U.S. government.
But while the Chinese supercomputer specialists were too shy or cautious about submitting their results to the renowned supercomputer performance tracker, researchers from NRCPC submitted results of the Sunway Oceanlite machine for another major supercomputing award, the Gordon Bell prize, reports NextPlatform. To get the Gordon Bell trophy, a system has to simulate the 53-qubit Sycamore circuit (Google's quantum architecture introduced several years ago), and the Sunway Oceanlite did so in 304 seconds. Meanwhile, a team from Oak Ridge National Laboratory (ORNL) estimated that the Summit supercomputer (a 200 PFLOPS machine) would have taken around 10,000 years to simulate Sycamore. By contrast, the 53-qubit Sycamore machine did the task in 200 seconds.
As it turns out, to get the spectacular result, engineers from NRCPC reduced the precision of the simulation, which is called cheating in the world of PC benchmarks.
"In their Gordon Bell Prize-winning work, the Chinese researchers introduced a systematic design process that covers the algorithm, parallelization, and architecture required for the simulation," Dmitry Liakh, a developer from ORNL, told NextPlatform. "Using a new Sunway Supercomputer, the Chinese team effectively simulated a 10x10x (1+40+1) random quantum circuit (a new milestone for classical simulation of RQC). Their simulation achieved a performance of 1.2 EFLOPS (one quintillion floating-point operations per second) single-precision, or 4.4 EFLOPS mixed-precision, using over 41.9 million Sunway cores."
While rigging the Sycamore simulation is one deplorable thing, it reveals that the Sunway Oceanlite system is capable of 1.2 FP32 EFLOPS performance in this particular algorithm. For obvious reasons, we cannot compare results allegedly obtained in LINPACK and results obtained in the Sycamore simulation. However, we can only wonder how a system that supposedly hit 1.05 FP64 EFLOPS in one benchmark could only achieve 1.2 FP32 EFLOPS in another.
Such inconsistencies in performance numbers cast doubts whether the initial LINPACK performance numbers for the Oceanlite and the Tianhe-3 supercomputers were correct.
While Chinese companies can design HPC hardware for petascale systems, it does not look like they can build an exascale machine with acceptable power consumption. Yet, China obviously wants to show its supercomputing prowess, which is why NRCPC did not shy away from allegedly rigging a quantum simulation benchmark result.
Right now, Chinese processors and accelerators may not be as fast as their competitors designed in the U.S. However, if China manages to produce them in high volumes, it can build more 100 – 500 FP64 PFLOPS machines to advance its scientific proficiency. Furthermore, if it needs exascale-like performance at no matter what power, it can try and scale-out its existing designs to get there. Meanwhile, the problem is that both Sunway and Phytium CPU developers are on the U.S. blacklist, which makes it extremely hard for them to develop and build processors.
It is ironic that out of three proposed exascale designs, the one that could hit 1 FP64 EFLOPS performance (and which had to be canceled) was to be based on a combination of AMD Zen-based Hygon CPU and an AMD Instinct compute GPU.