Chinese Exascale Supercomputers: Not Everything Is as It Seems

Tianhe
(Image credit: Top500.org/News.cn)

Chinese supercomputers have recently attracted much attention from the hardware and high-performance computing (HPC) communities following sanctions imposed by the U.S. government. Back in October, at least two Chinese supercomputers broke the so-called exascale barrier. And during the SuperComputing 21 (SC21) conference, reports alleged that another Chinese exascale supercomputer is under development. However, there seems to be a significant catch with these machines.

Three Exascale Systems

David K. Kahaner, an HPC expert and founder of the Asian Technology Information Program (ATIP), presented on modern supercomputers in China at SC21. Thankfully, parts of that presentation were published by Koji Uchikawa in a Twitter post (via ComputerBase). He revealed that Tianxia has multiple 100 – 500 PFLOPS systems online based on homegrown technologies or commercially available AMD, Intel, and Nvidia hardware. He also reiterated that two exascale-class systems exist in China and that another system in development has been delayed. 

(Image credit: ATIP/Koji Uchikawa Twitter)

As previously reported, the highest-performing Chinese supercomputer is the Tianhe-3 system located in the National Supercomputer Center in Guangzhou, China, according to ATIP. The machine uses Armv8-based Phytium 2000+ (FTP) processors for traditional HPC workloads with full FP64 precision. It relies on Matrix 2000+ (MTP) DSP accelerators for emerging workloads like AI that do not require FP64 precision at all times. ATIP says that the system is rated at around 1300 PFLOPS (1.3 EFLOPS).  

China's second highest-performing supercomputer is the Sunway Oceanlite, located in the National Research Center of Parallel Computer Engineering and Technology (NRCPC). It uses proprietary hybrid 390-core Sunway processors that derive from the Sunway SW26010 CPUs. ATIP estimates that the sustainable performance of the machine is around 1050 PFLOPS (1.05 EFLOPS). 

(Image credit: ATIP/Koji Uchikawa Twitter)

The National Supercomputing Center in Shenzhen also proposed an EFLOPS-class system several years ago. That supercomputer was set to be designed by Sugon and was due to be delivered in 2022. However, Sugon's Hygon processor division no longer has access to AMD's technologies (including Zen CPU microarchitecture for its Dhyana processors and AMD compute GPUs for accelerators) due to restrictions from the U.S. government. So it is unclear how the company plans to deliver the system. Experts from ATIP believe that the NSCC and Sugon will need to find a new exascale-capable hardware platform to deploy the supercomputer. Meanwhile, the key message here is that China clearly wants another high-performance supercomputer.

 

It's All About Precision

It is necessary to point out that supercomputing specialists, such as Top500.org, measure the compute performance of supercomputers in the number of double-precision (64-bit) floating-point operations per second (FLOPS), or in FP64 FLOPS, using the LINPACK benchmark. While processors can execute FLOPS with lower precision faster, the common standard for HPC performance is FP64 FLOPS achieved in LINPACK. 

When we reported about the two Chinese exascale systems last month, we said that both were tested using the LINPACK benchmark (which means that the results were by definition in FP64 FLOPS), just as NextPlatform described their performance. Neither supercomputing sites submitted performance numbers to Top500.org, but some observers believe that they wanted to protect their suppliers from sanctions by the U.S. government.

But while the Chinese supercomputer specialists were too shy or cautious about submitting their results to the renowned supercomputer performance tracker, researchers from NRCPC submitted results of the Sunway Oceanlite machine for another major supercomputing award, the Gordon Bell prize, reports NextPlatform. To get the Gordon Bell trophy, a system has to simulate the 53-qubit Sycamore circuit (Google's quantum architecture introduced several years ago), and the Sunway Oceanlite did so in 304 seconds. Meanwhile, a team from Oak Ridge National Laboratory (ORNL) estimated that the Summit supercomputer (a 200 PFLOPS machine) would have taken around 10,000 years to simulate Sycamore. By contrast, the 53-qubit Sycamore machine did the task in 200 seconds.

As it turns out, to get the spectacular result, engineers from NRCPC reduced the precision of the simulation, which is called cheating in the world of PC benchmarks

"In their Gordon Bell Prize-winning work, the Chinese researchers introduced a systematic design process that covers the algorithm, parallelization, and architecture required for the simulation," Dmitry Liakh, a developer from ORNL, told NextPlatform. "Using a new Sunway Supercomputer, the Chinese team effectively simulated a 10x10x (1+40+1) random quantum circuit (a new milestone for classical simulation of RQC). Their simulation achieved a performance of 1.2 EFLOPS (one quintillion floating-point operations per second) single-precision, or 4.4 EFLOPS mixed-precision, using over 41.9 million Sunway cores."

While rigging the Sycamore simulation is one deplorable thing, it reveals that the Sunway Oceanlite system is capable of 1.2 FP32 EFLOPS performance in this particular algorithm. For obvious reasons, we cannot compare results allegedly obtained in LINPACK and results obtained in the Sycamore simulation. However, we can only wonder how a system that supposedly hit 1.05 FP64 EFLOPS in one benchmark could only achieve 1.2 FP32 EFLOPS in another.

Such inconsistencies in performance numbers cast doubts whether the initial LINPACK performance numbers for the Oceanlite and the Tianhe-3 supercomputers were correct.  

Summary

While Chinese companies can design HPC hardware for petascale systems, it does not look like they can build an exascale machine with acceptable power consumption. Yet, China obviously wants to show its supercomputing prowess, which is why NRCPC did not shy away from allegedly rigging a quantum simulation benchmark result.

Right now, Chinese processors and accelerators may not be as fast as their competitors designed in the U.S. However, if China manages to produce them in high volumes, it can build more 100 – 500 FP64 PFLOPS machines to advance its scientific proficiency. Furthermore, if it needs exascale-like performance at no matter what power, it can try and scale-out its existing designs to get there. Meanwhile, the problem is that both Sunway and Phytium CPU developers are on the U.S. blacklist, which makes it extremely hard for them to develop and build processors.

(Image credit: ATIP/Koji Uchikawa Twitter)

It is ironic that out of three proposed exascale designs, the one that could hit 1 FP64 EFLOPS performance (and which had to be canceled) was to be based on a combination of AMD Zen-based Hygon CPU and an AMD Instinct compute GPU.

Anton Shilov
Contributing Writer

Anton Shilov is a contributing writer at Tom’s Hardware. Over the past couple of decades, he has covered everything from CPUs and GPUs to supercomputers and from modern process technologies and latest fab tools to high-tech industry trends.

  • Endymio
    So China cheated on the test? That shouldn't surprise anyone...
    Reply
  • setx
    Is US that desperate that they've lost/losing performance crown? China didn't even announce their supercomputer speed but so many accents in the article "it's not real, probably!"
    Reply
  • Endymio
    setx said:
    China didn't even announce their supercomputer speed ...
    The PRC-funded National Research Center entered their supercomputer to win the Gordon Bell prize. How is that not announcing its speed?
    Reply
  • setx
    Endymio said:
    The PRC-funded National Research Center entered their supercomputer to win the Gordon Bell prize. How is that not announcing its speed?
    That's way more about their simulation algorithm than raw speed.
    Reply
  • Endymio
    setx said:
    That's way more about their simulation algorithm than raw speed.
    The prize is for achievements in supercomputing, not algorithm design. China knowingly entered it in an attempt to garner prestige and standing, and they knowingly cheated on the benchmark.
    Reply
  • crostini
    setx said:
    Is US that desperate that they've lost/losing performance crown? China didn't even announce their supercomputer speed but so many accents in the article "it's not real, probably!"

    It's important to realize the amount of money China spends for a worldwide PR campaign that amounts to typical propaganda. From LeBron to John Cena to many US state governors, it includes big names all the way to the nameless. Including those posting on forums of tech sites online. In china you can even decrease your prison sentence by each post you make supporting the peoples republic.
    Reply
  • setx
    crostini said:
    It's important to realize the amount of money China spends for a worldwide PR campaign that amounts to typical propaganda. From LeBron to John Cena to many US state governors, it includes big names all the way to the nameless. Including those posting on forums of tech sites online. In china you can even decrease your prison sentence by each post you make supporting the peoples republic.
    Well, maybe they spend money for propaganda in US, but definitely not where I am. Also, never heard of LeBron nor John Cena...

    Personally I welcome news about (potentially) competing new Chinese hardware purely from consumer standpoint of the more competition the better. Just compare Intel's and Chinese videocards: Intel's are pure propaganda for now but every technical site just has to regularly write lengthy articles about them, while there is almost nothing about Chinese cards. Pretty obvious who pays for what.

    Endymio said:
    The prize is for achievements in supercomputing, not algorithm design. China knowingly entered it in an attempt to garner prestige and standing, and they knowingly cheated on the benchmark.
    I can't comment on cheating as I'm not going to read it into details, but supercomputing without proper algorithms is just stupid. Computers (and supercomputers) are totally useless without corresponding algorithms.
    Reply
  • Nobonita Barua
    Instead of these china should fast track EUV, domestic ones are scheduled for 2023,USA SCs are not of much use given they can not really show what they are doing with those. They always come first in test where they make the question papers.
    Next, bring on the quantum computers.
    Reply
  • galftanus
    Endymio said:
    The prize is for achievements in supercomputing, not algorithm design. China knowingly entered it in an attempt to garner prestige and standing, and they knowingly cheated on the benchmark.

    No, they did not. The article itself seems to be written by somebody who is not too familiar with HPC and supercomputing.

    The fact that they achieved 1.2 Exaflop in a real scientific algorithm in no way contradicts them getting about 1 Exaflop in Linpack. Linpack is a very simple benchmark and achieves a very high percentage of peak performance. Real world algorithms typically achieve much less. Of course it does not prove it either, since the floating point accuracy is different.

    The Gordon Bell price is given to " recognize outstanding achievement in high-performance computing" " (https://awards.acm.org/bell). Using mixed precision algorithms to achieve best performance is absolutely allowed, and a typical strategy. It is not like they hide it, read the article yourself: https://dl.acm.org/doi/pdf/10.1145/3458817.3487399
    Reply