China's Loongson Unveils 32-Core CPU, Reportedly 4X Faster Than Arm Chip

Loongson 3D5000
Loongson 3D5000 (Image credit: MyDrivers)

Loongson, a Chinese fabless chipmaker, has launched the new 3D5000 processor for data centers and cloud computing. MyDrivers reported that Loongson claims its 32-core domestic chips deliver 4X higher performance than rival Arm processors.

The 3D5000 still leverages LoongArch, Loongson's homemade instruction set architecture (ISA) from 2020. The chipmaker was previously a firm believer in MIPS. However, Loongson eventually built LoongArch from the ground up with the sole objective of not relying on foreign technology to develop its processors. LoongArch is a RISC (reduced instruction set computer) ISA, similar to MIPS or RISC-V.

The 3D5000 arrives with 32 LA464 cores running at 2 GHz. The 32-core processor has 64MB of L3 cache, supports eight-channel DDR4-3200 ECC memory, and up to five HyperTransport (HT) 3.0 interfaces. It also supports dynamic frequency and voltage adjustments. Officially, the 3D5000 has a 300W TDP; however, Loongson stated that the conventional power consumption is around 150W. That's roughly 5W per core.

The 3D5000 flaunts a chiplet design since Loongson has glued together two 16-core 3C5000 processors. Loongson developed the 3C5000 server part to compete with AMD's Zen and Zen+ architectures. The latest 3D5000, which measures 75.4 x 58.5 x 7.1mm,  slides into a custom LGA4129 socket. 

The processor supports 2P and 4P configurations; therefore, Loongson has launched the 7A2000 bridge chip to manage the communication between the processors and other components. As per the chip designer, the 7A2000 is up to 400% faster than the previous generation. Furthermore, with the help of the 7A2000, there's a possibility to scale up to 128 cores per motherboard.

According to Loongson's provided numbers, the 3D5000 scores over 425 points in SPEC CPU 2006, a depreciated benchmark replaced with the newer SPEC CPU 2017 version. The 3D5000 also delivers over 1 TFLOPs of FP64 performance, up to 4X higher than regular Arm cores. Meanwhile, the processor's stream performance with eight channels of DDR4-3200 memory crosses the 50GB mark.

While performance isn't the 3D5000's strong suit, security is. The 32-core processor allegedly has a custom-made mechanism to defend against vulnerabilities such as Meltdown or Spectre. The chip also has its Trusted Platform Module (TPM), so it doesn't rely on an external solution. In addition, according to MyDrivers' report, the 3D5000 also supports a secret national algorithm with an embedded security module that seemingly delivers excellent encryption and decryption efficiency higher than 5 Gbps.

In addition to the 3D5000 and 7A2000, Loongson also announced the 2K050, the company's baseboard management controller (BMC). The 2K050 features LA264 cores at 500 MHz, integrated 2D GDP, 32-bit DDR3 support, and outputs at a 1080p (1920x1080) resolution at 60 Hz.

Loongson's 3D5000 is no match for AMD's EPYC Genoa or Intel's Sapphire Rapids Xeon processors. It was never about beating the foreign competition but pushing for self-sufficiency. Unfortunately, with the ongoing U.S. sanctions, Chinese companies have no means to secure chipmaking tools originating from the U.S. In addition, the U.S. Department of Commerce recently blacklisted Loongson, which likely derailed some of the company's plans.

Zhiye Liu
RAM Reviewer and News Editor

Zhiye Liu is a Freelance News Writer at Tom’s Hardware US. Although he loves everything that’s hardware, he has a soft spot for CPUs, GPUs, and RAM.

  • DaveLTX
    Despite loongson claiming LSA is developed on their own, it's MIPS with their own AVX like extensions

    So yes it's modified mips
    Reply
  • bit_user
    Thanks for the continuing coverage. I think it's worth keeping tabs on China's tech industry.

    Loongson claims its 32-core domestic chips deliver 4X higher performance than rival Arm processors.
    That's immediately quite suspect, so I plugged in the MyDrivers link to Google Translate and here's what it claims they said:
    "In terms of performance, the SPEC 2006 score of Loongson 3D5000 exceeds 425. The floating point part adopts dual 256bit vector units, and the double precision floating point performance can reach 1TFLOPS (1 trillion times), which is 4 times of the typical ARM core performance."(the bold is theirs)That's a far more nuanced statement. For one thing, the typical ARM core has dual 128-bit vectors, which gives it an automatic 2x. I don't know where the other 2x comes from, but it's not hard to imagine theirs has dual-FMA units, whereas their basis for comparison doesn't. That still doesn't get us quite to 4x, but now we're in the ballpark.

    It's way off the mark to generalize 4x the vector fp64 throughput to an overall 4x performance increase, however.

    According to Loongson's provided numbers, the 3D5000 scores over 425 points in SPEC CPU 2006
    That's the claimed single-CPU score. The article also mentions scores of 800 and 1500, for dual- and quad- CPU configurations.

    For comparison, the last set of SPEC2006 scores I could find for an AMD CPU are the Zen 2-based Rome Epyc:

    https://www.anandtech.com/show/14694/amd-rome-epyc-2nd-gen/10

    I think the score they're quoting is a simple average of the sub-scores, in which case the 128-core dual-CPU Rome Epyc delivered 3434. That's about 2.3x what they claimed to achieve with the same core-count.

    Meanwhile, the processor's stream performance with eight channels of DDR4-3200 memory crosses the 50GB mark.
    This probably refers to the OpenMP-based Stream Triad benchmark, which adds 2 streams of numbers and writes out a 3rd. That means actual memory traffic is 3-4x whatever number they're quoting, so 150 to 200 GB/s. That aligns well with a raw bandwidth of about 205 GB/s from 8x DDR4-3200. In multi-CPU configurations, it particularly stresses the inter-processor link. Presumably, the number their quoting is just from a single-CPU config.

    It would be great to know what process node these chips were designed for.
    Reply
  • bit_user
    DaveLTX said:
    Despite loongson claiming LSA is developed on their own, it's MIPS with their own AVX like extensions
    It's certainly more than that, but I'm not an authority on the subject. My best understanding is that they borrowed much of the MIPS system architecture, while the instruction set architecture is substantially different.

    If you have a good source on the matter, please share it.
    Reply
  • The Historical Fidelity
    bit_user said:
    It would be great to know what process node these chips were designed for.
    I suspect it is SMIC 14nm since the 3A5000 was originally designed to TSMC 14nm design rules and this new cpu is just 2 3A5000’s glued together. Since SMIC 14nm is an unsanctioned copy of TSMC 14nm and loongson no longer has access to TSMC services, the cpu design would be compatible with SMIC’s node so this seems the most likely assumption.
    Reply
  • shady28
    The Historical Fidelity said:
    I suspect it is SMIC 14nm since the 3A5000 was originally designed to TSMC 14nm design rules and this new cpu is just 2 3A5000’s glued together. Since SMIC 14nm is an unsanctioned copy of TSMC 14nm and loongson no longer has access to TSMC services, the cpu design would be compatible with SMIC’s node so this seems the most likely assumption.

    Pretty sure that TSMC never had a 14nm. They went from 16->12->10 (Apple only) -> 7

    Worth noting that Intel 14nm is about 10-12% higher density than TSMC 12FFC and about 20% higher density than SMIC 14nm.

    This is basically on a SMIC node that is just a smidge better than the old 16nm node that TSMC was using in 2013.
    Reply
  • anonymousdude
    shady28 said:
    Pretty sure that TSMC never had a 14nm. They went from 16->12->10 (Apple only) -> 7

    Worth noting that Intel 14nm is about 10-12% higher density than TSMC 12FFC and about 20% higher density than SMIC 14nm.

    This is basically on a SMIC node that is just a smidge better than the old 16nm node that TSMC was using in 2013.

    You are correct. TSMC never did formally name a node 14nm. Their 16nm and 12nm constituted the "14nm-class" node.
    Reply
  • DaveLTX
    bit_user said:
    It's certainly more than that, but I'm not an authority on the subject. My best understanding is that they borrowed much of the MIPS system architecture, while the instruction set architecture is substantially different.

    If you have a good source on the matter, please share it.
    The programming manual mentioned by chips and cheese shows the lineage of the LSA clear as day and night, that and they were using MIPS recently so that gives a sign it's based off MIPS with their own additions
    Reply
  • The Historical Fidelity
    shady28 said:
    Pretty sure that TSMC never had a 14nm. They went from 16->12->10 (Apple only) -> 7

    Worth noting that Intel 14nm is about 10-12% higher density than TSMC 12FFC and about 20% higher density than SMIC 14nm.

    This is basically on a SMIC node that is just a smidge better than the old 16nm node that TSMC was using in 2013.
    Good catch, I meant TSMC 12nm
    Reply
  • thisisaname
    It is easy to make claims and I do not think any third party benchmarks are going to be run any time soon.
    Reply
  • DaveLTX
    thisisaname said:
    It is easy to make claims and I do not think any third party benchmarks are going to be run any time soon.
    https://chipsandcheese.com/2023/04/09/loongsons-3a5000-chinas-best-shot/Except there is?
    https://chipsandcheese.com/2023/01/29/previewing-chinas-loongson-3a5000-with-performance-counters/
    Reply