Alibaba Cloud ditches Nvidia's interconnect in favor of Ethernet — tech giant uses own High Performance Network to connect 15,000 GPUs inside data center

(Image credit: Shutterstock)

Alibaba Cloud engineer and researcher Ennan Zhai shared his research paper via GitHub, revealing the Cloud provider’s design for its data centers used for LLM training. The PDF document, entitled “Alibaba HPN: A Data Center Network for Large Language Model Training,” outlines how Alibaba used Ethernet to allow its 15,000 GPUs to communicate with each other.

General cloud computing generates consistent but small data flows with speeds lower than 10 Gbps. On the other hand, LLM training produces periodic bursts of data that can hit up to 400 Gbps. According to the paper, “This characteristic of LLM training predisposes Equal-Cost Multi-Path (ECMP), the commonly used load-balancing scheme in traditional data centers, to hash polarization, causing issues such as uneven traffic distribution.”

To avoid this, Zhai and his team developed the High-Performance Network (HPN), which used a “2-tier, dual-plane architecture” that decreases the number of possible ECMP occurrences while letting the system “precisely select network paths capable of holding elephant flows.” The HPN also used dual top-of-rack (ToR) switches, which allowed them to back up each other. These switches are the most common single-point failure for LLM training, requiring GPUs to complete iterations in sync.

Eight GPUs per host, 1,875 hosts per data center

Alibaba Cloud divided its data centers into hosts, with one host equipped with eight GPUs. Each GPU has its network interface card (NIC) with two ports, with each GPU-NIC system being called a ‘rail.’ The host also gets an extra NIC to connect to the backend network. Each rail then connects to two different ToR switches, ensuring that the entire host isn’t affected even if one switch fails.

Despite ditching NVlink for inter-host communication, Alibaba Cloud still uses Nvidia’s proprietary technology for the intra-host network, as communication between GPUs inside a host requires more bandwidth. However, since communication between rails is much slower, the “dedicated 400 Gbps of RDMA network throughput, resulting in a total bandwidth of 3.2 Tbps” per host, is more than enough to maximize the bandwidth of the PCIe Gen5x16 graphics cards.

Alibaba Cloud also uses a 51.2 Tb/sec Ethernet single-chip ToR switch, as multi-chip solutions are prone to more instability, with a four times greater failure rate than single-chip switches. However, these switches run hot, and no readily available heat sink on the market could stop them from shutting down due to overheating. So, the company created its novel solution by creating a vapor chamber heat sink with more pillars at the center to carry thermal energy much more efficiently.

Ennan Zhai and his team will present their work at the SIGCOMM (Special Interest Group on Data Communications) conference in Sydney, Australia, this August. Many companies, including AMD, Intel, Google, and Microsoft, would be interested in this project, primarily because they have banded together to create Ultra Accelerator Link—an open-standard interconnected set to compete with NVlink. This is especially true as Alibaba Cloud has been using the HPN for over eight months, meaning this technology has already been tried and tested.

However, HPN still has some drawbacks, the biggest being its complicated wiring structure. With each host having nine NICS and each NIC connected to two different ToR switches, there are a lot of chances to mix up which jack goes to which port. Nevertheless, this technology is presumably more affordable than NVlink, thus allowing any institution setting up a data center to save a ton of money on set-up costs (and perhaps even enable it to avoid Nvidia technology, especially if it’s one of the companies sanctioned by the U.S. in the ongoing chip war with China).

TOPICS

Jowi Morales is a tech enthusiast with years of experience working in the industry. He’s been writing with several tech publications since 2021, where he’s been interested in tech hardware and consumer electronics.

3 Comments Comment from the forums

bit_user

FYI, Tenstorrent, Habana, and Cerebras are all using 100+ Gigabit Ethernet. In the case of the first two, they're even using it for intra-chassis communication, which was referenced in some choice remarks by Jim Keller:
https://www.tomshardware.com/tech-industry/artificial-intelligence/jim-keller-suggests-nvidia-should-have-used-ethernet-to-stitch-together-blackwell-gpus
I'm not saying Alibaba paper is unoriginal, but just pointing out that its novelty is probably in the approach of mitigating switch failures and reducing their likelihood.
Reply
AINets

This article is a bit confusing, as NVLink is still used inside the GPU hosts (to interconnect the 8x GPU memory and communication). NVLink was never an option for connecting hosts together, that is the domain of Ethernet, Infiniband, or a custom optical crossconnect (Google et al). Perhaps the author is referring to favoring Ethernet vs. Infiniband? In that case I agree with @bit_user that this is known territory. Meta have been very vocal publicly about using Ethernet and lightweight IP traffic engineering on the TOR to favor paths for elephant flows. This would seem to be the same or similar scheme.
Reply
bit_user

AINets said:
This article is a bit confusing, as NVLink is still used inside the GPU hosts (to interconnect the 8x GPU memory and communication).
Correct.

AINets said:
NVLink was never an option for connecting hosts together,
In the early revs, it was just for intra-machine communication. In the last couple generations, it started to expand to rack-scale and maybe a little beyond.

From Nvidia's GTC 2024 Keynote:
04:29PM EDT - And NVDIA is building a rack-scale offering using GB200 and the new NVLink opertions, GB200 NVL72"

04:29PM EDT - NVLink 5 scales up to 576 GPUs"

04:48PM EDT - 5000 NVLink cables. 2 miles of cables

04:48PM EDT - And those are all copper cables. No optical transceivers needed

04:48PM EDT - That saved 20kW to be spent on computation
At some point, you're right that Nvidia wants you to switch to Infiniband. I think their acquisition of Mellanox, ~5 years ago, had something to do with that.
Reply