Exacluster with 144 Nvidia H200 AI GPUs detailed by its designer: Hydra Host enters the scene

(Image credit: Will Bryk/X)

Earlier this month, we reported on ExaAILabs's Exacluster, a cluster of 18 machines running 144 Nvidia H200 GPUs, which happens to be one of the first clusters based on these processors. Since then, Hydra Host, the company that facilitated the construction of the cluster, has given us additional details about the system. The cluster uses Lenovo systems with multiple customizations from Hydra Host, which played a significant role. The machine can also be rented — when not in use by the owner — through Hydra's Brokkr platform.

A Lot of Compute Power

The cluster's backbone consists of 18 Lenovo nodes equipped with 144 Nvidia H200 GPUs and 20TB of HBM3E memory — or eight per system — enabling compute performance of 570 FP8 PetaTOPS for AI. 16 nodes are configured and fine-tuned by HydraHost for training, which requires massive computation and memory performance, while the remaining two serve as inference nodes. In addition, Hydra Host installed its Brokkr platform for GPU provisioning, management, and remote renting (more on this later).

Hydra Host collaborated with Computacenter to design a high-performance networking architecture tailored to the cluster's needs. The setup uses 3.2Tbps InfiniBand for east-west traffic and 400Gbps Ethernet for north-south communication, including dual 200Gbps connections per server and 400Gbps Dell Ethernet switches. Computacenter's networking engineers ensured all components aligned with Nvidia's reference architecture for seamless compatibility.

"We supplied the 18 Lenovo nodes with H200 GPUs (16 interconnected and two inference nodes), designed the networking architecture in collaboration with Computacenter, and facilitated colocation through Patmos," explained Andrea Holt, a spokesperson for Hydra Host.

The cluster itself is quite powerful, even in terms of general-purpose computing. The servers feature 192 96-core processors (for a total of 3,456 cores) paired with 36TB of DDR5 memory and 270TB of NVMe solid-state storage. There are spare bays so that storage space can be expanded easily. The supercomputer uses a network custom-built by HydraHost.

The company also brought in Patmos to handle colocation, providing enough power (around 100kW) and cooling for the power-hungry and hot machines.

Best Performance at Best Price

The Exacluster costs $5 million, averaging $277,777 per machine, comparable to a single 8-way H200 baseboard rather than a full server. Here is where it gets interesting. Who facilitated that price?

On the one hand, Hydra Host is a close Nvidia partner and only offers Nvidia GPUs as a service. In addition, its Brokkr software is optimized primarily for CUDA. On the other hand, ExaAI is a company backed by Nvidia, so it can potentially get preferential pricing.

"We are best in market at getting our customers the right GPU for their needs and at the best price," said Ryan Horjus, Lead Sales Engineer at Hydra. "This cluster was supported by Nvidia from an architecture design and their Inception program. Hydra handled it for Exa, as we do for other companies."

Hydra also specializes in building custom solutions for startups and even monetizes their machines when not in use.

"Hydra has helped startups get into their own clusters for better pricing through bulk purchasing," Horjus added. "They can achieve ideal pricing through our network. They are also able to monetize the servers when not in use via the Brokkr management platform."

Speaking of Brokkr, it is a GPU management and provisioning software and a monetization platform for GPUs. It provides datacenters and startups with a turnkey software solution for getting their hardware into customers' hands and getting them paid for, explained Ariel Deschapell, chief technology officer and co-founder of Hydra.

"One of its key features is automated bare metal provisioning and lifecycle management," described Deschapell. "That means the platform does all the work of configuring and managing the base server OS and firmware, setting up drivers and other supporting software, and running tests on the GPUs and other components. That speeds up and standardizes the delivery process significantly, reducing idle time on servers and GPUs. It also makes it easy to resell unused servers later to other users on the Brokkr platform looking for bare metal GPUs, if capacity needs change."

TOPICS

Anton Shilov is a contributing writer at Tom’s Hardware. Over the past couple of decades, he has covered everything from CPUs and GPUs to supercomputers and from modern process technologies and latest fab tools to high-tech industry trends.

6 Comments Comment from the forums

bit_user

The article said:
The setup uses 3.2Tbps InfiniBand for east-west traffic and 400Gbps Ethernet for north-south communication
In this context, what do "east-west" and "north-south" refer to? I'd guess east-west means communication among peer nodes and north-south is referring to communication with clients and storage.
Reply
bit_user

The article said:
On the one hand, Hydra Host is a close Nvidia partner and only offers Nvidia GPUs as a service. In addition, its Brokkr software is optimized primarily for CUDA. On the other hand, ExaAI is a company backed by Nvidia, so it can potentially get preferential pricing.
LOL, it's the same hand!

The article said:
Hydra also specializes in building custom solutions for startups and even monetizes their machines when not in use.
This is a nice idea, in theory. However, the concern I'd have is that AI training tends to be so data-intensive that it would take a long time to upload all of your training data to their servers and then you get to actually use the cluster for how long???

Plus, once you get bumped, because they want to use it, what do you do? I guess you have to transfer your partially-trained model + training data to some other cluster and continue there? Sounds inefficient, to me. I guess if the cost savings are substantial vs. one of the big cloud operators, then it might be worth the downsides.
Reply
Stomx

This setup is nothing special, most typical university supercomputers are like that just instead of H200, B200 they still have older A100 which are around 3x slower than H200.

By the way since DeepSeek trained their model on 2048 H800 GPUs for two months, on such cluster, it would train it in 9-12 months

(the H200 is 2-3x faster than H800 which seems severely restricted on FP64 which is not used for AI and restricted on communication speed).

The cost of electricity for 100kW *10,000 hours * 10 cents per kWh is just $100,000 per year or negligible.

So this just $5M cluster is pretty capable to do the same job like DeepSeek have done for the same money
Reply
Stomx

Now i understand why they called their company DeepS...
Reply
George³

bit_user said:
In this context, what do "east-west" and "north-south" refer to? I'd guess east-west means communication among peer nodes and north-south is referring to communication with clients and storage.
Right naming is vertical and horizontal networks communications in the cause in supercomputers. Just is article is used terms that is between workers which install physically the network in supercomputers.
Reply
Stomx

More interesting numbers.
If Elon Mask current AI cluster consisting of 100,000 H100 GPUs (which are not substantially slower than H200) would start training DeepSeek model at 9am in the morning it would finish it at approximately the end of the day at 7pm.

And if the DeepSeek decided to submit the application requesting computer time on the most powerful supercomputer Frontier they most probably would got rejected for asking too little time, below their cut-off threshold which is around 250,000 node-hours
Reply

Show more comments