Nvidia and AMD-Powered Polaris Supercomputer: Preparing for Intel's Delayed Aurora

The US Department of Energy's Argonne National Laboratory has tapped Nvidia's A100 GPUs paired with AMD's CPUs for its new Polaris supercomputer, a move largely thought necessary due to the delayed Intel-powered Aurora supercomputer. That system hit a snag due to production difficulties with Intel's Sapphire Rapids server chips.

Polaris will employ four Nvidia A100s and two AMD CPUs per node, for a total of 2,240 Nvidia GPUs spread across 560 nodes. Polaris will deliver up to 44 PetaFLOPS of FP64 performance, meaning it will be a much less powerful machine than the planned Aurora exascale supercomputer that will deliver up to one ExaFLOP of sustained performance when it arrives in 2022 - 2023.

In contrast, Polaris will reach up to 1.4 "AI ExaFLOPS" of performance, which isn't measured with the standard FP64 workload used to quantify supercomputer performance. That means Polaris isn't an exascale-class machine — we have yet to see a planned Nvidia-powered supercomputer that will reach that mark. However, Polaris' 44 PetaFLOPs of performance qualifies for a spot in the top ten of the Top 500 list of the fastest supercomputers in the world.

As per the press release, Polaris will be used to help develop code for the forthcoming Aurora:

"The system will accelerate transformative scientific exploration, such as advancing cancer treatments, exploring clean energy and propelling particle collision research to discover new approaches to physics. And it will transport the ALCF into the era of exascale AI by enabling researchers to update their scientific workloads for Aurora, Argonne’s forthcoming exascale system."

Supercomputers Aurora Polaris — (Image credit: ASCR)

As we can see above (via Dylan Patel of Semiananlysis), Polaris is projected to consume roughly 2 MW of power at peak operation, which pales in comparison to Aurora's 60 MW. Polaris will also come with 560 nodes as opposed to Aurora's planned 9,000 nodes. The system will also use Cray's Slingshot networking fabric, meaning the system is produced by Hewlett Packard Enterprise (HPE).

[EDIT:] HPC Wire reports that 32-core Epyc Rome 7532 CPUs will be used at first, then changed in March 2022 to newer 32-core Epyc Milan 7543 chips. Polaris will also be updated later from a Slingshot 10 to the Slingshot 11 fabric to match Aurora, confirming that Intel doesn't handle the networking for Aurora. It uses 40 racks of Apollo Gen10 servers.

Image 1 of 2

The delayed Aurora exascale system is yet another blow for Intel's recent HPC efforts. In fact, Aurora was originally planned as a much smaller 180-PetaFLOP machine that would debut back in 2018, but Intel's Xeon Phi processors were delayed and eventually canceled. That lead to a complete redesign of the Aurora system into an Exascale system powered by Intel's Ponte Vecchio GPUs and Sapphire Rapids GPUs. In fact, Intel recently divulged that it developed a key aspect of its Ponte Vecchio GPUs, the Xe Tile that speeds communication between clustered compute elements, specifically at the request of the DoE for Aurora. That effort took a year of development work.

The DoE has confirmed the second revision of Aurora is now delayed from its original 2021 launch window, which isn't surprising given Intel's struggles with its 10nm Enhanced SuperFin node (now renamed Intel 7) used for the Sapphire Rapids processors. As a result, Aurora is now scheduled for deployment in the 2022-2023 timeframe, ceding the title of the world's first exascale supercomputer to the AMD-powered Frontier. AMD's machine will also be the fastest, and work continues apace.

Meanwhile, Polaris is already in the final stages of installation and will be ready for its first research work in early 2022, with broadened availability to the research community in Q2 2022.

TOPICS

Paul Alcorn is the Editor-in-Chief for Tom's Hardware US. He also writes news and reviews on CPUs, storage, and enterprise hardware.