Intel-powered Aurora supercomputer fails to dethrone AMD-powered Frontier on Top500 list, again — claims spot as fastest AI supercomputer with HPL-MxP benchmark instead
Aurora faces stability issues from hardware failures, cooling malfunctions, and operational errors.
Today, a slew of supercomputing entities submitted their newest benchmark test results to the Top500 committee to compete for the top spot on the list. The Intel-powered Aurora supercomputer was widely expected to take the top spot from the AMD-powered Frontier, the #1 supercomputer on the Top500 list, but it took second place instead. However, Aurora did take the top spot in the AI-centric HPL-MxP mixed-precision benchmark, allowing Intel to lay claim to powering the fastest AI supercomputer in the world with 10.6 AI Exaflops of performance.
It's noteworthy that Aurora is still not fully operational, so the entire system wasn't used for any of the benchmark submissions. Aurora remains beset by numerous hardware issues like hardware and cooling system failures, operational errors, and network instability, among others (details in the last section below). The continued issues are a bit surprising—the system was first announced nine years ago, the second revision was announced five years ago (the first version was canceled), and the final components were installed eleven months ago.
The system now houses the full complement of 21,248 CPUs and 63,744 GPUs spread across 10,624 compute blades, but Argonne National Laboratory (ANL), which hosts the system, was again unable to submit a full-system Linpack run for the Top500 list.
System | Cores | Rmax (PFlop/s) | Rpeak (PFlop/s) | Power (kW) |
Frontier - HPE Cray EX235a, AMD custom 3rd-Gen EPYC 64C 2GHz, AMD Instinct MI250X | 8,699,904 | 1,206.00 | 1,714.81 | 22,786 |
Aurora - HPE Cray EX - Xeon CPU Max 9470 52C 2.4GHz, Intel Data Center GPU | 9,264,128 | 1,012.00 | 1,980.01 | 38,698 |
Eagle - Microsoft NDv5, Xeon Platinum 8480C 48C 2GHz, NVIDIA H100 | 2,073,600 | 561.20 | 846.84 | ? |
Instead, Aurora placed second with 1.012 Exaflops, breaking the Exaflops barrier with 87% of the system active (9,234 of the full 10,624 nodes). This solidifies Aurora's second-place position — Aurora's first submission (with only half the system) also took second place, reaching 585.34 petaflops six months ago.
Aurora is supposed to be faster than Frontier in the High-Performance Linpack (HPL) benchmark and thus take the lead in the Top500 upon completion, but it's clear the system will need more tuning to live up to its billing. Frontier is ~19% faster than Aurora with 1.206 exaflops of performance, and, assuming linear scaling, Aurora still wouldn't win after adding the remaining 13% of nodes that weren't used for the Top500 benchmark run.
Intel has ballyhooed Aurora's theoretical peak performance of 2 exaflops (Rpeak), but supercomputers are measured by sustained performance (Rmax). Frontier delivers 70% of its peak as sustained performance in Linpack, while Aurora only delivers 51% of its peak. This should hopefully improve over time, and Aurora would easily take the top spot if it delivered a similar 70% of its peak performance (~1.4 exaflops) during sustained workloads.
I asked ANL if Aurora is expected to take the lead over Frontier in the Top500 upon completion. "There's a contractual target number that is faster than Frontier," a representative responded. "So, if we're successful in reaching that number, we'll be faster than Frontier." Notably, the statement says Aurora should beat Frontier, not that it will. We've followed up for a firm confirmation of the actual performance target.
Stay On the Cutting Edge: Get the Tom's Hardware Newsletter
Get Tom's Hardware's best news and in-depth reviews, straight to your inbox.
Aurora took first place in the HPL-MxP mixed-precision benchmark with 10.6 exaflops of AI performance with only 89% of the Aurora system active. This benchmark prioritizes lower precision (FP32 and lower, even FP16) than the FP64 used for the Linpack benchmark used for the Top500 ranking. Thus, this benchmark better represents AI workloads and an increasing number of other real-world applications — FP64 is largely relegated to traditional scientific computing, and some argue it is a shrinking portion of that segment, too.
HPL-MxP is becoming much more important to model real-world performance in the age of AI, but Aurora's position at the top will be hotly contested. There has yet to be a submission from a large-scale Nvidia Grace-Hopper-powered system to the leaderboard. The Alps supercomputer, which now promises 20 exaflops of AI performance, is slated to have all of its 10,752 Grace Hopper processors installed by the end of June 2024, so competition for the leadership spot is on the way.
The High Performance Conjugate Gradients (HPCG) benchmark is also designed to be more representative of real workload applications than Linpack. Aurora performed impressively in this benchmark as well, taking the #3 ranking with a mere 38.5% of the supercomputer active. Aurora also took fifth in the Graph500 benchmark, which is designed to measure performance in data-intensive applications, but ANL didn't specify how much of the system was active for this benchmark run.
Aurora hasn't placed in the Green500, a list of the most power-efficient supercomputers, and that isn't surprising. Aurora will consume up to 60 MW of peak power, slightly more than double Frontier's 29 MW, but we don't know how its final performance will look. It isn't clear if Aurora can beat Frontier in Linpack performance, but even if it does win, it will be by a small amount—certainly not enough to justify the increased power consumption for that particular workload. However, there are plenty of other applications that operate at lower precisions, and power efficiency comparisons will vary by application. Regardless, Nvidia's Grace Hopper systems now comprise five of the top ten systems on the Green500, so it appears that Nvidia has both Intel and AMD beat in the power efficiency department.
Aurora facing hardware failures, cooling system malfunctions, among other problems
Ten long months passed between the final Aurora hardware being installed and when ANL submitted its benchmarks, raising questions about the source of the continued delay in standing up the full machine. We followed up with Intel on the matter.
“[...]Since we completed the physical delivery of the last compute node at the end of June 2023 (only 10 months ago), we have been working hand-in-hand with Argonne National Laboratory and HPE to fully stabilize and tune the system, including the compute nodes, storage system, fabric, power delivery, and cooling."
"We are also actively working on addressing stability issues like hardware failures, software bugs, cooling system malfunctions, issues with power supply, networking infrastructure stability, environmental factors, and operational errors,” the Intel representative said to Tom's Hardware.
Argonne National Laboratories and Intel have yet to provide a firm date for when they expect the system to be fully operational, but we do know that Aurora's window to take the lead in the Top500 is closing. The AMD-powered El Capitan, rated for two exaflops of peak throughput (not sustained), is largely expected to beat Aurora and Frontier in Linpack. Lawrence Livermore Labs submitted early results for sub-scale models of El Capitan today, and the system is expected to be completely installed by the end of 2024.
Paul Alcorn is the Managing Editor: News and Emerging Tech for Tom's Hardware US. He also writes news and reviews on CPUs, storage, and enterprise hardware.
-
aldaia "Aurora will consume up to 60 mW of peak power, slightly more than double Frontier's 29 mW, ..."Reply
Be careful with units:
mW = milliWatts
MW = MegaWatts
I highly doubt supercomputers consume milliWatts -
dalek1234 "hardware failures, software bugs, cooling system malfunctions, issues with power supply, networking infrastructure stability, environmental factors, and operational errors"Reply
And it uses twice the power compared to Frontier. And how does that translate to extra $-for-electricity?
What a clown-show Intel has become -
jp7189 Wasn't there an article not long ago about frontier having a high number of failures. I felt the subtext was that Intel would be producing a higher quality result when Aurora launched. Gosh it almost sounds like building supercomputers is hard.Reply -
Pierce2623
Definitely. It’s always hard to get huge amounts of hardware to scale properly over a large area. Hell Cray doesn’t even make its own products anymore but it’s still a viable business just from having experience getting these things operable.jp7189 said:Wasn't there an article not long ago about frontier having a high number of failures. I felt the subtext was that Intel would be producing a higher quality result when Aurora launched. Gosh it almost sounds like building supercomputers is hard. -
PEnns This could also be an Intel win...it just needs to think outside the box:Reply
Intel-powered Aurora supercomputer fails to dethrone AMD-powered Frontier on Top500 list, again — claims spot as fastest AI supercomputer with Purple-Gold-Aquamarine color! -
helper800
The power consumptions for sure is a clown show, however, ALL supercomputers have tons of everything in your quoted section including the aforementioned AMD powered Frontier supercomputer.dalek1234 said:"hardware failures, software bugs, cooling system malfunctions, issues with power supply, networking infrastructure stability, environmental factors, and operational errors"
And it uses twice the power compared to Frontier. And how does that translate to extra $-for-electricity?
What a clown-show Intel has become -
tamalero
in what? record power consumption and inefficiency?PEnns said:This could also be an Intel win...it just needs to think outside the box:
Intel-powered Aurora supercomputer fails to dethrone AMD-powered Frontier on Top500 list, again — claims spot as fastest AI supercomputer with Purple-Gold-Aquamarine color!
wait, like what?helper800 said:The power consumptions for sure is a clown show, however, ALL supercomputers have tons of everything in your quoted section including the aforementioned AMD powered Frontier supercomputer. -
helper800
I know enough to know I don't know very much about supercomputers, however, I do know the below.tamalero said:wait, like what?
There are so many motherboards, RAM sticks, CPUs, hundreds of miles of cords, PSUs, GPUs, storage devices, et cetera that go bad on a hardware level because these supercomputers contain many hundreds if not thousands of them. On a software level it is a massive clusterF to get all of the hardware to work and scale the task you are doing. You cannot just open a normal application on a supercomputer and expect it to work, and that is another thing, lots of times the software that is run on them is entirely unique or needs to be heavily modified. Development and implementation hell is all we can say to describe this.