World's Fastest Supercomputer Can't Run a Day Without Failure

Building a supercomputer is always challenging, but creating the industry’s first exascale-class system is an encounter with something wholly unexpected and requires a lot of work with hardware and software. Unfortunately, this might be happening with Oak Ridge National Laboratory’s Frontier supercomputer, which can barely last a day without numerous hardware failures.

ORNL’s Frontier is the industry’s first system designed to deliver up to 1.685 FP64 ExaFLOPS peak performance using AMD’s 64-core EPYC Trento processors, Instinct MI250X compute GPUs, and HPE’s Slingshot interconnections at 21 MW of power. HPE built the system and used the Cray EX architecture designed for scale-out applications, primarily for ultra-fast supercomputers.

“We are working through issues in hardware and making sure that we understand (what they are),” said Justin Whitt, program director for the Oak Ridge Leadership Computing Facility (OLCF), in an interview with InsideHPC. “You are going to have failures at this scale. Mean time between failure on a system this size is hours, it’s not days.”

Rumors about potential hardware failures of Frontier have been floating around for quite a while now. Some said that the system experienced problems with the Slingshot interconnect, according to another InsideHPC story. In addition, others indicated that AMD’s Instinct MI250X compute GPUs were not as reliable as expected this year. Remember that the X version, with a higher number of stream processors and high clocks, is only available to select customers.

“A lot of challenges are focused around those [GPUs], but that’s not the majority of the challenges that we are seeing,” the head of OLCF said. “It is a pretty good spread among common culprits of parts failures that have been a big part of it. I don’t think that at this point that we have a lot of concern over the AMD products.”

Oak Ridge National Laboratory’s Frontier supercomputer is by far not the only system around to use HPE’s Cray EX architecture with Slingshot interconnects, AMD’s EPYC CPUs and AMD’s Instinct compute GPUs. For example, Finland’s Lumi supercomputer (Cray EX, EPYC Milan, Instinct MI250X compute GPUs) delivers 550 PetaFLOPS peak performance and is officially ranked as the world’s third most powerful supercomputer. Perhaps, the problem is valid with the scale of the machine that uses 60 million parts in total.

TOPICS

Anton Shilov is a contributing writer at Tom’s Hardware. Over the past couple of decades, he has covered everything from CPUs and GPUs to supercomputers and from modern process technologies and latest fab tools to high-tech industry trends.

29 Comments Comment from the forums

Fix_that_Glitch

Looks like they need to call up Finland. It can't really be thee fastest Supercomputer if it doesn't run. And can it run Crysis?
Reply
ManDaddio

Wait I thought AMD can't do anything wrong....
Reply
King_V

ManDaddio said:
Wait I thought AMD can't do anything wrong....
Are we trolling? Because, unless you're willing, and able, to back that statement, and show us that there's any serious claim being made that "AMD can't do anything wrong," then you're trolling.
Reply
9cento

1v1 me on Prime95 Small FFT 24h, amateurs
Reply
MoxNix

Must be running Windows...
Reply
Alvar "Miles" Udell

Must be AMD's famously terrible drivers...
Reply
UWguy

Should have used Intel.
Reply
Co BIY

At this scale with the huge number of parts there will always be a problem somewhere in the system but the design should be able to work around them.

60 million parts !

If mean time between failure is going to be hours rather than days are you reaching the point where bigger is not better ?

Would a system half the size but could solve the same problem in double the time be a better solution?
Reply
SunMaster

ManDaddio said:
Wait I thought AMD can't do anything wrong....

Is that how you interpret « I don’t think that at this point that we have a lot of concern over the AMD products.”? Or maybe you didn’t get so far reading the article,
Reply
missingxtension

I am sure that operations per hardware failure are better than most PC.
Reply

Show more comments