Building a supercomputer is always challenging, but creating the industry’s first exascale-class system is an encounter with something wholly unexpected and requires a lot of work with hardware and software. Unfortunately, this might be happening with Oak Ridge National Laboratory’s Frontier supercomputer, which can barely last a day without numerous hardware failures.
ORNL’s Frontier is the industry’s first system designed to deliver up to 1.685 FP64 ExaFLOPS peak performance using AMD’s 64-core EPYC Trento processors, Instinct MI250X compute GPUs, and HPE’s Slingshot interconnections at 21 MW of power. HPE built the system and used the Cray EX architecture designed for scale-out applications, primarily for ultra-fast supercomputers.
While on paper, the Frontier supercomputer looks exceptionally good, and hardware parts of the machine system have been delivered, it seems like problems with hardware keep chasing the machine from coming online and being available to researchers requiring performance of around 1 FP64 ExaFLOPS.
“We are working through issues in hardware and making sure that we understand (what they are),” said Justin Whitt, program director for the Oak Ridge Leadership Computing Facility (OLCF), in an interview with InsideHPC. “You are going to have failures at this scale. Mean time between failure on a system this size is hours, it’s not days.”
Rumors about potential hardware failures of Frontier have been floating around for quite a while now. Some said that the system experienced problems with the Slingshot interconnect, according to another InsideHPC story. In addition, others indicated that AMD’s Instinct MI250X compute GPUs were not as reliable as expected this year. Remember that the X version, with a higher number of stream processors and high clocks, is only available to select customers.
Mr. Whitt did not confirm that the system experiences any particular issues with Instinct or Slingshot, but he pressed that the machine suffers from numerous hardware issues.
“A lot of challenges are focused around those [GPUs], but that’s not the majority of the challenges that we are seeing,” the head of OLCF said. “It is a pretty good spread among common culprits of parts failures that have been a big part of it. I don’t think that at this point that we have a lot of concern over the AMD products.”
Oak Ridge National Laboratory’s Frontier supercomputer is by far not the only system around to use HPE’s Cray EX architecture with Slingshot interconnects, AMD’s EPYC CPUs and AMD’s Instinct compute GPUs. For example, Finland’s Lumi supercomputer (Cray EX, EPYC Milan, Instinct MI250X compute GPUs) delivers 550 PetaFLOPS peak performance and is officially ranked as the world’s third most powerful supercomputer. Perhaps, the problem is valid with the scale of the machine that uses 60 million parts in total.
Only time will tell whether the Frontier supercomputer that was initially promised to come online in 2022 will be available to researchers starting in 2023, given that it is still not officially deployed.
60 million parts !
If mean time between failure is going to be hours rather than days are you reaching the point where bigger is not better ?
Would a system half the size but could solve the same problem in double the time be a better solution?
Is that how you interpret « I don’t think that at this point that we have a lot of concern over the AMD products.”? Or maybe you didn’t get so far reading the article,