World's Fastest Supercomputer Can't Run a Day Without Failure

OLCF
(Image credit: OLCF)

Building a supercomputer is always challenging, but creating the industry’s first exascale-class system is an encounter with something wholly unexpected and requires a lot of work with hardware and software. Unfortunately, this might be happening with Oak Ridge National Laboratory’s Frontier supercomputer, which can barely last a day without numerous hardware failures.

ORNL’s Frontier is the industry’s first system designed to deliver up to 1.685 FP64 ExaFLOPS peak performance using AMD’s 64-core EPYC Trento processors, Instinct MI250X compute GPUs, and HPE’s Slingshot interconnections at 21 MW of power. HPE built the system and used the Cray EX architecture designed for scale-out applications, primarily for ultra-fast supercomputers.

While on paper, the Frontier supercomputer looks exceptionally good, and hardware parts of the machine system have been delivered, it seems like problems with hardware keep chasing the machine from coming online and being available to researchers requiring performance of around 1 FP64 ExaFLOPS.

“We are working through issues in hardware and making sure that we understand (what they are),” said Justin Whitt, program director for the Oak Ridge Leadership Computing Facility (OLCF), in an interview with InsideHPC. “You are going to have failures at this scale. Mean time between failure on a system this size is hours, it’s not days.”

Rumors about potential hardware failures of Frontier have been floating around for quite a while now. Some said that the system experienced problems with the Slingshot interconnect, according to another InsideHPC story. In addition, others indicated that AMD’s Instinct MI250X compute GPUs were not as reliable as expected this year. Remember that the X version, with a higher number of stream processors and high clocks, is only available to select customers.

Mr. Whitt did not confirm that the system experiences any particular issues with Instinct or Slingshot, but he pressed that the machine suffers from numerous hardware issues.

“A lot of challenges are focused around those [GPUs], but that’s not the majority of the challenges that we are seeing,” the head of OLCF said. “It is a pretty good spread among common culprits of parts failures that have been a big part of it. I don’t think that at this point that we have a lot of concern over the AMD products.”

Oak Ridge National Laboratory’s Frontier supercomputer is by far not the only system around to use HPE’s Cray EX architecture with Slingshot interconnects, AMD’s EPYC CPUs and AMD’s Instinct compute GPUs. For example, Finland’s Lumi supercomputer (Cray EX, EPYC Milan, Instinct MI250X compute GPUs) delivers 550 PetaFLOPS peak performance and is officially ranked as the world’s third most powerful supercomputer. Perhaps, the problem is valid with the scale of the machine that uses 60 million parts in total.

Only time will tell whether the Frontier supercomputer that was initially promised to come online in 2022 will be available to researchers starting in 2023, given that it is still not officially deployed.

Anton Shilov
Contributing Writer

Anton Shilov is a contributing writer at Tom’s Hardware. Over the past couple of decades, he has covered everything from CPUs and GPUs to supercomputers and from modern process technologies and latest fab tools to high-tech industry trends.

  • Fix_that_Glitch
    Looks like they need to call up Finland. It can't really be thee fastest Supercomputer if it doesn't run. And can it run Crysis?
    Reply
  • ManDaddio
    Wait I thought AMD can't do anything wrong....
    Reply
  • King_V
    ManDaddio said:
    Wait I thought AMD can't do anything wrong....
    Are we trolling? Because, unless you're willing, and able, to back that statement, and show us that there's any serious claim being made that "AMD can't do anything wrong," then you're trolling.
    Reply
  • 9cento
    1v1 me on Prime95 Small FFT 24h, amateurs
    Reply
  • MoxNix
    Must be running Windows...
    Reply
  • Alvar "Miles" Udell
    Must be AMD's famously terrible drivers...
    Reply
  • UWguy
    Should have used Intel.
    Reply
  • Co BIY
    At this scale with the huge number of parts there will always be a problem somewhere in the system but the design should be able to work around them.

    60 million parts !

    If mean time between failure is going to be hours rather than days are you reaching the point where bigger is not better ?

    Would a system half the size but could solve the same problem in double the time be a better solution?
    Reply
  • SunMaster
    ManDaddio said:
    Wait I thought AMD can't do anything wrong....

    Is that how you interpret « I don’t think that at this point that we have a lot of concern over the AMD products.”? Or maybe you didn’t get so far reading the article,
    Reply
  • missingxtension
    I am sure that operations per hardware failure are better than most PC.
    Reply