Faulty Nvidia H100 GPUs and HBM3 memory caused half of failures during LLama 3 training — one failure every three hours for Meta's 16,384 GPU training cluster

Nvidia
(Image credit: Nvidia)

Meta recently released a study detailing its Llama 3 405B model training run on a cluster containing 16,384 Nvidia H100 80GB GPUs. The training run took place over 54 days and the cluster encountered 419 unexpected component failures during that time, averaging one failure every three hours. In half of the failure cases, GPUs or their onboard HBM3 memory were to blame.  

As the old supercomputing adage goes, the only certainty with large-scale systems is failure. Supercomputers are extremely complex devices that use tens of thousands of processors, hundreds of thousands of other chips, and hundreds of miles of cables. In a sophisticated supercomputer, it's normal for something to break down every few hours, and the main trick for developers is to ensure that the system remains operational regardless of such local breakdowns.

The scale and synchronous nature of 16,384 GPU training make it prone to failures. If the failures aren't mitigated correctly, a single GPU failure can disrupt the entire training job, necessitating a restart. However, the Llama 3 team maintained over a 90% effective training time. 

During a 54-day pre-training snapshot, there were 466 job interruptions, with 47 planned and 419 unexpected. Planned interruptions were due to automated maintenance, while unexpected ones mostly stemmed from hardware issues. GPU problems were the largest category, accounting for 58.7% of unexpected interruptions. Only three incidents required significant manual intervention; the rest were managed by automation. 

asdfg

(Image credit: Meta)

Out of 419 unexpected interruptions, 148 (30.1%) were caused by various GPU failures (including NVLink failures), whereas 72 (17.2%) were caused by HBM3 memory failures, which is not too surprising given that Nvidia's H100 GPUs consume around 700W and suffer a lot of thermal stress. Interestingly, only two CPUs failed in 54 days. 

But while GPUs are the most important components that also happen to be fragile, 41.3% of unexpected interruptions were caused by numerous factors, including software bugs, network cables, and network adapters. 

To enhance efficiency, Meta's team reduced job startup and checkpointing times and developed proprietary diagnostic tools. PyTorch’s NCCL flight recorder was used extensively to quickly diagnose and resolve hangs and performance issues, particularly related to NCCLX. This tool captures collective metadata and stack traces, aiding in swift problem resolution. 

NCCLX played a crucial role in failure detection and localization, especially for NVLink and RoCE-related issues. The integration with PyTorch allowed for monitoring and automatic timeout of communication stalls caused by NVLink failures. 

Straggling GPUs, which can slow down thousands of other GPUs, were identified using specialized tools. These tools prioritized problematic communications, enabling effective detection and timely resolution of stragglers, which ensured that slowdowns were minimized, maintaining overall training efficiency. 

Environmental factors, such as mid-day temperature fluctuations, impacted training performance by causing a 1-2% variation in throughput. The dynamic voltage and frequency scaling of GPUs were affected by these temperature changes, though it wasn't a big problem. 

Yet another challenge experienced by the Llama 3 405B LLM training team is simultaneous power consumption changes of tens of thousands of GPUs, which stresses their data center's power grid. These fluctuations, sometimes in the tens of megawatts, stretched the grid's limits, which means that Meta has to ensure that its data centers have enough power. 

Considering the fact that a 16,384 GPU cluster experienced 419 failures in 54 days (7.76 times per 24 hours, or a failure every three hours), we can only wonder how often xAI's cluster containing 100,000 H100 GPUs, a six-fold increase in the number of components that could fail, will experience a failure.

Anton Shilov
Contributing Writer

Anton Shilov is a contributing writer at Tom’s Hardware. Over the past couple of decades, he has covered everything from CPUs and GPUs to supercomputers and from modern process technologies and latest fab tools to high-tech industry trends.

  • jkhoward
    I’d be pissed if two of my CPUs failed in 54 days..
    Reply
  • Flayed
    Assuming they had somewhere in the region of 2,500 CPUs 2 failures doesn't seem that bad.
    Reply
  • Steve Nord_
    High rel humblebrags and chassis adequacy are good to have. Sooo...looking at soft failures? Ah, 96 page ops report linked at top of article, better than being there.
    Reply
  • Alvar "Miles" Udell
    Considering the fact that a 16,384 GPU cluster experienced 419 failures in 54 days (7.76 times per 24 hours, or a failure every three hours), we can only wonder how often xAI's cluster containing 100,000 H100 GPUs, a six-fold increase in the number of components that could fail, will experience a failure.

    Well if a 16,384 GPU cluster had a 2.56% failure rate with 419 GPU failures, then a 100,000 GPU cluster experiencing the same rate would see 2,560 failures.

    The question is if they're covered by warranty.
    Reply
  • slimsea
    Alvar Miles Udell said:
    Well if a 16,384 GPU cluster had a 2.56% failure rate with 419 GPU failures, then a 100,000 GPU cluster experiencing the same rate would see 2,560 failures.

    The question is if they're covered by warranty.
    It'll be really interesting to see the full bathtub curve over the system lifetime.

    These clusters have to run about three years, let's call that 20x 54 days. If the failure rate stayed flat over that time (was this the cluster's first run after commissioning or has it been around a while already?), that would mean 50% of gpus fail... Ouch! Does not seem very economical, or confidence-inspiring.
    Reply
  • spongiemaster
    Alvar Miles Udell said:
    Well if a 16,384 GPU cluster had a 2.56% failure rate with 419 GPU failures, then a 100,000 GPU cluster experiencing the same rate would see 2,560 failures.

    The question is if they're covered by warranty.
    Your numbers are wrong because the title is misleading, bordering on clickbait, and because you didn't bother to read any of the article. 419 is the total number of unexpected interruptions of any kind, not just the GPU. There were 148 GPU failures (0.90% failure rate) and 72 HBM3 failures (.44%) for a total H100 failure rate of 1.34%.
    Reply
  • Alvar "Miles" Udell
    spongiemaster said:
    Your numbers are wrong because the title is misleading, bordering on clickbait, and because you didn't bother to read any of the article. 419 is the total number of unexpected interruptions of any kind, not just the GPU. There were 148 GPU failures (0.90% failure rate) and 72 HBM3 failures (.44%) for a total H100 failure rate of 1.34%.

    Actually we are both wrong, it's 270, or 1.65%, going by the category numbers.

    Faulty GPU - 148
    Faulty HBM3 - 72
    Faulty SRAM - 19
    Faulty GPU Processor - 17
    Silent data corruption - 6
    Thermal Interface Sensor - 6

    Reply
  • spongiemaster
    Alvar Miles Udell said:
    Actually we are both wrong, it's 270, or 1.65%, going by the category numbers.

    Faulty GPU - 148
    Faulty HBM3 - 72
    Faulty SRAM - 19
    Faulty GPU Processor - 17
    Silent data corruption - 6
    Thermal Interface Sensor - 6

    Not all of those issues would require a GPU replacement. Regardless, the point stands that the article title is misleading and the number is much lower than 419 failures.
    Reply
  • DS426
    Even at a failure rate of "only 1.65%," this is on a rather short 54 day computing workload. Normalizing to an annualized failure rate and assuming the rate wouldn't climb (silicon degradation would accelerate the FR but let's ignore), Meta would be looking at about an 11.15% AFR.

    I guess the upside is that the GPUs would still be under warranty, so then it just becomes an additional labor cost and a tiny slab of productivity lost on training.
    Reply
  • Daniel15
    The link to the study is a temporary CDN URL which has expired, so it doesn't work any more. Can you please update it with the original link?
    Reply