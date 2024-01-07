Researchers at Oak Ridge National Laboratory trained a large language model (LLM) the size of ChatGPT on the Frontier supercomputer and only needed 3,072 of its 37,888 GPUs to do it. The team published a research paper that details how it pulled off the feat and the challenges it faced along the way.

The Frontier supercomputer is equipped with 9,472 Epyc 7A53 CPUs and 37,888 Radeon Instinct 37,888 GPUs. However, the team only used 3,072 GPUs to train an LLM with one trillion parameters and 1,024 to train another LLM with 175 billion parameters.

The paper notes that the key challenge in training such a large LLM is the amount of memory required, which was 14 terabytes at minimum. This meant that multiple MI250X GPUs with 64GB of VRAM each needed to be used, but this introduced a new problem: parallelism. Throwing more GPUs at an LLM requires increasingly better communication to actually use more resources effectively. Otherwise, most or all of that extra GPU horsepower would be wasted.

The research paper dives into the details of exactly how these computer engineers did it, but the short version is that they iterated on frameworks like Megatron-DeepSpeed and FSDP, changing things so that the training program would run more optimally on Frontier. In the end, the results were pretty impressive — weak scaling efficiency stood at 100%, which basically means more GPUs were used as efficiently as possible with an increasing workload size.

Meanwhile, strong scaling efficiency was slightly lower at 89% for the 175 billion parameter LLM and 87% for the one trillion parameter LLM. Strong scaling refers to increasing processor count without changing the size of the workload, and this tends to be where higher core counts become less useful, according to Amdahl's law. Even 87% is a decent result, given how many GPUs they used.

However, the team noted some issues achieving this efficiency on Frontier, stating "there needs to be more work exploring efficient training performance on AMD GPUs, and the ROCm platform is sparse." As the paper says, most machine learning at this scale is done within Nvidia's CUDA hardware-software ecosystem, which leaves AMD's and Intel's solutions underdeveloped by comparison. Naturally, efforts like these will foster the development of these ecosystems.

Nevertheless, the fastest supercomputer in the world continues to be Frontier, with its all-AMD hardware. In second place stands Aurora with its purely Intel hardware, including GPUs, though at the moment, only half of it has been used for benchmark submissions. Nvidia GPUs power the third fastest supercomputer, Eagle. If AMD and Intel want to keep the rankings this way, the two companies will need to catch up to Nvidia's software solutions.