Fujitsu uses Fugaku supercomputer to train LLM: 13 billion parameters

Artificial Intelligence of Things
(Image credit: Getty Images)

Although Fujitsu's Fugaku supercomputer is no longer the world's fastest machine from the Top 500 supercomputer list, it still is a very capable system and the versatility of the A64FX processor allows to use it for a variety of workloads, such as AI. This week Fujitsu released its Fugaku-LLM, a large language model with advanced Japanese language processing capabilities that is designed for both research and commercial applications. 

Fujitsu's Fugaku-LLM was trained using 380 billion tokens on 13,824 nodes of the Fugaku supercomputer based on the A64FX processor that supports FP64, FP32, FP16 and INT8 modes for a variety of AI and conventional supercomputer applications. The training of Fugaku-LLM naturally took advantage of distributed parallel learning techniques optimized for the supercomputer's architecture and the Tofu interconnect D. 

The Fugaku-LLM features 13 billion parameters, which looks pale compared to GPT-4's 175 billion, which is the largest LLM ever trained in Japan. Fujitsu says that its 13 billion parameter LLM does not require vast compute resources to inference, which will be optimal for businesses and researchers in Japan. Approximately 60% of the training data was in Japanese and 40% of the data was data in English, mathematics, and code.

This extensive Japanese-centric training sets it apart from other Japanese models that were trained primarily on English datasets. As a result, Fugaku-LLM boasts superior proficiency in Japanese, achieving an average score of 5.5 on the Japanese MT-Bench, the top score among openly available models trained with original data from Japan. It particularly excels in humanities and social sciences, achieving an impressive benchmark score of 9.18, according to Fujitsu.

The Fugaku-LLM initiative has been driven by collaborations among leading Japanese institutions including Tokyo Institute of Technology, Tohoku University, Fujitsu Limited, RIKEN, Nagoya University, CyberAgent, and Kotoba Technologies. One of the reasons they collaborated was a shortage of GPUs typically used to train and inference AI models. Another reason is that the model could be used with Fujitsu's next-generation 150-core Monaka datacenter CPU optimized for both AI and HPC workloads.

Fugaku-LLM is now available for both academic and commercial purposes under specified licensing terms from GitHub and Hugging Face (though Fujitsu did not provide any links). Additionally, it will also be offered via the Fujitsu Research Portal from May 10, 2024.

Anton Shilov
Freelance News Writer

Anton Shilov is a Freelance News Writer at Tom’s Hardware US. Over the past couple of decades, he has covered everything from CPUs and GPUs to supercomputers and from modern process technologies and latest fab tools to high-tech industry trends.

  • Metal Messiah.
    The training of Fugaku-LLM naturally took advantage of distributed parallel learning techniques optimized for the supercomputer's architecture and the Tofu interconnect D.

    To put it more clearly, the 'Megatron-DeepSpeed' DL framework was actually ported to Fugaku, and the dense matrix multiplication library was accelerated for 'Transformer', so as to maximize the distributed training perf.
    Reply
  • brandonjclark
    Metal Messiah. said:
    To put it more clearly, the 'Megatron-DeepSpeed' DL framework was actually ported to Fugaku, and the dense matrix multiplication library was accelerated for 'Transformer', so as to maximize the distributed training perf.
    Let's make that more clear... ;)

    Megatron (an NVidia-developed framework that excels at multi-GPU AI Acceleration), coupled with DeepSpeed, a Microsoft-written library were used.

    The framework ensures many GPU's can be used at once to train the model.

    The library itself helps by accelerating the performance of the model during training and operation by introducing many cool things like gradient-loss control and parallelism while still being fairly memory efficient.

    When you add on a self-attention tooling like Transformers, you have a model that can pick out or highlight the more important sections of the input.

    This type of AI acceleration (all of it put together) is very good at Natural Language Processing.


    I think what I've said is true but I'm still learning.
    Reply
  • Flayed
    I wonder how much memory 13 billion parameters uses
    Reply
  • A Stoner
    If they really want to perfect AI, they should be starting small and figuring out how to get the program to 'understand' what it knows... millions of smaller AI projects will absolutely move things forward far faster than a few massive ones in the long run.
    Reply