Tesla has boosted its in-house AI supercomputer with thousands of additional Nvidia A100 GPUs. The Tesla supercomputer had 5,760 A100 GPUs about a year ago, and that count has since risen to 7,360 A100 GPUs — that's an additional 1,600 GPUs, or about a 28% increase.
According to Tesla Engineering Manager Tim Zaman, this upgrade makes the firm's AI system a top-7 supercomputer worldwide by GPU count.
An Nvidia A100 GPU is a powerful Ampere architecture solution aimed at data centers. Yes, it uses the same GPU architecture as GeForce RTX 30 series GPUs, which are some of the best graphics cards currently available. However, there is no close consumer relation to the A100, which comes with 80GB of HBM2e memory on board, offers up to 2 TB/s bandwidth, and requires up to 400W of power. The architecture of the A100 has also been tweaked for accelerating tasks common in AI, data analytics, and high-performance computing (HPC) applications.
The first system Nvidia showed wielding the A100 was the Nvidia DGX A100, which packed in eight A100 GPUs linked via six NVSwitch with 4.8 TBps of bi-directional bandwidth for up to 10 PetaOPS of INT8 performance, 5 PFLOPS of FP16, 2.5 TFLOPS of TF32, and 156 TFLOPS of FP64 in a single node.
That was eight A100 GPUs — Tesla's AI supercomputer now has 7,360 of these. Tesla hasn't publicly benchmarked its AI supercomputer, but the similarly-equipped GPU-based NERSC Perlmutter, which has 6,144 Nvidia A100 GPUs, achieves 70.87 Linpack petaflops. Using this and data from other A100 GPU supercomputers as performance reference points, HPC Wire estimates the Tesla AI supercomputer is capable of achieving about 100 Linpack petaflops.
Tesla doesn’t intend to continue down the Nvidia GPU architecture path for its in-house AI supercomputers long-term. This world’s top-7 machine by GPU-count is merely a precursor to the upcoming Dojo supercomputer, which was first announced by Elon Musk back in 2020. A year ago we got a look at the Tesla D1 Dojo chip, which are designed to supplant Nvidia's GPUs for “maximum performance, throughput and bandwidth at every granularity.”
The Tesla Dojo D1 is a custom ASIC (application-specific integrated circuit) design, purposed for AI training, and it is one of the first ASICs in this field. Current D1 test chips are manufactured on TSMC N7 and pack in about 50 million transistors.
More information about the Dojo D1 chip, and the Dojo system, might be revealed at next week's Hot Chips Symposium — three Tesla presentations are schedule for next Tuesday, addressing Dojo D1 chip architecture, Dojo and ML training, and enabling AI through system integration.