Nvidia Unveils Its Next-Generation 7nm Ampere A100 GPU for Data Centers, and It's Absolutely Massive

Nvidia A100
(Image credit: Nvidia)

The day has finally arrived for Nvidia to take the wraps off of its Ampere architecture. Sort of. While Ampere will inevitably make its way into some of the best graphics cards and find a place on our GPU hierarchy, today's digital GTC announcement is only about the Nvidia A100, a GPU designed primarily for the upcoming wave of exascale supercomputers and AI research. It's the descendant of Nvidia's existing line of Tesla V100 GPUs, and like Volta V100 we don't expect to see A100 silicon in any consumer GPUs. Well, maybe a Titan card—Titan A100?—but I don't even want to think about what such a card would cost, because the A100 is a behemoth of a chip.

(Image credit: Nvidia)

Let's start with what we know at a high level. First, the Nvidia A100 will pack a whopping 54 billion transistors, with a die size of 826mm square. GV100 for reference had 21.1 billion transistors in an 815mm square package, so the A100 is over 2.5 times as many transistors, while only being 1.3% larger. Nvidia basically couldn't make a larger GPU, as the maximum reticle size for current lithography is around 850mm square. The increase in transistor count comes courtesy of TSMC's 7nm FinFET process, which AMD, Apple, and others have been using for a while now. It's a welcome and necessary upgrade to the aging 12nm process behind Volta.

Along with the monster GPU itself are six stacks of HBM2 memory, which Nvidia says provide 40GB of total memory capacity. Since HBM2 stacks come in power of two size increments (i.e., 8GB per stack), we can only assume one of the stacks is for either redundancy of some form and/or future products. We specifically asked and got a vague answer saying, "We're only shipping products with five HBM2 stacks right now." Whether the sixth stack simple doesn't exist on current actual A100 boards, or is a dummy spacer, or is simply disabled ... Nvidia wouldn't say. But at some point, Nvidia plans to have A100 solutions that do support six HBM2 stacks and probably with more SMs enabled (more on that below).

For HPC (high performance computing) nodes in a super computer, Nvidia also upgraded the NVLink to 600GB/s for each GPU, and NVSwitch provides full speed connections to any other node in a server. 8-way Nvidia A100 systems already exist and have been shipped to customers including the Department of Energy. These use six NVSwitch controllers with an aggregate 4.8 TBps of bandwidth.

The Nvidia A100 isn't just a huge GPU, it's the fastest GPU Nvidia has ever created, and then some. The third generation Tensor cores in A100 provide a new hybrid FP32 format called TF32 (Tensor Float 32, which is basically like Google's bfloat16 format, with three extra bits of significant digits) that aims to provide a balanced option with the precision of FP16 and exponent sizes of FP32. For workloads that use TF32, the A100 can provide 312 TFLOPS of compute power in a single chip. That's up to 20X faster than the V100's 15.7 TFLOPS of FP32 performance, but that's not an entirely fair comparison since TF32 isn't quite the same as FP32.

(Image credit: Nvidia)

Elsewhere, the A100 delivers peak FP64 performance of 19.5 TFLOPS. That's more FP64 performance than the V100's FP32, and about 2.5 times the FP64 performance. However, Nvidia says that the third generation Tensor cores support FP64, which is where the 2.5X increase in performance comes from. The Tensor cores aren't be the same instruction set as FP64 CUDA cores, so if you're looking for CUDA FP64 performance it's 'only' 9.7 TFLOPS.

Nvidia has now posted the specs of the A100 chip, and the details are ... surprising. First, the new Tensor cores are a significant change from the previous generation. There are only four Tensor cores per SM (compared to 8 on Volta and Turing), but those four cores deliver twice the performance of the previous generation's eight cores. The image on Nvidia's site showing the new SM is called "New GA100 SM with Uber Tensor Core.png" which pretty much says it all.

Something else to note: The full A100 chip has up to 128 SMs and 8192 FP32 CUDA cores, but only 108 SMs are enabled in the initial version. That says a lot about yields on such a massive chip, as well as TSMC's N7P node. Nvidia is probably saving up 'good' die for future products, but the initial A100 will only have about 85% of the SMs enabled (compared to 95% of SMs being enabled on the V100). There must be a massive amount of redundancy built into the critical paths to allow such a chip to exist and function at all.

The full A100 GPU has 128 SMs and up to 8192 CUDA cores, but the Nvidia A100 GPU only enables 108 SMs for now. (Image credit: Nvidia)

New GA100 SM with Uber Tensor Core, plus FP64 cores but no RT cores. (Image credit: Nvidia)

One interesting tidbit that never came up in the announcement: ray tracing. The Volta V100 also skipped ray tracing, partly because it came before Turing but also because it was focused on delivering compute above all else. Given the above block diagrams, it appears the A100 follows a similar path and leaves RT cores for other Ampere GPUs. It's yet another clear indication that A100 isn't for consumers and probably won't even go into a Titan card. Or maybe we'll get a Titan A100 for entry-level compute and deep learning or whatever.

Looking at the details of the SM, we get a similar design to Volta V100: 64 FP32 cores and 64 INT32 cores, with the four large Tensor clusters. With 54 SMs enabled, the A100 has a total of 6912 FP32 CUDA cores, and the same for INT32, with half as many (3456) FP64 cores. But as noted above, the Tensor cores can also do FP64 calculations, yielding the same 19.5 TFLOPS of compute for both FP64 and FP32. Interesting to note is that FP16 performance is up to 78 TFLOPS, but BF16 is half that.

Here's the full specs table:

Swipe to scroll horizontally
Data Center GPUNVIDIA Tesla P100NVIDIA Tesla V100NVIDIA A100
GPU CodenameGP100GV100GA100
GPU ArchitectureNVIDIA PascalNVIDIA VoltaNVIDIA Ampere
GPU Board Form Factor SXMSXM2SXM4
SMs5680108
TPCs284054
FP32 Cores / SM646464
FP32 Cores / GPU358451206912
FP64 Cores / SM323232
FP64 Cores / GPU179225603456
INT32 Cores / SMNA6464
INT32 Cores / GPUNA51206912
Tensor Cores / SMNA84
Tensor Cores / GPUNA640432
GPU Boost Clock1480 MHz1530 MHz1410 MHz
Peak FP16 Tensor TFLOPS with FP16 AccumulateNA125312/624
Peak FP16 Tensor TFLOPS with FP32 AccumulateNA125312/624
Peak BF16 Tensor TFLOPS with FP32 AccumulateNANA312/624
Peak TF32 Tensor TFLOPSNANA156/312
Peak FP64 Tensor TFLOPSNANA19.5
Peak INT8 Tensor TOPSNANA624/1248
Peak INT4 Tensor TOPSNANA1248/2496
Peak FP16 TFLOPS21.231.478
Peak BF16 TFLOPSNANA39
Peak FP32 TFLOPS10.615.719.5
Peak FP64 TFLOPS5.37.89.7
Peak INT32 TOPSNA15.719.5
Texture Units224320432
Memory Interface4096-bit HBM24096-bit HBM25120-bit HBM2
Memory Size16 GB32 GB / 16 GB40 GB
Memory Data Rate703 MHz DDR877.5 MHz DDR1215 MHz DDR
Memory Bandwidth720 GB/sec900 GB/sec1.6 TB/sec
L2 Cache Size4096 KB6144 KB40960 KB
Shared Memory Size / SM64 KBConfigurable up to 96 KBConfigurable up to 164  KB
Register File Size / SM256 KB256 KB256 KB
Register File Size / GPU14336 KB20480 KB27648 KB
TDP300 Watts300 Watts400 Watts
Transistors15.3 billion21.1 billion54.2 billion
GPU Die Size610 mm²815 mm²826 mm²
TSMC Manufacturing Process16 nm FinFET+12 nm FFN7 nm N7

I've seen a few people prior to this announcement suggesting Nvidia will have Ampere GPUs running at upward of 2.0 GHz, maybe even as high as 2.5 GHz. Clearly, A100 isn't going that route, with a boost clock of 1410 MHz. All of the performance data discussed is also based on the boost clock, which the A100 may or may not maintain under heavy workloads. Nvidia tends to be conservative on its consumer boost clocks, but in the data center things aren't quite the same.

Power use for a single Nvidia A100 is 400W this round, 33% higher than the P100 and V100 iterations. The HBM2 memory is also clocked at 1215 MHz this round, which is a healthy 38% increase over the V100 and currently represents the fastest HBM2 implementation we've seen. As noted above, while the images show six HBM2 stacks, Nvidia's specs table says the HBM2 has a 5120-bit interface, which is five 1024-bit interfaces. Is an entire HBM2 stack disabled due to yields, or is it just a dummy chip for now? We suspect the former, but Nvidia did not confirm this, saying only that current products are shipping with five HBM2 stacks.

Let's reiterate that this GPU is not going into GeForce any time soon. Asked about the consumer line, Nvidia CEO Jensen Huang noted that Nvidia doesn't use HBM2 in consumer parts. We can safely assume there will be GDDR6-equipped Ampere GPUs at some point, but they won't be using the A100 silicon. Which, again, is fair enough. 826mm square and 40GB HBM2 with 1.6 TBps of bandwidth isn't something our consumer PCs really need right now, never mind the FP64 and TF32 clusters or the lack of RT cores.

[Obi-wan waves his hand: "These aren't the GPUs you're looking for."]

(Image credit: Nvidia)

The Nvidia A100 should definitely deliver the goods for the target market, though. Along with other architectural enhancements including sparse matrix optimizations where it's (up to?) twice as fast as V100, the A100 has multi-GPU instancing that allows it to be partitioned into seven separate instances. For scale-out applications, then, a single A100 can have 7X the instancing performance of a V100 GPU.

Of course, supercomputers wouldn't just want a single A100 card, and while those will exist (for inference and instancing applications, among others), the real power comes in the form of the new Nvidia DGX A100. Packing eight A100 GPUs linked via six NVSwitch with 4.8 TBps of bi-directional bandwidth, it can in effect behave as a single massive GPU with the right workloads. The eight GPUs can also provide 10 POPS (PetaOPS) of INT8 performance, 5 PFLOPS of FP16, 2.5 TFLOPS of TF32, and 156 TFLOPS of FP64 in a single node. And all this can be yours, for only $199,000—well, it could be at some point, as the waiting list is probably already quite long.

(Image credit: Nvidia)

Need even more performance? Say hello to the Nvidia DGX A100 Superpod. Housing 140 DGX A100 systems, each with eight A100 GPUs (1,120 total A100 GPUs), the A100 Superpod was built in under three weeks and delivers 700 PFLOPS of AI performance. Nvidia has added four such superpods to its Saturn V supercomputer, which previously had 1,800 DGX-1 systems with 1.8 ExaFLOPS of compute. Adding just 560 DGX A100 systems tacks on another 2.8 ExaFLOPS, for a total of 4.6 ExaFLOPS.

All of this is great news for supercomputer and HPC use, but it leaves us with very little information about Nvidia's next generation Ampere GPUs for consumer cards. We know that Nvidia crammed in 2.5 times as many transistors in roughly the same die space, which means it could certainly do the same for consumer GPUs. Remove some of the FP64 and deep learning functionality and focus on ray tracing and graphics cores, and the resulting GPU should be very potent. We'll find out just how potent in the coming days.

You can view all eight parts of Jensen's keynote here:

Part 1 - Nvidia GTC keynote, introduction to data center computing
Part 2 - Nvidia GTC keynote on RTX graphics and DLSS, and Omniverse
Part 3 - Nvidia GTC keynote on GPU accelerated Spark 3.0
Part 4 - Nvidia GTC keynote on Merlin and recommender system
Part 5 - Nvidia GTC keynote on Jarvis and conversational AI
Part 6 - Nvidia GTC keynote on the A100 and Ampere architecture (yeah, this is the one you want)
Part 7 - Nvidia GTC keynote on EGX A100 and Isaac robotics platform
Part 8 - Nvidia GTC keynote on Orin and autonomous vehicles
Part 9 - Nvidia GTC keynote conclusion

Jarred Walton

Jarred Walton is a senior editor at Tom's Hardware focusing on everything GPU. He has been working as a tech journalist since 2004, writing for AnandTech, Maximum PC, and PC Gamer. From the first S3 Virge '3D decelerators' to today's GPUs, Jarred keeps up with all the latest graphics trends and is the one to ask about game performance.

  • tummybunny
    But can it run Crysis remastered?
    Reply
  • Jimbojan
    You may need to bring the Sun to power the system
    Reply
  • InvalidError
    One step closer to building the Jupiter Brain!
    Reply
  • bit_user
    @JarredWaltonGPU , thanks for the coverage, as always.

    ...but,
    For workloads that use TF32, the A100 can provide 312 TFLOPS of compute power in a single chip. That's up to 20X faster than the V100's 15.7 TFLOPS of FP32 performance, but that's not an entirely fair comparison since TF32 isn't quite the same as FP32.
    Right, a better comparison would be with the V100's fp16 tensor TFLOPS. Comparing it with general-purpose FP32 TFLOPS is apples vs. oranges, whereas TF32 vs fp16 tensor TFLOPS is maybe apples vs. pears.

    Nvidia says elsewhere that the third generation Tensor cores support FP64, which is where the 2.5X increase in performance comes from, but details are a scarce. Assuming the FP64 figure allows similar scaling as previous Nvidia GPUs, the A100 will have twice the FP32 throughput and four times the FP16 performance
    Um... don't assume that! Tensor cores are very specialized and support only a few, specific operations. You want to look at their spec for general-purpose fp64 TFLOPS.

    It seems likely the A100 will follow a similar path and leave RT cores for other Ampere GPUs.
    Agreed. Ray tracing is a different market segment. Even talking about photorealistic rendering for movies and such, it would make sense for them to take a page out of the Turing playbook and use the gaming GPUs for that.

    the A100 can be has multi-GPU instancing that allows it to be partitioned into seven separate instances.
    rats. You almost said "can has". I'd have enjoyed that.
    ; )
    Reply
  • bit_user
    InvalidError said:
    One step closer to building the Jupiter Brain!
    I think the bigger problems to solve are going to be power and cooling.

    On a related note, I'm reading it burns 400 W!
    Reply
  • JarredWaltonGPU
    bit_user said:
    @JarredWaltonGPU , thanks for the coverage, as always.

    ...but,

    Right, a better comparison would be with the V100's fp16 tensor TFLOPS. Comparing it with general-purpose FP32 TFLOPS is apples vs. oranges, whereas TF32 vs fp16 tensor TFLOPS is maybe apples vs. pears.


    Um... don't assume that! Tensor cores are very specialized and support only a few, specific operations. You want to look at their spec for general-purpose fp64 TFLOPS.


    Agreed. Ray tracing is a different market segment. Even talking about photorealistic rendering for movies and such, it would make sense for them to take a page out of the Turing playbook and use the gaming GPUs for that.


    rats. You almost said "can has". I'd have enjoyed that.
    ; )
    Nvidia has posted additional details, which I'm currently parsing and updating the article. There's a LOT to digest this morning!
    Reply
  • JarredWaltonGPU
    bit_user said:
    I think the bigger problems to solve are going to be power and cooling.

    On a related note, I'm reading it burns 400 W!
    Yes. Official specs were posted after the initial writeup, so I've fixed some things and added more details. RT cores and RTX are definitely not present.
    Reply
  • jeremyj_83
    @JarredWaltonGPU
    Isn't ECC a native feature on HBM2? If so then why would you need those extra stacks if not for redundancy?
    Reply
  • JarredWaltonGPU
    jeremyj_83 said:
    @JarredWaltonGPUIsn't ECC a native feature on HBM2? If so then why would you need those extra stacks if not for redundancy?
    I'm not positive on this, but ECC is an optional feature for HBM2 I think. I've found reports saying Samsung's chips have it... but is it always on? That's unclear. I have to think the sixth chip is simply for redundancy, which is still pretty nuts. Like, yields are so bad that we're going to stuff on six HBM2 stacks and hope the interfaces and everything else for at least five of them work properly? I asked for clarification on the six stacks -- TWICE -- but my question somehow got missed.
    Reply
  • jeremyj_83
    JarredWaltonGPU said:
    I'm not positive on this, but ECC is an optional feature for HBM2 I think. I've found reports saying Samsung's chips have it... but is it always on? That's unclear. I have to think the sixth chip is simply for redundancy, which is still pretty nuts. Like, yields are so bad that we're going to stuff on six HBM2 stacks and hope the interfaces and everything else for at least five of them work properly? I asked for clarification on the six stacks -- TWICE -- but my question somehow got missed.
    "And, of course, as a native feature of HBM2, it offers ECC memory protection."
    https://www.anandtech.com/show/15777/amd-reveals-radeon-pro-vii (bottom of the 6th paragraph)
    I just remembered where I had read the ECC reference to HMB2.

    Crazy idea but instead of redundancy could they use a stack as a large L3/4 cache of some sorts?
    Reply