The Ampere architecture will power the GeForce RTX 3090, GeForce RTX 3080, GeForce RTX 3070, and other upcoming Nvidia GPUs. It represents the next major upgrade from Team Green and promises a massive leap in performance. Based on current details (the cards come out later this month and in October for the 3070), these GPUs should easily move to the top of our GPU hierarchy, and knock a few of the best graphics cards down a peg or two. Let's get into the details of what we know about the Ampere architecture, including specifications, features, and other performance enhancements.

The Ampere architecture marks an important inflection point for Nvidia. It's the company's first 7nm GPU, or 8nm for the consumer parts. Either way, the process shrink allows for significantly more transistors packed into a smaller area than before. It's also the second generation of consumer ray tracing and third generation deep learning hardware. The smaller process provides a great opportunity for Nvidia to radically improve on the previous RTX 20-series hardware and technologies.

We know the Ampere architecture will find its way into upcoming GeForce RTX 3090, RTX 3080, and RTX 3070 graphics cards, and we expect to see RTX 3060 and RTX 3050 next year. It's also part of the Nvidia A100 data center GPUs, which are a completely separate category of hardware. Here we'll break down both the consumer and data center variations of the Ampere architecture and dig into some of the differences.

The launch of Nvidia's Ampere GPUs feels like a blend of 2016's Pascal and 2018's Turing GPus. Nvidia CEO Jensen Huang unveiled the data center focused A100 on May 14, giving us our first official taste of what's to come, but the A100 isn't designed for GeForce cards. It's the replacement for the Volta GV100 (which replaced the GP100). The consumer models have a different feature set, powered by separate GPUs like the GA102, GA103, and so on. The consumer cards also use GDDR6X/GDDR6, where the A100 uses HBM2.

(Image credit: Nvidia)

Besides the underlying GPU architecture, Nvidia has revamped the core graphics card design, with a heavy focus on cooling and power. As an Nvidia video notes, "Whenever we talk about GPU performance, it all comes from the more power you can give and can dissipate, the more performance you can get." A reworked cooling solution, fans, and PCB (printed circuit board) are all part of improving the overall performance story of Nvidia's Ampere GPUs. Of course, third party designs are free to deviate from Nvidia's designs.

With the shift from TSMC's 12nm FinFET node to TSMC N7 and Samsung N8, many expected Ampere to deliver better performance at lower power levels. Instead, Nvidia is taking all the extra transistors and efficiency and simply offering more, at least at the top of the product stack. GA100 for example has 54 billion transistors and an 826mm square die size. That's a massive 156% increase in transistor count from the GV100, while the die size is only 1.3% larger. We expect similar increases for the consumer GPUs.

While 7nm/8nm does allow for better efficiency at the same performance, it also allows for much higher performance at the same power. Nvidia is going one step further and offering even more performance at still higher power levels. The V100 was a 300W part for the data center model, and the new Nvidia A100 pushes that to 400W. We see the same on the consumer models. GeForce RTX 2080 Ti was a 250/260W part, and the Titan RTX was a 280W part. The RTX 3090 is rumored to blast way past that and come with an all-time high TDP for a single GPU of 350W. (That doesn't count the A100, obviously.)

What does that mean to the end users? Besides potentially requiring a power supply upgrade, and the use of a 12-pin power connector on Nvidia's own models, it means a metric truckload of performance. It's the largest single generation jump in performance I can recall seeing from Nvidia. Combined with the architectural updates, which we'll get to in a moment, Nvidia says the RTX 3080 has double the performance of the RTX 2080. And if those workloads include ray tracing and/or DLSS, the gulf might be even wider.

Thankfully, depending on how you want to compare pricing, pricing isn't going to be significantly worse than the previous generation GPUs. The GeForce RTX 3090 is set to debut at $1,499, which is a record for a single-GPU GeForce card, effectively replacing the Titan family. The RTX 3080 meanwhile costs $699, and the RTX 3070 will launch at $499, keeping the same pricing as the previous generation RTX 2080 Super and RTX 2070 Super. Does the Ampere architecture justify the pricing? We'll have to wait a bit longer to actually test the hardware ourselves, but the specs at least look extremely promising.

The Ampere GA100 dwarfs Nvidia's previous GPUs, with 2.5X as many transistors as GV100. (Image credit: Nvidia)

Nvidia Ampere Architecture Specifications

Along with the GA100 for data center use, Nvidia has at least three other Ampere GPUs slated to launch in 2020. There will potentially be as many as three additional Ampere solutions during the coming year, though those are as yet unconfirmed (and not in this table). Here's the high-level overview.

Nvidia Ampere / RTX 30-Series Specifications GPU GA100 GA102 GA103? GA103?? Graphics Card Nvidia A100 GeForce RTX 3090 GeForce RTX 3080 GeForce RTX 3070 Process (nm) TSMC N7 Samsung N8 Samsung N8 Samsung N8 Transistors (billion) 54 ? 28 28 Die Size (mm^2) 826 ~627? ? ? SMs 108 (up to 128) 82 68 46 CUDA Cores 6912 10496 8704 5888 RT Cores None 82 68 46 Tensor Cores 432 328 272 184 Boost Clock (MHz) 1410 1700 1710 1730 VRAM Speed (Gbps) 2.43 19.5 (GDDR6X) 19? (GDDR6X) 16? (GDDR6) VRAM (GB) 40 (48 max) 24 10 8 Bus Width 5120 (6144 max) 384 320 256 ROPs 160 (192 max) 96 80 64 TMUs 864 656 544 368 GFLOPS FP32 19492 35686 29768 20372 RT Gigarays N/A 22.3? 18.8? 12.9? Tensor TFLOPS (FP16) 312 285 238 163 Bandwidth (GB/s) 1555 936 760? 512? TBP (watts) 400 (250 PCIe) 350 320 220 Launch Date May 2020 September 2020 September 2020 October 2020 Launch Price $199K for DXG A100 (with 8xA100) $1,499 $699 $499

The biggest and baddest GPU is the A100. It has up to 128 SMs and six HBM2 stacks of 8GB each, of which only 108 SMs and five HBM2 stacks are currently enabled in the Nvidia A100. Future variations could have the full GPU and RAM configuration. However, the GA100 isn't going to be a consumer part, just like the GP100 and GV100 before it were only for data center and workstation use. Without ray tracing hardware, the GA100 isn't remotely viable as a GeForce card, never mind the cost of the massive die, HBM2, and silicon interposer.

(Image credit: Nvidia)

Stepping down to the consumer models, Nvidia makes some big changes. We don't have the full skinny just yet, but Nvidia apparently doubled the number of CUDA cores per SM, which results in huge gains in shader performance. With the GA102 and RTX 3090, Nvidia likely axes two of the SM clusters relative to GA100, leaving a maximum configuration of 96 SMs. Of these, only 82 are enabled in the RTX 3090. The HBM2 and silicon interposer are also gone, replaced by 12 GDDR6X chips.

With the doubled CUDA cores per SM, that equates to 10496 CUDA cores, likely with two FP64 capable CUDA cores per SM. Nvidia strips out the remaining FP64 functionality, and in its place adds 2nd generation RT cores. There are also four 3rd gen Tensor cores, each of which is four times the throughput per clock of the previous gen Turing Tensor cores. The boost clock of 1700 MHz gives a potential 35.7 TFLOPS of FP32 compute performance, and the 19.5 Gbps GDDR6X delivers 936 GBps of bandwidth. In case that's not clear, potentially the RTX 3090 will have more than double the performance of the RTX 2080 Ti.

Worth noting is that there's almost a full cluster of SMs disabled right now. Could there be a future Titan card with a fully enabled GA102? Absolutely. Maybe it will get 21 Gbps memory as well — with a price tag to match, naturally. (Seriously, don't buy a Titan GPU for gaming purposes, not even if you have money to blow. The last 3-5% performance increase is never worth it.)

(Image credit: Nvidia)

The GA103 continues the trimming relative to the GA102, now with six SM clusters and a maximum of 72 SMs. The RTX 3080 uses a nearly-complete GA103, with 68 SMs and 8704 CUDA cores, while we suspect the RTX 3070 uses a harvested chip with only 46 active SMs and 5888 CUDA cores. (It might be GA104, but that's not important.) The 3080 also has 10GB of GDDR6X memory and a 320-bit bus, while the 3070 disables two channels and ends up with 8GB of GDDR6 memory on a 256-bit bus.

Unlike in previous generations, the clocks on all three RTX 30-series GPUs are relatively similar: 1700-1730MHz. In terms of theoretical performance, the RTX 3080 can do 29.8 TFLOPS and has 760 GBps of bandwidth, and Nvidia says it's twice as fast as the outgoing RTX 2080.

The RTX 3070 meanwhile delivers 20.4 TFLOPS and 512 GBps of bandwidth. Nvidia says the RTX 3070 will end up faster than the RTX 2080 Ti as well, though there might be edge cases where the 11GB vs. 8GB VRAM allows the former heavyweight champion to squeak ahead. Again, architectural enhancements will definitely help, so without further ado, let's talk about the Ampere architecture.

The A100 is Nvidia's largest GPU ever. The various consumer chips are quite a bit smaller (Image credit: Nvidia)

Nvidia's GA100 Ampere Architecture

With the GA100 and Nvidia A100 announcement and GeForce RTX 30-series reveals behind us, we now have a good idea of what to expect. Nvidia will continue to have two separate lines of GPUs, one focused on data centers and deep learning, and the other on graphics and gaming. Some of the changes made with the data center GA100 propagate over to the consumer line, but that doesn't extend to Tensor core enhancements for FP64. Here's what we know of the Ampere architecture, starting with the GA100.

First, GA100 packs in a lot of new stuff. At a high level, the GPU has increased from a maximum of 80 SMs / 5120 CUDA cores in GV100 to 128 SMs / 8192 CUDA cores in GA100. That's a 60% increase in core counts, and yet GA100 uses 2.56 times as many transistors. All of those extra transistors go toward architecture enhancements. If you want to dig into the full details, check out Nvidia's A100 Architecture whitepaper, which we'll briefly summarize here.

The Tensor cores in GA100 receive the most significant upgrades. The previous generation GV100 Tensor cores operated on two 4x4 FP16 matrices and could compute a 4x4x4 fused multiply-add (FMA) of the two matrices with third matrix each cycle. That works out to 128 floating-point operations per cycle per Tensor core, and Nvidia rated the GV100 for 125 TFLOPS peak throughput for FP16. The GA100 Tensor cores by comparison can complete an 8x4x8 FMA matrix operation per clock, which is 256 FMA or 512 FP operations total per Tensor core — four times the throughput. Even with half the number of Tensor cores per SM, it's still twice the performance per SM.

GA100 also adds support for sparsity on the Tensor cores. The idea is that many deep learning operations end up with a bunch of weighted values that no longer matter, so as the training progresses these values can basically be ignored. With sparsity, the Tensor core throughput is effectively doubled. The Nvidia A100 is rated at 312 TFLOPS for FP16, but 624 TFLOPS with sparsity.

Besides the massive boost in raw throughput, the GA100 Tensor cores also add support for even lower precision INT8, INT4, and binary Tensor operations. INT8 allows for 624 TOPS, 1248 TOPS with sparsity, and INT4 doubles that to 1248 / 2496 TOPS. Binary mode doesn't support sparsity and may be of limited use, but the A100 can do 4992 TOPS in that mode.

On the other end of the spectrum, the Tensor cores in A100 also support FP64 instructions. The performance for FP64 is far lower than FP16 at 19.5 TFLOPS. However, for FP64 workloads that's still 2.5 times faster than the GV100's maximum FP64 throughput.

Finally, the A100 adds two new floating point formats. BF16 (Bfloat16) is already used by some other deep learning accelerators (like Google's TPUv4). It uses the 16 bits, just like FP16, but shifts things around to use an 8-bit exponent and 7-bit mantissa, matching the 8-bit exponent range of FP32 while lowering the precision. This has been shown to provide better training and model accuracy that the normal FP16 format. The second format is Nvidia's on Tensor Float 32 (TF32), which keeps the 8-bit exponent but extends the mantissa to 10-bit, matching the precision of FP16 with the range of FP32. The TF32 performance is the same as FP16 as well, so the extra accuracy for deep learning simulations basically comes 'free.'

Wow, that's a big chip with a metric buttload of transistors! (Image credit: Nvidia)

That's a lot of Tensor core enhancements, which should tell you where Nvidia's focus is for GA100. Deep learning and supercomputing workloads just got a massive boost in performance. There are some other architectural updates with GA100 as well, which we'll briefly cover here. The SM transistor count has increased by 50-60%, and all of those transistors had to go somewhere.

Multi-Instance GPU (MIG) is one new feature. This allows a single A100 to be partitioned into as many as seven separate virtual GPUs. Each of these virtual GPUs (with Tensor operations running inference workloads) potentially matches the performance of a single GV100, greatly increasing the scale-out opportunities for cloud service providers.

The A100 L1 cache per SM is 50% larger, at 192KB vs. 128KB on the V100. L2 cache has increased even more, from 6MB on the V100 to 40MB on the A100. It also has a new partitioned crossbar structure that delivers 2.3 times the read bandwidth of the GV100 L2 cache. Note that the total HBM2 memory has 'only' increased from 16GB or 32GB on the GV100 to 40GB on the GA100, but the increased L1 and L2 cache helps better optimize the memory performance.

NVLink performance has been nearly doubled as well, from 25.78 Gbps per signal pair in GV100 to 50 Gbps in GA100. A single NVLink in A100 provides 25 GBps in each direction, which is similar to GV100, but with half as many signal pairs per link. The total number of links has also been doubled to 12, giving total NVLink bandwidth of 600 GBps with A100 compared to 300 GBps with V100. PCIe Gen4 support is also present, nearly doubling the bandwidth for x16 connections (from 15.76 GBps to 31.5 GBps).

Finally, the A100 adds new asynchronous copy, asynchronous barrier, and task graph acceleration. Async copy improves memory bandwidth efficiency and reduces register file bandwidth, and can be done in the background while an SM is performing other work. Hardware-accelerated barriers provide more flexibility and performance for CUDA developers, and the task graph acceleration helps optimize work submissions to the GPU.

There are other architectural enhancements, like NVJPG decode that accelerates JPG decode for deep learning training of image-based algorithms. The A100 includes a 5-core hardware JPEG decode engine, which can outperform CPU-based JPEG decoding and alleviate PCIe congestion. Similarly, the A100 adds five NVDEC (Nvidia Decode) units to accelerate the decoding of common video stream formats, which helps the end-to-end throughput for deep learning and inference applications that work with video.

That's it for the GA100 and Nvidia A100 architecture, so now let's get into the Ampere architectural changes for the consumer GeForce RTX cards.

(Image credit: Nvidia)

Nvidia GA102/GA103 Ampere Architecture

There were a ton of changes made with the GA100 relative to the GV100, and the updates on the consumer side of things are just as significant. Many of the above changes to the Tensor cores carry straight over into the consumer models — likely minus the FP64 stuff, naturally. Besides support for the new GDDR6X memory from Micron instead of HBM2, the other major changes are for the ray tracing and CUDA cores.

Nvidia made a lot of noise about ray tracing in 2018 with the Turing architecture and GeForce RTX 20-series GPUs. Two years later ... well, let's be honest: Ray tracing in games hasn't really lived up to its potential. Battlefield V had better reflections, Shadow of the Tomb Raider and Call of Duty got improved shadows, Metro Exodus used RT global illumination, and in every instance performance took a nosedive for a relatively small improvement in visuals. To date, the best example of what ray tracing can do is arguably Control, a game that uses RT effects for reflections, shadows, and diffuse lighting. It looks quite nice, though as you might expect, the performance impact is still large.

How large? For an RTX 2080 Ti and Core i9-9900K, running Control at 1440p and maximum quality without ray tracing delivered performance of 80 fps. (That's in testing that we just completed for this article.) Turn on all the ray tracing extras and performance dropped to 43 fps — 47% slower, or basically half the performance. That's a painful penalty, though you can mostly mitigate things by enabling DLSS 2.0, which in the quality mode renders at 1707x960 and upscales to 1440p. That brings performance back to 72 fps.

(Image credit: Nvidia)

There are also demonstrations of 'full path tracing,' where the hardware is pushed even further. Take a relatively ancient and low-fi game like Quake II or Minecraft, and add full ray tracing effects for lighting, shadows, reflections, refraction, and more. Also, instead of hundreds of frames per second, you might get 60 fps — that's with an RTX 2070 Super at 1080p with DLSS enabled, at least at maximum quality.

If you think the loss in performance from ray tracing effects is too much and that Nvidia should reverse course, however, you don't know the company very well. The GeForce 256 was the first GPU (according to Nvidia) and introduced hardware transform and lighting calculations to consumer hardware. It was years before most games would come to use those features properly. The first GPUs with shaders also predated common use of the hardware by years, but today virtually every game released makes extensive use of shader technology. Nvidia sees ray tracing as a similar step.

The good news is that ray tracing performance with the Ampere architecture is getting a massive kick in the pants. Nvidia says the RTX 3080 can do 58 TFLOPS of ray tracing calculations, compared to the RTX 2080 Ti's 34 TFLOPS. Or put another way, it's 1.7 times faster at ray tracing. The 2080 Ti was rated to perform up to 11 gigarays per second for ray triangle intersection calculations, so the RTX 3080 can about 19 gigarays per second, and the RTX 3090 will double (or more) the previous king of the hill.

What does that mean for ray tracing games? We'll find out soon enough, but based on what we're hearing from Nvidia, we'll see more game developers increasing the amount of ray tracing effects. Cyberpunk 2077 will feature ray traced reflections, shadows, ambient occlusion, and more. Potentially, a game like Control might be able to run with all of the ray tracing effects enabled and not show a significant drop in performance, or even gain in performance relative to traditional rendering once you enable DLSS.

(Image credit: CD Projekt Red)

It's not just about ray tracing, of course. Nvidia is also doubling down on DLSS, and thanks to the even more potent Tensor cores, the quality and performance should be even better than before. We're already close to the point where DLSS 2.0 in quality mode looks better than native rendering with TAA or SMAA. It's not difficult to imagine a lot of gamers choosing to enable DLSS to get a healthy performance boost.

Since Ampere has native support for 8K displays, thanks to HDMI 2.1, DLSS becomes even more important. What sort of hardware could even hope to power 8K at anything approaching decent performance levels? That's easy: Turn on DLSS and render at 4K using an RTX 3090 or RTX 3080. Is that really 8K rendering? No, but does it really matter?

8K displays of course remain prohibitively expensive, and if you're sitting on your couch there's little chance you'd actually perceive the difference between 4K and 8K. Plus, if you're like me, with aging eyesight, there's zero chance. But the marketing force is strong in the home theater realm, so we can definitely expect to see a bigger push for 8K TVs going forward — that's how consumer electronics companies are going to try and convince all the 4K HDR TV owners to upgrade.

(Image credit: Nvidia)

Nvidia Ampere Architecture: Ray Tracing Round Two

No doubt there are going to be owners of Nvidia's RTX 20-series GPUs that now feel cheated. If you didn't see our advice a few months back about waiting to buy a new GPU until Ampere launches, seeing the RTX 30-series specs and Ampere architecture probably hurts even more. The thing is, we always knew this day would come. Just like Turing replaced Pascal, which replaced Maxwell, which in turn replaced Kepler, the steady march of progress in the world of GPUs continues.

If on the other hand you've been skeptical of ray tracing in games for the past couple of years, Ampere may finally convince you to take the plunge. Well, after you kick back another month or so to see what AMD's Big Navi brings to the table. The reality is that we're going to see far more games supporting ray tracing in some form, especially with the next generation PlayStation 5 and Xbox Series X consoles slated to arrive this fall. And we'll hopefully have enough hardware muscle behind the games to make ray tracing effects viable.

One thing is certain: Ray tracing isn't going away. It's become a major part of virtually every movie, and while games aren't at the point yet where they're trying to rival 2020 Hollywood movies, they might be able to go after 2000-era Hollywood. Right now, real-time gaming is mostly looking to use just a few rays per pixel (if that) to provide a better approximation of the way light behaves in the real world. Hollywood in contrast is using potentially thousands of rays (or paths) per pixel. GPUs with ray tracing hardware is still in its early days, but if Nvidia (and AMD and Intel) can keep upgrading our GPUs, the gap between games and movies will only decrease.

There's more to come! Nvidia hasn't revealed all of the Ampere architecture changes yet, so we'll be updating this story as we learn more.