Nvidia revealed the RTX-30 Series Ampere Architecture on September 1, celebrating the 21st anniversary of its first GPU, the GeForce 256. The features and specifications certainly look impressive, as you can read more in our GeForce RTX 3090, GeForce RTX 3080, and GeForce RTX 3070 breakdowns. However, we ended up with quite a few questions, and Nvidia provided plenty of additional information that we're summarizing here. We'll be adding much of this to our main Ampere architecture hub, so this is just the new details.
First, let's talk about the Ampere streaming multiprocessor (SM). The biggest change for gaming is likely the doubling of FP32 performance. Each SM now has two FP32 clusters, providing for up to 128 FMA (fused multply-add) operations per cycle. Half of these are full FP32 + INT cores, while the other half is FP32 only. That might sound like a potential problem, but generally speaking (particularly for gaming workloads) FP32 is the most important, INT less so. It's a balanced approach to boost overall performance without bloating the core too much.
To help feed the beast (TM!), the data path was doubled, along with L1 bandwidth. L1 capacity is also 33% larger, with twice the partition size.
One of the other changes made is that Ampere can simultaneously run work through the CUDA cores, RT cores, and Tensor cores. This allows a game to run DLSS to upscale one frame while at the same time doing the CUDA and RT calculations for the next frame, cutting down on rendering time and improving overall performance.
For the RT cores, Ampere also added functionality to interpolate triangle position. This is particularly important for things like motion blur, where not every triangle used to render a scene is at the same position or time. I'm still not a huge fan of motion blur in games, even if it might be more realistic looking, but whatever. This change potentially speeds up ray traversal by 8X, so it's an important addition.
That's it for the truly new information. Much of the remainder is previously known details, but we've provided the full slide deck below for those who want to see more. There are additional details looking into the performance of Wolfentstein Youngblood, as well as RTX IO (which we've covered elsewhere in our discussion of Microsoft DirectStorage and RTX IO).