AMD RDNA 3 GPU Architecture Deep Dive: The Ryzen Moment for GPUs

AMD RDNA 3 GPU Architecture Deep Dive
(Image credit: AMD)

On November 3, AMD revealed key details of its upcoming RDNA 3 GPU architecture and the Radeon RX 7900-series graphics cards. It was a public announcement that the whole world was invited to watch. Shortly after the announcement, AMD took press and analysts behind closed doors to dig a little deeper into what makes RDNA 3 tick — or is it tock? No matter.

We're allowed to talk about the additional RDNA 3 details and other briefings AMD provided now, which almost certainly has nothing to do with Nvidia's impending launch of the RTX 4080 on Wednesday. (That's sarcasm, just in case it wasn't clear. This sort of thing happens all the time with AMD and Nvidia, or AMD and Intel, or even Intel and Nvidia now that Team Blue has joined the GPU race.)

AMD's RDNA 3 architecture fundamentally changes several of the key design elements for GPUs, thanks to the use of chiplets. And that's as good of a place to start as any. We've also got separate articles covering AMD's Gaming and ISV Relations, Software and Platform details, and the Radeon RX 7900 Series Graphics Cards.

RDNA 3 and GPU Chiplets

Navi 31 consists of two core pieces, the Graphics Compute Die (GCD) and the Memory Cache Dies (MCDs). There are similarities to what AMD has done with its Zen 2/3/4 CPUs, but everything has been adapted to fit the needs of the graphics world.

For Zen 2 and later CPUs, AMD uses an Input/Output Die (IOD) that connects to system memory and provides all of the necessary functionality for things like the PCIe Express interface, USB ports, and more recently (Zen 4) graphics and video functionality. The IOD then connects to one or more Core Compute Dies (CCDs — alternatively "Core Complex Dies," depending on the day of the week) via AMD's Infinity Fabric, and the CCDs contain the CPU cores, cache, and other elements.

A key point in the design is that typical general computing algorithms — the stuff that runs on the CPU cores — will mostly fit within the various L1/L2/L3 caches. Modern CPUs up through Zen 4 only have two 64-bit memory channels for system RAM (though EPYC Genoa server processors can have up to twelve DDR5 channels).

The CCDs are small, and the IOD can range from around 125mm^2 (Ryzen 3000) to as large as 416mm^2 (EPYC xxx2 generation). Most recently, the Zen 4 Ryzen 7000-series CPUs have an IOD made using TSMC N6 that measures just 122mm^2 with one or two 70mm^2 CCDs manufactured on TSMC N5, while the EPYC xxx4 generation uses the same CCDs but with a relatively massive IOD measuring 396mm^2 (still made on TSMC N6).

(Image credit: AMD)

GPUs have very different requirements. Large caches can help, but GPUs also really like having gobs of memory bandwidth to feed all the GPU cores. For example, even the beastly EPYC 9654 with a 12-channel DDR5 configuration 'only' delivers up to 460.8 GB/s of bandwidth. The fastest graphics cards like the RTX 4090 can easily double that.

In other words, AMD needed to do something different for GPU chiplets to work effectively. The solution ends up being almost the reverse of the CPU chiplets, with memory controllers and cache being placed on multiple smaller dies while the main compute functionality resides in the central GCD chiplet.

The GCD houses all the Compute Units (CUs) along with other core functionality like video codec hardware, display interfaces, and the PCIe connection. The Navi 31 GCD has up to 96 CUs, which is where the typical graphics processing occurs. But it also has an Infinity Fabric along the top and bottom edges (linked via some sort of bus to the rest of the chip) that then connects to the MCDs.

The MCDs, as the name implies (Memory Cache Dies) primarily contain the large L3 cache blocks (Infinity Cache), plus the physical GDDR6 memory interface. They also need to contain Infinity Fabric links to connect to the GCD, which you can see in the die shot along the center facing edge of the MCDs.

GCD will use TSMC's N5 node, and will pack 45.7 billion transistors into a 300mm^2 die. The MCDs meanwhile are built on TSMC's N6 node, each packing 2.05 billion transistors on a chip that's only 37mm^2 in size. Cache and external interfaces are some of the elements of modern processors that scale the worst, and we can see that overall the GCD averages 152.3 million transistors per mm^2, while the MCDs only average 55.4 million transistors per mm^2.

AMD's High Performance Fanout Interconnect

Swipe to scroll horizontally
InterconnectPicojoules per Bit (pJ/b)
Infinity Fabric (Navi 31)0.4
TSMC CoWoS0.56
Bunch of Wires (BoW)0.5-0.7
Infinity Fabric (Zen 4)???
Infinity Fabric (Zen 3)1.5 (?)

One potential concern with a chiplet approach on GPUs is how much power all of the Infinity Fabric links require — external chips almost always use more power. As an example, the Zen CPUs have an organic substrate interposer that's relatively cheap to make, but it consumes 1.5 pJ/b (Picojoules per bit). Scaling that up to a 384-bit interface would have consumed a fair amount of power, so AMD worked to refine the interface with Navi 31.

The result is what AMD calls the high performance fanout interconnect. The image above doesn't quite explain things clearly, but the larger interface on the left is the organic substrate interconnect used on Zen CPUs. To the right is the high performance fanout bridge used on Navi 31, "approximately to scale."

You can clearly see the 25 wires used for the CPUs, while the 50 wires used on the GPU equivalent are packed into a much smaller area, so you can't even see the individual wires. It's about 1/8 the height and width for the same purpose, meaning about 1/64 the total area. That, in turn, dramatically cuts power requirements, and AMD says all of the Infinity Fanout links combined deliver 3.5 TB/s of effective bandwidth while only accounting for less than 5% of the total GPU power consumption.

There's a quick interesting aside here: all the Infinity Fabric logic on both the GCD and MCDs takes up a decent amount of die space. Looking at the die shot, the six Infinity Fabric interfaces on the GCD use about 9% of the die area, while the interfaces are around 15% of the total die size on the MCDs.

Wipe out the Infinity Fabric interface and build the whole chip as a monolithic part on TSMC's N5 node, and it would probably only measure 400–425mm^2. Apparently, the cost of TSMC N5 is so much higher than N6 that it was worth taking the chiplet route, which says something about the increasing costs of smaller fabrication nodes.

Related to this, we know that certain aspects of a chip design scale better with process shrinks. External interfaces — like the GDDR6 physical interface — have almost stopped scaling. Cache also tends to scale poorly. What will be interesting to see is if AMD's next-generation GPUs (Navi 4x / RDNA 4) leverage the same MCDs as RDNA 3 while shifting the GCD presumably to the future TSMC N3 node.

RDNA 3 Architecture Upgrades

(Image credit: AMD)

That takes care of the chiplet aspect of the design, so now let's go into the architecture changes to the various parts of the GPU. These can be broadly divided into four areas: general changes to the chip design, enhancements to the GPU shaders (Stream Processors), updates to improve ray tracing performance, and improvements to the matrix operation hardware.

Looking at the raw specs, it might not seem like AMD has increased clock speeds all that much, but previously we only had the Game Clock figures. Now we can say that the boost clocks are higher, and in general use, we expect AMD's RDNA 3 GPUs will exceed even the official boost clocks — they're conservative boosts, in other words.

AMD says that RDNA 3 has been architected to reach speeds of 3 GHz. The official boost clocks on the reference 7900 XTX / XT are well below that mark, but we also feel AMD's reference designs focused more on maximizing efficiency. Third-party AIB cards could very well bump up power limits, voltages, and clock speeds quite a bit. Will we see 3 GHz out-of-factory overclocks? Perhaps, so we'll wait and see.

According to AMD, RDNA 3 GPUs can hit the same frequency as RDNA 2 GPUs while using half the power, or they can hit 1.3 times the frequency while using the same power. Of course, ultimately, AMD wants to balance frequency and power to deliver the best overall experience. Still, given we see higher power limits on the 7900 XTX, we should also expect that to come with a decent bump to clock speeds and performance.

Another point AMD makes is that it has improved silicon utilization by approximately 20%. In other words, there were functional units on RDNA 2 GPUs where parts of the chip were frequently sitting idle even when the card was under full load. Unfortunately, we don't have a good way to measure this directly, so we'll take AMD's word on this, but ultimately this should result in higher performance.

Compute Unit Enhancements

Outside of the chiplet stuff, many of the biggest changes occur within the Compute Units (CUs) and Workgroup Processors (WGPs). These include updates to the L0/L1/L2 cache sizes, more SIMD32 registers for FP32 and matrix workloads, and wider and faster interfaces between some elements.

AMD's Mike Mantor presented the above and the following slides, which are dense! He basically talked non-stop for the better part of an hour, trying to cover everything that's been done with the RDNA 3 architecture, and that wasn't nearly enough time. The above slide covers the big-picture overview, but let's step through some of the details.

RDNA 3 comes with an enhanced Compute Unit pair — the dual CUs that became the main building block for RDNA chips. A cursory look at the above might not look that different from RDNA 2, but then notice that the first block for the scheduler and Vector GPRs (general purpose registers) says "Float / INT / Matrix SIMD32" followed by a second block that says "Float / Matrix SIMD32." That second block is new for RDNA 3, and it basically means double the floating point throughput.

You can choose to look at things in one of two ways: Either each CU now has 128 Stream Processors (SPs, or GPU shaders), and you get 12,288 total shader ALUs (Arithmetic Logic Units), or you can view it as 64 "full" SPs that just happen to have double the FP32 throughput compared to the previous generation RDNA 2 CUs.

This is sort of funny because some places are saying that Navi 31 has 6,144 shaders, and others are saying 12,288 shaders, so I specifically asked AMD's Mike Mantor — the Chief GPU Architect and the main guy behind the RDNA 3 design — whether it was 6,144 or 12,288. He pulled out a calculator, punched in some numbers, and said, "Yeah, it should be 12,288." And yet, in some ways, it's not.

AMD's own slides in a different presentation (above) say 6,144 SPs and 96 CUs for the 7900 XTX, and 84 CUs with 5,376 SPs for the 7900 XT, so AMD is taking the approach of using the lower number. However, raw FP32 compute (and matrix compute) has doubled. Personally, it makes more sense to me to call it 128 SPs per CU rather than 64, and the overall design looks similar to Nvidia's Ampere and Ada Lovelace architectures. Those now have 128 FP32 CUDA cores per Streaming Multiprocessor (SM), but also 64 INT32 units.

Along with the extra 32-bit floating-point compute, AMD also doubled the matrix (AI) throughput as the AI Matrix Accelerators appear to at least partially share some of the execution resources. New to the AI units is BF16 (brain-float 16-bit) support, as well as INT4 WMMA Dot4 instructions (Wave Matrix Multiply Accumulate), and as with the FP32 throughput, there's an overall 2.7x increase in matrix operation speed.

That 2.7x appears to come from the overall 17.4% increase in clock-for-clock performance, plus 20% more CUs and double the SIM32 units per CU. (But don't quote me on that, as AMD didn't specifically break down all of the gains.)

Bigger and Faster Caches and Interconnects

(Image credit: AMD)

The caches, and the interfaces between the caches and the rest of the system, have all received upgrades. For example, the L0 cache is now 32KB (double RDNA 2), and the L1 caches are 256KB (double RDNA 2 again), while the L2 cache increased to 6MB (1.5x larger than RDNA 2).

The link between the main processing units and the L1 cache is now 1.5x wider, with 6144 bytes per clock throughput. Likewise, the link between the L1 and L2 cache is also 1.5x wider (3072 bytes per clock).

The L3 cache, also called the Infinity Cache, did shrink relative to Navi 21. It's now 96MB vs. 128MB. However, the L3 to L2 link is now 2.25x wider (2304 bytes per clock), so the total throughput is much higher. In fact, AMD gives a figure of 5.3 TB/s — 2304 B/clk at a speed of 2.3 GHz. The RX 6950 XT only had a 1024 B/clk link to its Infinity Cache (maximum), and RDNA 3 delivers up to 2.7x the peak interface bandwidth.

Note that these figures are only for the fully configured Navi 31 solution in the 7900 XTX. The 7900 XT has five MCDs, dropping down to a 320-bit GDDR6 interface and 1920 B/clk links to the combined 80MB of Infinity Cache. We will likely see lower-tier RDNA 3 parts that further cut back on interface width and performance, naturally.

Finally, there are now up to six 64-bit GDDR6 interfaces for a combined 384-bit link to the GDDR6 memory. The VRAM also clocks at 20 Gbps (vs 18 Gbps on the later 6x50 cards and 16 Gbps on the original RDNA 2 chips) for a total bandwidth of 960 GB/s.

It's interesting how much the gap between GDDR6 and GDDR6X has narrowed with this generation, at least for shipping configurations. AMD's 960 GB/s on the RX 7900 XTX is only 5% less than the 1008 GB/s of the RTX 4090 now, whereas with the RX 6900 XT and RTX 3090 were only pushing 512 GB/s compared to Nvidia's 936 GB/s back in 2020.

AMD 2nd Generation Ray Tracing

Ray tracing on the RDNA 2 architecture always felt like an afterthought — something tacked on to meet the required feature checklist for DirectX 12 Ultimate. AMD's RDNA 2 GPUs lack dedicated BVH traversal hardware, opting to do some of that work via other shared units, and that's at least partially to blame for their weak performance.

RDNA 2 Ray Accelerators could do up to four ray/box intersections per clock, or one ray/triangle intersection. By way of contrast, Intel's Arc Alchemist can do up to 12 ray/box intersections per RTU per clock, while Nvidia doesn't provide a specific number but has up to two ray/triangle intersections per RT core on Ampere and up to four ray/triangle intersections per clock on Ada Lovelace.

It's not clear if RDNA 3 actually improves those figures directly or if AMD has focused on other enhancements to reduce the number of ray/box intersections performed. Perhaps both. What we do know is that RDNA 3 will have improved BVH (Bounding Volume Hierarchy) traversal that will increase ray tracing performance.

RDNA 3 also has 1.5x larger VGPRs, which means 1.5x as many rays in flight. There are other stack optimizations to reduce the number of instructions needed for BVH traversal, and specialized box sorting algorithms (closest first, largest first, closest midpoint) can be used to extract improved efficiency.

Overall, thanks to the new features, higher frequency, and increased number of Ray Accelerators, AMD says RDNA 3 should deliver up to a 1.8x performance uplift for ray tracing compared to RDNA 2. That should narrow the gap between AMD and Nvidia Ampere. Still, Nvidia also seems to have doubled down on its ray tracing hardware for Ada Lovelace, so we wouldn't count on AMD delivering equivalent performance to RTX 40-series GPUs.

Other Architectural Improvements

(Image credit: AMD)

Finally, RDNA 3 has tuned other elements of the architecture related to the command processor, geometry, and pixel pipelines. There's also a new Dual Media Engine with support for AV1 encode/decode, AI-enhanced video decoding, and the new Radiance Display Engine.

The Command Processor (CP) updates should improve performance for certain workloads while also reducing CPU bottlenecks on the driver and API side. Hardware-based culling performance is also 50% faster on the geometry side of things, and there's a 50% increase in peak rasterized pixels per clock.

That last seems to be a result of increasing the number of ROPs (Render Outputs) from 128 on Navi 21 result to 192 on Navi 31. That makes sense, as there's also a 50% increase in memory channels, and AMD would want to scale other elements in step with that.

The Dual Media Engine should bring AMD up to parity with Nvidia and Intel on the video side of things, though we'll have to test to see how quality and performance compare. We know from our Arc A380 video encoding tests that Intel generally delivered the best performance and quality, Nvidia wasn't far behind, and AMD was a relatively distant third on the quality front. Unfortunately, we haven't been able to test Nvidia's AV1 support yet, but we're looking forward to checking out both of the new AMD and Nvidia AV1 implementations.

AMD also gains at least a few points for including DisplayPort 2.1 support. Intel also has DP2 support on its Arc GPUs, but it tops out at 40 Gbps (UHBR 10), while AMD can do 54 Gbps (UHBR 13.5). AMD's display outputs can drive up to 4K at 229 Hz without compression for 8-bit color depths, or 187 Hz with 10-bit color. Display Stream Compression can more than double that, allowing for 4K and 480 Hz or 8K and 165 Hz — not that we're anywhere near having displays that actually support such speeds.

Realistically, we have to wonder how important DP2.1 UHBR 13.5 will be with the RDNA 3 graphics cards. You'll need a new monitor that supports DP2.1 first of all, and second, there's the question of how much better something like 4K 180 Hz looks with and without DSC — because DP1.4a can still handle that resolution with DSC while UHBR 13.5 could do it without DSC.

RDNA 3 Architecture Closing Thoughts

Swipe to scroll horizontally
AMD and Nvidia GPU Specifications
Graphics CardRX 7900 XTXRX 7900 XTRX 6950 XTRTX 4090RTX 4080
ArchitectureNavi 31Navi 31Navi 21AD102AD103
Process TechnologyTSMC N5 + N6TSMC N5 + N6TSMC N7TSMC 4NTSMC 4N
Transistors (Billion)58 (45.7 + 6x 2.05)56 (45.7 + 5x 2.05)26.876.345.9
Die size (mm^2)300 + 222300 + 185519608.4378.6
CUs / SMs96848012876
SPs / Cores (Shaders)6144 (12288)5376 (10752)5120163849728
Tensor / Matrix Cores???512304
Ray Tracing "Cores"96848012876
Boost Clock (MHz)25002400231025202505
VRAM Speed (Gbps)2020182122.4
VRAM (GB)2420162416
VRAM Bus Width384320256384256
L2 / Infinity Cache96801287264
TFLOPS FP32 (Boost)56.54323.782.648.7
TFLOPS FP16 (FP8)1138647.4661 (1321)390 (780)
Bandwidth (GBps)9608005761008717
TDP (watts)355300335450320
Launch DateDec 13, 2022Dec 13, 2022May 2022Oct 12, 2022Nov 16, 2022
Launch Price$999 $899 $1,099 $1,599 $1,199

For those who want the full collection of slides on the RDNA 3 architecture, you can flip through them in the above gallery. Overall, it sounds like an impressive feat of engineering and we're eager to see how the graphics cards based on the RDNA 3 GPUs stack up.

As we've noted before, we feel like there's a good chance AMD can compete quite well against Nvidia's RTX 4080 card, which launches on November 16. On the other hand, it seems quite unlikely that AMD will be able to go head-to-head against the bigger RTX 4090 in most games.

Simple math provides plenty of food for thought. With FP32 12,288 shaders running at 2.5 GHz vs. Nvidia's 16,384 shaders at 2.52 GHz, Nvidia clearly has the raw compute advantage — 61 teraflops vs. 83 teraflops. As noted, adding more FP32 units makes AMD's RDNA 3 seem more like Ampere and Ada Lovelace, so there's a reasonable chance that real-world gaming performance will match up more closely with the teraflops. Memory bandwidth at least looks pretty close and the difference probably shouldn't matter too much.

Beyond raw compute, we've got transistor counts and die sizes. Nvidia has built monolithic dies with its AD102, AD103, and AD104 GPUs. The largest has 76.3 billion transistors in a 608mm^2 chip. Even if AMD were doing a monolithic 522mm^2 chip with 58 billion transistors, we'd expect Nvidia to have some advantages. Still, the GPU chiplet approach means some of the area and transistors get used on things not directly related to performance.

Meanwhile, Nvidia's penultimate Ada chip, the AD103 used in the RTX 4080, falls on the other side of the fence. With a 256-bit interface, 45.9 billion transistors, and a 368.6mm^2 die size, Navi 31 should have some clear advantages — both with the RX 7900 XTX and the slightly lower tier 7900 XT. And don't even get us started on the AD104 with 35.8 billion transistors and a 294.5mm^2 die. There's no way the "unlaunched" RTX 4080 12GB was going to keep pace with an RX 7900 XT, not without DLSS3 being a major part of the story.

But there's more to performance than paper specs. Nvidia invests more transistors into features like DLSS (Tensor cores) and now DLSS3 (the Optical Flow Accelerator), and ray tracing hardware. AMD seems more willing to give up some ray tracing performance while boosting the more common use cases. We'll see how the RTX 4080 performs in just a couple of days, and then we'll need to wait until December to see AMD's RX 7900 series response.

For those who aren't interested in graphics cards costing $900 or more, 2023 will be when we get RTX 4070 and lower-tier Ada Lovelace parts, and we'll likely get RX 7800, 7700, and maybe even 7600 series offerings from AMD. Navi 32 is rumored to use the same MCDs, but with a smaller GCD, while further out, Navi 33 will supposedly be a monolithic die still built on the N6 node.

Based on what we've seen and heard so far, the future RTX 4070 and RX 7800 will likely deliver similar performance to the previous generation RTX 3090 and RX 6950 XT, hopefully at substantially lower prices and while using less power. Check back next month for our full reviews of AMD's first and fastest RDNA 3 graphics cards. 

Jarred Walton is a senior editor at Tom's Hardware focusing on everything GPU. He has been working as a tech journalist since 2004, writing for AnandTech, Maximum PC, and PC Gamer. From the first S3 Virge '3D decelerators' to today's GPUs, Jarred keeps up with all the latest graphics trends and is the one to ask about game performance.