AMD Zen 3 Ryzen 5000 Series Microarchitecture — The Quick Take
AMD's 19% IPC increase is the big headline feature of the Zen 3 microarchitecture, but it is especially impressive considering that it leveraged its existing Ryzen SoC to accomplish the feat. In fact, the base SoC is identical to the Ryzen 3000 processors: Zen 3 uses the same 12nm I/O Die (IOD) paired with either one or two 8-core chiplets (CCD) in an MCM (Multi-Chip Module) configuration. The IOD still contains the same memory controllers, PCIe, and other interfaces that connect the SoC to the outside world. Just like with the Matisse chips, the IOD measures ~125mm^2 and has 2.09 billion transistors.
The chiplets have been redesigned, however, and now measure ~80.7mm^2 and have 4.15 billion transistors. That's slightly larger than Zen 2's CCDs with ~74mm^2 of silicon and 3.9 billion transistors.
Just like the previous-gen Ryzen CPUs, processors with six or eight cores come with one compute chiplet, while CPUs with 12 or 16 cores come with two chiplets. AMD also used all of the existing SoC wire routing and packaging to maintain compatibility with the AM4 socket. As shown in the album above, the Infinity Fabric connections between the IOD and compute chiplets (CCD) remain the same and communicate at 32 Bytes-per-cycle for reads and 16B/cycle for writes. Even though these connections remain at the same speed as the previous-gen chips, the redesigned compute chiplets significantly reduce the amount of traffic that flows over the interface.
In the Zen 2 architecture (left), each Zen compute chiplet (CCD) contained two four-core clusters (CCXes) with access to an isolated 16MB slice of L3 cache. So, while the entire chiplet contained 32MB of cache, not all cores had direct access to all of the cache in the chiplet.
To access an adjacent slice of L3 cache, a core had to communicate with the other quad-core cluster by traversing the Infinity Fabric to the I/O die. The I/O die then routed the communication to the cache in the second quad-core cluster, even though it was contained within the same chiplet. To complete the transfer, the data had to travel back over the fabric to the I/O die, and then back into the adjacent quad-core cluster.
Each chiplet now contains one large unified 32MB slice of L3 cache, and all eight cores within the chiplet have full access to the shared cache. This improves not only core-to-cache latency but also core-to-core latency within the chiplet. AMD has also updated the CCD's core-to-core and cache interconnect to a ring topology.
While all eight cores can access the L3 cache within a single compute chiplet, in a dual-chiplet Zen 3 chip, there will be times that the cores will have to communicate with the other chiplet and its L3 cache. In those cases, the communications will still have to traverse the Infinity Fabric via signals routed through the I/O die.
Still, because an entire layer of external communication between the two four-core clusters inside each chiplet has been removed (as seen in the center of the chart above), the Infinity Fabric will naturally have far less traffic. This results in less contention on the fabric, thus simplifying scheduling and routing, and it could also increase the amount of available bandwidth for this type of traffic. All of these factors will result in faster transfers (i.e., lower latency) communication between the two eight-core chiplets, and it possibly removes some of the overhead on the I/O die, too.
These enhancements are important because games rely heavily on the memory subsystem, both on-die cache and main memory (DDR4). A larger pool of cache resources keeps more data closer to the cores, thus requiring fewer high-latency accesses to the main memory. Additionally, lower cache latency can reduce the amount of time a core communicates with the L3 cache. This new design will tremendously benefit latency-sensitive applications, like games — particularly if they have a dominant thread that accesses cache heavily (which is common).
However, the larger L3 cache does come with an increase in L3 latency to the tune of seven additional cycles.
Here's AMD's high-level bullet point list of improvements to the Zen 3 microarchitecture:
- Front-end enhancements:
- Major Design Goal: Faster fetching, especially for branchy and large-footprint code
- L1 branch target buffer (BTB) doubled to 1024 entries for better prediction latency
- Improved branch predictor bandwidth
- Faster recovery from misprediction
- "No Bubble" prediction to make back-to-back predictions faster and better handle branchy code
- Faster sequencing of op-cache fetches
- Finer granularity in switching of op-cache pipes
- Execution Engines:
- Major Design Goal: Reduce latency and enlarge to extract higher instruction-level parallelism (ILP)
- New dedicated branch and st-pickers for integer, now at 10 issues per cycle (+3 vs. Zen 2)
- Larger integer window at +32 vs Zen 2
- Reduced latency for select float and integer operations
- Floating point has increased bandwidth by +2 for a total of 6-wide dispatch and issue
- Floating point FMAC is now one cycle faster
- Major Design Goal: Larger structures and better prefetching — enhance execution engine bandwidth
- Overall higher bandwidth to feed larger/faster execution resources
- Higher load and store bandwidth vs. Zen 2 by +1
- More flexibility in load/store operations
- Improved memory dependence detection
- +4 table walkers in the Translation Look-aside Buffer (TLB)
Notably, AMD also added support for memory protection keys, added AVX2 support for VAES/VPCLMULQD instructions, and made a just-in-time update to the Zen 3 microarchitecture to provide in-silicon mitigation for the Spectre vulnerability.
Naturally, performance and power efficiency will improve as a function of architectural improvements. The reduced traffic on the Infinity Fabric also contributes (it always requires more energy to move data than to process it). Which brings us to IPC.
AMD Zen 3 Ryzen 5000 IPC Measurements
AMD chalks its 19% IPC improvement, which is the largest the company has seen in the post-Zen era, up to a number of Zen 3's architectural improvements. The company calculated its IPC improvements from 25 different workloads, including gaming, which seems a curious addition due to possible graphics-imposed bottlenecks, and some multi-threaded workloads. AMD's results show that the IPC improvements vary based on workloads, with up to a 39% improvement on the high end of the spectrum and a 9% improvement on the lower end.
We tested a limited subset of single-threaded workloads to see the clock-for-clock improvements, locking all chips to a static 3.8 GHz all-core clock with the memory dialed into the officially supported transfer rate (AMD used DDR4-3600 for its tests, which is technically an overclocked configuration).
AMD's generational march forward is clear as we move from the left to the right of each chart. Overall, AMD's gen-on-gen IPC increases are exceptional, and Zen 3's IPC obviously beats Intel's Comet Lake chips with ease.
MORE: Best CPUs
MORE: All CPUs Content