Game Modes & Architecture, Infinity Fabric Latency Testing
We've covered AMD's Zen architecture in depth, and also covered the Infinity Fabric at length. Head over to those articles for more coverage.
The Zeppelin Die Primer
Threadripper's massive package hides much complexity underneath, but we'll do our best to simplify and outline how it relates to AMD's innovative Creator and Game Mode features.
The Zen architecture employs a four-core CCX (CPU Complex) building block. AMD adorns each CCX with 8MB of L3 cache split into four slices; each core in the CCX accesses all L3 slices with the same average latency. Two CCXes come together to create an eight-core Ryzen 7 die (the large orange blocks in the second image below), and they communicate via AMD’s Infinity Fabric interconnect. The CCXes share the same dual-channel memory controller. This is basically two quad-core CPUs talking to each other over the Infinity Fabric pathway that also handles northbridge and PCIe traffic.
All Ryzen 7, 5, and 3 models feature the same single Zeppelin die. Although each core in a four-core CCX can access the local cache with the same average latency, trips to fetch data in adjacent CCXes incurs a latency penalty. Communication between threads on cores located in disparate CCXes also suffers, which is of particular importance for gaming. Many game engines split out various tasks to different threads, but they are reliant upon constant synchronization between them. Developers can defray some of the communication latency by tuning for the Ryzen architecture.
Building The Threadripper
The graphic below represents AMD's EPYC data center processor die, which shares Threadripper's basic design. We can see four separate Zeppelin dies connected via the Infinity fabric, and the two CCXes inside each die. This creates a 32-core Multi-Chip Module (MCM). Of course, Threadripper is "only" a 16-core processor. To create this configuration, AMD substitutes in two 'dummy dies,' which are non-functional fillers that ensure the heat spreader's structural integrity and consistent mating with the socket's pins. Without these dark dies, the IHS would either cave in when you tighten your cooling solution, or the chip would warp and not make full contact with the pins. AMD notes that Threadripper's functional dies are always placed diagonally from each other, which makes sense considering the fabric's design.
Remember, each Zeppelin die has its own memory and PCIe controllers. That means that if a workload executing on a die needs to access data resident in the memory of the other die (remote memory), it has to traverse a much larger gap. This introduces a level of latency we haven't seen from previous Ryzen models, and its effect on gaming performance is profound. The impact isn't as severe with most professional workloads, but some do suffer.
The New Toggles
To defray the impact of remote memory access, AMD introduces a new memory access mode that you can toggle either in the BIOS or with the Ryzen Master software. The Local and Distributed settings flip between either NUMA (Non-Uniform Memory Access) or UMA (Universal Memory Access).
UMA (distributed) is pretty simple; it allows the dies to access all of the attached memory. NUMA mode (local) attempts to keep all data for the process executing on the die confined to its directly attached memory controller. It establishes one NUMA node per die (visible in the task manager). This reduces, and even possibly eliminates, data fetches from the remote memory connected to another die, though the die can still access it if needed. NUMA has deep roots in the enterprise, but the technique works best if programs are designed specifically to utilize it. It's a rarity on the desktop, but even though almost no desktop applications are designed to support it entirely, there can be performance advantages for non-NUMA applications.
AMD's Threadripper introduces more cores to the desktop than we've ever seen; some programs are caught ill-prepared. In fact, a few games like Far Cry Primal and the DiRT series won't even run when the full complement of Threadripper's threads are brought to bear. That's obviously a problem, so AMD created a Legacy Compatibility mode that disables half of the processor's cores by executing a "bcdedit /set numproc XX" command in Windows that effectively disables half of the processor. Luckily, due to the operating system's core assignments, the command disables all of the cores/threads on the second die. That has a side benefit of eliminating thread-to-thread communication between disparate die, serving as a great solution to the constant synchronization between threads during most gaming workloads.
Because the change is made in software, the "disabled" die still has power fed to it, so the system can still access the memory and PCIe controllers connected to the inactive die.
Game Mode And Creator Mode
So what do you do with all these knobs? There are four separate combinations that will impact each application or game differently, so you have to cycle through them to find the best possible combination for your workload. That's a godsend to tuners looking to squeeze out every last drop of performance, but an absolute nightmare for the other 99%.
AMD decided to simplify the process by specifying two combinations that will either work best for games or standard applications. Creator mode, which is the stock configuration, exposes the full might of 32 threads. It should naturally provide excellent performance for most productivity applications.
Game mode cuts half the threads via compatibility mode and reduces memory and die-to-die latency with the Local memory mode. We're going to test both configurations with our gaming suite, and try another configuration that also offers the full complement of threads.
Infinity Fabric Latency Testing
Die-to-die communication adds another layer of latency to Ryzen’s complicated architecture. As you can see, those same latency metrics don’t apply to the earlier Ryzen models. They also present challenges to some applications, such as those with synchronized threads or frequent fetches from remote memory, but have less impact on others.
|Processor||Intra-Core Latency||Intra-CCX Core-to-Core Latency||Cross-CCX Core-to-Core Latency||Cross-CCX Average Latency||Die-to-Die Latency||Die-To-Die Average Latency||Average Transfer Bandwidth|
|TR 1950X Creator Mode DDR-2666||13.7 - 14.1||39.4 - 43.2ns||157.6 - 171.3||168ns||180.6 - 256.7ns||238.47ns||90.26 GB/s|
|TR 1950X Creator Mode DDR4-3200||13.8 - 14.9||39.2 - 45.4ns||144.9 - 167.2ns||160.1ns||213.1 - 227.8ns||216.9ns||91.67 GB/s|
|TR 1950X Game Mode DDR4-2666||13.9 - 14.2ns||39.5 - 42.3ns||149.2 - 164.1ns||159.66ns||X||X||46.58 GB/s|
|TR 1950X Game Mode DDR4-3200||14.3 - 14.9ns||41.2 - 46.2ns||123 - 150.6ns||145.44ns||X||X||45.52 GB/s|
|TR 1950X Local/SMT DDR4-2666||13.9 - 14.4ns||39.6 - 43.1ns||168.7 - 175.4ns||171.48ns||232.4 - 240.8||235.38ns||92.7 GB/s|
|TR 1950X Local/SMT DDR4-3200||13.9 - 14.4ns||39.9 - 44.5ns||146.7 - 159.4ns||153.89ns||209.3 - 220.9ns||212.53ns||91 GB/s|
|Ryzen 7 1800X||14.8ns||40.5 - 82.8ns||120.9 - 126.2ns||122.96ns||X||X||48.1 GB/s|
|Ryzen 5 1600X||14.7 - 14.8ns||40.6 - 82.8ns||121.5 - 128.2ns||123.48ns||X||X||43.88 GB/s|
The intra-core latency measurements represent communication between two logical threads resident on the same physical core, and they're unaffected by memory speed. Intra-CCX measurements quantify latency between threads that are on the same CCX but not resident on the same core. In the past, we observed slight performance variances, but intra-CCX latency is also largely unaffected by memory speed. However, we've seen a large decrease in cross-CCX latency, which denotes latency between threads located on two separate CCXes, by increasing the memory data transfer rate from DDR4-1333 to DDR4-3200 on Ryzen 5 and 7 models.
The same general trend continues with Threadripper. As we can see, toggling game mode removes the die-to-die latency for threads by effectively disabling one die, but it also reduces host processing resources. It’s an interesting feature that will benefit some workloads, but hamstring others.
We also notice that the Local/SMT combination, which consists of the local setting and leaves all cores active (legacy off), offers the best overall latency improvement via memory overclocking. We also recorded higher Cross-CCX latency with the Threadripper processors.
|Processor||Intra-Core Latency||Core-To-Core Latency||Core-To-Core Average Latency||Average Transfer Bandwidth|
|Core i9-7900X||14.5 - 16ns||69.3 - 82.3ns||75.56ns||83.21 GB/s|
|Core i9-7900X @ 3200 MT/s||16 - 16.1ns||76.8 - 91.3ns||83.93ns||87.31 GB/s|
|Core i7-6950X||13.5 - 15.4ns||54.5 - 70.3ns||64.64ns||65.67 GB/s|
|Core i7-7700K||14.7 - 14.9ns||36.8 - 45.1ns||42.63ns||35.84 GB/s|
We are in the midst of a broader set of tests to quantify how these modes impact memory latency and bandwidth, among other factors. Stay tuned.
MORE: Best CPUs
MORE: All CPUs Content