Architecture, NUMA & Game Mode
It Starts With 12nm LP
AMD's Threadripper 2 processors are manufactured on GlobalFoundries' 12nm LP process technology. The ported-over design helps boost transistor performance, but does not affect die area or transistor density. As a result, the Zeppelin die's ~4.8 billion transistors and 213mm2 area remain similar from first-gen Ryzen. The dual-die X-series models feature a total of 9.6 billion transistors and 426mm2 of silicon, while the quad-die WX processors feature 19.2 billion transistors over 852mm2.
Lower leakage current does enable 200 MHz-higher clock rates or an 80-120mV core voltage reduction at any given frequency compared to 14nm manufacturing. All told, AMD claims the 12nm design enables up to 11% less power consumption than 14nm-based Threadripper CPUs at the same clock rates, or up to 16% more performance at the same thermal design power. AMD also adds other nuanced refinements, like lower L1 (15%), L2 (9%), and L3 (8%) cache latencies, along with reduced memory latency (2%).
Threadripper 2990WX borrows from AMD's EPYC server designs and comes with four active dies. The company fused off PCIe and memory control from two of the dies, creating silicon only useful for computing. Meanwhile, the other two I/O-enabled dies serve up two channels of DDR4 memory support and 32 lanes of PCIe 3.0 each.
Unfortunately, the compute dies suffer from increased latency on every request to main memory and PCIe-attached devices, as those requests always have to traverse the Infinity Fabric.
AMD added more Infinity Fabric channels to connect two more dies. Unfortunately, that has a tremendous impact on fabric bandwidth, which drops from 50 Gb/s on a 16-core Threadripper 2950X to 25 Gb/s in this implementation. And again, AMD measured performance with a 3200 MT/s data rate, meaning throughput at DDR4-2933 will be lower. Even with the benefits of tightly-controlled fabric scheduling magic, the combination of reduced bandwidth and 32 threads that must communicate over the fabric for I/O and memory requests has an impact on performance.
According to chip analyst David Schor at WikiChip, each request from the compute die requires interfacing with the Cache-Coherent Masters (CCM), which then interfaces with the CAKE (Coherent AMD socKet Extender) module that encodes the request and sends it to the remote I/O die. The remote CAKE module then decodes the request, fetches the requested data via the UMC (Unified Memory Controller), and then encodes the data and transmits it back to the compute die.
Increased traffic and reduced fabric throughput will have a tangible impact on memory-hungry applications, leading to sub-par performance scaling under some conditions. Although Threadripper 2990WX is clearly aimed at the semi-professional market, configurations hosting multiple GPUs may slow down due to increased fabric latency and reduced throughput to remote PCIe lanes. That'd also affect the performance of PCIe-based M.2 storage and LAN devices connected to remote dies.
MSI's MEG Creation motherboard diagram provides a nice summary of the split connectivity between dies. And be mindful of new population rules, such as inserting the first GPU into PCIe slot four, along with custom M.2 recommendations. You need to populate all four DRAM channels or follow dual-channel population rules in order to realize maximum performance, as performance drops sharply in some dual-channel configurations due to the distributed design.
AMD carves the Threadripper 2990WX into four NUMA domains that cannot be altered. As such, the processor does not have a local memory toggle for its Game Mode feature. Instead, the processor simply flips into "1/4" mode, which disables all but one die and effectively creates an 8C/16T CPU. Ryzen Master also has "1/2" and "Off" options that expose 16 cores and 32 threads, or 32 cores and 64 threads.
The company claims it could not enable the compute dies' memory and I/O controllers even partially without significantly overhauling the package's trace routing, requiring a new socket interface. AMD reps say they prioritized drop-in compatibility with the existing motherboard and cooler ecosystem, leading them to build Threadripper 2990WX the way it turned out.
AMD continues working with Microsoft to route threads to the die with direct-attached memory first, and then spill remaining threads over to the compute dies. Unfortunately, the scheduler currently treats all dies as equal, operating in Round Robin mode. As a result, even moderately-threaded applications can suffer at the hands of high memory latency and low throughput. This is further complicated by thread migration. According to AMD, Microsoft has not committed to a timeline for updating its scheduler.
The Zeppelin Building Block
The Zen architecture employs a four-core CCX (CPU Complex) building block. Each CCX has 8MB of L3 cache split into four slices; each core in the CCX accesses all L3 slices with the same average latency. Two CCXes come together to create an eight-core Zeppelin die, and they communicate with each other via AMD’s Infinity Fabric. The CCXes share the same dual-channel memory controller. This is basically two quad-core CPUs talking to each other over the Infinity Fabric pathway that also handles northbridge and PCIe traffic.
Although each core in a four-core CCX can access the local cache with the same average latency, trips to fetch data in adjacent CCXes incur a latency penalty. Communication between threads on cores located in disparate CCXes also suffers.
2950X Architecture & Game Mode
Threadripper 2950X mirrors the layout of AMD's first-gen Threadripper chips: two Zeppelin dies are connected via another layer of the Infinity Fabric. AMD flanks them with a pair of dummy dies that serve as non-functional fillers to ensure the heat spreader's structural integrity and consistent mating with the socket's pins.
Remember, each Zeppelin die has its own memory and PCIe controller. If a thread running on one core needs to access data resident in cache on another die, it has to traverse the fabric between those dies and incur significant latency. Naturally, the latency penalty between dies is higher than it is between CCXes in a single-die configuration. But AMD claims to have made some improvements there. The 2950X purportedly offers 64ns latency to near memory and 105ns to far memory, while the previous-gen 1950X had to wait 78ns and 133ns, respectively. As per usual, the speed of the Infinity Fabric is tied to the memory controller, so higher data rate settings are desirable. AMD measured Threadripper 2's fabric performance with a 3200 MT/s data rate, which means fabric latency at the recommended DDR4-2933 will be higher.
To combat the potential for performance regression as a result of its "go-wide" approach, AMD devised an interesting solution: it introduced a memory access switch that you can toggle via motherboard BIOS or the Ryzen Master software. The Local and Distributed settings flip between either NUMA (Non-Uniform Memory Access) or UMA (Uniform Memory Access), same as they did for AMD's first-gen Threadripper CPUs.
UMA (Distributed) is pretty simple; it allows both dies to access all of the attached memory. NUMA mode (Local) attempts to keep all data for the process executing on the die confined to its directly attached memory controller, establishing one NUMA node per die. The goal is to minimize requests to remote memory attached to the other die. NUMA works best if programs are designed specifically to utilize it. Even though most desktop PC software wasn't written with NUMA in mind, performance gains are still possible in non-NUMA applications.
AMD also allows you to disable cores in Legacy Compatibility mode, which disables one die via a Windows command. This allows some programs that won't function with 32 threads to execute properly, and it also eliminates cross-die communication. The system can still access I/O connected to the second die, though, so you don't lose any associated memory or attached peripherals.
A set of toggles generally offers the best performance in games and applications by combining these settings optimally. Game Mode disables one die with the Legacy Compatibility mode, and then switches the 2950X into Local memory mode, effectively creating an 8C/16T CPU. Creator mode uses the Distributed memory setting and disables Legacy Compatibility, providing access to Threadripper 2950X's full armament of 16 cores and 32 threads for demanding workloads.
MORE: Best CPUs
MORE: All CPUs Content