Ryzen Threadripper 2 (2990WX and 2950X) Review: AMD Unleashes 32 Cores

Architecture, NUMA & Game Mode

It Starts With 12nm LP

AMD's Threadripper 2 processors are manufactured on GlobalFoundries' 12nm LP process technology. The ported-over design helps boost transistor performance, but does not affect die area or transistor density. As a result, the Zeppelin die's ~4.8 billion transistors and 213mm2 area remain similar from first-gen Ryzen. The dual-die X-series models feature a total of 9.6 billion transistors and 426mm2 of silicon, while the quad-die WX processors feature 19.2 billion transistors over 852mm2.

Lower leakage current does enable 200 MHz-higher clock rates or an 80-120mV core voltage reduction at any given frequency compared to 14nm manufacturing. All told, AMD claims the 12nm design enables up to 11% less power consumption than 14nm-based Threadripper CPUs at the same clock rates, or up to 16% more performance at the same thermal design power. AMD also adds other nuanced refinements, like lower L1 (15%), L2 (9%), and L3 (8%) cache latencies, along with reduced memory latency (2%).

2990WX Architecture

Threadripper 2990WX borrows from AMD's EPYC server designs and comes with four active dies. The company fused off PCIe and memory control from two of the dies, creating silicon only useful for computing. Meanwhile, the other two I/O-enabled dies serve up two channels of DDR4 memory support and 32 lanes of PCIe 3.0 each.

Unfortunately, the compute dies suffer from increased latency on every request to main memory and PCIe-attached devices, as those requests always have to traverse the Infinity Fabric.

AMD added more Infinity Fabric channels to connect two more dies. Unfortunately, that has a tremendous impact on fabric bandwidth, which drops from 50 Gb/s on a 16-core Threadripper 2950X to 25 Gb/s in this implementation. And again, AMD measured performance with a 3200 MT/s data rate, meaning throughput at DDR4-2933 will be lower. Even with the benefits of tightly-controlled fabric scheduling magic, the combination of reduced bandwidth and 32 threads that must communicate over the fabric for I/O and memory requests has an impact on performance.

According to chip analyst David Schor at WikiChip, each request from the compute die requires interfacing with the Cache-Coherent Masters (CCM), which then interfaces with the CAKE (Coherent AMD socKet Extender) module that encodes the request and sends it to the remote I/O die. The remote CAKE module then decodes the request, fetches the requested data via the UMC (Unified Memory Controller), and then encodes the data and transmits it back to the compute die.

Increased traffic and reduced fabric throughput will have a tangible impact on memory-hungry applications, leading to sub-par performance scaling under some conditions. Although Threadripper 2990WX is clearly aimed at the semi-professional market, configurations hosting multiple GPUs may slow down due to increased fabric latency and reduced throughput to remote PCIe lanes. That'd also affect the performance of PCIe-based M.2 storage and LAN devices connected to remote dies.

MSI's MEG Creation motherboard diagram provides a nice summary of the split connectivity between dies. And be mindful of new population rules, such as inserting the first GPU into PCIe slot four, along with custom M.2 recommendations. You need to populate all four DRAM channels or follow dual-channel population rules in order to realize maximum performance, as performance drops sharply in some dual-channel configurations due to the distributed design.

AMD carves the Threadripper 2990WX into four NUMA domains that cannot be altered. As such, the processor does not have a local memory toggle for its Game Mode feature. Instead, the processor simply flips into "1/4" mode, which disables all but one die and effectively creates an 8C/16T CPU. Ryzen Master also has "1/2" and "Off" options that expose 16 cores and 32 threads, or 32 cores and 64 threads.

The company claims it could not enable the compute dies' memory and I/O controllers even partially without significantly overhauling the package's trace routing, requiring a new socket interface. AMD reps say they prioritized drop-in compatibility with the existing motherboard and cooler ecosystem, leading them to build Threadripper 2990WX the way it turned out.

AMD continues working with Microsoft to route threads to the die with direct-attached memory first, and then spill remaining threads over to the compute dies. Unfortunately, the scheduler currently treats all dies as equal, operating in Round Robin mode. As a result, even moderately-threaded applications can suffer at the hands of high memory latency and low throughput. This is further complicated by thread migration. According to AMD, Microsoft has not committed to a timeline for updating its scheduler.

The Zeppelin Building Block

The Zen architecture employs a four-core CCX (CPU Complex) building block. Each CCX has 8MB of L3 cache split into four slices; each core in the CCX accesses all L3 slices with the same average latency. Two CCXes come together to create an eight-core Zeppelin die, and they communicate with each other via AMD’s Infinity Fabric. The CCXes share the same dual-channel memory controller. This is basically two quad-core CPUs talking to each other over the Infinity Fabric pathway that also handles northbridge and PCIe traffic.

Although each core in a four-core CCX can access the local cache with the same average latency, trips to fetch data in adjacent CCXes incur a latency penalty. Communication between threads on cores located in disparate CCXes also suffers.

2950X Architecture & Game Mode

Threadripper 2950X mirrors the layout of AMD's first-gen Threadripper chips: two Zeppelin dies are connected via another layer of the Infinity Fabric. AMD flanks them with a pair of dummy dies that serve as non-functional fillers to ensure the heat spreader's structural integrity and consistent mating with the socket's pins.

Remember, each Zeppelin die has its own memory and PCIe controller. If a thread running on one core needs to access data resident in cache on another die, it has to traverse the fabric between those dies and incur significant latency. Naturally, the latency penalty between dies is higher than it is between CCXes in a single-die configuration. But AMD claims to have made some improvements there. The 2950X purportedly offers 64ns latency to near memory and 105ns to far memory, while the previous-gen 1950X had to wait 78ns and 133ns, respectively. As per usual, the speed of the Infinity Fabric is tied to the memory controller, so higher data rate settings are desirable. AMD measured Threadripper 2's fabric performance with a 3200 MT/s data rate, which means fabric latency at the recommended DDR4-2933 will be higher.

To combat the potential for performance regression as a result of its "go-wide" approach, AMD devised an interesting solution: it introduced a memory access switch that you can toggle via motherboard BIOS or the Ryzen Master software. The Local and Distributed settings flip between either NUMA (Non-Uniform Memory Access) or UMA (Uniform Memory Access), same as they did for AMD's first-gen Threadripper CPUs.

UMA (Distributed) is pretty simple; it allows both dies to access all of the attached memory. NUMA mode (Local) attempts to keep all data for the process executing on the die confined to its directly attached memory controller, establishing one NUMA node per die. The goal is to minimize requests to remote memory attached to the other die. NUMA works best if programs are designed specifically to utilize it. Even though most desktop PC software wasn't written with NUMA in mind, performance gains are still possible in non-NUMA applications.

AMD also allows you to disable cores in Legacy Compatibility mode, which disables one die via a Windows command. This allows some programs that won't function with 32 threads to execute properly, and it also eliminates cross-die communication. The system can still access I/O connected to the second die, though, so you don't lose any associated memory or attached peripherals.

A set of toggles generally offers the best performance in games and applications by combining these settings optimally. Game Mode disables one die with the Legacy Compatibility mode, and then switches the 2950X into Local memory mode, effectively creating an 8C/16T CPU. Creator mode uses the Distributed memory setting and disables Legacy Compatibility, providing access to Threadripper 2950X's full armament of 16 cores and 32 threads for demanding workloads.

MORE: Best CPUs

MORE: Intel & AMD Processor Hierarchy

MORE: All CPUs Content

Loading...

This thread is closed for comments
50 comments
    Your comment
  • Rdslw
    first table is broken 32/64 cores/threads :)
    Ryzen Threadripper 2990WX
    Ryzen Threadripper 2950X
    Socket
    TR4
    TR4
    Cores / Threads
    16 / 32
    16 / 32
  • bilazaurus
    AMD Ryzen Threadripper 2. Your first choice for encoding!*

    *And nothing else.
  • philipemaciel
    Wow, while the 2990WX is a bit of a letdown, the 2950X is a nice surprise. Plenty of bang for your buck!
  • TEAMSWITCHER
    Avoid the flagship, buy the $900 part. Sounds a lot like Intel.
  • alves.mvc
    Why does Tom's Hardware stopped using the HPC benchmark? It was the most interesting measurement for me that work daily with finite differences and finite elements. Can you return to that?
  • totaldarknessincar
    Seems to me the best of both worlds continue to be Intel's 7900x which sells for $699 at microcenter. You get great gaming performance, and great multithreaded performance, and it's not 12-1800 bucks as some of these mega-threaded cards are.

    Despite all the fan-fare, it seems the 7980xe actually remains the best processor when overclocked overall.

    Lastly for gaming, it's still 8700K or 8086 as best, with the 2700x from AMD being the best when you factor gaming and some multi-threaded stuff, while being very competitive price wise.
  • feelinfroggy777
    Very surprising performance from the 2950x. Almost enough to consider parting ways with my 1950x. Maybe when the pricing comes down some from the 2950x in a few months I will consider.

    The 2990wx on the other hand is a slight let down. Too bad they could not get the scaling down between the dies like they did with Threadripper 1. But I have read that was going to be an issue. Maybe AMD did not want the 2990wx to cannibalize their Epyc market.

    With that being said, the 2990wx is still a modern marvel of technology, even more so when you consider the price. Only couple of years ago a CPU with less than a third of the cores cost just as much.

    Competition sure is grand!
  • basil.thomas
    Looks like Intel has an opportunity to bite AMD when they release their 28-core processor. I have a threadripper 2/x399 system but if I upgrade to the 2990wx, I will also upgrade the motherboard and the power supply as well. I think I may wait until the Intel 28 core comes out and see what kind of performance it delivers as I too notice running custom AI apps on the threadripper is barely faster than my old x99/6850 motherboard overclocked @ 4.3Ghz. I want max performance if I am going to pay over $1800 for the flagship which means core wars is just starting...

    MOD EDIT: watch your profanity
  • ffleader1
    109245 said:
    Seems to me the best of both worlds continue to be Intel's 7900x which sells for $699 at microcenter. You get great gaming performance, and great multithreaded performance, and it's not 12-1800 bucks as some of these mega-threaded cards are. Despite all the fan-fare, it seems the 7980xe actually remains the best processor when overclocked overall. Lastly for gaming, it's still 8700K or 8086 as best, with the 2700x from AMD being the best when you factor gaming and some multi-threaded stuff, while being very competitive price wise.

    Seem to me that you are mistaking best of both work with jack of all trade. No one who takes rendering seriously would want to sacrifice the performance for gaming. For that price, they may as well grab a 1950X. Sure you lose in gaming, but gain a huge jump in rendering. Also, I don't know about Microcenterbut it's still 1k on Amazon while 1950X is $850. 7900X is like a really really bad choice lol.
  • g-unit1111
    Wow, 32 cores for $1,000? I have to say very impressive. Your move, Intel!
  • akamateau
    Chess professionals and Computer Chess enthusiasts are going to eat this up.
  • timf79
    What is the estimated "street" availability of the 2950x
  • Vladimir Iliev
    I'm a bit disappointed again with Toms - why the Intel parts are overclocked where the AMD only on PBO and why this benchmark starts with gaming?!? Is this some kind of a joke or it's intel sponsored article?
  • Jo_7__
    not sure where the guylook when said intel is the best of both world gaming and multithreaded work, is he drunk while reading graph or just jump to comment stright

    dont just look 2990wx,, the TR 2950X is best of all around CPU, if you want compare price to performance, in gaming 2950x is more or less equal 8700k in minimum FPS and only lose 3 to 10 FPS Average FPS,, in Multithreaded work 2950x blow anything in that price range including 7900x,, read again the article
  • redgarl
    The main thing I am noticing is that the actual benchmarking suites is becoming obsolete for these kind of CPUs.

    One thing for sure is the next generation TR on 7nm will really help the 32 cores setup since it will probably use only a single Infinity Fabric Link and direct memory access, but I understand how much more expensive that CPU would have been and not really for that much more performances after all in what the intended field of work is.
  • feelinfroggy777
    2146959 said:
    I'm a bit disappointed again with Toms - why the Intel parts are overclocked where the AMD only on PBO and why this benchmark starts with gaming?!? Is this some kind of a joke or it's intel sponsored article?


    In most cases, PBO will have less than 1% difference in performance than overclocking the CPU to 4.15. They provided Intel overclock information because they already had those OCs from previous reviews. Just like they showed the OCs from 1st gen Threadripper parts.

    It takes a long time to conduct a thorough review, let alone 2 products and it is not like they have had this chip for a month. It probably came in last week when the unboxing videos were released. Then there was also the part where they said that they would have a more in depth article about overclocking performance.
  • TJ Hooker
    From page 9:
    Quote:
    The Ryzen line-up dominates the multi-core Cinebench and POV-Ray tests, but the 2990WX only provides a 35% speed over the 2950X in the POV-Ray benchmark. In light of its 100% increase in cores, that doesn’t represent the best scaling performance. We see better scaling from the 2990WX in the Cinebench test with a 66% performance improvement.

    For POV-Ray lower scores (times) are better. That means that a 100% core increase should theoretically reduce (improve) time by 50%. If we look at the actual time reduction of 35%, we get 0.35/0.5 = 70% of max theoretical scaling. In Cinebench, where higher scores are better, we would expect a 100% improvement in score but only get 66%, meaning we only get 66% of max theoretical scaling. The scaling in POV-Ray is actually slightly better than in Cinebench.
  • vortex240
    Please stop the autoplay videos on this site. Would you like if someone shoved **** in your face repeatedly?

    <Mod Edit- Watch the Language>
  • logainofhades
    I think AMD is a bit ahead of its time with the 32 core part. Software really isn't ready for that kind of horsepower, just yet. Competition is great though. :D
  • Rexer
    Wow. I'm almost finished building an 8700k and now I don't feel like completing it.
  • Giroro
    Man, I've never met anybody who uses Cinebench, nor do I have any idea what it does... But I bet whoever makes that software is super psyched flagship processors are being designed to be amazing at running it, for some reason.

    So where's this 2950x review that is supposed to compare the different modes it runs in?
  • mitch074
    I can't believe someone managed to say that a 7th gen quad core is better than any of these... Now though, the 32-core sees to much diminishing return to really be useful, but the 2950 really is the sweet spot for this platform.
  • mitch074
    If I'm not mistaken, cinebench uses the same engine as Cinema 4D - so it should be a good indicator of how good a CPU is for that software
  • Gillerer
    2146959 said:
    I'm a bit disappointed again with Toms - why the Intel parts are overclocked where the AMD only on PBO and why this benchmark starts with gaming?!? Is this some kind of a joke or it's intel sponsored article?


    Probably because manual overclocking on Zen+ based Ryzen CPUs is pointless unless all your important applications are heavily threaded.

    Since a manual overclock and voltages are decided on a fully threaded workload, it results in comparatively bad performance - lower than stock - in lightly threaded applications (on Zen+ with its advanced boosting). The Threadripper processors have such a high number of cores that the performance deficit is exacerbated.

    Instead of overclocking manually, on Zen+ you should instead enable PBO, then lower the CPU voltage using offset. This lowers temperatures and power use, allowing the CPU to boost even higher.