AMD claims RX 7900 XTX outperforms RTX 4090 in DeepSeek benchmarks

(Image credit: AMD)

AMD has provided benchmarks of its flagship RX 7900 XTX going head to head against the Nvidia RTX 4090 and RTX 4080 Super with DeepSeek's AI model. According to David McAfee on X, the RDNA3-based GPU outperformed the RTX 4090 by up to 13% and the RTX 4080 Super by up to 34%.

AMD tested the three GPUs with multiple LLMs and various parameters using DeepSeek R1. The RX 7900 XTX saw its biggest victory against the RTX 4090 using DeepSeek R1 Distill Qwen 7B, where it outperformed the Ada Lovelace GPU by 13%. AMD also tested three other LLM configurations against the RTX 4090. The RX 7900 XTX outperformed the RX 4090 in two of the three configurations — it was 11% faster using Distill Llama 8B and 2% faster using Distill Qwen 14B. The RX 4090 was 4% faster than the RX 7900 XTX in one configuration, using Distill Qwen 32B.

DeepSeek performing very well on @AMDRadeon 7900 XTX. Learn how to run on Radeon GPUs and Ryzen AI APUs here: https://t.co/FVLDLJ18Ov pic.twitter.com/5OKEkyJjh3January 29, 2025

AMD tested three configurations against the RTX 4080 Super. The RX 7900 XTX outperformed the RTX 4080 Super by 34% using DeepSeek R1 Distill Qwen 7B. This lead dropped to 27% using Distill Llama 8B, and 22% using Distill Qwen 14B.

This should all be taken with a pinch of salt, of course, as we can't be sure how the Nvidia GPUs were configured for the tests (which, again, were run by AMD). Not all AI workloads take advantage of a GPU's full computational throughput. We saw this in our Stable Diffusion tests, where Stable Diffusion did not use FP8 calculations or TensorRT code for processing.

It's not common for the RX 7900 XTX to be used as a dedicated AI processor, but the architecture is more than capable of processing AI workloads. The RDNA 3 architecture the RX 7900 XTX is based on is capable of matrix operations, supporting BF16 and INT8. AMD officially added the "AI Accelerator" terminology to RDNA 3 to demonstrate its AI-processing prowess. The RX 7900 XTX features 192 AI accelerators.

AMD recently published a tutorial on how its customers can get DeepSeek R1 to run on compatible AMD consumer-based hardware, including the RX 7900 XTX. DeepSeek R1 is a new AI model that offers performance comparable to Western leading-edge AI models, but at a fraction of the computing cost. DeepSeek R1 uses an assortment of hardware-based optimizations to make its model run 11X faster than its competitors, including using Nvidia's assembly-like PTX programming language.

TOPICS

Aaron Klotz is a contributing writer for Tom’s Hardware, covering news related to computer hardware such as CPUs, and graphics cards.

10 Comments Comment from the forums

gg83

Amd just trying to pump their stock a little.
Reply
Neilbob

Our AI thing AIs more AI than the AI thing of the other AI with this AI.

I'm just so very weary of everything. And AI - I'm weary of that too.
Reply
phxrider

In other news, Congress begins talks on embargoing exports of the 7900XTX to China..... https://static.xx.fbcdn.net/images/emoji.php/v9/td0/1/16/1f602.png
Maybe this illustrates the differences when software is explicitly written for one or the other? I'm pretty sure most games are written for Nvidia seeing as they own something like 85% of the market, except there are a few AMD sponsored games that do much better on AMD. Could this be the same effect? It's not too hard to figure out the Chinese might try writing this for AMD since the 4090 is embargoed, and 7900XTX is not.
Reply
Makaveli

This should all be taken with a pinch of salt, of course, as we can't be sure how the Nvidia GPUs were configured for the tests (which, again, were run by AMD.
I was testing this in LM studio last week in the LM Studio discord with a 4090 user there is no grain of salt needed its been verified.

on the 7B, 8B, 14B models the XTX is faster. The 4090 alittle faster on the 32B model about 4%
Reply
systemBuilder_49

I can hear the howls of anguish as all the NVidia-buyers LOSE THEIR MINDS over this fact ...
Reply
bit_user

The article said:
This should all be taken with a pinch of salt, of course, as we can't be sure how the Nvidia GPUs were configured for the tests (which, again, were run by AMD).
It's plausible, since the 7900 XTX has about the same memory bandwidth as the RTX 4090 and better bandwidth from L2 and L3 caches. So, if inferencing these models is bandwidth-limited and not compute-bound, then I could believe the 7900 XTX is holding its own against that GPU.

I didn't find an official number indicating how many TOPS the 7900 XTX is good for, but the number 123 did pop up. This is only 37% as much as the amount of dense TOPS as Nvidia (and halve that, for matrices with optimal sparsity).

Source: https://chipsandcheese.com/p/microbenchmarking-nvidias-rtx-4090

Source: https://chipsandcheese.com/p/microbenchmarking-amds-rdna-3-graphics-architecture

The article said:
The RDNA 3 architecture the RX 7900 XTX is based on is capable of matrix operations, supporting BF16 and INT8.
It turns out that the WMMA instructions in RDNA 3 are simply microcoded operations that utilize the same vector pipelines as normal shader arithmetic. So, RDNA 3 does not have something akin to Nvidia's Tensor cores in its client GPUs (the CDNA-based server chips do have dedicated Matrix units, however).
Reply
Amdlova

AMD trying to profit... Us authority sign to block the GPU from amd to chinese market. Good move AMD.
Reply
nogaard777

Why does AMD do this to themselves when it can be proven false so easily?
Reply
Cooe

bit_user said:
It's plausible, since the 7900 XTX has about the same memory bandwidth as the RTX 4090 and better bandwidth from L2 and L3 caches. So, if inferencing these models is bandwidth-limited and not compute-bound, then I could believe the 7900 XTX is holding its own against that GPU.

I didn't find an official number indicating how many TOPS the 7900 XTX is good for, but the number 123 did pop up. This is only 37% as much as the amount of dense TOPS as Nvidia (and halve that, for matrices with optimal sparsity).

Source: https://chipsandcheese.com/p/microbenchmarking-nvidias-rtx-4090

Source: https://chipsandcheese.com/p/microbenchmarking-amds-rdna-3-graphics-architecture

It turns out that the WMMA instructions in RDNA 3 are simply microcoded operations that utilize the same vector pipelines as normal shader arithmetic. So, RDNA 3 does not have something akin to Nvidia's Tensor cores in its client GPUs (the CDNA-based server chips do have dedicated Matrix units, however).
The last bit is mostly semantics as matrix workloads run on Nvidia's Tensor Cores don't also use the SM's standard ALU's at the same time. And AMD's added dedicated ASIC hardware to the CU's themselves to run matrix operations via WMMA. It's not just using the normal FP32 pipes, it's more complicated than it. It's still dedicated AI/matrix math hardware, just not completely split apart as totally separate "cores" ala Nvidia post-Turing or CDNA.

Aka, when matrix operations are being calculated on either architecture it's being done (at least partially) on dedicated fixed-function matrix hardware; it's just that AMD put said hardware inside their standard CU's (which ofc does necessitate less capable functionality) whereas Nvidia broke it out into its own, totally separate unit.

AMD's method of running matrix operations on the standard CU's (basically having them pull double duty) instead of using fully dedicated matrix cores saves SIGNIFICANT die-space, but ofc at the cost of peak performance (which tbh makes perfect sense for an architecture aimed at gamers & mainstream consumers 🤷).
Reply
bit_user

Cooe said:
The last bit is mostly semantics as matrix workloads run on Nvidia's Tensor Cores don't also use the SM's standard ALU's at the same time.
Pretty sure that's not true. I think Tensor cores are a separate pipeline from their vector pipes, hence you should be able to overlap vector ops with Tensor ops. Confirmed here:
https://forums.developer.nvidia.com/t/overlapping-cuda-cores-and-tensor-cores/288774

Cooe said:
And AMD's added dedicated ASIC hardware to the CU's themselves to run matrix operations via WMMA. It's not just using the normal FP32 pipes, it's more complicated than it.
That's not how I read it. To me, it sounds like WMMA uses their vector pipeline, but simply short-circuits the VGPR by using dedicated storage for intermediates. This means you can't overlap WMMA and other instructions, in the same CU.
Reply

Show more comments