AMD claims RX 7900 XTX outperforms RTX 4090 in DeepSeek benchmarks

AMD RX 7900 XTX
(Image credit: AMD)

AMD has provided benchmarks of its flagship RX 7900 XTX going head to head against the Nvidia RTX 4090 and RTX 4080 Super with DeepSeek's AI model. According to David McAfee on X, the RDNA3-based GPU outperformed the RTX 4090 by up to 13% and the RTX 4080 Super by up to 34%.

AMD tested the three GPUs with multiple LLMs and various parameters using DeepSeek R1. The RX 7900 XTX saw its biggest victory against the RTX 4090 using DeepSeek R1 Distill Qwen 7B, where it outperformed the Ada Lovelace GPU by 13%. AMD also tested three other LLM configurations against the RTX 4090. The RX 7900 XTX outperformed the RX 4090 in two of the three configurations — it was 11% faster using Distill Llama 8B and 2% faster using Distill Qwen 14B. The RX 4090 was 4% faster than the RX 7900 XTX in one configuration, using Distill Qwen 32B.

AMD tested three configurations against the RTX 4080 Super. The RX 7900 XTX outperformed the RTX 4080 Super by 34% using DeepSeek R1 Distill Qwen 7B. This lead dropped to 27% using Distill Llama 8B, and 22% using Distill Qwen 14B.

This should all be taken with a pinch of salt, of course, as we can't be sure how the Nvidia GPUs were configured for the tests (which, again, were run by AMD). Not all AI workloads take advantage of a GPU's full computational throughput. We saw this in our Stable Diffusion tests, where Stable Diffusion did not use FP8 calculations or TensorRT code for processing.

It's not common for the RX 7900 XTX to be used as a dedicated AI processor, but the architecture is more than capable of processing AI workloads. The RDNA 3 architecture the RX 7900 XTX is based on is capable of matrix operations, supporting BF16 and INT8. AMD officially added the "AI Accelerator" terminology to RDNA 3 to demonstrate its AI-processing prowess. The RX 7900 XTX features 192 AI accelerators.

AMD recently published a tutorial on how its customers can get DeepSeek R1 to run on compatible AMD consumer-based hardware, including the RX 7900 XTX. DeepSeek R1 is a new AI model that offers performance comparable to Western leading-edge AI models, but at a fraction of the computing cost. DeepSeek R1 uses an assortment of hardware-based optimizations to make its model run 11X faster than its competitors, including using Nvidia's assembly-like PTX programming language.

Aaron Klotz
Contributing Writer

Aaron Klotz is a contributing writer for Tom’s Hardware, covering news related to computer hardware such as CPUs, and graphics cards.

Read more
Nvidia GeForce RTX 4090
Nvidia counters AMD DeepSeek AI benchmarks, claims RTX 4090 is nearly 50% faster than 7900 XTX
Radeon Pro W7900 Dual Slot
AMD RDNA 3 professional GPUs with 48GB can beat Nvidia 24GB cards in AI — putting the 'Large' in LLM
AMD RDNA 4 at CES 2025
AMD RX 9070 XT allegedly tested in Black Myth Wukong and Cyberpunk 2077 — RDNA 4 flagship purportedly lands 4% faster than the RTX 4070 Ti Super per limited testing
AMD
AMD Radeon RX 9070 XT performance estimates leaked: 42% to 66% faster than Radeon RX 7900 GRE
RTX 5090
RTX 5090 is allegedly 39% faster than RTX 4090 in 3DMark benchmarks — beats RX 7900 XTX by 93%
GeForce RTX 5090
RTX 5090 exhibits 27% higher CUDA performance than RTX 4090 — exceeds 500K points in Geekbench
Latest in Artificial Intelligence
ChatGPT Security
Some ChatGPT users are addicted and will suffer withdrawal symptoms if cut off, say researchers
Ant Group headquarters
Ant Group reportedly reduces AI costs 20% with Chinese chips
Nvidia
U.S. asks Malaysia to 'monitor every shipment' to close the flow of restricted GPUs to China
Ryzen AI
AMD launches Gaia open source project for running LLMs locally on any PC
Intel CEO at Davos
At Nvidia's GTC event, Pat Gelsinger reiterated that Jensen 'got lucky with AI,' Intel missed the boat with Larrabee
Nvidia
Nvidia unveils DGX Station workstation PCs with GB300 Blackwell Ultra inside
Latest in News
RX 9070 XT Sapphire
Lisa Su says Radeon RX 9070-series GPU sales are 10X higher than its predecessors — for the first week of availability
RTX 5070, RX 9070 XT, Arc B580
Real-world GPU prices cost up to twice the MSRP — a look at current FPS per dollar values
Zotac Gaming GeForce RTX 5090 AMP Extreme Infinity
Zotac raises RTX 5090 prices by 20% and seemingly eliminates MSRP models
ASRock fixes AM5 motherboard by cleaning it
ASRock claims to fix 'burned out' AM5 motherboard by cleaning the socket
ChatGPT Security
Some ChatGPT users are addicted and will suffer withdrawal symptoms if cut off, say researchers
project-g-assist-nvidia-geforce-rtx-ogimage
Nvidia releases public G-Assist in latest App to provide in-game AI assistance — also introduces DLSS custom scaling factors
  • gg83
    Amd just trying to pump their stock a little.
    Reply
  • Neilbob
    Our AI thing AIs more AI than the AI thing of the other AI with this AI.

    I'm just so very weary of everything. And AI - I'm weary of that too.
    Reply
  • phxrider
    In other news, Congress begins talks on embargoing exports of the 7900XTX to China..... https://static.xx.fbcdn.net/images/emoji.php/v9/td0/1/16/1f602.png
    Maybe this illustrates the differences when software is explicitly written for one or the other? I'm pretty sure most games are written for Nvidia seeing as they own something like 85% of the market, except there are a few AMD sponsored games that do much better on AMD. Could this be the same effect? It's not too hard to figure out the Chinese might try writing this for AMD since the 4090 is embargoed, and 7900XTX is not.
    Reply
  • Makaveli
    This should all be taken with a pinch of salt, of course, as we can't be sure how the Nvidia GPUs were configured for the tests (which, again, were run by AMD.
    I was testing this in LM studio last week in the LM Studio discord with a 4090 user there is no grain of salt needed its been verified.

    on the 7B, 8B, 14B models the XTX is faster. The 4090 alittle faster on the 32B model about 4%
    Reply
  • systemBuilder_49
    I can hear the howls of anguish as all the NVidia-buyers LOSE THEIR MINDS over this fact ...
    Reply
  • bit_user
    The article said:
    This should all be taken with a pinch of salt, of course, as we can't be sure how the Nvidia GPUs were configured for the tests (which, again, were run by AMD).
    It's plausible, since the 7900 XTX has about the same memory bandwidth as the RTX 4090 and better bandwidth from L2 and L3 caches. So, if inferencing these models is bandwidth-limited and not compute-bound, then I could believe the 7900 XTX is holding its own against that GPU.

    I didn't find an official number indicating how many TOPS the 7900 XTX is good for, but the number 123 did pop up. This is only 37% as much as the amount of dense TOPS as Nvidia (and halve that, for matrices with optimal sparsity).

    Source: https://chipsandcheese.com/p/microbenchmarking-nvidias-rtx-4090

    Source: https://chipsandcheese.com/p/microbenchmarking-amds-rdna-3-graphics-architecture

    The article said:
    The RDNA 3 architecture the RX 7900 XTX is based on is capable of matrix operations, supporting BF16 and INT8.
    It turns out that the WMMA instructions in RDNA 3 are simply microcoded operations that utilize the same vector pipelines as normal shader arithmetic. So, RDNA 3 does not have something akin to Nvidia's Tensor cores in its client GPUs (the CDNA-based server chips do have dedicated Matrix units, however).
    Reply
  • Amdlova
    AMD trying to profit... Us authority sign to block the GPU from amd to chinese market. Good move AMD.
    Reply
  • nogaard777
    Why does AMD do this to themselves when it can be proven false so easily?
    Reply
  • Cooe
    bit_user said:
    It's plausible, since the 7900 XTX has about the same memory bandwidth as the RTX 4090 and better bandwidth from L2 and L3 caches. So, if inferencing these models is bandwidth-limited and not compute-bound, then I could believe the 7900 XTX is holding its own against that GPU.

    I didn't find an official number indicating how many TOPS the 7900 XTX is good for, but the number 123 did pop up. This is only 37% as much as the amount of dense TOPS as Nvidia (and halve that, for matrices with optimal sparsity).

    Source: https://chipsandcheese.com/p/microbenchmarking-nvidias-rtx-4090

    Source: https://chipsandcheese.com/p/microbenchmarking-amds-rdna-3-graphics-architecture


    It turns out that the WMMA instructions in RDNA 3 are simply microcoded operations that utilize the same vector pipelines as normal shader arithmetic. So, RDNA 3 does not have something akin to Nvidia's Tensor cores in its client GPUs (the CDNA-based server chips do have dedicated Matrix units, however).
    The last bit is mostly semantics as matrix workloads run on Nvidia's Tensor Cores don't also use the SM's standard ALU's at the same time. And AMD's added dedicated ASIC hardware to the CU's themselves to run matrix operations via WMMA. It's not just using the normal FP32 pipes, it's more complicated than it. It's still dedicated AI/matrix math hardware, just not completely split apart as totally separate "cores" ala Nvidia post-Turing or CDNA.

    Aka, when matrix operations are being calculated on either architecture it's being done (at least partially) on dedicated fixed-function matrix hardware; it's just that AMD put said hardware inside their standard CU's (which ofc does necessitate less capable functionality) whereas Nvidia broke it out into its own, totally separate unit.

    AMD's method of running matrix operations on the standard CU's (basically having them pull double duty) instead of using fully dedicated matrix cores saves SIGNIFICANT die-space, but ofc at the cost of peak performance (which tbh makes perfect sense for an architecture aimed at gamers & mainstream consumers 🤷).
    Reply
  • bit_user
    Cooe said:
    The last bit is mostly semantics as matrix workloads run on Nvidia's Tensor Cores don't also use the SM's standard ALU's at the same time.
    Pretty sure that's not true. I think Tensor cores are a separate pipeline from their vector pipes, hence you should be able to overlap vector ops with Tensor ops. Confirmed here:
    https://forums.developer.nvidia.com/t/overlapping-cuda-cores-and-tensor-cores/288774
    Cooe said:
    And AMD's added dedicated ASIC hardware to the CU's themselves to run matrix operations via WMMA. It's not just using the normal FP32 pipes, it's more complicated than it.
    That's not how I read it. To me, it sounds like WMMA uses their vector pipeline, but simply short-circuits the VGPR by using dedicated storage for intermediates. This means you can't overlap WMMA and other instructions, in the same CU.
    Reply