Latest from Tom's Hardware UK in Deepseek

Huawei-led team claims it post-trained DeepSeek's 1.6-trillion-parameter model — 1,000 Ascend 910C chips used in training

Luke James — Sat, 06 Jun 2026 12:00:00 +0000

A research group that includes Huawei Technologies says it completed full-parameter post-training of DeepSeek's V4-Pro, a 1.6-trillion-parameter model. The group used a cluster of at least 1,000 Huawei Ascend 910C chips, according to the Shenzhen municipal government, as reported by the South China Morning Post.

The revelation is evidence that Chinese accelerators can now handle a training-class workload on domestic silicon, the part of the AI pipeline Chinese firms have had the most trouble moving off Nvidia hardware under U.S. export controls. Huawei carried out the work with the Shenzhen Loop Area Institute, the Shenzhen campus of Harbin Institute of Technology, and the Shenzhen Research Institute of Big Data.

The Ascend 910C is Huawei's current flagship AI accelerator, a dual-die part that returned roughly 60% of an Nvidia H100's inference performance in earlier DeepSeek testing. Chinese chips have been competitive at inference, where a finished model answers prompts, but weak at training, where a model's weights are recalculated across large datasets. The team says it ran full-parameter post-training, meaning every weight was updated rather than a thin adapter layer added on top.

Post-training is essentially the “tuning” stage that follows the much larger pre-training phase. Pre-training builds a model's core capabilities by working through enormous text corpora, and DeepSeek's documentation puts V4-Pro's pre-training corpus at more than 32 trillion tokens.

Go deeper with TH Premium: AI and data centers

(Image credit: Microsoft)

Post-training then shapes behavior through instruction-following, safety alignment, and task-specific data. Completing it on Ascend silicon is a genuine result for the platform, but it doesn’t demonstrate that the chips can pre-train a frontier model from scratch, which is the heavier and costlier job.

Back in August, it was reported that DeepSeek couldn’t complete a single successful training run for its R2 model in Ascend chips, even with Huawei engineers on site, blaming unstable performance, slow chip-to-chip interconnects, and gaps in Huawei's CANN software stack, its substitute for Nvidia's CUDA. The company fell back on Nvidia GPUs for training and left Ascend on inference. DeepSeek-V4-Pro, released in April, was the first DeepSeek model built around Ascend from the outset.

As for the claim coming out of Shenzen, it carries no benchmarks, gives no figure for how long the run took, how it compared to the same job on Nvidia hardware, or how efficiently the 1,000-chip cluster was used. It’s ultimately just another addition to a series of dubious claims that have come from the Chinese state without anything to back them up; DeepSeek itself hasn’t commented.

Huawei braces for $12 billion in AI chip revenue driven by homegrown AI model demand — Chinese fabs can barely keep up as Nvidia's market share craters within the region

Luke James — Tue, 05 May 2026 13:29:27 +0000

Huawei expects revenue from its AI processors to reach roughly $12 billion in 2026, up from $7.5 billion last year. The projection, based on orders already received from major Chinese technology firms including Alibaba, ByteDance, and Tencent, would represent at least 60% year-over-year growth and position Huawei as the dominant supplier in a domestic AI chip market that Morgan Stanley estimates could reach $67 billion by 2030. The surge has coincided with Nvidia CEO Jensen Huang confirming that Nvidia's share of the Chinese AI accelerator market has collapsed to zero percent.

These numbers describe a market that has bifurcated with unusual speed. Just 18 months ago, Nvidia supplied the vast majority of AI training and inference silicon used by Chinese cloud providers. Today, Huawei's Ascend 950PR is the primary procurement target for China's largest tech companies, and a training-focused successor named the 950DT is scheduled for Q4 this year.

The impact of DeepSeek V4

This raging demand can be largely attributed to the release of DeepSeek’s V4 LLM in April, which has been optimized specifically for Huawei's Ascend architecture and its CANN software framework rather than for Nvidia's CUDA ecosystem. Huawei engineers, per reporting from South China Morning Post, are said to have collaborated directly with DeepSeek ahead of the model’s launch, and the company confirmed that its full Ascend SuperNode product line was adapted for V4 inference on day one. Alibaba Cloud and Tencent Cloud both deployed V4 services within hours of release.

The 950PR is currently the only Chinese-made AI processor that supports FP8, a compressed numerical format that allows more operations per second and lowers per-query costs. V4 uses a Mixture-of-Experts architecture with up to 1 trillion total parameters but activates only around 37 billion per inference pass. That favors inference-efficient hardware, which plays to the 950PR's strengths over its limitations in raw training throughput.

DeepSeek gave Huawei early optimization access, but didn’t extend the same to Nvidia or AMD. While V4's open weights are released in standard formats compatible with CUDA-based frameworks, DeepSeek's own infrastructure runs on Huawei Ascend silicon. The collaboration has pulled forward procurement timelines across the Chinese cloud industry, and chip prices for the 950PR have reportedly risen by about 20% as a result of the demand.

SMIC capacity and production

Huawei's ability to fill those orders depends on SMIC, China's leading foundry. SMIC manufactures the 950PR on its N+3 process, a 7nm-class node built without EUV lithography. Huawei is said to be targeting production of roughly 750,000 950PR units this year, with full-scale shipments expected in the second half following samples that were shipped to customers in January, but that figure is expected to fall short of demand.

Meanwhile, SMIC has been working on expanding its advanced-node capacity for more than a year. The goal is a five-fold increase over a period of two years that’ll lift 7nm and 5nm production to 100,000 wafers per month and half a million by 2030. In addition, the combined capacity for 22nm and below could rise from 30,000-50,000 wafer starts per month in 2025 to 50,000-60,000 or higher this year. Huawei is adding two dedicated fabrication plants, though ownership structures remain unclear. Once fully operational, those facilities could exceed the current output of comparable lines at SMIC.

Yields remain a thorn in China’s side, with SMIC’s 7nm-class process delivering substantially fewer good dies per wafer than TSMC’s equivalent nodes, and the 950PR is likely to be a much larger chip than a TSMC equivalent. SMIC’s cycle time from wafer start to finished and packaged as an Ascend processor is also a problem, currently sitting at around eight months, according to estimates from JP Morgan. For similar nodes at TSMC, it’s around three months.

Then there’s HBM — Huawei announced in September that it had developed its own HBM chips with up to 1.6 TB/s bandwidth, HiBL 1.0, and HiZQ 2.0, in partnership with CXMT, but how quickly CXMT can ramp production of competitive HBM remains an open question.

Nvidia's collapse in China

Huang's admission that “In China, we have now dropped to zero,” came during an interview with the Special Competitive Studies Project's "Memos to the President" podcast. He criticized U.S. export policy as having "already largely backfired," arguing that conceding a market the size of China doesn’t make strategic sense.

The H200, which Nvidia received U.S. licenses to sell to China earlier this year, hasn’t shipped a single unit despite receiving orders. Contradictory regulatory requirements from Washington and Beijing created a stalemate at customs: U.S. regulators require that H200 chips ordered by Chinese customers be used only inside China, while Beijing has instructed domestic technology companies to limit Nvidia hardware to overseas operations.

Nvidia confirmed in its FY2026 10-K filing that it’s "effectively foreclosed from competing in China's data center computing market" and is not assuming any data center compute revenue from the region in its current outlook. Bernstein analysts estimated earlier this year that Nvidia’s share of the China AI GPU market could fall to roughly 8% in the coming years, down from 66% in 2024, both due to U.S. restrictions and because domestic vendors are being pushed to cover up to 80% of demand from domestic sources. TrendForce projected in December that China's high-end AI chip market would grow by more than 60% in 2026, with domestic suppliers capturing about half of the total.

950PR performance

The 950PR performs somewhere in between Nvidia’s H100 and H200, and outperforms the restricted H20 by an estimated factor of 2.8 times, but trails the H200 in both compute and memory bandwidth. That 2.8 figure can’t be verified, however, since Hopper-era hardware doesn’t support FP4 natively.

Huawei compensates by linking large numbers of processors via optical interconnects. Its CloudMatrix 384 system combines twelve racks of Ascend modules into a 384-processor fabric delivering roughly 300 PFLOPS, though at nearly four times the power draw of Nvidia's comparable GB200-based configurations.

The 950PR is primarily an inference chip, though; the training-focused 950DT, expected in Q4, is designed for deep learning workloads and could narrow the gap with Nvidia's Hopper generation for model training tasks. Until it ships, Chinese firms that need to train large foundation models domestically face constraints that inference silicon can’t fully solve.

As for Huawei's CANN software ecosystem, it’s now thought to have more than four million developers, but it remains far smaller than Nvidia's CUDA install base. Whether CANN can attract enough third-party development to become self-sustaining remains to be seen. For now, commercial momentum is running in Huawei's favor inside China, driven by the simple absence of alternatives.

DeepSeek launches 1.6 trillion parameter V4 on Huawei chips as U.S. escalates AI theft accusations — U.S. gov't alleges IP theft by DeepSeek and other Chinese AI firms

Luke James — Sun, 26 Apr 2026 12:15:00 +0000

DeepSeek on Friday released a preview of its V4 large language model, the Hangzhou-based startup's most powerful to date, with 1.6 trillion parameters and a 1 million token context window. The model is the first major frontier release optimized for Huawei's Ascend AI processors rather than Nvidia hardware, and it arrived on the same day Reuters reported that the U.S. State Department had sent a diplomatic cable to embassies worldwide instructing staff to warn foreign governments about alleged IP theft by DeepSeek and other Chinese AI firms.

V4 comes in two variants: V4-Pro, the flagship, which costs $3.48 per million output tokens, and V4-Flash, a smaller 284 billion parameter version, which costs $0.28. OpenAI currently charges $30 per million output tokens for GPT-5.4, and Anthropic charges $25 for Claude Opus 4.6. DeepSeek, however, acknowledges V4 “falls marginally short” of those closed-source models by roughly three to six months of development, but outperforms every other open-source competitor in agentic coding and reasoning benchmarks.

DeepSeek trained its earlier V3 model on 2,048 Nvidia H800 GPUs, and the company has faced multiple investigations over whether it acquired restricted Nvidia hardware through intermediaries in Singapore.

V4 sidesteps that supply chain entirely by training on domestic Ascend chips. Huawei confirmed day-zero compatibility across its full Ascend SuperNode product line, including its latest 950 series processors, and DeepSeek said V4-Pro pricing could fall further once Huawei scales up Ascend 950 production in the second half of this year.

The diplomatic cable, per Reuters, instructed embassy staff to speak to their foreign counterparts about “concerns over adversaries’ extraction and distillation” of U.S. models, naming DeepSeek alongside Moonshot AI and MiniMax. Two days earlier, the White House Office of Science and Technology Policy published a memo accusing Chinese entities of running "deliberate, industrial-scale campaigns" to distill American frontier AI systems.

Those accusations build on claims Anthropic made in February, when the company said DeepSeek, Moonshot, and MiniMax had used 24,000 fraudulent accounts to make 16 million exchanges with its Claude model. OpenAI has also accused DeepSeek of distilling its models.

China's foreign ministry called the accusations "groundless," according to Reuters, and DeepSeek has previously said its V3 model relied on naturally occurring data collected through web crawling and didn’t intentionally use synthetic data generated by OpenAI. The diplomatic cable and the V4 launch both come just weeks before President Trump is scheduled to visit Chinese President Xi Jinping in Beijing for a summit expected to cover semiconductor export controls and IP disputes.

Anthropic accuses DeepSeek, other Chinese AI developers of 'industrial-scale' copying — Claims 'distillation' included 24,000 fraudulent accounts and 16 million exchanges to train smaller models

ashilov@gmail.com (Anton Shilov) — Mon, 23 Feb 2026 21:27:30 +0000

Anthropic on Monday accused three leading Chinese developers of frontier AI models of using large-scale distillation to improve their own models by using Anthropic's Claude capabilities. In total, DeepSeek, Moonshot, and MiniMax made 16 million exchanges using 24,000 fraudulent accounts.

Go deeper with TH Premium: AI and data centers

(Image credit: Microsoft)

Distillation is a machine learning technique in which a smaller or less capable model is trained on the outputs of a stronger model instead of using actual data to train. It can save time, create cheaper, more specialized models, extract capabilities from competitors, and/or lower requirements for hardware capabilities. While distillation is generally a legitimate technique, when a China-based entity with heavy restrictions does it, it violates both U.S. export controls and end-user license agreement with Anthropic.

"Distillation can be legitimate: AI labs use it to create smaller, cheaper models for their customers," a statement by Anthropic published on X reads. "But foreign labs that illicitly distill American models can remove safeguards, feeding model capabilities into their own military, intelligence, and surveillance systems."

American companies like OpenAI have long accused DeepSeek of using distillation to train some of their frontier models using outputs of ChatGPT and other services, but have not presented detailed explanation, unlike Anthropic.

How Chinese companies use distillation from American AI models

According to Anthropic, the perpetrators followed the same pattern: they used commercial services that resell access to frontier models and built what the company calls 'hydra cluster' networks — large pools of accounts that spread traffic across Anthropic's API and third-party clouds.

In one case, a single proxy setup allegedly controlled more than 20,000 fraudulent accounts at once. To avoid raising flags, it mixed extraction traffic with ordinary use requests. However, its prompt patterns stood out: very high volumes, tightly focused on specific capabilities, and highly repetitive. Such behavior was consistent with model training, but certainly not typical end-user interaction.

DeepSeek alone generated over 150,000 exchanges that targeted reasoning tasks, rubric-based grading suitable for reinforcement learning reward models, and censorship-safe rewrites of politically sensitive queries, according to Anthropic. Anthropic also observed prompts designed to produce step-by-step internal reasoning and therefore reveal chain-of-thought training data.

Moonshot, known for its Kimi models, accounted for more than 3.4 million exchanges, according to Anthropic. Its focus areas included agentic reasoning, tool use, coding, data analysis, computer-use agents, and computer vision. Moonshot allegedly used hundreds of fraudulent accounts spanning multiple access pathways and later tried to extract and reconstruct Claude's reasoning traces.

MiniMax conducted the largest campaign with over 13 million exchanges that targeted agentic coding and orchestration. Anthropic says it detected this operation while it was still ongoing, as MiniMax was training its model that was to be released in the future, which provides the American company a unique view on the lifecycle of the extraction. After Anthropic introduced a new Claude model, MiniMax allegedly redirected nearly half its traffic within 24 hours to capture capabilities from the latest model.

Anthropic's response

To fight future distillation attempts, Anthropic says it is strengthening defenses to make large-scale distillation harder to carry out and easier to detect. The company has deployed classifiers and behavioral fingerprinting systems to identify extraction patterns in API traffic, including chain-of-thought elicitation and coordinated multi-account activity. The company is also sharing technical indicators of large-scale distillation operation with other AI labs, cloud providers, and authorities, as well as tightening verification for educational, research, and startup accounts often used to create fraudulent access. In parallel, it is developing product-, API-, and model-level safeguards to reduce their usefulness of outputs for illicit training without harming legitimate users. At the same time, the company admits that that countering attacks at this scale requires coordinated industry and policy action.

Deepseek research touts memory breakthrough, decoupling compute power and RAM pools to bypass GPU & HBM constraints — Engram conditional memory module commits static knowledge to system RAM

sayem.ahmed@futurenet.com (Sayem Ahmed) — Tue, 13 Jan 2026 18:49:44 +0000

Tom's Hardware Premium Roadmaps

(Image credit: Future)

DeepSeek has released a new technical paper, which details a new method for how new AI models might rely on a queryable database of information committed to system memory. Named "Engram", the conditional memory-based technique achieves demonstrably higher performance in long-context queries by committing sequences of data to static memory. This eases the reliance on reasoning for AI models, allowing the GPUs to only handle more complex tasks, increasing performance, and reducing the reliance on high-bandwidth memory (HBM).

The paper details how N-grams, statistical sequences of words, are integrated into the model's neural networks, allowing them to be placed into a queryable memory bank. Engram allows models to remember facts, rather than having to reason them out, which is more computationally expensive.

Released on the company's GitHub page, Engram hopes to address how the company might be able to curb the reliance on more complex memory types and instead commit a knowledge library to a more common system memory standard, such as CXL.

Reducing the reliance on HBM

The ongoing reliance on high-bandwidth memory for AI accelerators is something that even Chinese silicon, such as Huawei's Ascend series, cannot escape. Each stack of HBM uses more memory dies, and with demand skyrocketing, easing any AI model's reliance on the GPU's direct high-bandwidth memory would be significant, especially considering the ongoing memory supply squeeze.

Engram would enable static memory to be held separately from an LLM's compute power, allowing the GPU's rapid HBM to dedicate itself to reasoning, therefore enabling more performant Engram-based AI models, compared to a standard Mixture of Experts (MoE) model.

As detailed in the paper, an Engram-based model scaled to nearly 27 billion parameters can beat out a standard MoE model in long-context training and eliminates computational waste generated by having to reason out facts, by allowing them to be externally stored.

A standard MoE model might have to reconstruct these pieces of data every time it's referenced in a query, which is called conditional computation. The model will then call on its expert parameters to assemble and reason the data every time, even when it only focuses the query on certain parts or experts, named sparse computation.

How Engram embeds itself into training and inference workloads (Image credit: Deepseek)

The Engram paper adds that placing conditional memory would allow the model to merely ask: "Do I already have this data?", rather than having to access the parts of the model that deal with reasoning.

"This process essentially amounts to an expensive runtime reconstruction of a static lookup table, wasting valuable sequential depth on trivial operations that could otherwise be allocated to higher-level reasoning," the paper reads.

How Engram is different to KVCache

Engram takes static patterns and lists its knowledge index into a parsable piece of conditional memory with a store of information, relieving the AI model from the burden of having to reason through context repeatedly. While Nvidia's KVCache, announced at CES 2026, offloads context data to NVMe memory with BlueField-4, this acts as more of a short-term solution, allowing the model to remember things that you have recently said or added within context, and is, for all intents and purposes, disposable after you move on to the next query or conversation.

KVCache, while persistent within the history of your conversations or queries, does not draw on an existing base of pre-calculated data, and is not persistent in the same way that Engram-based LLMs could be, if the paper is to be believed. To put it simply, KVCache can be likened to storing your handwritten notes, whereas Engram is a record of the whole encyclopedia.

(Image credit: Nvidia)

Hashing and gating

This is enabled through tokenizer compression, which compresses equivalent tokens (such as the same word with different forms of capitalization) as the same, canonical concept. This allowed Deepseek to reduce the vocabulary size for the conditional memory module by 23%, and allows for rapid parsing of information in context.

As there is an impossibly large number of phrases or combinations of words within a certain context, they employ a methodology named Hashing, which allows the model to apply a number to a series of words. Engram adds to this, with what it calls Multi-Head Hashing, where you can put several hashes onto multiple numbers, for that single phrase to avoid erroneously adding the wrong context. For example, Universal might be a single entry, distinct from Universal Studios, with Multi-Head Hashing employed to ensure no mistakes or database errors.

This is then passed on to Engram's context-aware gating, which then confirms that the term matches the context of the sentence it's being used in, before being deployed into an output.

The perfect allocation ratio

(Image credit: Deepseek)

To examine how Engram-based LLMs might work in large-scale deployments, Deepseek detailed how it might achieve the best allocation between embeddings of Engram and MoE parameters within an AI model.

The outcome was a U-curve, which proved that memory and compute (or reasoning) can be considered mathematically distinct forms of intelligence within AI models. This resulted in a sweetspot for MoE and Engram embeddings.

"Remarkably, the Engram model achieves comparable performance to the pure MoE baseline (𝜌 = 100%) even when the MoE allocation is reduced to just 𝜌 ≈ 40% (i.e., a total of 46 experts for the 5.7B model and 43 experts for the 9.9B model). Furthermore, the pure MoE baseline proves suboptimal: reallocating roughly 20%–25% of the sparse parameter budget to Engram yields the best performance."

Deepseek itself remarks on how both Engram-dominated and MoE-dominated models falter, whereas a ratio that yields 20-25% of the overall parameter budget of the model to Engram achieves the best results.

What if Engram's memory was infinite?

Deepseek ran another experiment in parallel, which it names the "Infinite Memory Regime." This effectively keeps the computational budget fixed, so the model doesn't get more expensive to run, and attaches a near infinite number of conditional memory parameters to be deployed using Engram.

What they found was that since Engram is distinct from the overall compute budget (since it's effectively a long-term storage bank, which taps into the overall model), Deepseek discovered that performance scales linearly with memory size. Meaning that if a model continued to add to its conditional memory banks, its performance would only continue to improve, without having to increase the overall compute budget.

(Image credit: Future)

This could have significant implications for the wider AI industry if performance and results are not singularly bound by compute, but to long-term "Engram" memory banks. If the performance benefits are indeed as good as the paper outlines, the memory squeeze would no longer be singularly based on the deployment of HBM, but all forms of memory that could be deployed within data centers, either through CXL or other methods of interconnection.

The results speak for themselves

Deepseek deployed an Engram-27B parameter model and a standard 27B MoE model in parallel to determine the performance benefits of computational memory within AI models, and the results were exemplary. Within knowledge-intensive tasks, Engram was 3.4 to 4 points better than its MoE equivalent, and it was even better at reasoning, with a 3.7 to 5 point uplift when compared to its MoE "reasoning-only" sibling. Similar results were also achieved in coding and mathematics-based tests.

However, the big win for Engram was in long-context tasks, increasing accuracy within the NIAH (Needle in a Haystack) benchmark to 97%, which is a leap from the MoE model's score of 84.2%. This is a large difference in reliability between the models, and could point toward AI's long-context and coherence issues eventually becoming a thing of the past, if Engram were to be deployed in a commercial AI model, especially if the demands for long-context AI queries increase.

Will Deepseek V4 be based on Engram?

Engram has significant implications for the AI industry, especially as the paper details how this specific methodology is no longer bound by HBM, but instead longer-term storage. System DRAM can now be utilized to significantly improve the quality of Engram-based LLM outputs, meaning that the much more expensive HBM will only be used for computationally heavy queries.

(Image credit: DeepSeek)

Of course, if Engram were to take off, it may worsen the ongoing DRAM supply crisis, as AI hyperscalers adopting the methodology would then flock to system DRAM, instead of solely focusing on putting all of their memory ICs in production into HBM for GPUs.

"We envision conditional memory functions as an indispensable modeling primitive for next-generation sparse models," Deepseek said, hinting at a possible V4 deploying Engram in a new AI model. With the company rumored to announce a new AI model within the next few weeks, don't be surprised if it implements Engram within it.

While the results are impressive on paper, Engram's impact has yet to be determined in real-world deployment. But, if everything the paper says holds in a real-world context, the company could be onto a new 'Deepseek moment.'

Nvidia decries 'far-fetched' reports of smuggling in face of DeepSeek training reports — unnamed sources claim Chinese company is involved in Blackwell smuggling ring

Sunny Grimm — Wed, 10 Dec 2025 16:41:06 +0000

A new report claims that DeepSeek has illegally obtained and operated "several thousand" Nvidia Blackwell GPUs in the process of training and developing its newest large language model. According to coverage by The Information, six unnamed sources all claim DeepSeek's involvement in a convoluted smuggling ring based around the use of fake data centers as fronts to move high-powered servers into mainland China, illegally circumventing U.S. sanctions on newer AI GPUs.

Sources close to the matter allege that DeepSeek is involved in a high-complexity smuggling ring focused on getting Blackwell chips into China illegally through the use of fake data centers. Shell companies purchase data centers worth of Nvidia servers somewhere in Southeast Asia, setting up the data center and its hardware entirely to spec. Nvidia's OEM partners send contractors to inspect the installation, confirming successful installs and export compliance.

After this inspection is finished, smugglers reportedly disassemble the entire data center rack by rack, shipping the GPU servers in suitcases across the border into mainland China, where the purchase and use of certain Nvidia chips are restricted by the United States government. According to the report, sources with knowledge of these smuggling operations claim that smugglers and clients prefer 8-GPU rack servers like the HGX B200 over the powerful GB200 NVL72 for this smaller size and ease of covert transportation.

When asked for comment, an Nvidia spokesperson gave the following statement to Tom's Hardware:

We haven't seen any substantiation or received tips of 'phantom datacenters' constructed to deceive us and our OEM partners, then deconstructed, smuggled, and reconstructed somewhere else. While such smuggling seems far-fetched, we pursue any tip we receive.

DeepSeek's Need for Nvidia GPUs

DeepSeek, the most recognizable Chinese AI firm in the United States, thanks to its R1 LLM making worldwide headlines one year ago, has long been connected with Nvidia GPUs. Its sensational R1 model was trained on only 2,048 Nvidia H800s in two months, a number of GPUs far smaller and more efficient than any Western competitor. Since this time, DeepSeek has consistently been linked to the stockpiling and purchase of as many Nvidia GPUs as it can obtain, with reports constantly swirling about DeepSeek somehow bypassing export restrictions and securing huge numbers of the newest Nvidia chips.

Interestingly, DeepSeek's latest internal reports seem to indicate plans to use Nvidia chips for its newest AI models. In a whitepaper released on December 2nd on DeepSeek V3.2, DeepSeek suggests their bottleneck on performance matches that of frontier models like Gemini-3.0-Pro is pre-training compute; "We plan to address this knowledge gap in future iterations by scaling up the pre-training compute." Pre-training compute is a workflow that Nvidia GPUs and CUDA software perform better than most other competitors, suggesting that DeepSeek engineers count on something changing for its access to high-caliber pre-training compute power.

DeepSeek's track record proves that Nvidia's pre-training abilities fill a niche unmatched by domestic Chinese products. Reports in August claimed that Huawei's Ascend GPU servers were unable to run necessary training workloads, prompting a return to Nvidia hardware in the R2 training process. This was despite government intervention and doctrines calling for DeepSeek to turn to domestic Chinese products for its AI workload. While the Huawei Ascend servers were used for inference for the models, the company could not turn anywhere but to Nvidia, much to the chagrin of China.

Nvidia's Future in China

The Trump administration recently announced plans to unrestrict the Nvidia H200 GPU in China, opening up Nvidia's sales in the country. Speculators claim that this policy U-turn from the White House, which has spent much of 2025 toeing a line of complete export isolationism to China, comes as fears of Huawei's CloudMatrix 384 and Ascend 910C systems grow. Reputable claims hold that these servers match the H200 and GB200 NVL72 in certain performance metrics, causing the U.S. government to release the H200 into China.

This new policy is based on a compromise between flooding China with easy-to-access American Nvidia tech and banning it altogether. The hope is to satiate Chinese tech needs and take away motivation for firms like Huawei to develop their own Nvidia competitors. The adoption of this doctrine, oft-touted by Nvidia's lobbying efforts to the White House, marks a major shift in the "Chip War" trade offensive between Beijing and Washington D.C., which has moved from preventing China from any access to next-gen tech to hoping to slow China's tech power that is beginning to threaten Western tech dominance.

While Trump's Commerce Department continues to insist that China will never see Nvidia Blackwell hardware, keeping the export exceptions limited to Hopper-generation hardware like the H200, time will tell if further Nvidia lobbying and fears of the Chinese tech sector will open the doors further. And of course, if DeepSeek truly is involved in conspiracies of phantom data centers, they won't even need the U.S. to allow them access to Blackwell.

China's autonomous military combat drone powered by DeepSeek highlights Nvidia reliance — investigation reveals People's Liberation Army, supporting institutions continue to use restricted H100 chips

editors@tomshardware.com (Jowi Morales) — Mon, 27 Oct 2025 16:50:19 +0000

China North Industries Corporation, or Norinco, a state-owned defense firm, earlier this year unveiled the P60, an autonomous military vehicle that can travel at 50 kilometers per hour (approximately 31 miles per hour) and features autonomous combat-support capability. According to Reuters, the drone is powered by DeepSeek, but details about it remain a state secret. However, the publication dived deep into procurement records and patents, which suggest the continued use of Nvidia AI GPUs by the Chinese People’s Liberation Army (PLA) and the various institutions that support it.

The U.S. has restricted the export of its most advanced AI chips to China since 2022, only allowing Nvidia and AMD to ship versions that are significantly less performant than their top-of-the-line models. Despite that, there’s reportedly still a healthy black market for chips like the Nvidia H100, which, although prohibited from export to China in the U.S., isn’t illegal to obtain in the East Asian nation. Still, Beijing is pushing for its own homegrown chips, and it has even banned some of its biggest tech companies from acquiring them, saying that the performance of domestic semiconductors can already match Nvidia’s H20 and RTX Pro 6000D AI GPUs.

Chinese military use of Nvidia tech

Despite Beijing's recent ban, 35 of the patents that Reuters discovered, most of them filed by the National University of Defense Technology (NUDT) and other educational institutions that conduct research for the Chinese military, mentioned the use of Nvidia’s A100 chips. One patent was filed as late as June of this year, although it’s unknown whether the AI GPUs used in it were acquired before or after Washington's export controls.

On the other hand, a further 15 patents noted the use of Huawei Ascend chips, showing China’s progress in making its own AI GPUs. DeepSeek, the AI tool used in Norinco’s vehicle, was also suspected to have been trained on Nvidia AI GPUs, although the company’s latest model now supports Huawei’s chips and its CANN software toolkit.

The report says that the PLA "and affiliates continue to use and look for Nvidia chips," including models currently under Washington export controls. "China has more than enough domestic chips for all of its military applications, with millions to spare," Nvidia told Tom's Hardware in a statement. "While we can't track individual resales of products sold years ago, recycling small quantities of old, second-hand products doesn't enable anything new or raise any national security concern. Using restricted products for military applications would be a nonstarter, without support, software, or maintenance."

Weaponized AI

China isn’t the first country to experiment with and employ AI for its military. In fact, a U.S. defense startup is already working on an AI-powered drone killer, Sweden is experimenting with an AI drone swarm, and Russia is allegedly field-testing an AI-powered drone. AI technologies will allow warfare to become more efficient, with Chinese researchers saying that planning and assessment, which often takes 48 hours by a team of military planners, can now be completed in a matter of seconds.

Despite that, its top leaders say that humans will maintain control of its weapons systems, especially as we cannot say with 100% accuracy how reliable these systems could be. We’ve seen this countless times in Hollywood movies, and even in one USAF test, in which the simulated AI drone turned on its human operator to accomplish its mission regardless of the consequences.

New Deepseek model drastically reduces resource usage by converting text and documents into images — 'vision-text compression' uses up to 20 times fewer tokens

Jon Martindale — Tue, 21 Oct 2025 13:04:01 +0000

Chinese developers of Deepseek AI have released a new model that leverages its multi-modal capabilities to improve the efficiency of its handling of complex documents and large blocks of text, by converting them into images first, as per SCMP. Vision encoders were able to take large quantities of text and convert them into images, which, when accessed later, required between seven and 20 times fewer tokens, while maintaining an impressive level of accuracy.

Deepseek is the Chinese-developed AI that shocked the world in early 2025, showcasing capabilities similar to those of OpenAI's ChatGPT, or Google's Gemini, despite requiring far less money and data to develop. The creators have continued to work on making the AI more efficient since, and with the latest release known as DeepSeek-OCR (optical character recognition), the AI can deliver an impressive understanding of large quantities of textual data without the usual token overhead.

“Through DeepSeek-OCR, we demonstrated that vision-text compression can achieve significant token reduction – seven to 20 times – for different historical context stages, offering a promising direction” to handle long-context calculations, the developer said.

The new model is made up of two components, the DeepEncoder and DeepSeek3B-MoE-A570M, which acts as the decoder. The encoder can take large quantities of text data and convert it into high-resolution images, while the decoder is particularly adept at taking those high-resolution images and understanding the textual context within them, while requiring fewer tokens to do so than if you just fed the text right into the AI wholesale. It manages this by dissecting each task into separate sub-networks and uses specific AI agent experts to target each subset of the data.

(Image credit: Deepseek/AI Engineering/Medium)

This works really well for handling tabulated data, graphs, and other visual representations of information. This could be of particular use in finance, science, or medicine, the developers suggest.

In benchmarking, the developers claim that when reducing the number of tokens by less than a factor of 10, DeepSeek-OCR can maintain a 97% accuracy rating in decoding the information. If the compression ratio is increased to 20 times, the accuracy falls to 60%. That's less desirable and shows there are diminishing returns on this technology, but if a near-100% accuracy rate could be achieved with even a 1-2x compression rate, that could still make a huge difference in the cost of running many of the latest AI models.

It's also being pitched as a way of developing training data for future models, although introducing errors at that point, even in the form of a few percent off base, seems like a bad idea.

If you want to play around with the model yourself, it's available via online developer platforms Hugging Face and GitHub.

U.S. Commerce Sec. Lutnick says American AI dominates DeepSeek, thanks Trump for AI Action Plan — OpenAI and Anthropic beat Chinese models across 19 different benchmarks

editors@tomshardware.com (Jowi Morales) — Thu, 02 Oct 2025 11:47:08 +0000

The National Institute of Science and Technology (NIST) has just completed a comprehensive test of Chinese and American AI models, with the results showing that models from OpenAI and Anthropic outperformed DeepSeek across 19 different benchmarks. U.S. Commerce Secretary Howard Lutnick shared the results on X, thanking President Donald Trump for his AI Action Plan to accelerate American AI innovation and infrastructure while encouraging its allies and friendly nations to adopt it.

“The report is clear: DeepSeek lags far behind, especially in cyber and software engineering. These weaknesses aren’t just technical. They demonstrate why relying on foreign AI is dangerous and shortsighted,” Sec. Lutnick said in his post. “Allowing our adversaries to control AI poses serious risks to our security. By setting the standards, driving innovation, and keeping America secure, the Department of Commerce is helping ensure continued U.S. leadership in AI.”

https://t.co/PVESOcZCHbOctober 1, 2025

DeepSeek’s new AI model debuts with support for China-native chips and CANN, a replacement for Nvidia's CUDA — Chinese chipmakers Huawei, Cambricon, and Hygon get first-class support

Luke James — Tue, 30 Sep 2025 17:50:29 +0000

Chinese AI firm DeepSeek has released its latest large language model, DeepSeek-V3.2-Exp, with first-day optimizations for Huawei’s Ascend hardware and CANN software stack. The launch marks a shift in priorities to ensure leading-edge models run on domestic accelerators rather than relying on Nvidia’s CUDA ecosystem.

DeepSeek announced the model on September 29, posting code and checkpoints to Hugging Face alongside a technical report. The company describes V3.2-Exp as an “intermediate step toward our next-generation architecture,” designed to cut costs on long-context inference. It features a sparse attention mechanism that trims memory and compute requirements while maintaining output quality.

Huawei’s Ascend team and the wider vLLM-Ascend community moved swiftly to integrate DeepSeek-V3.2-Exp. In the vLLM-Ascend repo, a new issue outlines custom operator installation steps and kernel packaging for Ascend NPUs to support V3.2-Exp. The CANN team also published an inference recipe, positioning the model for immediate deployment across Huawei hardware.

Other Chinese chipmakers have joined in, including Cambricon, which released an update to its vLLM-MLU fork with compatibility for V3.2-Exp, claiming the combination of its inference engine and the model’s sparse attention cuts costs for processing long sequences. Hygon also announced that its DCU accelerators had been tuned for “zero-wait” deployment through its DTK software stack.

Increased collaboration bw DeepSeek & Ascend/CANN team in supporting V3.2-Exp w/ gitcode updates to Cann as well as GitHub updates into vLLM & SGLang + TileLang support.Also Cambricon had updates into vLLM (vLLM-MLU) to support its inference.DS is really dealing w/ reality of… https://t.co/Unsvyxw9b6 pic.twitter.com/CBgk7pVZrxSeptember 29, 2025

China foes get worse results using DeepSeek, research suggests — CrowdStrike finds nearly twice as many flaws in AI-generated code for IS, Falun Gong, Tibet, and Taiwan

Mark Tyson — Thu, 18 Sep 2025 12:51:06 +0000

Research suggests that your DeepSeek AI results can be of drastically lower quality if you trigger topics that are geopolitically sensitive or banned in China. During tests undertaken by U.S. security firm CrowdStrike, it was observed that code generated for a professed Islamic State militant group computer system contained nearly twice as many flaws as it would otherwise have had. Other potential topics included: Falun Gong, Tibet, and Taiwan, according to a new Washington Post report.

One of the key findings, highlighted by the source, is that DeepSeek AI-generated code for a program to run an industrial control system would typically result in 22.8% of the code featuring flaws. If requested on behalf of an Islamic State project, a DeepSeek user could see that the flaw percentage rises sharply, to 42.1%.

Rather than delivering faulty code, DeepSeek would sometimes refuse to generate code for the likes of professed Islamic State backers or devotees of the spiritual movement Falun Gong. Refusals to aid those groups would occur 61% and 45% of the time, respectively. Notably, both movements are banned in China.

However, DeepSeek’s perceived reduction of the quality of code, when it is generated for such organizations and others, has surprised some. “That is something people have worried about — largely without evidence,” Helen Toner, from the Center for Security and Emerging Technology at Georgetown University, told the Washington Post.

DeepSeek’s reasons behind the downgrading of AI-generated code for purported use in places like Tibet and Taiwan may be less clear-cut. But such code was also less flawed than that generated for the Islamic State, for example.

What is happening? A few theories.

DeepSeek reportedly urged by Chinese authorities to train new model on Huawei hardware — after multiple failures, R2 training to switch back to Nvidia hardware while Ascend GPUs handle inference

ashilov@gmail.com (Anton Shilov) — Thu, 14 Aug 2025 12:39:10 +0000

DeepSeek’s bid to train R2 on Huawei’s Ascend chips failed due to technical limits, forcing a return to Nvidia GPUs and delaying the launch.

Singapore AI chip court case adjourned until August — trio accused of illegally smuggling Nvidia chips to China for use by AI firm DeepSeek

editors@tomshardware.com (Jowi Morales) — Tue, 01 Jul 2025 13:42:35 +0000

Prosecutors are asking for more time to study documents and get responses from international third parties.

Chinese AI firm DeepSeek reportedly using shell companies to try and evade U.S. chip restrictions — allegedly procured unknown number of H100 AI GPUs after ban, but Nvidia denies the claim

stephen.warwick@futurenet.com (Stephen Warwick) — Mon, 23 Jun 2025 14:00:14 +0000

Nvidia says DeepSeek lawfully acquired H800 chips, not H100

Huawei's brute force AI tactic seems to be working — CloudMatrix 384 claimed to outperform Nvidia processors running DeepSeek R1

Sunny Grimm — Fri, 20 Jun 2025 10:34:12 +0000

A new report says that Huawei's CloudMatrix 384 outperforms Nvidia processors running DeepSeek R1, which is to be expected given the energy use involved.

Apple says generative AI cannot think like a human - research paper pours cold water on reasoning models

ashilov@gmail.com (Anton Shilov) — Mon, 09 Jun 2025 19:19:44 +0000

Apple researchers found that today's most advanced AI reasoning models, though better than standard LLMs on moderately complex tasks, ultimately fail at higher complexities. That exposes fundamental limits in their ability to generalize reasoning.

MSI to unveil desktop AI supercomputer at Computex 2025, powered by Nvidia DGX

stephen.warwick@futurenet.com (Stephen Warwick) — Mon, 12 May 2025 16:41:32 +0000

MSI has confirmed it will take the covers off a host of exciting new products at Computex 2025 later this month, including a brand new desktop AI supercomputer powered by Nvidia's DGX Spark platform.

The company confirmed in a press release that it will unveil its EdgeXpert MS-C931, a new desktop AI supercomputer built on the Nvidia DG Spark platform. The MS-C931 is powered by Nvidia's GB10 Grace Blackwell Superchip and is capable of 1,000 AI TOPS FP4 performance. The GB10 SoC features Nvidia Blackwell GPU architecture and fifth-generation Tensor Cores, as well as NVLink-C2C connection to Nvidia's Grace CPU, an Arm architecture core featuring 20 power-efficient chips.

It will also feature ConnectX 7 networking, 128GB unified memory, and LLM support, which Nvidia has previously promised can support up to 4TB of NVMe storage, and can run up to 200-billion-parameter LLMs, or 405-billion parameter models when running two linked chips.

Nvidia's DGX Spark platform promises compact and efficient performance, and comes pre-installed with Nvidia's AI software stack so that developers can run AI models from all the major players, including DeepSeek, Meta, and Google.

Alongside the MS-C931, MSI also says it will unveil a new lineup of Industrial Motherboards, as well as systems powered by Intel Twin Lake, Raptor Lake Refresh, Bartlett Lake, and Arrow Lake processor families.

Specifically, the company highlighted three new products. The MS-C926 is an ultra-slim fanless box PC with applications in smart retail and digital signage.

The Ms-927 is an ultra-compact box PC featuring Intel Core Ultra processors for high-performance edge computing.

Finally, the MS-CF20 is a new next-generation ATX motherboard featuring 16th Gen Intel Arrow Lake-S processors.

MSI says it will also have live demonstrations for solutions for smart retail and digital signage. There will also be a new fanless palm box for "space-constrained industrial environments, remote control management solutions utilizing SysLink, and edge AI innovations, including LLM and chatbot applications enabled by AI smartLink software.

Nvidia GPU tracking tech proposed by US lawmakers in smuggling crackdown

editors@tomshardware.com (Jowi Morales) — Tue, 06 May 2025 12:34:58 +0000

U.S. Congressman and physicist Bill Foster plans to introduce a bill that will require advanced AI chipmakers like Nvidia to include a built-in location reporting system. According to Reuters, this system will use existing and readily available technology to find the general country-level location of an AI chip. In fact, two sources say that Alphabet is using something similar to track the location of its in-house Tensor AI chips across all its data centers to protect against theft and other security breaches.

The White House, under both the Biden and Trump administrations, has been enforcing bans against the export of advanced chips to China since 2022. It has been doing this to limit its rival’s access to cutting-edge technology and help ensure the U.S.’s dominance in AI technology. Washington even recently expanded the export controls to include previously allowed chips like the MI308 and H20, resulting in an $800-million and $5.5-billion write-off for AMD and Nvidia, respectively.

However, the bans and sanctions have been criticized for being ineffective, with the former U.S. Commerce Secretary Gina Raimondo calling them “a fool’s errand”. We’ve had verified reports of Chinese businesses smuggling advanced chips into mainland China, and the company behind DeepSeek, one of China’s most advanced AI models, is accused of using smuggled Nvidia AI chips. Even the U.S. Senate found that the Bureau of Security and Industry (BIS), the agency in charge of export controls, was sorely lacking in resources and relied on voluntary compliance from chipmakers.

Rep. Foster’s proposal aims to solve this issue by requiring AI chips to communicate with a secure computer server whenever they go online. According to the source, the time difference between when the chip sends the signal and when the server receives it is enough to determine its rough location. Reuters claims that independent technical experts say that the proposal by the congressman, who is a former particle physicist and has a doctorate in physics from Harvard University, is feasible and could potentially work.

Geo-blocking implementation to be discussed

Chinese companies stockpiled billions of dollars worth of Nvidia H20 GPUs prior to recent ban

editors@tomshardware.com (Aaron Klotz) — Wed, 23 Apr 2025 17:15:56 +0000

China's top three internet companies reportedly stockpiled billions of dollars worth of Nvidia H20 GPUs before the U.S. export restrictions went into effect in April. Nikkei Asia reports that ByteDance, Alibaba, and Tencent anticipated the likelihood of an export ban on the China-specific H20 last year, and have since been snapping up as many H20 GPUs as they can get their hands on.

The three companies have reportedly accumulated around 1 million H20s — or about a full year's supply. While that is a significant number of GPUs, the companies' full supply was cut short by a month, as they requested that Nvidia ship them their fully-requested volume of H20s by the end of May. If all three companies managed to get their hands on all the H20s they requested, the total value would exceed $12 billion.

High demand for computing power is apparently the main reason for the companies' stockpiling: Tencent's integration of DeepSeek into WeChat is a huge contributor to China's demand for computing power.

The Nvidia H20 will serve as a stop-gap solution for Chinese companies until homegrown AI GPUs are able to provide similar — or better — performance. Huawei is reportedly working on a new Ascend GPU claimed to rival the performance of Nvidia's GB200, which would give China the same AI computing capabilities as Western countries.

Starting in April, the U.S government banned the exportation of Nvidia's H20 HGX AI GPU designed for the Chinese market. The government cited the GPU's memory and interconnect bandwidth, as well as its potential use in supercomputers, as reasons for the ban. The new restriction will force Nvidia to take a massive $5.5 billion financial hit, as it can no longer sell its existing inventory of H20 GPUs to China.

The H20 is a cut-down variant of the H100 — Nvidia's predecessor to the current HGX B200. Similar to the RTX 5090D and RTX 4090D, the H20 is a datacenter GPU tailor-made to comply with the U.S government's export sanctions to China, featuring dramatically reduced AI and HPC performance compared to its bigger brother.

ChatGPT is now a potent tool for finding the locations of photos, raising doxxing concerns

Mark Tyson — Sun, 20 Apr 2025 16:03:58 +0000

With the release of its latest models earlier in the week, OpenAI seems to have inadvertently tuned ChatGPT to become a potent geo-guesser. The newly available o3 and o4-mini are so good at this ‘reverse location search’ task that showing off this newfound functionality has become a viral social media trend, notes TechCrunch. However, this apparent geographic needle-in-a-haystack hunting improvement raises privacy concerns. And pro geo-guessers on social media platforms might be a little worried too.

This newfound ability of ChatGPT is a great example of the strengthened visual reasoning being brought to the platform with model updates. It can now reason based on the content of uploaded images and perform some Photoshop-esque tasks like cropping, rotating, and zooming in.

As per the source report, there are plenty of examples of users of this famous AI chatbot now using it to drill down on the location of various images. A popular jape is to ask ChatGPT to imagine it is playing the online GeoGuessr game and provide the answer based on supplied imagery. Below, we’ve embedded an example of ChatGPT's location divining skills, shared by AI enthusiast YouTuber and Twitterer Brendan Jowett.

4. ChatGPT can now guess your location… from a single photo.Users are going viral with “reverse location searches” using OpenAI’s new GPT o3 model.It can zoom, rotate, and analyze tiny image details to pinpoint cities, landmarks even bars.Think GeoGuessr, but for real… pic.twitter.com/VYLEPKC9OgApril 19, 2025

As Jowett points out, the newly popular ChatGPT ‘reverse image search’ functionality has privacy implications, and raises particular concerns with regard to doxing. Doxing is publicly sharing someone’s private information, particularly location / residence, on the broad internet. People are commonly doxed with malicious intent, with the perpetrator hoping to direct loonies and cranks to visit upon the victim(s).

Interestingly, TechCrunch notes that ‘Geoguessr’ ability isn’t new for ChatGPT with the release of o3 and o4-mini. It is just the trend / awareness that has ballooned. It is said that o3 is particularly good at reverse location search, but GPT-4o, a model released without image-reasoning, can sometimes outpace o3, and deliver the same correct answer “more often than not,” says TechCrunch.

Before we go, it is worth mentioning that AI geo-guessing isn’t a totally dependable or 100% accurate function of ChatGPT with the latest models. That’s not surprising. Also, TechCrunch got a statement from OpenAI on its viral GeoGuessr success. In brief, the artificial intelligence pioneer said that, while it works to improve its tools with things like visual reasoning, it also spends time training models to refuse requests for private or sensitive information. Creative users may be able to sidestep safeguards for a time, but OpenAI indicated it will take action where it sees evidence of abuse of its usage policies.

Huawei's new AI CloudMatrix cluster beats Nvidia's GB200 by brute force, uses 4X the power

ashilov@gmail.com (Anton Shilov) — Fri, 18 Apr 2025 16:22:22 +0000

Unable to use leading-edge process technologies to produce its high-end processors for AI, Huawei has to rely on brute force – install more processors than its industry competitors to achieve comparable performance for AI.

To do this, Huawei took a multifaceted strategy that includes a dual-chiplet HiSilicon Ascend 910C processor, optical interconnections, and the Huawei AI CloudMatrix 384 rack-scale solution that relies on proprietary software, reports SemiAnalysis. The whole system provides a 2.3X lower performance per watt than Nvidia's GB200 NVL72, but it still enables Chinese companies to train advanced AI models.

At glance

Huawei's CloudMatrix 384 is a rack-scale AI system composed of 384 Ascend 910C processors arranged in a fully optical, all-to-all mesh network. The system spans 16 racks, including 12 compute racks housing 32 accelerators each and four networking racks facilitating high-bandwidth interconnects using 6,912 800G LPO optical transceivers.

Unlike traditional systems that use copper wires for interconnections, CloudMatrix relies entirely on optics for both intra- and inter-rack connectivity, enabling extremely high aggregate communication bandwidth. The CloudMatrix 384 is an enterprise-grade machine that features fault-tolerant capabilities and is designed for scalability.

In terms of performance, the CloudMatrix 384 delivers approximately 300 PFLOPs of dense BF16 compute, which is nearly two times the throughput of Nvidia’s GB200 NVL72 system (which delivers about 180 BF16 PFLOPs). It also offers 2.1 times more total memory bandwidth despite using HBM2E and over 3.6 times greater HBM capacity. The machine also features 2.1 times higher scale-up bandwidth and 5.3 times scale-out bandwidth thanks to its optical interconnections.

However, these performance advantages come with a tradeoff: The system is 2.3 times less power-efficient per FLOP, 1.8 times less efficient per TB/s of memory bandwidth, and 1.1 times less efficient per TB of HBM memory compared to Nvidia.

Comparison between Nvidia's GB200 NVL72 and Huawei's CloudMatrix CM384

	GB200 NVL72	CloudMatrix CM384	Difference
BF16 dense PFLOPS	180.0 PFLOPS	300.0 PFLOPS	1.7x
HBM capacity	13.8 TB	49.2 TB	3.6x
HBM bandwidth	576.0 TB/s	1229.0 TB/s	2.1x
Scale Up Bandwidth	518400.0 Gb/s uni-di	1075200.0 Gb/s uni-di	2.1x
Scale Up Domain Size	72.0 GPUs	384.0 GPUs	5.3x
Scale Out Bandwidth	28800.0 Gb/s uni-di	153600.0 Gb/s uni-di	5.3x
All-In System Power	145 kW	559 kW	3.9x
All-in Power per BF16 dense FLOP	0.81 W/TFLOP	1.87 W/TFLOP	2.3x
All-in Power per memory bandwidth	251.7 W per TB/s	455.2 W per TB/s	1.8x
All-in Power per memory capacity	10.5 kW/TB	11.4 kW/TB	1.1x

But this does not really matter, as Chinese companies (including Huawei) cannot access Nvidia's GB200 NVL72 anyway. So if they want to get truly high performance for AI training, they will be more than willing to invest in Huawei's CloudMatrix 384.

At the end of the day, the average electricity price in mainland China has declined from $90.70 MWh in 2022 to $56 MWh in some regions in 2025, so users of Huawei's CM384 aren't likely to go bankrupt because of power costs. So, for China, where the energy is abundant, but advanced silicon is constrained, Huawei's approach to AI seems to work just fine.

HiSilicon Ascend 910C: Huawei goes dual-chiplet

When we first encountered Huawei's HiSilicon Ascend 910C processor several months ago, it was a die shot of its compute chiplet, presumably produced by SMIC, which had an I/O that was supposed to connect it to its I/O die. This is why we thought it was a processor with one compute chiplet. We were wrong.

Apparently, the HiSilicon Ascend 910C is a dual-chiplet processor with eight HBM2E memory modules and without an I/O die that resembles AMD's Instinct MI250X and Nvidia's B200. The unit delivers 780 BF16 TFLOPS compared to MI250X's 383 BF16 TFLOPS and B200's 2.25 - 2.5 BF16 TFLOPS.

Comparison between Nvidia's B200 and Huawei's Ascend 910C

	Nvidia B200 (in GB200)	Huawei Ascend 910C	Difference
BF16 dense TFLOPS	2500.0 TFLOPS	780.0 TFLOPS	0.3x
HBM capacity	192.0 GB	128.0 GB	0.7x
HBM bandwidth	8.0 TB/s	3.2 TB/s	0.4x
Scale Up Bandwidth	7200.0 Gb/s uni-di	2800.0 Gb/s uni-di	0.4x
Scale Out Bandwidth	400.0 Gb/s uni-di	400.0 Gb/s uni-di	1.0x

The HiSilicon Ascend 910C was designed in China for large-scale training and inference workloads. The processor is was designed using advanced EDA tools from well-known companies and can be produced using 7nm-class process technologies. SemiAnalysis reports that while SMIC can produce compute chiplets for the Ascend 910C, the vast majority of Ascend 910C chiplets used by Huawei were made by TSMC using workarounds involving third-party entities like Sophgo, allowing Huawei to obtain wafers despite U.S. restrictions. It is estimated that Huawei acquired enough wafers for over a million Ascend 910C processors from 2023 to 2025. Nonetheless, as SMIC's capabilities improve, Huawei can outsource more production to the domestic foundry.

The Ascend 910C uses HBM2E memory, most of which is sourced from Samsung using another proxy, CoAsia Electronics. CoAsia shipped HBM2E components to Faraday Technology, a design services firm, which then worked with SPIL to assemble HBM2E stacks alongside low-performance 16nm logic dies. These assemblies technically complied with U.S. export controls because they did not exceed any thresholds outlined by the U.S. regulations. The system-in-package (SiP) units were shipped to China only to have their HBM2E stacks desoldered to be shipped to Huawei, which then reinstalled them on its Ascend 910C SiPs.

In performance terms, the Ascend 910C is considerably less powerful on a per-chip basis than Nvidia's latest B200AI GPUs, but Huawei's system design strategy compensates for this by scaling up the number of chips per system.

More processors = more performance

Indeed, as the name suggests, the CloudMatrix 384 is a high-density computing cluster composed of 384 Ascend 910C AI processors, physically organized into a 16-rack system with 32 AI accelerators per rack. Within this layout, 12 racks house compute modules, while four additional racks are allocated for communication switching. Just like with Nvidia's architecture, all Ascend 910Cs can communicate with each other as they are interconnected using a custom mesh network.

However, a defining feature of the CM384 is its exclusive reliance on optical links for all internal communication within and between racks. It incorporates 6,912 linear pluggable optical (LPO) transceivers, each rated at 800 Gbps, resulting in a total internal bandwidth exceeding 5.5 Pbps (687.5 TB/s) at low latency and with minimal signal integrity losses. The system supports both scale-up and scale-out topologies: scale-up via the full-mesh within the 384 processors, and scale-out through additional inter-cluster connections, which enables deployment in larger hyperscale environments while retaining tight compute integration.

With 384 processors, Huawei's CloudMatrix 384 delivers 300 PFLOPs of dense BF16 compute performance, which is 166% higher compared to Nvidia's GB200 NVL72. However, all system power (including networking and storage) of the CM384 is around 559 kW, whereas Nvidia's GB200 NVL72 consumes 145 kW.

As a result, Nvidia's solution delivers 2.3 times higher power efficiency than Huawei's solution. Still, as noted above, if Huawei can deliver its CloudMatrix 384 in volumes, with proper software and support, the last thing its customers will care about is the power consumption of their systems.

Lawmakers demand answers from Nvidia over suspected GPU diversions to China, company denies any wrongdoing

ashilov@gmail.com (Anton Shilov) — Thu, 17 Apr 2025 19:07:20 +0000

A U.S. congressional investigation suspects that Chinese AI company DeepSeek has trained its large language model using tens of thousands of Nvidia GPUs, many covered by U.S. export controls. As Nvidia's hardware was central to China's AI progress, the House Select Committee on the CCP this week sent a formal letter to Nvidia asking to provide extensive records amid suspicions that its GPUs ended up at DeepSeek through Singapore despite export restrictions.

Nvidia denies any wrongdoing and says its clients use billing addresses in Singapore, but want their GPUs developed elsewhere.

"Our reported Singapore revenue indicates the billing address, often for subsidiaries of our U.S. customers," a statement from Nvidia reads. The associated products are shipped to other locations, including the United States and Taiwan, not to China."

The House Select Committee on the CCP investigation cites SemiAnalysis, which claims that DeepSeek trained its advanced LLMs using at least 60,000 of Nvidia's GPUs, including 10,000 A100, 10,000 H100, 10,000 H800, and 30,000 H20 processors. While the the H800 and H20 GPUs were specifically designed for the Chinese market after the U.S. imposed export restrictions on advanced GPUs in 2022 and 2023, limiting performance threshold of chips that could legally be sold to China without an export license, other processors are restricted for shipments to China.

Authorities believe that DeepSeek may have obtained restricted Nvidia GPUs such as the H100, A100, and other high-performance accelerators through third-party intermediaries, particularly those in Singapore. The suspicions stem from Nvidia's financial filings for FY 2023 and FY 2025 which demonstrate a sharp rise in shipments to Singapore following the imposition of export restrictions from 5% in mid-FY 2023 (calendar 2022) to 18% in FY 2025 (calendar 2025). This discrepancy led investigators to suspect that Singapore may have been used as a transshipment point for restricted hardware ultimately destined for Chinese buyers like DeepSeek.

(Image credit: The House Select Committee on the CCP )

Microsoft researchers build 1-bit AI LLM with 2B parameters — model small enough to run on some CPUs

editors@tomshardware.com (Jowi Morales) — Thu, 17 Apr 2025 17:40:41 +0000

Microsoft researchers just created BitNet b1.58 2B4T, an open-source 1-bit large language model (LLM) with two billion parameters trained on four trillion tokens. But what makes this AI model unique is that it’s lightweight enough to work efficiently on a CPU, with TechCrunch saying an Apple M2 chip can run it. The model is also readily available on Hugging Face, allowing anyone to experiment with it.

Bitnets use 1-bit weights with only three possible values: -1, 0, and +1 — technically it's a "1.58-bit model" due to the support for three values. This saves a lot of memory compared to mainstream AI models with 32-bit or 16-bit floating-point formats, allowing them to operate much more efficiently and require less memory and computational power. Bitnet’s simplicity has one drawback, though — it’s less accurate compared to larger AI models. However, BitNet b1.58 2B4T makes up for this with its massive training data, which is estimated to be more than 33 million books.

The team behind this lightweight model compared it against leading mainstream models, including Meta’s LLaMa 3.2 1B, Google’s Gemma 3 1B, and Alibaba’s Qwen 2.5 1.5B. BitNet b1.58 2B4T scored relatively well against these models in most tests, and even took top honors in a few benchmarks. More importantly, it only consumed 400MB in non-embedded memory — less than 30% of what the next smallest model (Gemma 3 1B) used, which is 1.4 GB.

Benchmark	BitNet b1.58 2B	LLaMa 3.2 1B	Gemma 3 1B	Qwen 2.5 1.5B
Non-embedding memory usage	0.4 GB	2 GB	1.4 GB	2.6 GB
Latency (CPU Decoding)	29ms	48ms	41ms	65ms
Training tokens	4 trillion	9 trillion	2 trillion	18 trillion
ARC-Challenge	49.91	37.80	38.40	46.67
ARC-Easy	74.79	63.17	63.13	76.01
OpenbookQA	41.60	34.80	38.80	40.80
BoolQ	80.18	64.65	74.22	78.04
HellaSwag	68.44	60.80	57.69	68.28
PIQA	77.09	74.21	71.93	76.12
WinoGrande	71.90	59.51	58.48	62.83
CommonsenseQA	71.58	58.48	42.10	76.41
TruthfulQA	45.31	43.80	38.66	46.67
TriviaQA	33.57	37.60	23.49	38.37
MMLU	53.17	45.58	39.91	60.25
HumanEval+	38.40	31.10	37.20	50.60
GSM8K	58.38	38.21	31.16	56.79
MATH-500	43.40	23.00	42.00	53.00
IFEval	53.48	62.71	66.67	50.12
MT-bench	5.85	5.43	6.40	6.12
Average	54.19	44.90	43.74	55.23

AMD splits ROCm toolkit into two parts – ROCm AMDGPU drivers get their own branch under Instinct datacenter GPU moniker

editors@tomshardware.com (Aaron Klotz) — Mon, 14 Apr 2025 16:22:20 +0000

AMD is marking a major shift in the development of its ROCm open-source software stack, with the introduction of a new Instinct driver for Radeon Instinct GPUs that will be part of the ROCm toolkit. According to a blog post by AMD, the change aims to improve the toolkit's usability for ROCm users.

The new Instinct driver is a renamed version of the Linux AMDGPU driver packages that are already distributed and documented with ROCm. Previously, everything related to ROCm (including the amdgpu driver) existed as part of the ROCm software stack. But now, AMD is splitting the driver portion of the ROCm software stack into a separate branch that will live independently and carry its own identity.

(Image credit: AMD)

These changes start with ROCm version 6.4, where ROCm is split into two groups: the Instinct Driver and the ROCm Toolkit. The latter handles everything besides the physical driver itself. The change aims to improve the flexibility of the ROCm software stack, and AMD claims new and exciting features are planned for the Instinct driver that will benefit from ROCm's bifurcation.

Some of these features include "New installation options to remove permission complexities such as user membership in the video or render groups. Future installation options may exclude packages needed to run display outputs to reduce the driver footprint. A future driver release series may be maintained for security fixes for an extended period as long term stability driver. A future driver release series may be maintained for security fixes for an extended period as long term stability driver. Users choosing to use amdgpu from the stock Linux kernels may choose to skip all the installation documentation for ROCm that references the Instinct driver. Please note this is not an AMD support option today."

The reason AMD is bifurcating ROCm seems to be to improve the longevity and flexibility of its GPU drivers. Splitting the drivers from the ROCm toolkit will allow a single Instinct driver to support multiple versions of ROCm toolkits, without upgrading or downgrading the driver to support to whichever toolkit version the user needs.

As a result, the support duration of the new Instinct drivers will greatly increase. Currently, the ROCm AMDGPU drivers support 6 months (backwards and forwards) worth of ROCm toolkit releases. With the new Instinct driver bifurcation, support duration doubles to a full year's worth of ROCm toolkit releases.

Starting with ROCm 6.4, the documentation for the bifurcated ROCm branches can be found at instinct.docs.amd.com. Information on the new Instinct-branded GPU drivers are available on the Instinct driver website. That said, AMD states the versioning scheme for the Instinct drivers will not change (which will inevitably cause some confusion). In ROCm version 6.5, the Insticnt driver version will be separate from the ROCm versioning.

Chinese tech giants boosted Nvidia GPU purchases by 4x to 6x during Q1

ashilov@gmail.com (Anton Shilov) — Wed, 02 Apr 2025 21:00:45 +0000

Chinese tech giants have collectively spent over $16 billion on Nvidia's H20 data center GPUs for AI so far this year, according to Reuters, which cites a report from The Information. Chinese companies increased their spending on Nvidia GPUs despite the expected 'DeepSeek impact' and unused AI infrastructure in China. This likely happened as a response to the AI diffusion rule proposed by the previous U.S. government that bans Chinese entities from buying American AI GPUs starting in May.

Alibaba Group, ByteDance, and Tencent Holdings led the purchasing spree, placing large-scale orders in the first quarter of the year. H3C, one of the leading server makers in China, even raised concerns about possible Nvidia GPU shortages last week as it could not get what it demanded.

$16 billion per first quarter is a lot of money. Nvidia reported $17.11 billion in earnings from China and Hong Kong in fiscal year 2025 (ended on January 28, 2025), so last year, the company earned approximately $4.27 billion per quarter on average selling GPUs to Chinese customers. So, big Chinese companies quadrupled their purchases of Nvidia's H20 GPUs for AI applications in the first quarter of calendar 2025.

The Chinese tech giants accelerated their purchases of Nvidia hardware from quarter to quarter, so comparing $16 billion of its alleged sales to Chinese clients in Q1 2025 to its sales to Chinese customers in Q1 2024 makes sense. Unfortunately, Nvidia's China revenue in calendar Q1 2024 is something hard to estimate (as Nvidia's fiscal first quarter ends in late April), though based on Nvidia's filings with the Security and Exchange Commission, it is safe to say that we are dealing with a sum of around $2.4 billion – $2.5 billion. This essentially means that Chinese tech giants increased purchases of Nvidia's H20 GPUs in Q1 2025 by over six times compared to Q1 2024.

However, there is a catch. Nvidia's sales to entities in Singapore increased by over 10 times in fiscal 2025 compared to fiscal 2023, from $2.288 billion in FY2023 to $23.684 billion in FY2025. Many observers believe that GPUs sold to Singapore entities are smuggled to restricted countries, such as China. To that end, it is hard to estimate how many GPUs Chinese entities actually obtain every quarter.

Ant Group reportedly reduces AI costs 20% with Chinese chips

editors@tomshardware.com (Kunal Khullar) — Mon, 24 Mar 2025 15:14:07 +0000

Ant Group, the financial technology giant backed by Alibaba, has announced a major achievement in artificial intelligence (AI) by successfully training a model using domestically produced semiconductors. According to Bloomberg, a source said that Ant Group leveraged chips from Chinese tech giants Alibaba and Huawei to train its AI model, reaching performance levels comparable to those obtained with Nvidia’s H800 chips. A key highlight of Ant Group’s achievement is a reported 20% reduction in costs compared to using Nvidia hardware.

While Ant Group continues to utilize Nvidia’s hardware for certain AI development tasks, the company is now relying increasingly on alternatives — particularly chips from AMD and Chinese manufacturers — for its latest models. This strategic pivot reflects a broader trend within China’s tech industry, driven in part by tightening U.S. sanctions that limit access to Nvidia’s most advanced GPUs.

This development demonstrates China’s growing AI capabilities and suggests that domestic and non-U.S. alternatives to Nvidia’s GPUs are becoming viable for large-scale AI training. A key highlight of Ant Group’s achievement is a reported 20% reduction in costs compared to using Nvidia hardware. High-performance AI training requires substantial computational power, and Nvidia’s GPUs have long been the gold standard in the industry. However, with access to Nvidia’s chips increasingly constrained, Chinese firms have ramped up investments in their own semiconductor technologies and diversified their hardware sources.

This also raises comparisons to China’s DeepSeek AI, which recently outperformed OpenAI’s GPT-4 on certain benchmarks. If Ant Group’s breakthrough represents a similar leap in AI training efficiency, it could mark another step toward reducing reliance on Western technology. However, questions remain about whether Chinese chips and alternative suppliers like AMD can scale effectively and whether they can match Nvidia’s long-term performance and ecosystem support.

While specific details about the chips used in Ant Group’s AI training remain undisclosed, reports suggest that Alibaba’s in-house AI hardware and Huawei’s Ascend series chips played crucial roles. If other Chinese firms can replicate these results, it could accelerate China’s AI ambitions and lessen the country’s dependence on foreign technology.

Whether these domestic and alternative AI chips can maintain competitiveness in the long run remains an open question. But this development is a clear indication of China’s push toward technological independence.

Pat Gelsinger becomes executive chairman, head of technology at church-focused platform Gloo

ashilov@gmail.com (Anton Shilov) — Mon, 24 Mar 2025 15:05:45 +0000

Pat Gelsinger, former CEO of Intel and VMWare, announced on Monday that his role at faith-focused technology company Gloo has been expanded, and he will become the company's executive chairman as well as head of technology who will be in charge of products development. After nearly 10 years as a board member and investor, he is now leading Gloo's product and engineering efforts, with a focus on building a vertical cloud platform for the faith ecosystem.

"Effective today, I have been named Gloo's executive chair and head of technology," Pat Gelsinger wrote on LinkedIn. "I have been involved with Gloo for almost 10 years, both as a board member and investor. Gloo's focus on creating a technology platform that connects and catalyzes the faith ecosystem perfectly aligns with my own sense of purpose."

Among the first projects that Pat Gelsinger will lead at Gloo will be the creation of one of the first vertical industry clouds for faith and advanced values-aligned AI. Earlier this year Pat Gelsinger praised DeepSeek and announced that Gloo would use it over OpenAI’s models for its AI chatbot Kallm, due to its open-source nature and ease of integration. Despite controversy surrounding DeepSeek's data practices, allegations of distilling ChatGPT data, and using Nvidia's smuggled GPUs to train its models, Gelsinger noted its affordability and potential to push the industry toward more open, efficient AI development.

"Across all of our efforts we are deeply committed to open-source, trust through transparency and benchmarking, and licensing of content for training and use of AI," Gelsinger wrote. "I see tremendous opportunity ahead for Gloo and I couldn’t be happier to partner with CEO Scott Beck and the rest of the leadership team as we prepare for our next phase of growth. I will have a few more updates to share on this new chapter in the coming days. Gloo will be a major focus… but there is a bit more to come."

Gloo is a tech company that builds tools and platforms to support churches, ministries, and faith-based organizations. Its main goal is to help these groups connect better with people, grow their communities, and use new technologies like AI in ways that align with their values.

For now, Gloo offers a digital workspace for ministry leaders to organize content, communication, and outreach efforts, as well as AI-powered tools to enable churches to better engage with their members and reach new people. In addition, Gloo works to connect faith organizations and distribute content from Christian publishers and media to churches and individuals.

"Now more than ever, there is great need for faith-based communities to take an active role in ensuring we shape technology as a force for good," Gelsinger wrote. "As we have seen with social media, the impact of technology evolutions is swift, deep and long lasting. AI is an even more powerful yet nascent tool. It is imperative we ensure AI is used to enhance the human experience, not harm it."

Nvidia announces Blackwell Ultra B300 —1.5X faster than B200 with 288GB HBM3e and 15 PFLOPS dense FP4

Jarred Walton — Tue, 18 Mar 2025 18:35:22 +0000

The Nvidia Blackwell Ultra B300 data center GPU was announced today during CEO Jensen Huang's keynote at GTC 2025 in San Jose, CA. Offering 50% more memory and FP4 compute than the existing B200 solution, it raises the stakes in the race to faster and more capable AI models yet again. Nvidia says it's "built for the age of reasoning," referencing more sophisticated AI LLMs like DeepSeek R1 that do more than just regurgitate previously digested information.

Naturally, Blackwell Ultra B300 isn't just about a single GPU. Along with the base B300 building block, there will be new B300 NVL16 server rack solutions, a GB300 DGX Station, and GB300 NV72L full rack solutions. Put eight NV72L racks together, and you get the full Blackwell Ultra DGX SuperPOD: 288 Grace CPUs, 576 Blackwell Utlra GPUs, 300TB of HBM3e memory, and 11.5 ExaFLOPS of FP4. These can be linked together in supercomputer solutions that Nvidia classifies as "AI factories."

While Nvidia says that Blackwell Ultra will have 1.5X more dense FP4 compute, what isn't clear is whether other compute have scaled similarly. We would expect that to be the case, but it's possible Nvidia has done more than simply enabling more SMs, boosting clocks, and increasing the capacity of the HBM3e stacks. Clocks may be slightly slower in FP8 or FP16 modes, for example. But here are the core specs that we have, with some inference of other data (indicated by question marks).

Tom's Hardware

Nvidia Blackwell Ultra B300 vs Blackwell B200
Platform	B300	B200	B100
Configuration	Blackwell GPU	Blackwell GPU	Blackwell GPU
FP4 Tensor Dense/Sparse	15/30 petaflops	10/20 petaflops	7/14 petaflops
FP6/FP8 Tensor Dense/Sparse	7.5/15 petaflops ?	5/10 petaflops	3.5/7 petaflops
INT8 Tensor Dense/Sparse	7.5/15 petaops ?	5/10 petaops	3.5/7 petaops
FP16/BF16 Tensor Dense/Sparse	3.75/7.5 petaflops ?	2.5/5 petaflops	1.8/3.5 petaflops
TF32 Tensor Dense/Sparse	1.88/3.75 petaflops ?	1.25/2.5 petaflops	0.9/1.8 petaflops
FP64 Tensor Dense	68 teraflops ?	45 teraflops	30 teraflops
Memory	288GB (8x36GB)	192GB (8x24GB)	192GB (8x24GB)
Bandwidth	8 TB/s ?	8 TB/s	8 TB/s
Power	?	1300W	700W

We asked for some clarification on the performance and details for Blackwell Ultra B300 and were told: "Blackwell Ultra GPUs (in GB300 and B300) are different chips than Blackwell GPUs (GB200 and B200). Blackwell Ultra GPUs are designed to meet the demand for test-time scaling inference with a 1.5X increase in the FP4 compute." Does that mean B300 is a physically larger chip to fit more tensor cores into the package? That seems to be the case, but we're awaiting further details.

What's clear is that the new B300 GPUs will offer significantly more computational throughput than the B200. Having 50% more on-package memory will enable even larger AI models with more parameters, and the accompanying compute will certainly help.

Nvidia gave some examples of the potential performance, though these were compared to Hopper, so that muddies the waters. We'd like to see comparisons between B200 and B300 in similar configurations — with the same number of GPUs, specifically. But that's not what we have.

By leveraging FP4 instructions, using B300 alongside its new Dynamo software library to help with serving reasoning models like DeepSeek, Nvidia says an NV72L rack can deliver 30X more inference performance than a similar Hopper configuration. That figure naturally derives from improvements to multiple areas of the product stack, so the faster NVLink, increased memory, added compute, and FP4 all factor into the equation.

In a related example, Blackwell Ultra can deliver up to 1,000 tokens/second with the DeepSeek R1-671B model, and it can do so faster. Hopper, meanwhile, only offers up to 100 tokens/second. So, there's a 10X increase in throughput, cutting the time to service a larger query from 1.5 minutes down to 10 seconds.

The B300 products should begin shipping before the end of the year, sometime in the second half of the year. Presumably, there won't be any packaging snafus this time, and things won't be delayed, though Nvidia does note that it made $11 billion in revenue from Blackwell B200/B100 last fiscal year. It's a safe bet to say it expects to dramatically increase that figure for the coming year.

Watch Jensen Huang’s Nvidia GTC 2025 keynote here — Blackwell 300 AI GPUs expected

editors@tomshardware.com (Jowi Morales) — Tue, 18 Mar 2025 15:00:56 +0000

Nvidia’s annual GPU Technology Conference (GTC) is happening today, and Jensen Huang is set to give the keynote address this morning. The multi-day event focuses on artificial intelligence, computer graphics, and other technologies that rely on GPUs' specialized computational power. The keynote address will happen live at the SAP Center in San Jose, California at 10 am Pacific Time, but it will also be live-streamed to a global audience via YouTube.

A pre-broadcast livestream, Live at Nvidia GTC with Acquired, will start on YouTube at 8 am Pacific Time. The company says this event will feature speakers who will dive into Nvidia’s over 30-year history to see how it became the AI giant it is today.

But what’s more exciting for everyone is that Huang is expected to unveil the Blackwell Ultra GPU, which has since been renamed the B300 series, that is expected to deliver more performance and have upgraded memory configurations. Huang said he will also show off next-generation Rubin AI GPUs and more at GTC.

The B300 series AI GPUs are expected to be available in the latter half of this year, while the next-generation Rubin is scheduled for 2026. Many people are anticipating the arrival of these more powerful chips, especially as tech giants and startups alike are battling for supremacy in the AI space.

Nvidia’s competitors, like AMD and Intel, also have their own AI GPU offerings. However, they are miniscule compared to Team Green, which currently owns around 92% of the entire data center GPU market. Its near-monopoly on AI GPUs, plus the hype around AI models, allowed it to become the most valuable company in the world practically overnight.

It has since dropped to third place after some market corrections, with the company losing more than half a trillion dollars in market cap after the release of DeepSeek AI. But as long as there’s demand for powerful AI GPUs, it’s unlikely that Nvidia will go away anytime soon.

It’s just a shame that many gaming enthusiasts, which was Nvidia’s primary market before AI exploded into the scene, feel that they’re being left behind by the company. While it’s understood that the company will prioritize its AI cash cow, the pricing and availability (or lack thereof) of its recently launched RTX 50-series GPUs has disappointed millions of its core fan base.

AMD's beastly Ryzen AI Max+ 395 comes to a new GMKTec mini-PC, and AMD's Lisa Su appears to approve

Mark Tyson — Tue, 18 Mar 2025 14:40:43 +0000

China's GMKTec has announced the EVO-X2, which it claims is "the world's first AI mini PC equipped with the AMD Ryzen AI Max+ 395 processor." The device was shown off at the AMD Greater China Channel Summit today, with top red-team execs like CEO Lisa Su and SVP & GM Jack Huynh in attendance. GMKTec was lucky enough to get Dr. Su to sign one or more of the new EVO-X2 mini PCs, and the source blog hints there is a batch of Signature Editions.

GMKTec blog

GMKTec blog and social media

Detailed hardware specs of the EVO-X2 are not readily available, and the product page has yet to be published by GMKTec, so you'll have to settle for a brief outline for now. As far as the internals go, all we are sure about is that this compact desktop will pack an AMD Ryzen AI Max+ 395 processor. However, its performance in DeepSeek R1 tests is more than 3X that of an RTX 5080 desktop GPU, according to purported GMKTec hardware demonstrations.

To recap the AMD Strix Halo APU's specs, it features a Zen 5 architecture CPU configured with 16 cores and 32 threads (two CCDs), operating at a maximum frequency of 5.1 GHz. As an APU, this is joined by a beefy iGPU, namely the RDNA 3.5 architecture Radeon 8060S with 40 compute units. Rounding off the processing package, there is an XDNA2 NPU which is capable of 50 TOPS. AMD has also shoved 80MB cache onboard, but buyers will have a choice of the RAM quota (up to 128GB of 256-bit LPDDR5X-8533 RAM) at time of purchase.

(Image credit: GMKTec blog and social media)

Back to pixel peeping to unearth more details about the GMKTec EVO-X2, and one of the first comments we can make is that it is pretty sizable for a mini PC. In the image featuring Jack Huynh, you can see the AMD SVP & GM holding a Geekom machine, which appears to be a traditional '4 x 4' NUC-a-like. With the EVO-X2 in close proximity, we Photoshop-estimate that the EVO-X2 is a '5 x 4.5' and larger in the remaining dimension.

Images shared by GMKTec on social media provide further ideas about the shape and form of the EVO-X2. Also, we can clearly see a power button to the front, alongside an SD card reader, a Thunderbolt (Type C) port, twin USB (Type A ports), and a headset jack. Around the back are three more USB (Type-A) ports, another Thunderbolt, HDMI, DP, an RJ45 network port, another headset jack, and a barrel power jack.

Enthusiasts are excited about the new generation Strix Halo APUs from AMD. So if GMKTec can get its product on shelves ahead of the likes of the HP Z2 Mini G1a and Framework Desktop - at an attractive price - it will have won a minor coup. We don't know how many 'Lisa Su' signed models there are, but the source blog hints there is a batch.

As for who will be first with an AMD Strix Halo, Asus originally slated its ROG Flow Z13 for release in February, but it still seems to be in pre-order purgatory, and it's not a mini PC. The Framework Desktop machines with Ryzen AI Max chips aren't expected until Q3 this year. The other machine we have heard about, the HP Z2 Mini G1a, is set to be released in 'spring.'

AMD boasts its Ryzen AI Max+ 395 is up to 12.2x faster than Lunar Lake in AI workloads

editors@tomshardware.com (Aaron Klotz) — Mon, 17 Mar 2025 16:30:48 +0000

AMD's fire-breathing Ryzen AI Max+ 395 allegedly crushes Intel's latest efficiency-focused Lunar Lake CPUs in AI benchmarks. An AMD blog post claims the new Zen 5 + RDNA 3.5 chip is up to 12.2x faster than the Core Ultra 7 258V.

AMD benchmarked the Ryzen AI Max+ 395 and Core Ultra 7 258V (with Arc 140V graphics) in a variety of large language models and LLM configurations, including DeepSeek R1 and Llama. Model sizes were restricted to 16GB to offer a fairer comparison against Lunar Lake-powered laptops with 32GB of memory (the highest memory configuration available for these devices). An Asus ROG Flow Z13 64GB was used for the Ryzen AI Max+ 395 test system, and an Asus Zenbook S14 32GB was used for the Core Ultra 7 258V test system.

AMD

In DeepSeek R1, the Ryzen chip was up to 2.1x faster (measured in tokens per second) than the Intel counterpart using Distill Qwen 1.5b, up to 2.2x faster using Distill Qwen 7b; up to 2.1x faster using Distill Llama 8b; and up to 2.2x faster using Distill Qwen 14b. In Phi 4 Mini Instruct 3.8b, the Ryzen chip was up to 2.1x faster than the Intel Lunar Lake chip; up to 2.2x faster in Phi 4 14b; and up to 2.1x faster in Llama 3.2 3b Instruct.

In the same LLM configurations but benchmarking the "time to the first token," the Ryzen AI Max+ 395 was up to 12.2x faster in DeepSeek R1 Distill Qwen 14b. The Zen 5 chip's least-performant dominance was in Phi 4 Mini Instruct 3.8b and Llama 3.2 3b Instruct, where the AMD chip was "only" 4x faster than the Core Ultra 7 258V.

AMD showed similar dominance in AI vision models using the same "time to the first token" benchmarking technique. In IBM Granite Vision 3.2 2B, the 395 was up to 7x faster than the 258V, up to 4.6x faster in Google Gemma 3.4b, and up to 6x faster in Google Gemma 3 12b.

AMD's benchmarks show complete dominance of its Ryzen AI Max+ 395 against the Core Ultra 7 258V in AI benchmarks. This is all thanks to the Ryzen AI Max CPU's significantly more powerful integrated graphics chip (which rivals discrete graphics with its 40 RDNA 3.5 CUs), eight more CPU cores, and its significantly higher configurable TDP (rated up to 120W). Even though it consumes significantly more power than the Core Ultra 7 258V (which has a max turbo power of 37W), both chips operate in the same market, and are compatible in the same thin-and-light category of laptop PCs.

It will be interesting to see how the new AMD mobile APUs shape up against Nvidia's RTX 50-series mobile GPUs, which are reportedly facing supply chain issues, delaying their launch in upcoming RTX 50 series gaming laptops. On a pure performance level (not considering form factor), these new Nvidia-powered systems will be AMD's primary competition.

AMD is allegedly well on its way to handling discrete GPU competition, as it already advertised superior AI performance on the Ryzen AI Max+ 395 against Nvidia's RTX 4090 laptop GPU.

ERNIE 4.5 AI model by Baidu claims to match DeepSeek R1 at half the cost

editors@tomshardware.com (Kunal Khullar) — Mon, 17 Mar 2025 12:30:00 +0000

Baidu, the Chinese technology giant, has announced the release of two advanced artificial intelligence (AI) models: ERNIE 4.5 (Enhanced Representation through Knowledge Integration) and ERNIE X1. The company also announced that its conversational AI platform, ERNIE Bot, is now freely accessible to all users - ahead of schedule.

ERNIE 4.5 represents Baidu's latest advancement in multimodal AI modeling. The model integrates various data types — text, images, audio, and video — through joint modeling, enhancing its ability to comprehend and generate content across these modalities. This integration leads to improvements in understanding, generation, reasoning, and memory capabilities. Notably, ERNIE 4.5 demonstrates significant enhancements in logical reasoning and coding abilities, addressing previous challenges in these areas.

In internal evaluations, ERNIE 4.5 has shown performance on par with models like DeepSeek-R1, but at approximately half the deployment cost. This cost efficiency positions ERNIE 4.5 as a competitive option for enterprises seeking advanced AI capabilities without incurring substantial expenses.

Baidu

ERNIE X1 is Baidu's first model specifically designed for reasoning-intensive tasks. It excels in logical inference, problem-solving, and structured decision-making, making it suitable for applications in finance, law, and data analysis. The model's architecture emphasizes understanding, planning, reflection, and evolution, aiming to provide robust reasoning capabilities while maintaining cost efficiency.

Originally scheduled for a later release, Baidu's ERNIE Bot is now freely available to all users ahead of plan. This early rollout is attributed to improvements in production capacity and model optimization. By making ERNIE Bot accessible to a broader audience, Baidu aims to accelerate user engagement and gather feedback to improve its AI offerings.

Baidu plans to integrate ERNIE 4.5 and ERNIE X1 across its product ecosystem, including Baidu Search and the Wenxiaoyan app. This integration aims to enhance user experience by providing more versatile and advanced AI functionalities. For enterprise users and developers, ERNIE 4.5 is now accessible via APIs on Baidu AI Cloud's Qianfan platform, with ERNIE X1 to follow soon.

The release of ERNIE 4.5 and ERNIE X1 occurs amid increasing competition in the AI industry. Companies like OpenAI, Google, and DeepSeek are continually advancing their AI models. Baidu's focus on cost-effective, high-performance models reflects its strategy to meet the growing demand for scalable AI solutions in various sectors.

Deepseek 'clearly not interested' in scaling up — 160-person team focused on developing new models

sayem.ahmed@futurenet.com (Sayem Ahmed) — Fri, 14 Mar 2025 13:27:08 +0000

China-based AI company Deepseek is reportedly focusing on development and research, instead of chasing revenue, unlike many of its western AI rivals like OpenAI, Google and Anthropic. According to the Financial Times, the Hangzhou-based company is focused on developing two new models, R2 and V4, with the intention to hit their goal of achieving Artificial General Intelligence (AGI).

Deepseek garnered significant attention in January 2025, triggering a stock market shakeup that resulted in Nvidia losing $589 billion in market cap in a single day, following the launch of Deepseek R1.

Despite the newfound attention, the billionaire founder and CEO of Deepseek, Liang Wenfeng, is allegedly taking a different approach when compared to the company's western competitors.

Speaking to the Financial Times, sources close to Deepseek said that there is "little intention to capitalize on Deepseek's sudden fame to commercialize its technology in the near term". Instead, the company is instead focusing on "model development" and developing towards AGI.

Deepseek's revenues are also reported to be covering ongoing costs, likely thanks to interest garnered thanks to the release of the Deepseek R1 model in January.

Wenfeng is also notoriously difficult to contact, with the Deepseek CEO outright declining interest in any further investment from "venture and state-backed funds", the report continues.

But, Wenfeng clearly has enough resources to fund further development. He's also the founder of one of China's leading hedge-funds, High Flyer. According to sources speaking to the Financial Times, he purchased 10,000 Nvidia H800 GPUs, and 10,000 A100s. Though, the chips were purchased before they were banned for sale in China. Deepseek has already incurred over $1.6 billion in hardware costs, and has total fleet of over 50,000 Nvidia GPUs.

However, the company might find it difficult to access more advanced Nvidia chips in the future, and could "consider future partnerships" to resolve the issue. In late February, a Singapore-based smuggling ring was busted for alleged illegal re-export of high-performance GPUs, destined for Deepseek, bypassing trade restrictions.

It's also alleged that if Deepseek's future demand exceeds their current data center capacity, that the company will rely on "third-party providers", instead of procuring more for themselves. The Chinese government has also thrown support behind Deepseek, with the company gaining access to state-funded datacenters.

Deepseek has also invested over $500 million into its technology, and will remain self-funded. "They clearly are not interested in scaling up right now. It's a rare situation where the founder is wealthy and committed enough to keep it lean in a Navy Seal-style for his pursuit of AGI", one industry insider told the Financial Times.

Deepseek's 160 employees are dedicated to development

Deepseek has "about 160" employees, which is significantly fewer than OpenAI's gargantuan headcount of around 2000 employees (as of December 2024), according to sources speaking to the Financial Times. This makes the company much leaner than many of its rivals.

The team is focused on development of the next-generation R2 and V4 models, which are currently slated for release in May. However development "may be accelerated to keep its momentum going" according to Financial Times sources.

With Deepseek's next move just a short few months away, another Chinese AI company named Manus AI, which is developing autonomous AI agents, has enjoyed heightened interest.

But, it has yet to come within spitting distance of the impact that Deepseek has had on the AI industry. Whether Deepseek's next release can trigger another shock moment for stock markets is also yet to be seen, as the company hones in its focuses on rapid development of advanced AI technologies.

AMD RDNA 3 professional GPUs with 48GB can beat Nvidia 24GB cards in AI — putting the 'Large' in LLM

editors@tomshardware.com (Aaron Klotz) — Thu, 13 Mar 2025 18:45:11 +0000

AMD is swinging back at Nvidia with new DeepSeek benchmarks that claim its monster 48GB RDNA 3 GPUs can outperform Team Green's previous-generation RTX 4090.

David McAfee, AMD vice president and general manager of Ryzen CPUs and Radeon graphics, posted on X that the Radeon Pro W7900 and Pro W7800 48GB cards can outperform an RTX 4090 by up to 7.3x in DeepSeek R1.

McAfee shared a graph of the three GPUs benchmarked in several iterations of DeepSeek R1 using LM Studio 0.3.12 and Llama.cpp runtime 1.18. The DeepSeek R1 iterations consisted of Distill Qwen 32B 8-bit, Distill Llama 70B 4-bit, Distill Qwen 32B 8-bit, and Distill Llama 70B 4-bit. The former two were configured to output conversational prompts (with 20 tokens) and the latter summarization prompts (with 3017 tokens).

Click See more to see the benchmark results:

A single @AMD Radeon PRO W7800 48GB or W7900 48GB has enough VRAM to run with great performance even the largest DeepSeek R1 Distill (or higher precision for 32B). pic.twitter.com/4uNTO6XAYGMarch 13, 2025

In DeepSeek R1 Distill Qwen 32B 8-bit, the RTX 4090 allegedly produced 2.7 tokens a second, the Pro W7800 48GB produced 19.1, and the Pro W7900 48GB produced 19.8 tokens per second. In Distill Llama 70B 4-bit, the RTX 4090 produced 2.3 tokens a second, the Pro W7800 48GB 12.8, and the Pro W7900 48GB 12.7 tokens a second.

In Distill Qwen 32B 8-bit, the RTX 4090 produced 2.5 tokens per second, Pro W7800 48GB 15.7 and Pro W7900 48GB 16.2 tokens per second. In R1 Distill Llama 70B 4-bit, the RTX 4090 produced two tokens per second, Pro W7800 48GB 10.1 and Pro W7900 48GB 10.4 tokens per second.

AMD's benchmarks claim the Radeon Pro W7800 or Pro W7900 48GB GPUs are up to 7.3x faster in Distill Qwen 32B 8-bit, 5.5x faster in Distill Llama 70B 4-bit, 6.5x faster in Distill Qwen 32B 8-bit, and 5.2x faster in Distill Llama 70B 4-bit compared to the RTX 4090.

David McAfee claims the 48GB trims of the WPro W7800 and W7900 have enough VRAM to run the largest DeepSeek R1 models. VRAM is one of the most critical aspects of processing large language models; parameters for LLMs are stored directly in VRAM and are directly proportional to the model sizes. Thus, the larger an LLM is, the more VRAM you need. But with the extra VRAM capacity comes very high prices.

The W7900 48GB costs a whopping $3,500 — $1,500 over the RTX 5090's $2,000 MSRP and $2,000 over the RTX 4090's $1,500 MSRP (though hardly any 4090's were sold at that price). But on the flip side, the 48GB RDNA 3 GPU is less than half the price of the closest current-generation 48GB Nvidia GPU you can buy today, the RTX A6000 Ada.

AMD's marketing looks great, but we have seen this before. AMD previously shared benchmarks of its RX 7900 XTX outperforming the RTX 4090 (mostly) in DeepSeek R1 benchmarks. However, Nvidia responded by showcasing benchmarks of the RTX 4090 (and RTX 5090), drastically outperforming the flagship RDNA 3 GPU with the same DeepSeek R1 configurations.

AMD also neglected to share any benchmarks comparing Nvidia's newest flagship, the RTX 5090, against its RDNA 3-based 48GB workstation-focused graphics cards. It will be interesting to see if Nvidia will follow up with another round of benchmarks to combat AMD, particularly since AMD has more VRAM on its 48GB cards than even the RTX 5090 with its 32GB of GDDR7.

Huawei reportedly acquired two million Ascend 910 AI chips from TSMC last year through shell companies

ashilov@gmail.com (Anton Shilov) — Mon, 10 Mar 2025 15:52:25 +0000

Although Huawei cannot legally obtain advanced chips made by TSMC, the company used shell companies last year to obtain compute chiplets for its Ascend 910 AI chips. The conspiracy was discovered by TechInsights and TSMC, which ceased to ship chiplets to Huawei's proxies and initiated an internal investigation. However, it was not clear how many chiplets it supplied to Huawei. According to a report by the Center for Strategic and International Studies, Huawei obtained as many as two million Ascend 910 AI chiplets.

"However, TSMC manufactured large quantities of Huawei Ascend 910B chips on behalf of Huawei shell companies and shipped the chips to China in violation of U.S. export controls," the report by CSIS reads.

According to the report, "government officials told CSIS that TSMC manufactured more than 2 million Ascend 910B logic dies and that all of these are now with Huawei. If true, this is enough dies to make 1 million Ascend 910C units. […] Even though Huawei likely has the more than 2 million Ascend 910B logic dies made by TSMC, there is a question as to whether it has enough HBM to integrate with those dies […] It seems likely that Huawei does, however, since the U.S. plan to restrict all advanced HBM sales to China on a country-wide basis was leaked to Bloomberg in August 2024 and did not go into effect until December of that year, giving Huawei ample time to legally acquire HBM chips as part of a stockpiling strategy."

Although the report seems correct about Huawei's stockpiling strategy and even gives us an insight into how many chips TSMC produced for Huawei's intermediaries, it still contains several inaccuracies that lead to wrong conclusions.

The progression From The Ascend 910 To The Ascend 910C

Huawei's original HiSilicon Ascend 910, which was launched in 2019, consists of a Virtuvian AI chiplet, a Nimbus V3 I/O die, four HBM2E memory stacks, and two dummy dies. TSMC produced Virtuvian chiplets for Huawei from 2019 to September 2020, using its N7+ process technology, a 7nm-class node with some EUV layers.

After the U.S. government put Huawei on its Entity List in 2020, Huawei had to redesign its Virtuvian chiplet to make it at SMIC, which used its N+1 technology (1st Generation 7nm-class process) to build it. GPUs with the new Virtuvian chiplet are called HiSilicon Ascend 910B and have nothing to do with TSMC.

Later, Huawei developed a more sophisticated version of its Virtuvian chiplet for its Ascend 910C, which SMIC makes using its 2nd Generation 7nm fabrication technology (N+2). Contrary to the report, the Ascend 910C has only one compute chiplet. Again, the Ascend 910C has nothing to do with TSMC. As Huawei managed to deceive TSMC, the latter produced the original Ascend 910 chiplet for the company in 2023 – 2024, as discovered by TechInsights.

Ascend 910B And Ascend 910C Yields Are Poor

Another noteworthy thing about Huawei's Ascend 910B and Ascend 910C is that their yields are not exactly high, so most parts are shipped with some compute elements disabled. Also, only 75% of Huawei's AI chips survive advanced packaging, which is not a good result.

"However, the advanced packaging process by which two Ascend 910B dies and HBM are combined into a unified Ascend 910C chip also introduces defects that can compromise the functionality of the chip," the report says. "Industry sources told CSIS that roughly 75% of the Ascend 910Cs currently survive the advanced packaging process."

Nonetheless, Huawei continues to acquire millions of Ascend 910B and Ascend 910C for its internal AI projects and external customers. For example, DeepSeek claims that the Ascend 910C delivers 60% of the performance offered by Nvidia's H100, which may not be enough for training large language models but is good enough for inference workloads.

Where to buy AMD's Radeon RX 9070 series graphics cards

editors@tomshardware.com (Hassam Nasir) — Thu, 06 Mar 2025 15:01:42 +0000

The first wave of AMD's RDNA 4 GPUs are out in the wild, starting with the RX 9070 series featuring the $599 RX 9070 XT and its $549 sibling, the RX 9070. These GPUs land hot on the heels of Nvidia's latest strides into the more mainstream segment with the RTX 5070 series. AMD has scored some solid wins across the board in performance and pricing and now all eyes are on the availability.

Both GPUs in the RX 9070 family offer 16GB of memory, ample to power your 1440p and even 4K needs. The flagship RX 9070 XT packs 64 Compute Units, dropping to 56 for its non-XT counterpart. Given the shift in the Radeon nomenclature, both GPUs are positioned to rival Nvidia's RTX 5070 family. When sticking to rasterization at 1440p, the RX 9070 XT is within reach of the RTX 5070 Ti, while the RX 9070 comes out on top versus the RTX 5070, with an 8% lead. With base-level price tags of $599 and $549, let's hope AMD has enough inventory to match consumer demand.

We've compiled listings posted across major US retailers including Best Buy, Newegg, and B&H Photo. Despite the recommended MSRPs, most models are noticeably pricier, further exacerbated by the lack of an MBA (Made By AMD) model this generation. Unlike the RTX 50 series, RX 9070 series went live after the embargo lifted, likely as an effort to counter bots and scalpers.

Where to buy the AMD RX 9070 XT in the US

Many GPUs instantly sold out within minutes of launch, but we expect a resupply soon. There is a lot of fluctuation in the listed prices, especially at B&H Photo with a handful of its RX 9070 XTs costing north of $1,000. Without that anomaly, the most expensive RX 9070 XT is XFX's Mercury Magnetic Air OC White edition, priced at $849.99. On the RX 9070's end of the ring, the Gigabyte Gaming RX 9070 shot up to $739.99 once the first batch (at MSRP) sold out.

Model	Retailer	Price
XFX Swift Black AMD Radeon RX 9070 XT	Best Buy	$599.99
XFX Mercury Magnetic Air OC Black AMD Radeon RX 9070 XT	Best Buy	$829.99
XFX Swift White AMD Radeon RX 9070 XT	Best Buy	$749.99
XFX Mercury White OC AMD Radeon RX 9070 XT	Best Buy	$819.99
XFX Mercury Magnetic Air OC White AMD Radeon RX 9070 XT	Best Buy	$849.99
Asus Prime AMD Radeon RX 9070 XT	Newegg	$719.99
ASRock Steel Legend AMD Radeon RX 9070 XT	Newegg	$599.99
Gigabyte Gaming OC AMD Radeon RX 9070 XT	Newegg	$729.99
Sapphire Pulse AMD Radeon RX 9070 XT	Newegg	$599.99
PowerColor Hellhound OC AMD Radeon RX 9070 XT	Newegg	$759.99
Gigabyte Aorus Radeon RX 9070 XT	Newegg	$759.99
Sapphire Pure AMD Radeon RX 9070 XT	Newegg	$739.99
ASRock Taichi AMD Radeon RX 9070 XT	Newegg	$729.99
XFX QuickSilver AMD Radeon RX 9070 XT	Newegg	$749.99
PowerColor Reaper AMD Radeon RX 9070 XT	Newegg	$599.99
Sapphire Nitro+ AMD Radeon RX 9070 XT	Newegg	$779.99
PowerColor Red Devil OC AMD Radeon RX 9070 XT	Newegg	$799.99
XFX Mercury OC AMD Radeon RX 9070 XT	Newegg	$799.99
Gigabyte Gaming AMD Radeon RX 9070 XT	Newegg	$599.99
Asus TUF OC AMD Radeon RX 9070 XT	B&H Photo	$1,049.99
Asus TUF Gaming AMD Radeon RX 9070 XT	B&H Photo	$1,099.99
Asus TUF Gaming OC AMD Radeon RX 9070 XT	B&H Photo	$899.99
Gigabyte Gaming OC AMD Radeon RX 9070 XT	B&H Photo	$899.99
Gigabyte Aorus Elite AMD Radeon RX 9070 XT	B&H Photo	$949.99

Where to buy the AMD RX 9070 non-XT in the US

Model	Retailer	Price
XFX Swift Black OC AMD Radeon RX 9070	Best Buy	$549.99
XFX QuickSilver Black OC AMD Radeon RX 9070	Best Buy	$649.99
XFX Swift White OC AMD Radeon RX 9070	Best Buy	$649.99
XFX QuickSilver White OC AMD Radeon RX 9070	Best Buy	$669.99
Sapphire Pure AMD Radeon RX 9070	Newegg	$679.99
PowerColor Reaper AMD Radeon RX 9070	Newegg	$549.99
Sapphire Nitro+ AMD Radeon RX 9070	Newegg	$709.99
Gigabyte Gaming AMD Radeon RX 9070	Newegg	$739.99
Sapphire Pulse AMD Radeon RX 9070	Newegg	$549.99
ASRock Challenger AMD Radeon RX 9070	Newegg	$549.99
ASRock Steel Legend AMD Radeon RX 9070	Newegg	$639.99
Asus TUF Gaming AMD Radeon RX 9070	Newegg	$709.99
PowerColor Hellhound AMD Radeon RX 9070	Newegg	$629.99
PowerColor Red Devil AMD Radeon RX 9070	Newegg	$659.99
XFX Swift OC AMD Radeon RX 9070	Newegg	$549.99
Asus TUF Gaming OC AMD Radeon RX 9070	B&H Photo	$899.99
Asus Prime OC AMD Radeon RX 9070	B&H Photo	$849.99
Gigabyte Gaming OC AMD Radeon RX 9070	B&H Photo	$739.99

Inventory was expected to be high at launch, and many hoped we wouldn't see widespread stockouts like with Nvidia's RTX 50-series GPUs. The reality was basically a repeat of what we saw with every recent GPU launch. From the U.S. to the U.K. and Europe to Asia, there are numerous reports of immediate sold out cards.

For those in the US seeking a more hands-on experience, physical stores like Micro Center supposedly had a reasonable number of cards, but you'd need a local store to even try this approach. And by reasonable we mean perhaps dozens of cards, which are no doubt sold out by now.

Given the demand for silicon wafers from the AI sector and the significantly higher prices such products can command, there's real concern that the supply of consumer GPUs could be as bad or worse than what was experienced during the 2020–2022 cryptocurrency mining induced GPU shortages. Let's hope it's not that bad, but at present things aren't looking good.

AMD Radeon RX 9070 XT and RX 9070 review: Excellent value, if supply is good

Jarred Walton — Wed, 05 Mar 2025 14:29:18 +0000

Introducing the AMD Radeon RX 9070 XT and RX 9070

The AMD Radeon RX 9070 XT and RX 9070 are here, ushering in the RDNA 4 GPU architecture and RX 9000 series of graphics cards. AMD spilled the beans on the hardware and specs last week, and we've already done a deeper dive into what makes these new GPUs tick, but now it's time to see how the RX 9070 XT and RX 9070 stand up to the best graphics cards — all while we wait to see what happens with the retail launch tomorrow and how quickly the supply disappears.

RDNA 4 represents a throwback to AMD architectures of years past, as the company is once again targeting mainstream performance and maybe even budget performance further down the road. But today, we're getting the $599 RX 9070 XT and $549 RX 9070 cards. And while some might feel cards at up to $600 don't qualify as "mainstream," in today's market, we'd say mainstream stretches from around $400 up to $600, while anything below about $300 is clearly in the budget range. The PC graphics card market has become much more expensive in the past decade.

The one question we can't answer is what retail availability will look like. It seems like the AIBs have been stockpiling cards for about two or three months now, but how quickly were they being supplied the requisite GPUs? We don't know. Maybe there are tens of thousands of 9070 series cards just waiting to go on sale tomorrow; maybe there are only a few thousand. What we do know is that if there aren't enough to meet demand, prices are going to head north, just as they did with the RTX 50-series launches of the past two months.

Additional Reading

If you want to know more about the AMD RDNA 4 architecture and Radeon RX 9000-series GPUs, start with our "everything you need to know" primer that goes into a lot more detail about the design and architectural changes that power these graphics cards.

Speaking of which, the Nvidia RTX 5070 officially goes on sale this morning. Of course we knew the performance of the RX 9070 XT and 9070 when we posted that review yesterday. What we don't know — what no one outside of Nvidia and its distributors and retail partners knows — is how many 5070 cards will be available today.

The RTX 5070 Ti, RTX 5080, and RTX 5090 have all sold out almost immediately, and we've seen prices shoot up by 50% or more relative to the MSRPs.

Will the AMD graphics cards buck that trend or join the "party?" We'll find out in the coming days, but considering what happened with Intel's Arc B580, it's obvious that lower-priced cards aren't immune from the potential supply and demand problems.

Our default assumption right now, based on nearly all prior generation graphics cards already being sold out and/or overpriced — with only the RTX 4060 and RX 7600 still available at or near MSRP — is that RDNA 4 isn't going to be a magic bullet to solve the availability issues plaguing the graphics card market right now.

Let's also clear the air on the comparison GPUs in our charts. We've (or at least I've) been testing GPUs more or less constantly since the beginning of 2025. Drivers keep changing, certain tests that failed to run in the past have been fixed, bugs come and go, and we have a new GPU testbed and test suite. Ideally, we'd love to have every reasonable comparison present in the charts, but it will be a while before we have all the data compiled at a rate of a few GPUs getting tested per week.

So, the RTX 4070 Super wasn't tested for the 5070 review, not because we don't think it's important but because of time. Similarly, the RX 7900 GRE won't be in this review because we don't have time. Eventually, we'll get those tested, and all the data will be available in our GPU benchmarks hierarchy.

You should be able to reasonably estimate where those 'missing' cards would land, and as both of those were later additions to their respective GPU families, it seemed to make more sense to leave those out rather than some other GPUs.

All good? Good. Let's hit the specs.

Graphics Card Specifications
Graphics Card	RX 9070 XT	RX 9070	RX 7900 XTX	RX 7900 XT	RX 7900 GRE	RX 7800 XT	RTX 5070 Ti	RTX 5070
Architecture	Navi 48	Navi 48	Navi 31	Navi 31	Navi 31	Navi 32	GB203	GB205
Process Technology	TSMC N4P	TSMC N4P	TSMC N5 + N6	TSMC N5 + N6	TSMC N5 + N6	TSMC N5 + N6	TSMC 4N	TSMC 4N
Transistors (Billion)	53.9	53.9	45.6 + 6x 2.05	45.6 + 5x 2.05	45.6 + 4x 2.05	28.1 + 4x 2.05	45.6	31
Die size (mm^2)	356.5	356.5	300 + 225	300 + 225	300 + 225	200 + 150	378	263
SMs / CUs	64	56	96	84	80	60	70	48
GPU Shaders (ALUs)	4096	3584	6144	5376	5120	3840	8960	6144
Tensor / AI Cores	128	112	192	168	160	120	280	192
Ray Tracing Cores	64	56	96	84	80	60	70	48
Boost Clock (MHz)	2970	2520	2500	2400	2245	2430	2452	2512
VRAM Speed (Gbps)	20	20	20	20	18	19.5	28	28
VRAM (GB)	16	16	24	20	16	16	16	12
VRAM Bus Width	256	256	384	320	256	256	256	192
L2 / Infinity Cache	64	64	96	80	64	64	48	48
Render Output Units	128	128	192	192	160	96	96	80
Texture Mapping Units	256	224	384	336	320	240	280	192
TFLOPS FP32 (Boost)	48.7	36.1	61.4	51.6	46.0	37.3	43.9	30.9
TFLOPS FP16 (INT4/FP4 TOPS)	389 (1557)	289 (1156)	122.8	103.2	92	74.6	352 (1406)	247 (988)
Bandwidth (GB/s)	640	640	960	800	576	624	896	672
TBP (watts)	304	220	355	315	260	263	300	250
Launch Date	Mar 2025	Mar 2025	Dec 2022	Dec 2022	Jul 2023	Sep 2023	Feb 2025	Feb 2025
MSRP	$599	$549	$999	$749	$549	$499	$749	$549

The raw specs are interesting, but there's more to GPU performance than specs. For example, Intel's Arc B580 as an example has "worse" compute performance than the Arc A770: 14.6 TFLOPS versus 19.7 TFLOPS. But in actual benchmarks, the B580 is up to 17% faster across our gaming test suite at 1440p. Both AMD and Nvidia have also updated their core architectures to improve performance, and today we find out just how much.

The RX 9070 XT offers theoretical peak compute of 48.7 TFLOPS for FP32, which is used for graphics, and up to 1557 TOPS of INT4 AI compute (with sparsity). The previous generation RX 7900 XTX offers 61.4 TFLOPS of FP32, but only 122.8 TFLOPS of FP16 for AI workloads — or alternative 122.8 TOPS of INT8 compute. We'll spoil the surprise a bit by saying that, for a lot of games, the 7900 XTX is still faster... but in AI tasks and RT games, the tables can turn.

It's not just compute performance that matters, of course. Memory bandwidth and capacity are also factors. The 7900 XTX had a 384-bit interface and 24GB of VRAM, compared to the 9070 XT and 9070 with 256-bit interfaces and 16GB of VRAM. In all cases the memory is GDDR6 clocked at 20 Gbps, so the prior generation halo card had 50% more bandwidth and capacity.

There's also the RT accelerators. AMD's RDNA 4 has doubled the ray/triangle and ray/box intersection rates with RDNA 4 compared to RDNA 3, which means the 64 RT units in the 9070 XT should be the performance equivalent of 128 RDNA 3 RT units, but the 7900 XTX only has 96 RT accelerators. So that's potentially 33% higher ray tracing performance from the new generation.

(Image credit: Tom's Hardware)

As already noted, the prices on paper look good. What we don't know is whether prices will stay close to what AMD recommends, or if they'll get jacked up by the retail outlets and AIBs. Because AMD isn't making any graphics cards itself this round, it will be up to the add-in board (AIB) partners to determine prices on the various models.

There are probably requirements for each company to have an MSRP priced GPU, but we've seen those disappear in the past — or things like Asus's "special launch pricing" on some of its RTX 50-series cards.

We can also look at what graphics cards are available at retail. Last November, during the holiday shopping season, most graphics cards went on sale at prices below MSRP. And then they were gone. Now, virtually everything at the usual places for the U.S. — Newegg, Amazon, B&H, Best Buy, etc. — is either out of stock or seriously overpriced.

RX 7900 XTX was selling for as little as $819, now the best price we can find is $1,094 for a PSU and GPU combo, and after that the price jumps to $1,283 at Amazon.

The same pattern applies to pretty much every other GPU. Outside of the RTX 4060, RX 7600, and Arc B570, we can't find anything at MSRP, never mind below MSRP. If you want a mainstream or higher performance GPU, it's currently overpriced compared to just a month or two back. Given the scarcity of any graphics card with an MSRP above $400, then, it's hard to imagine the 9070 XT and 9070 will stay at MSRPs in the near term. But we'll wait and see.

MORE: Best Graphics Cards
MORE: GPU Benchmarks and Hierarchy
MORE: All Graphics Content

AMD Radeon RX 9070 XT and RX 9070, by PowerColor

Tom's Hardware

AMD provided two graphics cards for this review, both from PowerColor and both with reference clocks. They're branded Reaper, a new family for PowerColor that presumably sits near the lower end of the product stack. These are triple-fan cards, but everything else says base model — no RGB lighting, no dual BIOS, no extras in the box. That's fine, as base MSRP cards usually don't have a lot of extras.

We're primarily focusing on the higher spec RX 9070 XT for this review, though we'll have all the performance data for both cards. It's again a matter of time constraints. Doing three full graphics card writeups in one week is just a bit too much. But we'll have plenty to say about the vanilla RX 9070 as well.

Both cards have the same physical dimensions: 292x111x41 mm. The fans are 88mm models with integrated rims that help improve airflow. But while the dimensions are the same, there are some differences between the two cards. Specifically, the 9070 XT card has a copper heat plate while the 9070 has an aluminum (or some silver metal) heat plate. There are likely other differences under the shrouds, as the 9070 XT will have to dissipate more heat.

Tom's Hardware

PowerColor takes the traditional approach of including three DisplayPort 2.1a ports and a single HDMI 2.1b port. However, the specifications note that only two simultaneous DP2.1 connections can be active at the same time. Also, these are UHBR13.5 (54 Gbps) ports, rather than the full 80 Gbps maximum that DisplayPort 2.1a allows for.

The RX 9070 XT has a 304W TBP (Total Board Power), so it makes sense that it comes with dual 8-pin power connectors. Along with the 75W maximum power provided by the PCIe x16 slot — and yes, it's a PCIe 5.0 slot — that's up to 375W of power.

The 9070 only has a 220W TBP, so technically it could even be run off a single 8-pin connector plus the power from the PCIe slot, but taking the safer route of providing a second 8-pin connection is appreciated.

Of course, being the old and reliable 8-pin connectors means there shouldn't be much risk of any meltdowns happening, and you can get away with using pre-ATX 3.0 power supplies. Either way, these are minimalist designs that should work well in general. Which brings us to the important part for anyone reading: the benchmarks.

MORE: Best Graphics Cards
MORE: GPU Benchmarks and Hierarchy
MORE: All Graphics Content

AMD Radeon RX 9070 XT and RX 9070 Test Setup

(Image credit: Tom's Hardware)

This is mostly going to be a rehash of what we've said in other recent reviews, as our testing hasn't changed. At the end of last year, just in time for the Arc B580 launch, we revamped our test suite and our test PC, wiping the slate clean and requiring new benchmarks for every graphics card we want to have in our GPU benchmarks hierarchy.

That takes time, and we've been busy trying to keep up with the new graphics card launches — because it's not just the eight new GPUs that have launched since December, but a bunch of prior generation cards to use for comparison. We also need to retest some of the first cards we put through our new suite, as driver updates and game patches have certainly impacted a few of the results.

Like Nvidia's 50-series GPUs, AMD has some new technologies coming into play with the RX 9000-series RDNA 4 GPUs. FSR 4, AMD's new AI-powered upscaling and frame generation algorithm, requires an RDNA 4 GPU.

Perhaps AMD will figure out how to backport the technology to RDNA 3 and even RDNA 2 GPUs, but given the discrepancies in AI compute potential, it likely won't look as good. But while there are games that already support FSR 4, we're going to focus initially on the base level performance. Page six will have the results when they're ready, but that's going to take a bit longer.

Our GPU test PC has an AMD Ryzen 7 9800X3D processor, the fastest current CPU for gaming purposes. We also have 32GB of DDR5-6000 memory from G.Skill with AMD EXPO timing enabled (CL30) on an ASRock X670E Taichi motherboard.

We're running Windows 11 24H2, with the latest drivers at the time of testing. We used AMD's 25.2.1 drivers for the 7000-series GPUs and AMD's preview 24.30.31.03 for the 9070 cards. For the Nvidia GPUs, we've used several different drivers from the 572 family, depending on when the particular GPUs were tested.

We haven't had time to retest everything on the latest releases, unfortunately, but we've retested a few games and apps where earlier results seemed to not correlate with later testing.

Our PC is hooked up to an MSI MPG 272URX QD-OLED display, which supports G-Sync and Adaptive-Sync, allowing us to properly experience the higher frame rates that RTX 50-series GPUs with MFG are supposed to be able to reach. Most games won't get anywhere close to the 240Hz limit of the monitor at 4K when rendering at native resolution, which is where framegen and MFG can be useful.

Test Equipment

TOM'S HARDWARE AMD ZEN 5 PC

AMD Ryzen 7 9800X3D
ASRock Taichi X670E
G.Skill TridentZ5 Neo 2x16GB DDR5-6000 CL28
Crucial T700 4TB
Cooler Master ML280 Mirror
Corsair HX1500i

GRAPHICS CARDS
AMD RX 9070 XT (PowerColor Reaper)
AMD RX 9070 (PowerColor Reaper)
AMD RX 7900 XTX (MBA reference card)
AMD RX 7900 XT (MBA reference card)
AMD RX 7800 XT (MBA reference card)
Asus RTX 5070 Ti Prime
Nvidia RTX 5070 Founders Edition
Nvidia RTX 4080 Super Founders Edition
Asus RTX 4070 Ti Super TUF Gaming
Gigabyte RTX 4070 Ti Gaming
Nvidia RTX 4070 Founders Edition

Our new GPU test suite currently consists of 22 games. We're still looking at some potential changes and additions, but this is where we're at for now. Six of the games in our standard test suite have RT support enabled.

The remaining 16 games are run in pure rasterization mode. However, we'll be looking at supplemental testing in the coming days to further investigate full RT along with FSR 4 upscaling and framegen. (That testing is still ongoing, but check page six to see if we've added anything.)

All 22 games are tested without any upscaling or frame generation as our baseline. Again, we plan to do additional investigations into things like FSR 2/3/4 and DLSS 2/3/4 along with framegen/MFG, but that will be separate from the primary testing.

There are noticeable differences between the image quality of DLSS, FSR, and XeSS, as well as differences in how much they can affect performance, which is why we're not using any of them for our baseline measurements.

All games are tested using 1080p 'medium' settings (the specifics vary by game and are noted in the chart headers), along with 1080p, 1440p, and 4K 'ultra' settings.

This provides a good overview of performance in a variety of situations. Depending on the GPU, some of those settings don't make as much sense as others, but seeing how fast cards like the RTX 5090 and 5080 run at 1080p can be enlightening.

Our OS has all the latest updates applied. We're also using Nvidia's PCAT v2 (Power Capture and Analysis Tool) hardware, which means we can grab real power use, GPU clocks, and more during our gaming benchmarks. We'll cover those results on page eight.

Finally, because GPUs aren't purely for gaming these days, we run some professional and AI application tests. We've previously tested Stable Diffusion, using various custom scripts, but to level the playing field and hopefully make things a bit more manageable (AI is a fast moving field!), we're turning to standardized benchmarks.

We use Procyon and run the AI Vision test as well as the Stable Diffusion 1.5 and XL tests; MLPerf Client 0.5 preview for AI text generation; SPECworkstation 4.0 for Handbrake transcoding, AI inference, and professional applications; 3DMark DXR Feature Test to check raw hardware RT performance; and finally Blender Benchmark 4.3.0 for professional 3D rendering.

MORE: Best Graphics Cards
MORE: GPU Benchmarks and Hierarchy
MORE: All Graphics Content

AMD Radeon RX 9070 XT and RX 9070 Rasterization Gaming Performance

We divide gaming performance into two categories: traditional rasterization games and ray-tracing games. We benchmark each game using four different test settings: 1080p medium, 1080p ultra, 1440p ultra, and 4K ultra.

Like the RTX 5070, we'd rate the 1440p ultra results as the most important here, though arguably the 9070 XT can also target 4K. So, we'll go ahead and just sort each grouping from highest to lowest resolution/setting.

Do note that 1440p also correlates with 4K using quality mode upscaling, though there's some overhead for the algorithms, and 1080p likewise correlates with 4K using performance mode upscaling.

The interesting thing here is going to be seeing how the two 9070-series cards compete with each other as well as with Nvidia's RTX 5070 and RTX 5070 Ti. The latter has a much higher $749 MSPR, and it's currently selling at $1,149 and up. Even if the 9070 XT can't quite catch the 5070 Ti, if it can come close while also staying closer to its $599 MSRP, it would represent a serious coup.

We'll start with the rasterization suite of 16 games, as that's arguably still the most useful measurement of gaming performance. Plenty of games that have ray tracing support end up running so poorly that it's more of a feature checkbox than something useful.

We'll provide limited to no commentary on most of the individual game charts, letting the numbers speak for themselves. The Geomean charts will be the main focus, since those provide the big picture overview of where the RX 9070 XT and RX 9070 land relative to the other GPUs.

Tom's Hardware

There are several important comparisons we want to look at. First is how the fastest RDNA 4 GPU, the 9070 XT, fares against the 7900 XTX. The answer: It's probably closer than you would expect based on the raw specs.

The 7900 XTX ends up winning by just 5% overall at 4K — and also 5% at 1440p and 1080p ultra, with a slightly lesser 3% lead at 1080p medium, where CPU bottlenecks become a bigger factor. The 9070 XT is also consistently 5~10 percent faster than the RX 7900 XT, so that's higher performance than the prior generation's nominally $750 part with a price of $600.

Next up, let's look at the 9070 XT versus 9070. The XT costs just $50 more, a 9% price increase, with a theoretical 35% advantage in raw compute. Except that raw compute assumes the GPUs are running at their boost clocks, and that's not always the case. The vanilla 9070 tends to exceed its boost clock in many of our tests, particularly at lower resolutions... but the same goes for the 9070 XT.

Overall, the XT leads by 15% at 4K, 13% at 1440p, and 10%/8% at 1080p ultra/medium. That means that, as many surmised before today's review embargo, the RX 9070 XT is the better value.

But what a lot of people really want to know is how the AMD versus Nvidia matchup shakes out. Based on MSRPs, the RTX 5070 Ti should be the fastest of the new cards, and it is. However, the margin of victory isn't very large at all, considering the $150 price difference. We're talking low single digit percentages for our rasterization tests: 0 to 4 percent across our suite, with the biggest lead of 4% coming at 1080p ultra.

That's pretty surprising, considering the 5070 Ti has 40% more memory bandwidth thanks to GDDR7.

That of course means the matchup between the RX 9070 XT and the RTX 5070 ends up being a relative blowout. For $50 more — on paper — the RX 9070 XT beats the RTX 5070 by 29% at 4K, 21% at 1440p, and 14–15 percent at 1080p. It's not even close.

There's only one game in our rasterization suite that the 5070 wins by a decent amount at 1080p, Warhammer 40K: Space Marine 2, which seems to be lacking in the AMD driver optimizations arena — and the 9070 XT is still 11% faster at 4K ultra.

And finally, what about the RX 9070 versus the RTX 5070, both nominally priced at $549? If you're at all good at math and were paying attention above, you'll already know that the 9070 comes out ahead. It's 12% faster at 4K, 8% faster at 1440p, and 4%/7% faster at 1080p.

There are five games where the 5070 manages any lead at all, with Space Marine 2 being the biggest margin of victory and the only one where the 5070 leads at 4K. In general, though, the RX 9070 is clearly better for rasterization performance at native resolution.

Below are the 16 rasterization game results, in alphabetical order, with short notes on the testing where something worth pointing out is present.

Tom's Hardware

Assassin's Creed Mirage uses the Ubisoft Anvil engine and DirectX 12. It's also an AMD-promoted game, though these days, that doesn't necessarily mean it always runs better on AMD GPUs. It could be CPU optimizations for Ryzen, or more often, it just means a game has FSR2 or FSR3 support — FSR2 in this case. It also supports DLSS and XeSS upscaling.

Tom's Hardware

Baldur's Gate 3 is our sole DirectX 11 holdout — it also supports Vulkan, but that performed worse on the GPUs we checked, so we opted to stick with DX11. Built on Larian Studios' Divinity Engine, it's a top-down perspective game, which is a nice change of pace from the many first-person games in our test suite. The faster GPUs are hitting CPU bottlenecks in this game.

Tom's Hardware

Black Myth: Wukong is one of the newer games in our test suite. Built on Unreal Engine 5, which supports full ray tracing as a high-end option, we opted to test using pure rasterization mode. Full RT may look a bit nicer, but the performance hit is quite severe. (Check our linked article for our initial launch benchmarks if you want to see how it runs with full RT enabled. We've got supplemental testing coming as well.)

Tom's Hardware

Dragon Age: The Veilguard uses the Frostbite engine and runs via the DX12 API. It's one of the newest games in my test suite, having launched this past Halloween. It's been received quite well, though, and in terms of visuals, I'd put it right up there with Unreal Engine 5 games — without some of the LOD pop-in that happens so frequently with UE5.

Tom's Hardware

Final Fantasy XVI came out for the PS5 last year, but it only recently saw a Windows release. It's also either incredibly demanding or quite poorly optimized (or both), but it does tend to be very GPU limited. Our test sequence consists of running a set path around the town of Lost Wing.

Tom's Hardware

We've been using Flight Simulator 2020 for several years, and there's a new release below. But it's so new that we also wanted to keep the original around a bit longer as a point of reference. We've switched to using the 'beta' (eternal beta) DX12 path for our testing now, as it's required for DLSS frame generation, even if it runs a bit slower on Nvidia GPUs.

Tom's Hardware

Flight Simulator 2024 is the latest release of the storied franchise, and it's even more demanding than the above 2020 release — with some differences in what sort of hardware it seems to like best. Where the 2020 version really appreciated AMD's X3D processors, the 2024 release tends to be more forgiving to Intel CPUs, thanks to improved DirectX 12 code (DX11 is no longer supported).

Tom's Hardware

God of War Ragnarök released for the PlayStation two years ago and only recently saw a Windows version. It's AMD promoted, but it also supports DLSS and XeSS alongside FSR3. We run around the village of Svartalfheim, which is one of the most demanding areas in the game that we've encountered.

Tom's Hardware

Hogwarts Legacy came out in early 2023 and it uses Unreal Engine 4. Like so many Unreal Engine games, it can look quite nice but also has some performance issues with certain settings. Ray tracing, in particular, can bloat memory use, tank framerates, and also causes hitching, so we've opted to test without ray tracing. (At maximum RT settings, the 9800X3D CPU ends up getting only around 60 FPS, even at 1080p with upscaling!) We may replace this one in the coming days.

Tom's Hardware

Horizon Forbidden West is another two years old PlayStation port, using the Decima engine. The graphics are good, though I've heard at least a few people think it looks worse than its predecessor — excessive blurriness being a key complaint. But after using Horizon Zero Dawn for a few years, it felt like a good time to replace it.

Tom's Hardware

The Last of Us, Part 1 is another PlayStation port, though it's been out on PC for about 20 months now. It's also an AMD-promoted game and really hits the VRAM hard at higher-quality settings. Cards with 12GB or more memory usually do fine, and the RTX 5070 lands about where expected.

Tom's Hardware

A Plague Tale: Requiem uses the Zouna engine and runs on the DirectX 12 API. It's an Nvidia-promoted game that supports DLSS 3, but neither FSR nor XeSS. (It was one of the first DLSS 3-enabled games as well.) It has RT effects, but only for shadows, so it doesn't really improve the look of the game and tanks performance.

Tom's Hardware

Stalker 2 is another Unreal Engine 5 game, but without any hardware ray tracing support — the Lumen engine also does "software RT" that's basically just fancy rasterization as far as the visuals are concerned, though it's still quite taxing. VRAM can also be a serious problem when trying to run the epic preset, with 8GB cards struggling at most resolutions.

There's also quite a bit of microstuttering in Stalker 2, and it tends to be more CPU limited than other recent games.

Tom's Hardware

Star Wars Outlaws uses the Snowdrop engine, and we wanted to include a mix of options. It also has a bunch of RT options that we leave off four our tests. As with several other games, turning on maximum RT settings in Outlaws tends to result in a less than ideal gaming experience, with a lot of stuttering and hitching even on the fastest cards.

Tom's Hardware

Starfield uses the Creation Engine 2, an updated engine from Bethesda, where the previous release powered the Fallout and Elder Scrolls games. It's another fairly demanding game, and we run around the city of Akila, one of the more taxing locations in the game. It's a bit more CPU limited, particularly at lower resolutions.

Tom's Hardware

Wrapping things up, Warhammer 40,000: Space Marine 2 is yet another AMD-promoted game. It runs on the Swarm engine and uses DirectX 12, without any support for ray tracing hardware.

We use a sequence from the introduction, which is generally less demanding than the various missions you get to later in the game but has the advantage of being repeatable and not having enemies everywhere. Curiously, the RTX 40-series cards are able to hit much higher performance at 1080p than the 50-series and AMD cards.

MORE: Best Graphics Cards
MORE: GPU Benchmarks and Hierarchy
MORE: All Graphics Content

AMD Radeon RX 9070 XT and RX 9070 Ray Tracing Gaming Performance

Ray tracing can be extremely demanding, and it's traditionally been a weak point for AMD's GPUs. However, the RDNA 4 architecture promises improved RT performance, so now we get to see how it actually fares. AMD even said the 9070 XT should beat the previous generation RX 7900 XTX in RT performance, which means it should be fairly competitive with the 5070 at least.

We're running native rendering for our tests, which is more than most GPUs can handle at 4K in particular. The RTX 5090 and perhaps 4090 can manage that, but mainstream GPUs? Not so much.

The more demanding RT games are usually better optimized for Nvidia GPUs, and often Nvidia promoted. That's no surprise as Nvidia has been pushing the tech far more than AMD or Intel. We've selected six reasonably demanding RT games for our testing, and we'll add additional supplemental RT / full RT / upscaling / framegen testing on the next page (in the future).

Tom's Hardware

Again, there are multiple interesting comparisons. New AMD versus old AMD, the 9070 XT delivers 10–12 percent higher performance on average across our test resolutions compared to the RX 7900 XTX. Only Avatar — a lighter RT game as far as graphics effects go — runs faster on the XTX card. Elsewhere, Cyberpunk 2077 runs around 25% faster on the 9070 XT. And relative to the 7900 XT, the 9070 XT is 22–32 percent faster.

Looking at the two 9070-series cards, the 9070 XT gets a slightly larger lead in ray tracing than it did in rasterization performance. It's 12–19 percent faster, so again, for 9% more money it's the clearly better option. That's assuming MSRPs have any real meaning, of course.

So, AMD has clearly improved its ray tracing performance compared to RDNA 3, by quite a lot. 64 RT accelerators in the 9070 XT outperform 96 previous gen RT accelerators in the 7900 XTX. But is that enough to compete with Nvidia's cards?

The 9070 XT doesn't quite manage to take down the RTX 5070 Ti, but it's closer than we've seen in the past. It's 13% slower at 4K, 9% slower at 1440p, and 11% slower at 1080p medium — and nearly tied at 1080p ultra, but that's because Nvidia's 50-series has issues with Minecraft at 1080p "ultra."

But while AMD couldn't take down the higher tier 5070 Ti, the RTX 5070 is a different matter. Nvidia's new mainstream card does get slightly higher performance in Minecraft (except at 1080p ultra where performance on Nvidia is again terrible), but everywhere else the 9070 XT gets a clear win. It's 16–20 percent faster overall at our ultra settings, and 10% faster at 1080p medium. For a potential 9% increase in price, it's again the clear winner — though obviously DLSS and other software are still factors to consider.

And finally, we have the RX 9070 versus RTX 5070. AMD got the win in rasterization performance while Nvidia gets a slight win here. And we do mean slight. The 9070 is 4% slower at 1080p medium, only 1% slower at 1440p and 4K — likely thanks to having 16GB — and 5% faster at 1080p ultra where Nvidia's poor Minecraft result still skews the numbers.

Tom's Hardware

Combining all 22 game results into a single chart, the RX 9070 XT is basically tied in overall performance with the RX 7900 XTX, and 10–17 percent faster than the RX 7900 XT. Not surprisingly, since they're using the same architecture, the gap between the 9070 XT and 9070 cards remains pretty consistent, with the XT being 9–16 percent faster overall.

If it hasn't been abundantly clear already, the 9070 XT is the obviously the better choice based on performance and MSRP.

Against Nvidia, the RX 9070 XT ultimately ends up slightly slower than the RTX 5070 Ti overall. It loses by 5% at 4K, 4% at 1440p, and 3% at 1080p. But again, on paper, it's 20% cheaper, so that's not a bad tradeoff. Naturally, that means the 9070 XT gets a big win over the vanilla 5070. The 9070 XT is 9–16 percent faster than the 5070, with a larger lead at the higher resolutions.

The RX 9070 ends up a lot closer to its direct competitor. It gets the win, but not by a huge margin: 4% at 1080p, 5% at 1440p, and 8% at 4K. And at that point, barring major differences in real-world pricing, it's close enough that the extras on offer from Nvidia like DLSS, Broadcast, etc. could sway the choice. Still, you do get 33% more VRAM with AMD's card.

The individual RT gaming charts follow, again with limited commentary on each.

Tom's Hardware

Avatar: Frontiers of Pandora uses ray tracing, but it's not particularly forthcoming on when and where it's used. Reflections, in general, don't appear to use RT, which is one of the most noticeable upgrades RT can provide. Instead, it's used for shadows and possibly global illumination and some other effects.

What I can say for sure is that nothing in the menus (other than "BVH Quality") directly mentions ray tracing, and the performance hit doesn't seem to be as severe as in some games. Still, since there's RT of some form, this one gets lumped into our DXR suite.

Tom's Hardware

If you want a game where ray tracing is both clearly visible and actually makes the game look better, without totally destroying performance, look no further than Control. It's now five years old, and we're using the Ultimate version, but it's still arguably the best example of using RT well.

And probably a lot of that is because you're running around the Federal Bureau of Control, an office space of sorts that has good reasons to have plenty of glass windows that reflect the scenery.

Note that Nvidia's RTX 50-series GPUs have some rendering errors in Control right now, and there's a hard 240 FPS cap that can impact the 1080p results. (This game is on the chopping block if I decide I want to trim down the number of tests I'm running.)

Tom's Hardware

Possibly the most hyped-up use of RT in a game, Cyberpunk 2077 launched with more RT effects than other games of its era, and later, the 2.0 version added full path tracing and DLSS 3.5 ray reconstruction. Ray reconstruction ends up looking the best but only works on Nvidia GPUs, so, as with upscaling, it can be a case of trying to compare apples and oranges.

We're using medium settings with RT lighting at medium and RT reflections enabled, and then the step up uses the RT-Ultra preset. In all cases, any form of upscaling or frame generation gets turned off. However, we'll have more details on Cyberpunk 2077 with RT-Overdrive on the next page (eventually).

Tom's Hardware

F1 24 enables several RT effects on the ultra preset but leaves them off on medium. But then 1080p medium runs at hundreds of frames per second, so we went ahead and turned all the RT effects on for our testing. We use the Great Britain track for testing.

Tom's Hardware

Minecraft supports full path tracing, as well as DLSS 2 upscaling on RTX cards. We don't enable DLSS, and the game doesn't even allow it on the RTX 50-series GPUs right now. Apparently, it has some sort of hard-coded check for an RTX 20-, 30-, or 40-series GPU is our best guess. Or it's just a driver bug of some form.

The 50-series GPUs also underperform in Minecraft, especially at 1080p and less so at 1440p and 4K (the 'medium' results are mostly okay). Nvidia is aware of the problem and presumably working on a fix, but we've been saying that for over a month now.

Tom's Hardware

Last on our list of RT-enabled games, Spider-Man: Miles Morales doesn't look as nice with RT turned on as the previous Spider-Man: Remastered. The reflections are less obvious, and perhaps performance is better as a result. But beyond the RT effects, maxing out the settings in Miles Morales definitely needs more than 8GB of VRAM, and even 12GB cards can struggle at times.

(Image credit: Tom's Hardware)

One final ray tracing benchmark we have is the 3DMark DXR Feature Test, where we report the average FPS rather than the calculated score. This is similar to full RT in a game, only done via a standalone benchmark and perhaps in a more vendor-agnostic fashion. Nvidia has also fixed a bug here that was causing Blackwell 50-series GPUs to underperform.

Interestingly, in the "pure" RT performance of 3DMark's DXR Feature Test, the 7900 XTX still comes out slightly ahead of the 9070 XT. The RTX 5070 also comes out ahead of the 7900 XTX. So, if the hope was that this would be a more neutral view of ray tracing potential, it doesn't quite show what we expected to see.

MORE: Best Graphics Cards
MORE: GPU Benchmarks and Hierarchy
MORE: All Graphics Content

AMD Radeon RX 9070 XT and RX 9070 Full RT and FSR 4 Testing (Coming later...)

As we've said in other recent reviews, there's a lot of other testing we want to conduct, but we've been short on time for the past month or more it feels like. AMD has some games already available with FSR 4 support, and Nvidia has games with DLSS 4 support, but doing the additional testing for all of that can be a massive time sink and we just don't have the time right now.

We'll certainly be revisiting this subject in the coming days, and we'll update this page when we've got some hard data. For now, just know that FSR 4 is something we intend to investigate, sooner than later. We'll discuss things in more detail once we have some actual numbers.

More to come....

AMD Radeon RX 9070 XT and RX 9070 Content Creation, Professional Apps, and AI

Modern GPUs like the RX 9070 XT aren't just about gaming. They're used for video encoding and professional applications, and increasingly, they're being used for AI. We've revamped our professional and AI test suite to give a more detailed look at the various GPUs. We'll start with the AI benchmarks, as those tend to be more important for a wider range of users.

Tom's Hardware

Procyon has multiple AI tests, and we've run the AI Vision benchmark along with two different Stable Diffusion image generation tests. The tests have several variants available that are all determined to be roughly equivalent (in output) by UL: OpenVINO (Intel), TensorRT (Nvidia), and DirectML (potentially for everything, but mostly for AMD). There are also options for FP32, FP16, and INT8 data types on some of the tests, which can give different results. We tested the available options and used the best result for each GPU.

Procyon has finally received the necessary update to run the TensorRT workloads on Blackwell 50-series GPUs, which wasn't the case for the 5090, 5080, and 5070 Ti reviews. Those same updates also improved the AI Vision performance for Nvidia's RTX 40-series cards, but the Stable Diffusion results remained about the same.

With the updates in place, Nvidia pretty much clobbers AMD. Even the RTX 4070 outperforms the 9070 XT in SDXL, though AMD does come out ahead in SD 1.5. And in the AI Vision tests, the gap is even worse. The 4070 is 71% faster than the 9070 XT, while the 5070 more than doubles its performance.

If there's a bright spot here, it's that AMD's new 9070 cards do outperform the prior generation AMD GPUs. It's also worth pointing out that Nvidia and Intel GPUs get a performance boost by using integers rather than floating point in the AI Vision test, but AMD doesn't get good performance from integer mode when using ONNX. It defaults back to the GPU shaders rather than running integer computations on the AI accelerators, which clearly doesn't help AMD's standings in this particular task.

Tom's Hardware

ML Commons' MLPerf Client 0.5 test suite does AI text generation in response to a variety of inputs. There are four different tests, all using the LLaMa 2 7B model, and the benchmark measures the time to first token (how fast a response starts appearing) and the tokens per second after the first token. These are combined using a geometric mean for the overall scores, which we report here.

While AMD, Intel, and Nvidia are all ML Commons partners and were involved with creating and validating the benchmark, it doesn't seem to be quite as vendor-agnostic as we would like. AMD and Nvidia GPUs only have a DirectML execution path, while Intel has both DirectML and OpenVINO as options. Intel's Arc GPUs score quite a bit higher with OpenVINO than with DirectML.

The 9070 series cards only do slightly better than the 7000-series GPUs in time to first token, while the tokens per second results are a lot closer than in some of the other benchmarks. It's not clear exactly why that is, but the 9070 cards also come in below the 7900 XT in tokens per second, so there's likely plenty of room for improvement here.

(Image credit: Tom's Hardware)

We'll have some additional SPECworkstation 4.0 results below, but there's an AI inference test composed of ResNet50 and SuperResolution workloads that runs on GPUs (and potentially NPUs, though we haven't tested that). We calculate the geometric mean of the four results given in inferences per second, which isn't an official SPEC score but it's more useful for our purposes.

The RX 9070 and 9070 XT results were odd here, with the 9070 outperforming the 9070 XT. We'll have to look into retesting; perhaps we inadvertently swapped the numbers when recording the results. But the 9070 ends up on par with the RTX 5070, so we'd expect the 9070 XT to rank a lot higher.

Tom's Hardware

For our professional application tests, we'll start with Blender Benchmark 4.3.0, which has support for Nvidia Optix, Intel OneAPI, and AMD HIP libraries. Those aren't necessarily equivalent in terms of the level of optimizations, but each represents the fastest way to run Blender on a particular GPU at present.

We need to note here that the 9070 cards couldn't run Blender Benchmark right now. Instead, we needed to get a special build of Blender 4.4.0, the full application, that supported the RDNA 4 GPUs. It doesn't appear to have inflated the new AMD GPU results, which end up being slightly ahead of the 7900 XTX and 7900 XT for the two newcomers. Nvidia meanwhile beats AMD's fastest card with the 4070 and above.

(Image credit: Tom's Hardware)

SPECworkstation 4.0 has two other test suites that are of interest in terms of GPU performance. The first is the video transcoding test using HandBrake, a measure of the video engines on the different GPUs and something that can be useful for content creation work. We use the average of the 4K to 4K and 4K to 1080p scores. Note that this only evaluates speed of encoding, not image fidelity.

AMD has improved its video encoding hardware with RDNA 4, so our previous GPU encoding tests that showed AMD with significantly lower image fidelity are no longer up to date — particularly with regards to the 9070 cards. But performance has also improved, with the 9070 cards basically tied for maximum performance.

Tom's Hardware

Our final professional app tests consist of SPECworkstation 4.0's viewport graphics suite. This is basically the same tests as SPECviewperf 2020, only updated to the latest versions. (Also, Siemen's NX isn't part of the suite.) There are seven individual application tests, and we've combined the scores from each into an unofficial overall score using a geometric mean.

AMD's drivers for its consumer cards tend to be more friendly toward these professional applications, and the 9070 series doesn't alter that. Instead, AMD improves its standings slightly, with the 9070 XT taking the top spot, just ahead of the 7900 XTX. The 9070 ends up slightly behind the XTX, in third place overall.

These AI and professional tests are ultimately just one aspect of GPU performance, and if you only care about gaming they shouldn't exert much influence on your choice of GPU. That's especially true of the professional tests. AI could become something useful even for gaming, maybe, but higher Blender performance will only matter if you're actually using Blender for 3D modeling.

MORE: Best Graphics Cards
MORE: GPU Benchmarks and Hierarchy
MORE: All Graphics Content

AMD Radeon RX 9070 XT and RX 9070 Power, Clocks, Temps, and Noise

(Image credit: Tom's Hardware)

All our gaming tests are conducted using an Nvidia PCAT v2 device, which allows us to capture total graphics card power, GPU clocks, GPU temperatures, and some other data as we run each gaming benchmark. We have separate 1080p, 1440p, and 4K results for each area, which we'll order from highest to lowest resolution for these tests.

Tom's Hardware

AMD's power requirements were a lot higher than Nvidia with the prior generation, but with RDNA 4 and Blackwell the two companies are more or less on the same process node — N4P for AMD and 4N for Nvidia. The 9070 XT has a 304W TBP and comes in slightly below that mark at 4K, while the 9070 has a 220W TBP and is basically right on target.

Dropping down to lower resolutions and settings reduces power draw on all the cards, and the net result is that the 9070 generally uses less power than the 5070 — they're basically tied at 1080p, while AMD proves to be more efficient by using less power at 1440p and 4K.

The 9070 XT meanwhile ends up using more power across the test suite compared to the 5070 Ti. That's interesting, as Nvidia uses more power with the 5070 relative to the 9070, while the 5070 Ti offers more performance than the 9070 XT while drawing less power.

Tom's Hardware

Clock speeds among the different GPUs and architectures aren't super important, but it's interesting to see where things land. AMD has increased clock speeds on average compared to RDNA 3, with the 9070 XT at times breaking the 3.0 GHz barrier even at stock settings. It does fall off the pace a bit at 4K, basically tied with the 5070 Ti, but it's over 2.9 GHz at all the lower resolutions.

For the RX 9070, it exceeds its rated boost clock at 1440p and 1080p but falls below 2.5 GHz at 4K. Power limits appear to be a significant limiting factor in performance at 4K for the card, so manually overclock could end up being quite beneficial.

Tom's Hardware

Like the clock speeds, comparing GPU temperatures without considering other aspects of the cards doesn't make much sense. One card might run its fans at higher RPMs, generating more noise while being "cooler." So these graphs should be used alongside the noise and performance results.

AMD doesn't make reference 9070 cards, so the results here are a reflection of the GPUs to a certain degree, but really they're more an indication of how the PowerColor Reaper cards run. And they do a lot better than the 5070 Founders Edition, considering it's one of the hotter running cards.

But we also need to look at noise levels...

(Image credit: Tom's Hardware)

We check noise levels using an SPL (sound pressure level) meter placed 10cm from the card, with the mic aimed right at the center of one fan: the center fan if there are three fans, or the right fan for two fans. This helps minimize the impact of other noise sources, like the fans on the CPU cooler. The new noise floor of our test environment and equipment is around 34 dB(A), due to the noise from the CPU cooling pump.

Even more impressive than the thermals on the PowerColor cards is their noise levels. Only the RTX 4070 ended up being quieter than the 9070 XT, but it also ran quite a bit warmer. Not that any of these cards are really running hot, but it does show that traditional cooler designs with triple fans are still very capable.

Tom's Hardware

Here's the full table of testing results, with FPS/$ calculated using the various launch MSRPs for the cards. That's because current retail prices are all wildly inflated, and many of the previous generation GPUs are now discontinued. We can only hope prices on the latest generation cards actually manage to reach MSRPs at some point. (Wishful thinking, perhaps.) Latency results are included for some of the games as well, and you can see the game-by-game power figures.

MORE: Best Graphics Cards
MORE: GPU Benchmarks and Hierarchy
MORE: All Graphics Content

AMD Radeon RX 9070 XT and RX 9070: The XT is great, the vanilla card less so

(Image credit: Tom's Hardware)

The AMD Radeon RX 9070 XT and RX 9070 represent a big step forward in several areas for AMD. They have significantly improved ray tracing performance, to the point where the RX 9070 XT easily beats the 5070 and comes relatively close to the 5070 Ti. There's also new and improved AI hardware that's not quite as fast as what Nvidia offers, but it should provide some substantial improvements to a variety of workloads. It will also power FSR 4, but we'll have to investigate that more when we have more time.

There's still the question of price and availability. There are a lot of rumors and suggestions that the 9070 cards will have a lot more stock ready for interested buyers than what we've seen from Nvidia, but here's the thing: No one actually knows how many RTX 50-series GPUs have been sold, outside of Nvidia itself. What we do know is that there's massive demand and an ongoing shortage, and it doesn't look like it will clear up any time soon.

Where will prices end up on the RX 9070 XT and 9070 cards? We can only guess. In the meantime, MSRPs are the only thing we can really point to, and AMD has delivered a potentially excellent value proposition with its new GPUs. The 9070 XT, in particular, looks set to be a hot item, as it's roughly 15% faster than the 9070 for 9% more money. It's also only about 5% slower than the RTX 5070 Ti but costs 20% less — in theory.

In practice, of course, Nvidia's RTX 5070 Ti is currently sold out at MSRP and commanding prices of potentially over $1,000. We expect the same thing will happen with the RTX 5070 launch this morning — maybe not the $1,000+ prices, but selling out quickly seems almost inevitable. But we'll have to wait and see what happens.

(Image credit: Tom's Hardware)

As we noted with the RTX 5070 review, the fundamental problem right now is one of manufacturing capacity. TSMC has the best 5nm-class and 3nm-class processes right now, and the line of companies wanting to order wafers has gotten very large. Most of the orders are likely going to AI hardware, including Nvidia's Blackwell B200 GPUs, which sell for far higher prices — prices that consumer hardware can't really hope to compete with.

AMD competes for those same wafers. And AMD is also using those same wafers for its Ryzen and EPYC CPUs. The CCDs (Core Compute Dies) in Zen 5 are pretty small compared to the Navi 48 GPUs at only 71 mm^2 versus 357 mm^2. AMD can get about five Zen 5 CCDs from the same wafer that provides a single Navi 48. And Zen 5 CCDs going into Ryzen 7 9800X3D are making far more money per unit for AMD than a Radeon GPU.

AMD also has data center MI300X and MI350X chips, which like Nvidia's Hopper and Blackwell command significantly higher prices. It's not doing the same AI volume as Nvidia, but it has said in the past that the MI product lines and CDNA series have been one of its fastest sales ramps ever. Should it make more data center chips that sell for $10,000 or more, or make more consumer chips that sell for $600 or less?

At the same time, AMD wants to increase its share of the GPU market. It has a far smaller total share than Nvidia, and that share has been trending downward. Intel is also trying to gain market share with its Arc GPUs. So, we could see both companies sacrifice some profit margins to increase their GPU share.

(Image credit: Tom's Hardware)

At MSRP, the RX 9070 XT represents an awesome value and a great card in general. Performance is higher than the previous generation RX 7900 XT, pretty much across the board, with a nominal price of $599. If it sells at that price, in quantity, this is about as good as we can expect from the graphics card market right now. Based on that, we've scored it 4.5 stars. Obviously, if prices increase substantially, the desirability of the cards will change.

RX 9070 XT, on paper, delivers a knockout blow to the RTX 5070. More VRAM, up to 25% higher performance, competitive RT, all for just $50 more? What's not to love? Well, as we said, actual retail availability is still unknown and could end up being just as horrible as the RTX 50-series launches so far.

The RX 9070 isn't quite as impressive. Yes, it's faster than the RTX 5070, but not by that much. It also offers more VRAM than the 5070, but conversely, Nvidia offers better software and features. FSR 4 might make AMD more competitive, but DLSS is in far more games than FSR. It ends up being pretty much a wash in our book, with real prices being the determining factor.

We're primarily talking about the 9070 XT today, even though we've shown all the 9070 results, so the 4.5-star score doesn't apply to the vanilla card. Unless supply ends up being far better than we're expecting, we're tentatively giving it the same 3.5-star score as the RTX 5070, perhaps a 4.0-star — because it's not that much slower than the XT, it has the same amount of VRAM, and it uses 80W or so less power. We'll finalize that score in a separate review in the coming days, and it could get bumped up half a point based on what happens with the launch tomorrow.

Ultimately, while the performance, on-paper specs, and pricing look great on many of the new GPUs, the actual prices will end up mattering most. And if you're hoping to buy a new graphics card right now, you really don't have many other options. It's not like there are a bunch of previous-generation cards still taking up retail space that need to be cleared out.

MORE: Best Graphics Cards
MORE: GPU Benchmarks and Hierarchy
MORE: All Graphics Content

Apple debuts M3 Ultra in refreshed Mac Studio with up to 512GB memory

Andrew E. Freedman — Wed, 05 Mar 2025 14:00:00 +0000

Apple has a new powerhouse computer ready to go. The company today announced that it's refreshing the Mac Studio (which hasn't seen a change since 2023) with two new chips: M4 Max and M3 Ultra.

Don't get it twisted: while M4 Max is the same capable chip we released late last year in the MacBook Pro, the M3 Ultra is actually Apple's most capable processor to date, despite the generation names.

Meet M3 Ultra

M3 Ultra, like M2 Ultra, is comprised of two 3nm M3 Max chips with an interposer. M3 Ultra features up to a 32-core CPU with 24 performance cores — the most CPU cores ever in a Mac. There's an 80-core GPU, making for Apple's largest graphics chip yet, with support for Dynamic Caching, mesh shading, and hardware-accelerated ray tracing. The new chip also boasts a 32-core Neural Engine. M3 Ultra can pair with up to 16TB of internal storage, which is sure to be wildly expensive. Perhaps most importantly, it can use up to 512GB of unified memory with 800 GB/s memory bandwidth. This is enough to load large language models with over 600 billion in memory.

In a demo, I saw the M3 Ultra run Cinema 4D, where an artist wanted to spread foliage around a landscape. Using LM Studio, they created a Python script to scatter the assets, pasted it into Cinema 4D, and it was done. What could've taken a day took just minutes. From there, they were able to open Maxon Redshift to see a high-quality preview with hardware ray tracing.

In addition, I saw (but did not play) an early demo of Cyberpunk 2077, which is coming this year for Macs, running on the hardware, locked at 60 fps (thanks to VSync on the monitor) with full ray tracing.

Why no M4 Ultra? The M2 Ultra also lagged other chips, so it may just come down to development time. But Apple mentioned while showing some demos that not every generation of its silicon would get an Ultra chip, so time will tell if an M4 Ultra will show up at all. There have been M1 Ultra and M2 Ultra chips, so no Ultra chips have been skipped just yet.

Despite this top-end chip, there's no update (yet at least) to the Mac Pro, which is currently using M2 Ultra.

Like the existing Mac Studios, the Ultra-based computer will get more significant cooling, which means it will weigh approximately two pounds more than the Max version.

M4 Max and Connectivity

Apple

For those that don't need that massive chip, there's still the plenty powerful M4 Max option, with up to 16 CPU cores and up to 40 GPU cores. That starts at 36GB of memory and goes up to 128GB. The demo Isaw with that chip was a more straightforward video production project in Autodesk Flame, including base color and special effects layers.

Outside of performance, the other big benefit you get from the updated Mac Studio is a bump to Thunderbolt 5. On the M3 Ultra version, every single USB Type-C port carries that technology, while on M4 Max, the rear ports use Thunderbolt 5 while the front two ports use USB-C up to 10 Gb/s. On either model, you also still get a pair of USB Type-A ports for legacy peripherals, along with an HDMI port, 10Gb Ethernet, a headphone jack, and an HDMI port.

The Mac Studio with M4 Max will start at $1,999 with 36GB of RAM and 512GB of storage, while the M3 Ultra version will start at $3,999 with 96GB of RAM and 1TB of storage. Both are available for pre-order today and will launch on March 12.

DeepSeek brings disruption to AI-optimized parallel file systems, releases powerful new open-source Fire-Flyer File System

Sunny Grimm — Sat, 01 Mar 2025 16:32:08 +0000

DeepSeek AI has made its Fire-Flyer Fire System (3FS) parallel file system fully open-source this week, as part of its Open Source Week event. The disruptive AI company from China brags that 3FS can hit 7.3 TB/s aggregate read throughput in its own server data clusters, where DeepSeek has been using 3FS to organize its servers since at least 2019.

3FS is a Linux-based parallel file system designed for use in AI-HPC operations, where many data storage servers are being constantly accessed by GPU nodes for training LLMs. 3FS is unique from other file systems thanks largely to its almost singular prioritization of random read speeds above all else, and almost completely ignoring read caching.

When training AI models, compute units need to access random training data constantly, and reading this data is a one-time-only process. Therefore, a read cache is nearly useless and is largely done away with by 3FS. In fact, using the read cache when training LLMs may be potentially harmful; as LLMs are basically just super-tuned inference machines, reading the same data in the same order repeatedly has the potential to link completely different data as a set to the language model.

The team responsible for operating one of DeepSeek's deep learning clusters, Fire-Flyer 2, published this paper last August outlining using 3FS in the custom-built system. In Fire-Flyer 2, DeepSeek utilized 180 storage nodes, each loaded with 16 16TB SSDs and two 200Gbps NUCs. These nodes served 10,000 PCIe Nvidia A100 GPUs, built out in much cheaper servers than Nvidia's proprietary DGX-A100 products.

Across the whole array, DeepSeek claims it benchmarked 3FS's performance at 6.6 TB/s, while also running training tasks in the background that added an additional 1.4TB/s of read throughput. In comparison, competitor file system Ceph only reached speeds of 1.1 TB/s read throughput (on a server with 68 nodes, loaded with 10 16TB SSDs and 2 x 100 Gbps networking) for the first time in early 2024.

3FS was credited as a crucial part of DeepSeek's software stack for training DeepSeek AI in the above paper, as tested on the Fire-Flyer 2 HPC solution that achieved 80% of the performance of Nvidia's DGX-A100 server solution for 50% of the price and 60% of the power draw.

Those curious about trying out the Fire-Flyer File System and its random-read-forward style for AI-HPC solutions can find the full download on DeepSeek's Github page. We'd be surprised if this new open-source system does not become a hit for enthusiasts and enterprise AI-HPC users alike, though it may have to overcome some level of anti-Chinese tech fear to hit blockbuster status.

Singapore police bust major ring smuggling Nvidia GPUs to China-based DeepSeek: Report

ashilov@gmail.com (Anton Shilov) — Fri, 28 Feb 2025 12:29:45 +0000

Singapore Police Force have charged three men with fraud in a case involving allegedly illegal re-export of Nvidia GPUs to Chinese AI company DeepSeek, bypassing U.S. trade restrictions, reports ChannelNewsAsia. The police and customs authorities raided 22 locations, arrested nine individuals, and seized documents and electronic records, reports Reuters.

When Singapore suddenly became Nvidia's second largest geographical source of revenue in 2024, many suspected that this happened because Nvidia's GPUs were illegally re-exported from Singapore to China. Nvidia denied all accusations saying that billing locations do not represent actual destination of GPUs. Still, the U.S. Commerce Department started investigation whether DeepSeek has acquired restricted American GPUs to train its AI models.

"Customers use Singapore to centralize invoicing while our products are almost always shipped elsewhere," a statement by Nvidia reads. "Shipments to Singapore were less than 2% of fiscal year 2025 total revenue."

(Image credit: Nvidia)

However, it looks like the problem with smuggling high-performance Nvidia GPUs from Singapore to China exists and intermediaries in Singapore helped smuggle Nvidia GPUs for AI and HPC to China in violation of U.S. export laws.

The accused include Singaporeans Aaron Woon Guo Jie, 41, and Alan Wei Zhaolun, 49. Prosecutors allege that in 2024, they conspired to deceive a server supplier by falsely claiming the equipment would not be resold to unauthorized parties. A third suspect, Li Ming, 51, a Chinese national, faces separate charges related to a similar scheme in 2023. Authorities claim he misrepresented the intended recipient of hardware, stating it was meant for a Singapore-based company, Luxuriate Your Life.

If convicted, the suspects could face up to 20 years in prison, fines, or both. Authorities have not disclosed details about other arrested individuals or whether additional charges will be filed.

While the arrests clearly indicate the involvement of Singapore-based groups in smuggling restricted high-performance Nvidia GPUs to China, the extent of their operations are yet to be determined. Companies like DeepSeek need tens of thousands of Nvidia Hopper GPUs (H100, H20, H800) to train its large-language models. However, smaller research institutions run smaller clusters containing tens or hundreds of such processors.

Last week Singapore's government emphasized that while it is not legally bound to enforce unilateral export restrictions imposed by other nations, it expects businesses operating within its borders to comply with such regulations where applicable. Authorities have reiterated that the country does not tolerate attempts to exploit its trade networks to circumvent international controls.

Chinese CPU maker Zhaoxin rolls out DeepSeek support to all processors — entire product lineup now runs DeepSeek LLMs natively

Sunny Grimm — Tue, 25 Feb 2025 15:56:18 +0000

DeepSeek's rollout into the Chinese consumer market continues, as Zhaoxin has announced its adoption of the DeepSeek-R1 LLM across its hardware lineup. Zhaoxin, one of the few Chinese companies licensed to work with the x86 instruction set, boasts that its processors and OEM systems can natively run the 1.5B, 7B, 14B, 32B, 70B, and 671B parameter models released by DeepSeek so far.

Zhaoxin's press release mainly highlights two chips: its KaiXian KX-7000/8 consumer processor and Kaisheng KH-40000/32 64-core server processor. The KX-7000/8 is an 8-core model running at 3.7GHz with 32MB of L3 cache. Zhaoxin advertises that the chip can natively run DeepSeek-R1-7 B when paired with an unnamed Chinese GPU. Integrations with word processors and the VSCode interface allow AI-assisted writing, spreadsheets, and programming.

This unnamed AI accelerator card is undoubtedly a major contributor to the AI performance touted here. When recently tested against the 7-year-old Intel i3-8100 quad-core chip, the KX-7000/8 could win out in multi-core benchmarks but was handily beaten in single-core workloads, putting up CPU-Z single-core results of 335.9 points vs. the i3-8100's 422.2.

Zhaoxin's enterprise-grade KH-40000 family is also featured heavily, with the KH-40000/16 and /32 chips being tested for their AI performance. As part of an OEM AI workstation, the KH-40000/16 successfully deployed up to the 32B model of Deepseek-R1. The Lianhe Donghai XRS302 server workstation, a fully Chinese-made server, was fitted with four Chinese AI accelerator add-in cards to supplement the 16-core, 2.2GHz server processor. As the Donghai XRS302 does not ship complete, we do not know more details, such as the GPUs or RAM used for these tests.

Finally, Zhaoxin's flagship KH-40000/32, a 32-core chip designed for use in dual-CPU servers, was able to deploy Deepseek's 70B model and run its 671B model without a GPU (translation loses some nuance; it seems the 70B was a more comfortable deployment, while the 671B model just managed to run).

Zhaoxin's claims above are difficult to judge accurately, thanks to the language barrier and the company's incredibly vague parameters for LLM performance beyond screenshots. Whether DeepSeek-R1 running on Zhaoxin CPUs speaks more to the success of DeepSeek's software or Zhaoxin hardware remains to be seen.

DeepSeek continues to be a breakout moment for the Chinese tech market, with more hardware companies rushing to integrate it with their products. Even smart TVs recently got DeepSeek integration. After taking Nvidia and OpenAI stock to the cleaners, DeepSeek, and the Chinese hardware field will likely try to keep up the momentum in the coming weeks to prove the capability of the Chinese tech sector's capability.

AMD Radeon RX 9070 XT performance estimates leaked: 42% to 66% faster than Radeon RX 7900 GRE

ashilov@gmail.com (Anton Shilov) — Sun, 23 Feb 2025 18:25:00 +0000

AMD reportedly held a press briefing and disclosed more information about its upcoming Radeon RX 9000-series graphics processors as well as the RDNA 4 architecture. Perhaps the most important part was the disclosure of AMD's official performance numbers of the new Radeon RX 9070 XT graphics card that appeared to be significantly ahead of the Radeon RX 7900 GRE, according to the allegedly official numbers published by VideoCardz.

In fact, AMD claims that the upcoming Radeon RX 9070 XT is 42% – 168% faster than the Radeon RX 7900 GRE at a 4K resolution with 'ultra' quality settings across over 30 games. The Radeon RX 9070 XT outperforms the RX 7900 GRE by an average of 38% at 1440p and 42% at 2160p. However with certain titles that rely on ray tracing more than others — such as Cyberpunk 2077 and Hitman 3 — performance gains reach 164% –168%, again according to the numbers published by VideoCardz.

Games with ray tracing tend to see the biggest increases, emphasizing AMD's RDNA 4 advances in handling RT workloads. Titles like Cyberpunk 2077, Dying Light 2, F1 24, and Hitman 3 show the strongest performance jumps of 56% to 66%, which clearly makes the new Radeon RX 9070-series offerings strong contenders to sit amongst the best graphics cards in the coming quarters.

When it comes to the performance difference between the Radeon RX 9070 XT and the Radeon RX 9070 (non-XT), the delta averages between 16.1% at 1440p Ultra Settings and 18.3% at 2160p Ultra Settings across the tested games, according to AMD. Yet, the Radeon RX 9070 (non-XT) still delivers a 20% performance boost at 1440p and a 21% higher performance at 2160p over the Radeon RX 7900 GRE.

AMD admitted that it did not have a GeForce RTX 5070 Ti for comparison, so it did not compare its new flagship against some of Nvidia's most wanted parts of today. The company did not explain why it decided not to compare its upcoming Radeon RX 9070 XT against its existing Radeon RX 7900 XTX flagship, but stuck to the cut-down Radeon RX 7900 GRE. The latter has around 25% lower compute performance compared to the range-topping Radeon RX 7900 XTX, but also has 16GB of memory onboard, whereas the range-topping board carries 24GB of GDDR6 VRAM.

Unlike AMD, Nvidia uses upscaling technologies like frame generation to demonstrate performance improvements over the previous generation, so the red company gains some kudos. As a result, AMD has chosen to focus on native rendering performance and ray tracing, so performance gains are quite real. More details will, of course, be shared on February 28 when AMD officially presents its Radeon RX 9070-series products.

Leaked AMD RX 9070 XT benchmarks see it match Nvidia's RTX 4070 in synthetic tests

editors@tomshardware.com (Hassam Nasir) — Fri, 21 Feb 2025 12:57:58 +0000

Geekbench leaks has offered us a glimpse of what to expect from AMD's upcoming RX 9070 series GPUs (via Benchleaks at X). These test scores were probably inadvertently made public by an unsuspecting reviewer. In any case, while we don't want to jump to conclusions, and neither should you, the numbers are disappointing considering everything we've heard thus far. The flagship RX 9070 XT barely matches Nvidia's RTX 4070 Super. However, we cannot confirm the authenticity of these tests. Also, given the architectural revamps with RDNA 4, synthetic tests are not guaranteed to reflect how these GPUs will hold up in real-world performance.

Two separate test benches were used for the RX 9070 XT and its non-XT counterpart. The former is equipped with AMD's Ryzen 7 9800X3D on the Asus ROG Crosshair X870E Hero motherboard. The latter sticks with the standard Ryzen 7 9700X coupled with the MSI B650 Gaming Plus Wi-Fi motherboard. The setups are pretty different, so it's obvious the tests weren't conducted by a single person.

Geekbench was also generous enough to unofficially confirm previously rumored specifications. From the listings, the RX 9070 XT and 9070 non-XT share a similar 16GB VRAM configuration. The only difference is in the core counts; the 9070 XT has 64 CUs (Compute Units) while the 9070 offers 56 CUs. For the sake of comparison, we've aggregated publicly available data across the Vulkan and OpenCL APIs at Geekbench.

Jumping into the benchmarks, the RX 9070 XT scored 177,395 and 179,178 points in Vulkan and OpenCL, dropping to 158,520 and 140,842 points for the 9070 respectively. The RX 9070 XT, per this test, is 22% slower than Nvidia's RTX 5070 Ti and that's quite telling. Jumping over to Ada, the RTX 4070 Super is a more suitable match. As expected, the 9070 XT isn't quite able to topple the RDNA 3 flagship but manages a somewhat decent 28% uplift against its predecessor, the RX 7800 XT.

GPU	Vulkan	OpenCL	vs 9070 XT in Vulkan	vs 9070 XT in OpenCL
RX 9070 XT (Leaked)	177395	179178	100.00%	100.00%
RX 9070 (Leaked)	158520	140842	89.36%	78.60%
RX 7900 XTX	235279	212081	132.63%	118.36%
RX 7900 XT	206494	186399	116.40%	104.03%
RX 7800 XT	155488	139983	87.65%	78.13%
RTX 5070 Ti	228576	229140	128.85%	127.88%
RTX 4070 Super	178982	192378	100.89%	107.37%

We can compare synthetics all day long, but at the end of the day, proper real-world tests are what truly matter. Back in January, a tipster alleged raster performance in the ballpark of an RTX 4080 Super, so it's possible that RDNA 4 doesn't perform as well in theoretical tests, but that's just a guess. If the RTX 5070 is around 15% faster than the RTX 4070; typical for Blackwell GPUs, that'd land it in RTX 4070 Super territory. This doesn't leave much wiggle room for AMD, but we'll get a clearer picture after its presentation on February 28.

RTX 5070 Ti restocks expected within 2-6 weeks, says UK retailer — All sold out on launch day

editors@tomshardware.com (Hassam Nasir) — Thu, 20 Feb 2025 19:33:51 +0000

The RTX 5070 Ti is now officially available for purchase, assuming you can find it model in stock. Following the RTX 5090/5080 launch debacle, this much was expected and it doesn't take more than a few clicks at eBay to find scalpers selling a $749 GPU in the four-digit territory. OCUK, a large UK reseller frequently publishes updates of its latest GPU inventory at X (formerly Twitter). The latest report is that all RTX 5070 Ti models have been sold out, with restocks anticipated within two to six weeks. Other Blackwell GPUs are also impossible to find, though the restock estimates are slightly more generous than at launch.

The handful of MSRP models instantly flew off shelves as the embargo lifted and are nowhere to be found. Custom models that cost north of $900 were snapped up by eager enthusiasts or, most likely scalpers shortly afterward. The RTX 5070 Ti beats its predecessor in 4K gaming by around 25% per our testing. This isn't much in the grand scheme of things. For context, the RTX 4070 Ti led the RTX 3070 Ti by over 60%

Supply for the RTX 5070 Ti isn't as bad as high-end Blackwell at OCUK, given that pre-orders are still up and running. potential customers have been warned of long waiting times, possibly up to six weeks (early April). A handful of RTX 5080 units are arriving weekly with orders expected to be fulfilled in around three weeks which is an improvement over last time. To be fair, the reseller has no RTX 50 GPU in stock, which is a shame and could end up proving troublesome for Nvidia with RDNA 4 launching early next month.

Stock Update:RTX 5070 Ti sold out and pre-orders open.RTX 5080 sold out but limited stock arriving weekly.RTX 5090 sold out and pre-orders ceased. Stock ETAs are as follows:RTX 5070 Ti ETA: 2-6 Weeks.RTX 5080 ETA: 1-3 Weeks.RTX 5090 ETA: 2-14 Weeks.Pre-orders are being…February 20, 2025

Leakers have claimed that Nvidia is repurposing data-center-tailored GB200 wafers for the RTX 5090, expected to improve availability in around one month. We cannot verify the authenticity of this claim, especially given OCUK's up to 14-week ETA for the flagship card. You can probably imagine how the RTX 5070 will fare at launch, but let's not get ahead of ourselves.

AMD's updated nomenclature positions the Radeon RX 9070 XT as a direct competitor to the RTX 5070 Ti. With these GPUs retailing in early March, Nvidia has roughly three weeks to get its supply chain issues sorted out. This might lead AMD to set an otherwise high price tag for its GPUs. Let's hope that doesn't come to fruition lest AMD should jeopardize its market share ambitions this generation.

DeepSeek GPU smuggling probe shows Nvidia's Singapore GPU sales are 28% of its revenue, but only 1% are delivered to the country: Report

editors@tomshardware.com (Jowi Morales) — Tue, 18 Feb 2025 15:48:02 +0000

A senior government official in Singapore said that only a fraction of Nvidia’s sales in the country actually make it into the country. Bloomberg said that Singapore's Second Minister for Trade and Industry, Tan See Land, made this statement as Washington is investigating whether the firm behind DeepSeek used banned Nvidia GPUs smuggled via the island state.

“The physical delivery of products sold by Nvidia to Singapore represent less than 1% of Nvidia’s overall revenue,” Tan said. He then added, “It is common practice for global entities to centralize the billing for procured goods and services in their hubs, but this is separate from where the products are shipped to so far from our checks.” This is despite reports saying Singapore accounts for nearly 28% of Nvidia’s revenue for 2024.

$NVDA's Singapore revenue in the last quarter was $7.7B (+185% YoY), more than half of its U.S. revenue. Let's not pretend the U.S. is the only region with chip access. pic.twitter.com/WhgLf84v9dJanuary 28, 2025

That means a company based in Singapore could order chips from Nvidia, with their billing address marked as such, but have them delivered to another country. However, Tan said this business strategy isn’t new, with many multinational companies operating across borders doing the same thing, saying that if you’re operating in different countries, it’s sometimes more cost-effective to bill everything using the headquarters address and then have the items shipped directly to where they’re needed.

In fact, Nvidia itself has long said [PDF], "Revenue by geographic area is based upon the billing location of the customer. The end customer and shipping location may be different from our customer’s billing location. For example, most shipments associated with Singapore revenue were to locations other than Singapore and shipments to Singapore were insignificant."

However, Singapore is closely tied to China — especially in business. This is especially true in the tech sector, where many Chinese companies have set up key offices on the island. For example, TikTok, which Chinese tech giant ByteDance owns, has its headquarters in the country, and its CEO is also Singaporean. Despite that, the country also considers the U.S. to be a key strategic partner, both in trade and politics, with the two countries’ militaries even allowed to use each other’s facilities on the island and in Guam.

The country has to carefully balance its relationship with China and the United States, especially as the countries are currently engaged in a trade war with various bans and sanctions taking effect in recent years. Singapore likely doesn’t want to be put on Washington’s entity list, especially as it considers itself a business-friendly country, and getting on that list means it will have several limitations put on it, especially in the tech space. Because of this, Tan said that the Singapore government is working closely with U.S. authorities to investigate this discrepancy and that the country does not condone any business using their Singaporean address to get around export controls set by other countries.

Elon Musk's Grok 3 is now available, beats ChatGPT in some benchmarks — LLM took 10x more compute to train versus Grok 2

editors@tomshardware.com (Jowi Morales) — Tue, 18 Feb 2025 13:45:00 +0000

Elon Musk just launched Grok 3, the latest version of xAI’s LLM that was trained at the Colossus Supercluster in Memphis, Tennessee using 100,000 Nvidia H100 GPUs. He had previously said, about a week ago, that its full release was imminent and claimed that it would outperform its rivals. Today he launched the AI model via a live stream on X (formerly Twitter) showcasing impressive performance benchmark results.

Early Grok-3 benchmarks show it dominating the field. pic.twitter.com/KXubPhaA5xFebruary 18, 2025

Musk began the presentation by saying “The mission of xAI and Grok is to understand the universe,” and explaining that he wants to answer questions like, “What’s going on? Where are the aliens? What is the meaning of life? How does the universe end? How did it start?” He added, “Of course, that’s to be a maximally truth-seeking AI even if that truth is sometimes at odds with what is politically correct.”

https://t.co/hEfQ31gANQFebruary 18, 2025

After speaking about his goals with AI, Musk proclaimed that Grok 3 is an order of magnitude more capable than Grok 2, and that it was trained in a very short period. This was likely possible because of the massive number of GPUs xAI used for parallelized training, which also took just 19 days to set up — a record time especially since Nvidia's CEO Jensen Huang said that that usually takes four years.

Grok 3 isn’t just a single LLM though — instead, it’s a family of several models, with the first ones launched being Grok 3 and Grok 3 mini. xAI also showed off Grok 3 Reasoning and Grok 3 mini Reasoning, which are similar to OpenAI 03-mini and DeepSeek R1 models and will solve problems through a step-by-step logical process.

xAI

Benchmarks shown by the xAI team reveal Grok-3 and Grok-3 mini models outperforming its competition, including Gemini-2 Pro, DeepSeek-V3, Claude 3.5 Sonnet, and GPT-4o, in several tests, including Math (AIME), Science (GPQA), and Coding (LCB). The reasoning models, which are accessible via the Grok app, also outperform the competition using the same benchmarks. Aside from this, the Grok app will have a new feature called DeepSearch, which scours the internet when questioned to then distill all the information into a single answer.

Other experts have been given access to Grok 3 in advance and were able to test these claims. For example, former Tesla Director of AI and OpenAI founder Andrej Karpathy shared his test results on X, saying that Grok 3 + Thinking feels similar to OpenAI’s o1-pro model while being a bit better than DeepSeek-R1 and Gemini 2.0 Flash Thinking. This is actually quite a feat, especially since OpenAI and Google have had a massive head start over xAI.

I was given early access to Grok 3 earlier today, making me I think one of the first few who could run a quick vibe check.Thinking✅ First, Grok 3 clearly has an around state of the art thinking model ("Think" button) and did great out of the box on my Settler's of Catan… pic.twitter.com/qIrUAN1IfDFebruary 18, 2025

Grok 3 will be available to X Premium+ subscribers first. However, those who want to access more advanced features will need to sign up for SuperGrok, which is rumored to cost around $30 a month or $300 annually.

Chinese AI model DeepSeek is being integrated into smart TVs — Skyworth G7F Pro understands local dialects and generates multimedia content

editors@tomshardware.com (Jowi Morales) — Mon, 17 Feb 2025 15:31:34 +0000

Chinese appliance manufacturer Skyworth (machine translated) announced it’s releasing a new smart TV equipped with DeepSeek.

The Chinese-developed LLM DeepSeek made waves when it was introduced because of its relatively low cost of training. It has reached the point that AI chip maker Nvidia lost over half a trillion dollars in market value because people thought companies would no longer buy its overpowered chips to train the next generation of AI. These advancements have led many companies to incorporate DeepSeek into their products; this includes Microsoft, which said that its Snapdragon X Copilot+ PCs will soon get local DeepSeek R-1 support. Likewise, Skyworth is integrating DeepSeek into its next generation of smart TVs.

Skyworth said that its Kukai AI OS will get access to the DeepSeek R-1 Inference open-source model, which will allow it to support dialect recognition and fuzzy semantic understanding. This means users can talk to their TVs in their local dialects, which is quite important in China, given that it has around seven to ten main language groups, each with several dialects. The TV’s fuzzy semantic understanding will also help the TV recognize context and consider vagueness and ambiguity when processing language data.

In addition, the Skyworth G7F Pro’s AI is designed for various scenarios. The company said it can generate ambient music and support interactive painting for entertainment, serve as an “AI oral spanning partner,” create picture books for education, and create travel plans for the entire family, including itinerary bookings and schedule reminders.

It’s unclear if the TV needs to be connected to the internet for its AI model to access these functions. However, given that DeepSeek is relatively lightweight, it’s not impossible for a smart TV to run it locally if it has a chip with an embedded NPU with enough power to handle it. But even if the AI functions are handled off the device and on a server, it wouldn’t be taxing for the company as you could run DeepSeek relatively affordably, as a UCLA-Berkley research team has proven.

AMD's beastly 'Strix Halo' Ryzen AI Max+ matches the RTX 4060 laptop in leaked 3DMark tests

editors@tomshardware.com (Hassam Nasir) — Thu, 13 Feb 2025 13:50:36 +0000

A leaked benchmark of the Radeon 8060S, powering the graphics engine of AMD's Ryzen AI 300 "Strix Halo" flagship is shown to match Nvidia's dedicated RTX 4060 mobile GPU in a leaked 3DMark Time Spy result. A Chinese user at Baidu (via HXL), shared a couple of screenshots with what appears to be the Ryzen AI Max+ 395 flexing its muscles, beating AMD's latest Radeon 890M iGPU by almost 3x. Since the tested sample is based on early engineering silicon, there is surely still some room for improvement. However, this leak should be viewed cautiously as the CPU OPN code and the integrated GPU don't align.

AMD extended its Ryzen AI 300 lineup with mainstream Krackan Point and flagship Strix Halo APUs last month at CES. Strix Halo, or Ryzen AI Max+ is a one-of-a-kind processor delivering (up to) 16 Zen 5 CPU cores bundled with 40 RDNA 3.5 Compute Units for workstation-grade laptops and high-end mini-PCs. Bear in mind, all this power is packaged on a single chip, featuring two CCDs and a massive I/O die beneath, bordered by (up to) 128GB of fast unified memory. For context, AMD's marketing material positions the Radeon 8060S (the subject of this article) as an equivalent to Nvidia's RTX 4070 laptop dGPU.

It's kind of pointless to compare laptops with different TGPs and thermal designs so it's best not to read too much into these results. For the sake of comparison, we'll look over the average Time Spy score of several relevant GPUs, obtained via 3DMark's score explorer feature. Another screenshot shows that the laptop or mini-PC in question features 128GB (16GBx8) of LPDDR5-8532 memory, with 96GB allocated to the iGPU. Both screenshots inaccurately label the iGPU as the Radeon 8050S, however, the OPN code reveals it's actually the Radeon 8060S with a 40 CU configuration. That's probably due to the silicon's premature nature.

GPU	Time Spy Score	Type	vs Radeon 8060S
Radeon 8060S (Add Salt)	10106	Integrated	100.00%
Radeon 890M	3705	Integrated	36.66%
Radeon 880M	3568	Integrated	35.31%
Radeon RX 7700S	10218	Dedicated	101.11%
RTX 4070 Laptop	12517	Dedicated	123.86%
RTX 4060 Laptop	10549	Dedicated	104.38%

In 3DMark's Time Spy benchmark, the Radeon 8060S scores 10,106 points, almost matching Nvidia's RTX 4060 laptop and AMD's own RX 7700S. Against the Radeon 890M seen on Strix Point, the 8060S lands ahead by a gigantic 2.7x but that was kind of expected given the large difference in shader counts. Still, it loses to the RTX 4070 by almost 20% which is disappointing but you should wait for independent reviews to see how these Strix Halo APUs perform in real-world scenarios.

You should see laptops and workstations equipped with these processors from partners across Q1 and Q2 this year, which is a rather vague timeframe. HP is readying the ZBook Ultra G1a workstation laptop and the HP Z2 Mini G1a mini-PC, while Asus has announced the ROG Flow Z13, with no definite release date provided for any system.

AMD kills 'Golden Rabbit Edition' GRE branding, renames it 'Great Radeon Edition'

Sunny Grimm — Tue, 11 Feb 2025 18:12:52 +0000

AMD’s upcoming RX 7650 GRE graphics card for the Chinese market has a new name for fans of redundancy. The “GRE” badge now officially stands for “Great Radeon Edition” per Chinese news hub NetEase. To use its full name, the AMD Radeon RX 7650 GRE (Great Radeon Edition) is expected to launch in China this month.

AMD debuted the “GRE” badge for its China-specific GPU SKUs in 2023, starting with the RX 7900 GRE. The “Golden Rabbit Edition” badge coincided with the traditional Year of the Rabbit, seeking to better connect with Chinese gamers. The GRE badge has since made it to the Western market, with the RX 7900 GRE launching in the US in 2024 and competing with the best graphics cards.

Thanks to its price-for-performance margins, the 7900 GRE is a seriously compelling card, though AMD’s other GRE release, the RX 6750 GRE, was more disappointing. As a rebrand of Navi 22 to dump oversupply, the 6750 GRE became a midrange contender in China but not one that moved impressively. The upcoming RX 7650 GRE is not expected to be as exciting or dull as its GRE predecessors and is likely to stay in China for the duration of its lifespan.

The RX 7650 GRE is equipped with AMD’s Navi 33 chip, the same used in the RX 7600 and 7600 XT cards. The 7650 GRE does feel like a direct middle-man for the 7600 and 7600 XT, inheriting their 32 compute units and 128-bit memory bus. With a boost clock of 2,695MHz over the 7600’s 2,625MHZ and a 5W TDP gain, the GRE feels like a better-tuned 7600. AMD elected to keep the 7650 GRE at 8GB VRAM rather than stepping it up to 16GB on the XT.

AMD’s choice to keep the GRE badge around even as the Year of the Snake begins is a good sign for AMD’s fans in the Chinese market. As clunky as “Great Radeon” feels, the Chinese market now has a dedicated title for its unique releases. GRE cards seem to be succeeding; the 6750 GRE has seen new variants as recently as November as AMD’s premier midrange option in the country even a year later. Of course, whether you read this as AMD supporting a beloved SKU or a last-ditch effort to clear out four-year-old Navi 22 inventory is up to the reader.

The RX 7650 GRE will launch for 2,049 Yuan, roughly $280. This is no discount over the RX 7600, which holds a similar MSRP in China. Breaking the GRE tradition of offering a discount over previously launched similar cards, the 7650 GRE likely seeks instead to become a 7600 XT replacement. We would be surprised if the 7650 GRE sees American shores, though the 7900 GRE going worldwide last year was also a shock.

How to Run DeepSeek R1 on your Raspberry Pi 5

Les Pounder — Thu, 06 Feb 2025 13:00:00 +0000

You can’t have missed the seismic event that saw Nvidia lose $589 billion in market cap as confidence in AI took a hit after DeepSeek claimed that its open source R1 model could provide rival OpenAI’s o1 model performance, with 11x less compute to train its latest models. The fallout from this is still being debated, but it has certainly put the cat amongst the pigeons.

(Image credit: Tom's Hardware)

Before we delve too deeply into this how to, let's manage expectations. Yes you can run DeepSeek on your Raspberry Pi but it is CPU bound so don’t expect your queries to complete in a few seconds. No official AI accelerator HAT or addon will currently accelerate the model. The only means is to connect up a GPU to the Raspberry Pi 5’s PCIe connector, likely using one of Pineboard’s Hat UPCIty Lite boards and an external power supply.

This means that the Raspberry Pi 5 is at a disadvantage to my desktop PC which has an Nvidia RTX 4070 GPU. When ollama runs, it checks for a GPU and if found, it will use it. So my RTX 4070 is doing all of the work.

The Test

Running the R1:8b locally, I wanted a simple test, and the first thing that came to my mind was writing some Python code. The prompt being:

“write a python script to ask the user their name, save it to a variable called username, and then greet the user by their name 100 times.”

(Image credit: Tom's Hardware)

How would I normally tackle this? Three lines of Python code, one to capture the user input to a variable, then two lines to create a for loop that prints the personalized greeting. It's basic beginner Python that I taught to hundreds of students, so how would an AI tackle it?

username = input("What is your name?: ")for i in range(100):    print("Hello",username)

Test Machine	Specifications	Time Taken
Raspberry Pi 5	8GB LPDDR4X RAM, Broadcom BCM2712 2.4GHz quad-core 64-bit Arm Cortex-A76 CPU	8:01.08 minutes
AMD Ryzen 5 5600X	32GB DDR4 RAM AMD Ryzen 5 5600X hexa-core 3.7 / 4.6 GHz CPU Nvidia RTX 4070 GPU	16.12 seconds

The Raspberry Pi 5’s code was as follows

username = input().stripfor _ in range(100):    print(f”Hello, {username}”)

Capturing the user input and then sanitizing it before assigning it to the function is a smart move. The strip will remove any white space from the captured string. Printing the greeting using f-strings is a more recent means to format the output. That’s a bit redundant in this scenario, but I would be pleased to see a student try this approach. My gripe with this code is that there is no prompt for the user to type, hence in the video there is a short delay. Thankfully the PC did not replicate this issue.

On the PC, DeepSeek produced this code.

username = input(“Enter your name: “)for _ in range(100):    print(f”Hello, {username}!”)

The user input is captured and saved to the variable, and we have an input prompt for the user to respond to. The rest is the same as on the Raspberry Pi 5, with just an extra “!” to emphasize the greeting.

(Image credit: Tom's Hardware)

You can’t miss the time difference between the PC and the Pi 5. All of this was offline, relying on the model and the CPU / GPU of the device it is being run on. The PC did everything in 16 seconds, but the Pi 5 hit 8 minutes! Heck the PC was done while the Pi 5 was still loading the model. But, running an LLM on a Raspberry Pi 5 is an interesting experiment, and worth spending a little time on, so lets install one onto a Raspberry Pi 5 8GB. Note, that the Raspberry Pi 5 8GB is really the lowest spec Pi 5 that we would attempt this on. You could try a 4GB Pi 5 with a tweaked model, but your mileage will vary!

Setting up DeepSeek on the Raspberry Pi 5 via ollama

To make things easier, we’ll be setting up DeepSeek via ollama, a free and open source tool that enables anyone to run large language models (LLMs) on their own machines.

The model that we will be using is a distilled Llama model which fits into the 8GB of RAM afforded by our Raspberry Pi 5.

The ollama team states that “DeepSeek team has demonstrated that the reasoning patterns of larger models can be distilled into smaller models, resulting in better performance compared to the reasoning patterns discovered through RL on small models.” Why are we using this model and not a “true” DeepSeek model? Simply because the deepseek-r1:671b model is 404GB in size, and it would clearly overwhelm the Raspberry Pi 5.

Installation on the Raspberry Pi is a breeze thanks to ollama’s script.

1. Open a terminal and ensure that your Raspberry Pi 5 is running the latest software.

sudo apt updatesudo apt upgrade -y

2. Download and install the ollama install script. Normally, installing software using a script from the Internet is a major no no. We would never do this in a production environment. If you are curious, the install.sh can be saved to a file and the contents read before use.

curl -fsSL https://ollama.com/install.sh | sh

3. Check the version number. Ours was 0.5.7 but yours may differ given the fast pace of LLM development. It is always handy to know what version number you have installed, should you need to log any issues or search for specific guidelines.

ollama --version

4. Download and run DeepSeek-r1:8b. This is a distilled Llama model which fits into the 8GB of RAM afforded by our Raspberry Pi 5.

ollama run deepseek-r1:8b

5. Wait for the download and install to finish. This can take some time at first, but subsequent loads should be much faster.

6. The user interface is simple, just type in a request / query and the LLM will interpret and respond. Slowly.

7. When you are done, you can either press CTRL + D or type /bye and press Enter to close the session.

DeepSeek on the Raspberry Pi 5 is purely CPU bound. It cannot be used with any of the AI accelerator boards. If you have the knowledge and the equipment, it can be used with an GPU via the PCIe connector on the Raspberry Pi 5. We were unable to test this due to a lack of equipment, but the ever fearless Jeff Geerling is sure to test this in the near future.