Latest from Tom's Hardware UK in Cuda

Nvidia's CUDA Tile examined: AI giant releases programming style for Rubin, Feynman, and beyond — tensor-native execution model lays the foundation for Blackwell and beyond

ashilov@gmail.com (Anton Shilov) — Wed, 31 Dec 2025 14:32:45 +0000

This month, Nvidia rolled out what might be one of the most important updates for its CUDA GPU software stack in years. The new CUDA 13.1 release introduces the CUDA Tile programming path, which elevates kernel development above the single-instruction, multiple-thread (SIMT) execution model, and aligns it with the tensor-heavy execution model of Blackwell-class processors and their successors.

By shifting to structured data blocks, or tiles, Nvidia is changing how developers design GPU workloads, setting the stage for next-generation architectures that will incorporate more specialized compute accelerators and therefore depend less on thread-level parallelism.

Tom's Hardware Premium Roadmaps

(Image credit: Future)

SIMT vs. Tiles

Before proceeding, it is worth clarifying that the fundamental difference between the traditional CUDA programming model and the new CUDA Tile is not in capabilities, but in what programmers control. In the original CUDA model, programming is based on SIMT (single-instruction, multiple-thread) execution. The developer explicitly decomposes the problem into threads and thread blocks, chooses grid and block dimensions, manages synchronization, and carefully designs memory access patterns to match the GPU's architecture. Performance depends heavily on low-level decisions such as warp usage, shared-memory tiling, register usage, and the explicit use of tensor-core instructions or libraries. In short, the programmer controls how the computation is executed on the hardware.

(Image credit: Nvidia)

CUDA Tile shifts programming to a tile-centric abstraction. The developer describes computations in terms of operations on tiles — structured blocks of data such as submatrices — without specifying threads, warps, or execution order. Then the compiler and runtime automatically map those tile operations onto threads, tensor cores, tensor memory accelerators (TMA), and the GPU memory hierarchy. This means the programmer focuses on what computation should happen to the data, while CUDA determines how it runs efficiently on the hardware, which ensures performance scalability across GPU generations, starting with Blackwell and extending to future architectures.

A strategic pivot in the CUDA Model

But why introduce such significant changes at the CUDA level? There are several motives behind the move: drastic architectural changes in GPUs, and the way modern GPU workloads operate. Firstly, AI, simulation, and technical computing no longer revolve around scalar operations: they rely on dense tensor math. Secondly, Nvidia's recent hardware has also followed the same trajectory, integrating tensor cores and TMAs as core architectural enhancements. Thirdly, both tensor cores and TMAs differ significantly between architectures.

(Image credit: Nvidia)

From Turing (the first GPU architecture to incorporate tensor units as assisting units) to Blackwell (where tensors became the primary compute engines), Nvidia has repeatedly reworked how tensor engines are scheduled, how data is staged and moved, and how much of the execution pipeline is managed by warps and threads versus dedicated hardware. With Turing, tensors were used to execute warp-issued matrix instructions, but with Blackwell, things shifted to tile-native execution pipelines with autonomous memory engines, fundamentally reducing the role of traditional SIMT controls.

As a result, as tensor hardware has been scaling aggressively, the lack of uniformity across generations has made low-level tuning on warp and thread levels impractical, so Nvidia had to elevate CUDA toward higher-level abstractions that describe intent at the tile level, rather than at the thread level, leaving all the optimizations to compilers and runtimes. One bonus to this approach is that it can extract performance gains across virtually all workloads throughout the active life cycle of its GPU architectures.

Note that it does not abandon SIMT paths with NVVM/LLVM and PTX altogether; when developers need them, they can write appropriate kernels. However, when they need to use tensor cores, they must write tile kernels.

CUDA Tile: How does it work?

At the center of this new CUDA Tile stack sits CUDA Tile IR, a virtual instruction set that plays the same role for tile workloads that parallel thread execution (PTX) plays for SIMT kernels. In the traditional CUDA stack, PTX serves as a portable abstraction for thread-oriented programs that ensures that SIMT kernels persist across GPU generations. CUDA Tile IR is designed to provide that same long-term stability for tile-based computations: it defines tile blocks, their relationships, operations that transform them, but hides execution details that can change from one GPU family to another.

This virtual ISA also becomes the target for compilers, frameworks, and domain-specific languages that want to exploit tile-level semantics. Tool builders who previously generated PTX for SIMT can now create parallel backends that emit Tile IR for tensor-oriented workloads. The runtime takes Tile IR as input and assigns work to hardware pipelines, tensor engines, and memory systems in a way that maximizes performance without exposing device-level variability to the programmer.

Profile generated from Nsight Compute, showing the tile statistics for the vector_add kernel (Image credit: Nvidia)

In addition to Tile IR itself, CUDA 13.1 introduces another key component to bring CUDA Tile to life: cuTile Python, a domain-specific language that allows developers to author array- and tile-oriented kernels directly in Python.

For now, development efforts are focused primarily on AI-centric algorithms, but Nvidia plans to expand functionality, features, and performance over time, as well as to introduce a C++ implementation in upcoming releases. Tile programming itself is certainly not limited to artificial intelligence and is designed as a general-purpose abstraction. As Nvidia's CUDA Tile evolves, it can be applied to a wide range of applications, including scientific simulations (on architectures that support the required precision), signal and image/video processing, and many HPC workloads that decompose problems into block-based computations.

In its initial release, CUDA Tile support is limited to Blackwell-class GPUs with compute capabilities 10.x and 12.x, but future releases will bring support for 'more architectures' though it is unclear whether we are talking previous-generation Hopper or next-generation Rubin.

Setting the stage for Rubin, Feynman, and beyond

With CUDA Tile, Nvidia is reorganizing the CUDA software model around tensor-based execution patterns that dominate modern workloads. Traditional CUDA Tile will coexist with the proven SIMT model, as not all workloads use tensor math extensively, though the vector of industry development is more or less clear, so Nvidia's focus will follow it.

Nvidia's CUDA Tile IR provides abstraction that enables architectural stability needed for future generations of tensor-focused hardware, while cuTile Python (and similar languages), as well as enhanced tools, offer practical paths for developers to transition from SIMT-heavy workflows.

Combined with expanded partitioning features, math-library optimizations, and improved debugging tools, CUDA 13.1 marks a major milestone in Nvidia’s long-term strategy: abstracting away hardware complexity and enabling seamless performance scalability across each GPU generation.

White House U-turn on Nvidia H200 AI accelerator exports down to Huawei's powerful new Ascend chips, report claims — U.S. committed to 'dominance of the American tech stack'

sayem.ahmed@futurenet.com (Sayem Ahmed) — Wed, 10 Dec 2025 13:09:04 +0000

The U.S. can now export Nvidia's H200 AI accelerator to China, with a 25% fee attached. However, following the authorization of the chips, a new report suggests that the decision was made to ensure American tech dominance globally. Reportedly, a major part of the decision is Huawei's recent advancements with its CloudMatrix 384 and Ascend 910C systems, which are on par with both the H200 and Nvidia's GB200 NVL72, a new Bloomberg report suggests.

This decision would enable China to continue to access Nvidia's CUDA-based AI accelerators, as many AI systems still rely on that particular software stack. While China is attempting to standardize an instruction set of its own, the open-source CANN, it has been noted that Nvidia chips are preferable for companies such as Deepseek for training advanced AI models.

According to sources speaking to Bloomberg, multiple scenarios were considered, including flooding the market to "overwhelm" Huawei, to exporting no AI accelerators, which would mark a dramatic shift, if the previously-approved Nvidia H20 (a cut-down H200) were to be affected. Ultimately, the decision rests somewhere in the middle.

China won't get access to Nvidia's latest Blackwell architectures, but it will retain access to the full-fat H200. The White House likely hopes that this move keeps the latest Nvidia chips, while also keeping the country locked into Nvidia's carefully crafted CUDA-shaped moat. "The Trump administration is committed to ensuring the dominance of the American tech stack - without compromising on national security", said White House spokesman Kush Desai, in a statement to Bloomberg.

White House officials are reported to have reviewed the performance of Huawei's AI accelerator ecosystem, in particular, the CloudMatrix 384 system, which utilizes 384 Ascend 910C chips. The CloudMatrix 384 is positioned directly against Nvidia's (export-controlled) GB200 platform, but with obvious tradeoffs in performance and efficiency.

(Image credit: Huawei)

While Bloomberg notes recent rumblings that Huawei is preparing to up its 910C chip production to 600,000 units next year, the report claims U.S. officials concluded "that Huawei would be capable in 2026 of producing a few million of its Ascend 910C accelerators," according to the insider.

The AI race is now bound by pure performance, and the Trump Administration clearly hopes that retaining its architectural advantage by restricting Blackwell will keep Western frontier AI models at the forefront of the industry.

Just last week, Nvidia CEO Jensen Huang commented that he would be uncertain whether or not Chinese companies would even end up purchasing the H200. Huang has long since been a vocal detractor against export controls of AI GPUs, as Nvidia wrote off $5.5 billion in AI chips in April 2025. Whether or not the availability of H200 systems on the Chinese market will be enough to recover the shortfall remains to be seen.

Fragmented ecosystems and limited supply: Why China cannot break free from Nvidia hardware for AI

ashilov@gmail.com (Anton Shilov) — Mon, 18 Aug 2025 11:21:46 +0000

Last week saw major twists in China's AI landscape: Trump imposed a 15% sales tax on AMD and Nvidia hardware sold to China, Beijing froze new Nvidia H20 GPU purchases over security concerns, and DeepSeek dropped plans to train its R2 model on Huawei’s Ascend NPUs — raising doubts about China's ability to rely on domestic hardware for its AI sector.

As part of its recurring five-year strategic plans, China's long-stated goal has been to gain its own technological independence, particularly in new and emerging segments that it sees as key to its national security. However, after years of plowing billions into fab startups and its own nascent chip industry, that country still lags behind its Western counterparts and has struggled to build its own truly insulated supply chain that can create AI accelerators. Additionally, the country lacks an effective software ecosystem to rival Nvidia's CUDA, creating even more challenges. Here's a closer look at how this is impacting the country's AI efforts.

China wants to rely on its own hardware

China has had a self-sufficiency plan for its semiconductor industry in general since the mid-2010s. Over time, as the U.S. imposed sanctions against the People's Republic's high-tech sectors, the plan evolved to address supercomputers (including those capable of AI workloads) and fab tools. In 2025, China has created several domestic AI accelerators, and Huawei has even managed to develop its rack-scale CloudMatrix 384.

(Image credit: Biren Technology)

However, ever since the AI Diffusion Rule was canned, and the incumbent Trump administration banned sales of AMD's Instinct MI308 and Nvidia's HGX H20 to Chinese entities, the PRC doubled down on its efforts to switch crucially important AI companies to using domestic hardware.

As a result, when the U.S. government announced plans to grant AMD and Nvidia export licenses to sell their China-specific AI accelerators to clients in the People's Republic, U.S. President Trump announced an unprecedented 15% sales tax on AMD's and Nvidia's hardware sold to China.

China's government then made shipments of Nvidia's HGX H20 hardware strategic, and instructed leading cloud service providers to halt new purchases of Nvidia’s H20 GPUs while it examines alleged security threats, a move that could potentially bolster demand for domestic hardware. This may be good news for companies like Biren Technology, Huawei, Enflame, and Moore Threads.

There's a twist in this tale, though — DeepSeek reportedly had to abandon training of its next-generation R2 model on domestically developed Huawei's Ascend platforms because of unstable performance, slower chip-to-chip connectivity, and limitations of Huawei's Compute Architecture for Neural Networks (CANN) software toolkit. This all begs the question: can China rely on its homegrown hardware for AI development?

Nvidia is dominating

Nvidia has been supplying high-performance AI GPUs fully supported by a stable and versatile CUDA software stack for a decade, so it's not surprising that many, if not all, of the major Chinese AI hyperscalers — Alibaba, Baidu, Tencent, and smaller players like DeepSeek currently use Nvidia's hardware and software. Although Alibaba and Baidu develop their own AI accelerators (primarily for inference), they still procure tons of Nvidia's HGX H20 processors.

(Image credit: Nvidia)

SemiAnalysis estimated that Nvidia produced around a million HGX H20 processors last year, and almost all of them were purchased by Chinese entities. No other company in China supplied a comparable number of AI accelerators in 2024. However, analyst Lennart Heim believes that Huawei had managed to illegally obtain around three million Ascend 910B dies in 2024 from TSMC, which is enough to build around 1.4 – 1.5 million Ascend 910C chips in 2024 – 2025. This is comparable to what Nvidia supplied to China in the same period. However, while Huawei may have enough Ascend processors to train its Pangu AI models, it appears that other companies have other preferences.

DeepSeek trained the R1 model on a cluster of 50,000 Hopper-series GPUs. This consisted of 30,000 HGX H20s, 10,000 H800s, and 10,000 H100s. These chips were reportedly purchased by DeepSeek's investor, High-Flyer Capital Management. As a result, it's logical that the whole software stack of DeepSeek — arguably China's most influential AI software developer — is built around Nvidia's CUDA.

However, when the time came to assemble a supercluster to train DeepSeek's upcoming R2 model, the company was reportedly persuaded by the authorities to switch to Huawei's Ascend 910-series processors. However, when it encountered unstable performance, slower chip-to-chip connectivity, and limitations of Huawei's CANN software toolkit, it decided to switch back to Nvidia's hardware for training, but use Ascend 910 AI accelerators for inference. Speaking of these exact accelerators, we do not know whether DeepSeek used Huawei's latest CloudMatrix 384, based on the latest Ascend 910C, or something else.

Since DeepSeek has not disclosed these challenges officially, we can only rely on a report from the Financial Times. The publication claims that Huawei's Ascend platforms did not work well for DeepSeek. Why they were deemed to be unstable is another question. It's a distinct possibility that DeepSeek only began to work with CANN this Spring, so the company simply has not had enough time to port its programs from Nvidia's CUDA to Huawei's CANN toolkit.

Steps into right directions

It is extremely complicated to analyze high-tech industries in China, as companies tend to keep secrets closely guarded and fly under the U.S. government's radar. However, two important factors that may have a drastic effect on the development of AI hardware in China occurred this summer. Firstly, the Model-Chip Ecosystem Innovation Alliance was formed, and secondly, Huawei made its CANN software stack open source.

(Image credit: Moore Threads)

The Model Chip Ecosystem Innovation Alliance includes Huawei, Biren Technologies, Enflame, and Moore Threads and others. The group aims to build a fully localized AI stack, linking hardware, models, and infrastructure, which is a clear step away from Nvidia or any other foreign hardware. Its success depends on achieving interoperability among shared protocols and frameworks to reduce ecosystem fragmentation. While low-level software unification may be difficult due to varied architectures (e.g., Arm, PowerVR, custom ISAs), mid-level standardization is more realistic.

By aligning around common APIs and model formats, the group hopes to make models portable across domestic platforms. Developers could write code once — e.g., in PyTorch —and run it on any Chinese accelerator. This would strengthen software cohesion, simplify innovation, and help China build a globally competitive AI industry using its own hardware. There is also an alliance called the Shanghai General Chamber of Commerce AI Committee, which focuses on applying AI in real-world industries; this also unites hardware and software makers.

Either as part of the commitment to the new alliance, or as part of the general attempt to make its Ascend 910-series the platform of choice among China-based companies, Huawei open-sourced CANN in early August, which is specifically optimized for AI and its Ascend hardware.

Until this summer, Huawei's AI toolkit for its Ascend NPUs was distributed in a restricted form. Developers had access to precompiled packages, runtime libraries, and bindings, which allowed TensorFlow, PyTorch, and MindSpore to run on the hardware. These pieces worked well enough to allow users to train and deploy models, but the underlying stack, such as compilers or libraries, remained closed.

CANN goes open-source

(Image credit: Huawei)

Now, this barrier has been removed. The company released the source code for the full CANN toolchain; however, it did not formally confirm what exactly it unseals, so we can only wonder or speculate. The list of opened up technologies likely includes compilers that convert model instructions into commands that Ascend NPUs understand, such as low-level APIs, libraries of AI operators that accelerate core math functions, and a system-level runtime. This will allow the management of memory, scheduling, and communication. This isn't officially confirmed, but merely an educated guess as to what CANN's open-sourcing might enable.

By opening up CANN, Huawei can attract a broad community of developers from academia, startups, and other enterprises to its platform, and enable them to experiment with performance tuning or framework integration (beyond TensorFlow and PyTorch). This will inevitably speed up CANN's evolution and bug fixing. Eventually, these efforts could bring CANN closer to what CUDA offers, which will be a useful string in Huawei's bow.

For Huawei, opening up CANN ahead of other Model-Chip Alliance members was beneficial, as it already had the most mature AI hardware platform in production, and needed to position its Ascend platform as the baseline software ecosystem others could rely on. This move makes CANN the default foundation for domestic models and hardware developers (at least for now). By taking this first step, Huawei set a reference point for interoperability and signalled a commitment to shared standards, which could help reduce fragmentation in China's AI software stack.

What about hardware availability?

But while unification of the software stack is a step in the right direction, there is an elephant in the room regarding China's AI hardware self-reliance. The People's Republic still cannot produce hardware that is on par with AMD or Nvidia in volume domestically. The hardware that can be made in China is years behind the processors developed on U.S. soil.

(Image credit: Biren Technology)

All leading developers of AI accelerators in China, like Biren, Huawei, and Moore Threads, are in the U.S. Department of Commerce's Entity List. This means that they do not have access to the advanced fabrication capabilities of TSMC. To that end, they have to produce their chips at China-based SMIC, whose process technologies cannot match those offered by TSMC. While SMIC can produce chips on its 7nm-class fabrication process, Huawei had to obtain the vast majority of silicon for its Ascend 910B and Ascend 910C processors by deceiving TSMC. Companies like Biren or Moore's Threads do not disclose which foundry they use, but they do not have the luxury of choice.

Of course, neither Huawei nor SMIC stands still. The two companies are working to advance China's semiconductor industry and build a local fab tools supply chain that will replace the leading-edge equipment that SMIC cannot acquire. Before this happens, SMIC is expected to start building chips on its 6nm-class process technology and even 5nm-class production node, so it may well build advanced AI processors for Huawei and other players. But the big question is whether volumes will manage to meet the demands of AI training and inference, especially if Nvidia hardware is largely unobtainable in China.

China's Chicken and egg dilemma

Legendary GPU architect Raja Koduri's new startup leverages RISC-V and targets CUDA workloads — Oxmiq Labs supports running Python-based CUDA applications unmodified on non-Nvidia hardware

ashilov@gmail.com (Anton Shilov) — Tue, 05 Aug 2025 15:56:29 +0000

Raja Koduri, a legendary GPU architect from ATI Technologies, AMD, Apple, and Intel, on Tuesday said he had founded a new GPU startup that emerged from stealth mode today. Oxmiq Labs is focused on developing GPU hardware and software IP and licensing them to interested parties. In fact, software may be the core part of Oxmiq's business as it is designed to be compatible with third-party hardware.

Another RISC-V-based 'GPU' for AI

Oxmiq develops a vertically integrated platform that combines GPU hardware IP with a full-featured software stack aimed at AI, graphics, and multimodal workloads where explicitly parallel processing is beneficial. On the hardware side, Oxmiq offers a GPU IP core based on the RISC-V instruction set architecture (ISA) called OxCore, which integrates scalar, vector, and tensor compute engines in a single modular architecture and can support near-memory and in-memory compute capabilities.

Oxmiq also offers OxQuilt, a chiplet-based system-on-chip (SoC) builder that enables customers to create their own SoCs that integrate compute cluster bridge (CCB, which probably integrates OxCores), memory cluster bridge (MCB), and interconnect cluster bridge (ICB) modules based on specific workload requirements in a rapid and cost-efficient manner. For example, an inference AI accelerator for edge applications can pack a CCB and an ICB or two, an inference SoC requires more CCBs, MCBs, and ICBs, whereas a large-scale SoC for AI training can pack dozens of chiplets. Oxmiq does not disclose whether its OxQuilt enables building only multi-chiplet system-in-packages (SiP), or is designed to assemble monolithic processors too.

Software is the key

Oxmiq's software stack is perhaps an even more important product that the company has to offer. The software package is designed to abstract the complexity of heterogeneous hardware and enable deployment of AI and graphics workloads across a range of hardware platforms, not just those using the company's IP. The core of the software stack is OXCapsule, a unified runtime and scheduling layer that manages workload distribution, resource balancing, and hardware abstraction. The layer encapsulates applications into self-contained environments, which the company calls 'heterogeneous containers.' These containers are designed to operate independently of the underlying hardware, enabling developers to target CPUs, GPUs, and AI accelerators without modifying their codebase or dealing with low-level configuration.

A standout component of this stack is OXPython, a compatibility layer that translates CUDA-centric workloads into Oxmiq's runtime and allows Python-based CUDA applications to run unmodified on non-Nvidia hardware without recompilation. OXPython will first launch not on Oxmiq's IP, but on Tenstorrent's Wormhole and Blackhole AI accelerators. In fact, Oxmiq's software stack is fundamentally designed to be independent from Oxmiq hardware, and that is a core part of its strategy.

"We are excited to partner with Oxmiq on their OXPython software stack," said Jim Keller, CEO of Tenstorrent. "OXPython's ability to bring Python workloads for CUDA to AI platforms like Wormhole and Blackhole is great for developer portability and ecosystem expansion. It aligns with our goal of letting developers open and own their entire AI stack."

What about graphics?

Having developed graphics processors at S3 Graphics, ATI Technologies, AMD, Apple, and Intel, Raja Koduri is primarily known as a GPU developer. In fact, he even positions Oxmiq as the first GPU startup in Silicon Valley in decades.

"We may be the first new GPU startup in Silicon Valley in 25+ years," wrote Koduri in an X post. "GPUs are not easy."

However, it should be noted that Oxmiq is not building a consumer GPU like AMD Radeon or Nvidia GeForce. In fact, it does not develop all the IP blocks necessary to build a GPU, unlike Arm or Imagination Technology: it does not support full consumer graphics features out-of-the-box (such as texture units, render back ends, display pipeline, ray tracing hardware, DisplayPort or HDMI outputs), so Oxmiq licensees must implement them in silicon themselves, if they plan to build a GPU.

Asset low strategy

A project to bring CUDA to non-Nvidia GPUs is making major progress — ZLUDA update now has two full-time developers, working on 32-bit PhysX support and LLMs, amongst other things

ashilov@gmail.com (Anton Shilov) — Thu, 03 Jul 2025 15:52:45 +0000

ZLUDA, a CUDA translation layer that almost closed down last year, but got saved by an unknown party, this week shared an update about its steady technical progress and team expansion over the last quarter, reports Phoronix. The project continues to build out its capabilities to run CUDA workloads on non-Nvidia GPUs; for now, it is more focused on AI rather than on other things. Yet, work has also begun on enabling 32-bit PhysX support, which is required for compatibility with older CUDA-based games.

Perhaps, the most important thing for the ZLUDA project is that its development team has grown from one to two full-time developers working on the project. The second developer, Violet, joined less than a month ago and has already delivered important improvements, particularly in advancing support for large language model (LLM) workloads through the llm.c project, according to the update.

32-bit PhysX

A community contributor named @Groowy began the initial work to enable 32-bit PhysX support in ZLUDA by collecting detailed CUDA logs, which quickly revealed several bugs. Since some of these problems could also impact 64-bit CUDA functionality, fixing them was added to the official roadmap. However, completing full 32-bit PhysX support will still rely on further help from open-source contributors.

Compatibility with LLM.c

The ZLUDA developers are working on a test project called llm.c, which is a small example program that tries to run a GPT-2 model using CUDA. Even though this test is not huge, it is important because it is the first time ZLUDA has tried to handle both normal CUDA functions and special libraries like cuBLAS (fast math operations).

This test program makes 8,186 separate calls to CUDA functions, spread over 44 different APIs. In the beginning, ZLUDA would crash right away on the very first call. Thanks to many updates contributed by Violet, it can now get all the way to the 552nd call before it fails. The team has already completed support for 16 of the 44 needed functions, so they are getting closer to running the whole test successfully. Once this works, it will help ZLUDA support bigger software like PyTorch in the future.

Improving accuracy of ZLUDA

ZLUDA's core objective is to run standard CUDA programs on non-Nvidia GPUs while matching the behavior of Nvidia hardware as precisely as possible. This means each instruction must either deliver identical results down to the last bit or stay within strict numerical tolerances compared to Nvidia hardware. Earlier versions of ZLUDA, before the major code reset, often compromised on accuracy by skipping certain instruction modifiers or failing to maintain full precision.

The current implementation has made substantial progress in fixing this. To ensure accuracy, it runs PTX 'sweep' tests — systematic checks using Nvidia's intermediate GPU language — to confirm that every instruction and modifier combination produces correct results across all inputs, something that has never been used before. Running these checks revealed several compiler defects, which were addressed later. ZLUDA admits that not every instruction has completed this rigorous validation yet, but stressed that some of the most complex cases — such as the cvt instruction — are now confirmed bit-accurate.

Improving logging

The foundation for getting any CUDA-based software to work on ZLUDA — whether it is a game, a 3D application, or an ML framework — is having logs of how the program communicates with CUDA, which includes tracking both direct API calls, undocumented parts of the CUDA runtime (or drivers), and any use of specialized performance libraries.

With the recent update, ZLUDA's logging system has been significantly upgraded. The new implementation captures a wider range of activity that was not visible before, including detailed traces of internal behavior, such as when cuBLAS relies on cuBLASLt or how cuDNN interacts with the lower-level Driver API.

Runtime compiler compatibility

"Export control was a failure." Nvidia CEO Jensen Huang slams Biden's China AI restrictions - Computex 2025 Q and A

sayem.ahmed@futurenet.com (Sayem Ahmed) — Mon, 26 May 2025 13:00:00 +0000

Following Nvidia's keynote at Computex 2025, Nvidia's CEO, Jensen Huang, sat down with journalists to talk about all of Team Green's latest announcements, including talk of GB200, and international market opportunities.

In response to one question, Huang talked about export controls which were first imposed by the Biden administration saying that the limitations had caused Nvidia's China Market share to be cut from 95 to 50 percent within in the presidency and hadn't accomplished what they set out to do. He also posited that the restrictions did not prevent China from developing its own, competing technologies.

Huang also talked about the massive write-downs his company had to take because of bans on the H20, saying "export controls resulted in us writing off multiple billions of dollars. Then, write off the inventory. The write-off of H20 is as big as many semiconductor companies."

He later praised the Trump administration for ending Biden's AI diffusion rule, saying "I think it's really a great reversal of a wrong policy."

This was a roundtable talk with several other journalists from other publications, but this is not a complete transcription of the entire Q&A. However, we've transcribed all of the questions we did manage to hear, some elements lightly edited for flow and clarity. Some speakers did not have clear audio while speaking, and we have noted as such on the transcript.

Ahead of reading this Q&A, you should familiarize yourself with what Huang announced at Nvidia during his Computex 2025 keynote, we've popped it down below, just so you can get a refresher.

Jensen Huang: Good Morning, very nice to see all of you. Did you guys see all of this? Isn't this just incredible?

So, this is the motherboard of a new server [Presumably the GB200 NVL72, or RTX Pro machine], and this server has many GPUs that are connected. The GPU's are connected on the bottom, or on top. And on the bottom are switches that connect all the GPUs together, and these switches also connect this computer to all the other computers, using CX8 networking (Nvidia ConnectX-8), 800 gigabits per-second, and then the transceivers plug in right there... Plug this into that, now you have an enterprise AI supercomputer. Because this system is air cooled, it's very easy for enterprise [users] to buy. It's available from all over the world's enterprise IT OEMs. Every single company will be offering this, and you can use this for everything.

It runs x86, so all of your software that you run with your enterprise IT works today. You can run Redmap, VMWare, Nutanix, so all of the orchestration and operating system works just fine.

And, you can run computer graphics, you can run AI, you can run agents. This is incredible. This one idea that the GPU... they didn't give us a GPU [To showcase], but you know what one looks like, it's the gold one, and that makes it a new server.

So, this is the RTX Pro Enterprise AI server, and this is a huge, huge announcement, and it opens up the enterprise market. As you know today, all of the AI isn't involved, but OEMs would like to serve the enterprises, companies would like to build it for themselves.

And so anyway, that's a very big announcement. So Cloud AI over here and GB200, that's our enterprise AI system.

Right, whose got the first question? Good morning.

Elaine Huang, Commonwealth Magazine: Just like what you mentioned, that enterprise is a very important market, and everybody talks about not just AI servers, but AI PCs. So what are the potential opportunities that Taiwan has to co-work with Nvidia for upcoming features?

Jensen Huang: Last night, Microsoft announced Windows ML, Windows Machine Learning. Machine Learning is AI, so Windows AI. And, they announced it on Nvidia. So now Windows ML, which is a new API [that] runs AI inside Windows, runs on Nvidia.

And, the reason for that is Nvidia's RTX has CUDA and Tensor cores, everyone one of them, exactly the same. We have several hundred million RTX PCs in the world. RTX equals AI, and now Windows ML runs on RTX. Everyone with RTX? They know. Hundreds of millions of gamers and PC users, workstation users, everybody. Laptop, desktop... Bingo! Home run, job done [laughs], Windows ML.

If you're a developer, and you would like your own AI supercomputer so that you don't have to keep going to the cloud, open the cloud, and when you're done, you know, shut down your session, because if you don't shut it down, the bill keeps running. And so maybe you're doing development, there's a lot of idle time. And so, you would not like to do that on the cloud, you would like to do it on your desk.

So, if you have a Mac? No problem. If you have a Chromebook? No problem. If you have Linux? No problem. If you have Windows? No problem. We have a perfect little device for you.

[We want] To give you this little AI supercomputer [Presumably DGX Spark], that sits next to you. You can connect it with networking, or you can connect them with Wi-Fi. You talk to this like you like you talk to the cloud, and the software is exactly the same.

So, if you are a developer, software programmer, AI creator, this product is perfect for you. The perfect developer companion.

And if you'd like a bigger one, this is essentially a computer for the AI natives [DGX Station], anybody who's getting completely AI applications, and you would like a bigger one than this one [DGX Spark], this is an AI workstation [Referring to DGX Station].

And this goes into a desktop, a normal desktop, and you can access it, you can remote, you can use it like the cloud, but it's yours, and you can walk away and go enjoy a coffee and don't feel bad.

(Image credit: Nvidia)

Question 2 (Unclear speaker and outlet): I'm curious to know, over the last five to ten years, you've had a lot of great new products. The array of products and services you have now is quite extensive. I'm really curious to know what you had in the pipeline. Did you kill anything before it entered production? Like, you had a project, it had momentum and at some point you had to get down to business... [unintelligible speech], I'm really curious about the products that never saw the light of day. Can you share anything with us about that?

Jensen Huang: I would say it's very rare that we would completely kill a project. It's very likely that we shape it, shape it and reshape it. And, the reason for that is because the direction needs to be right. Like, for example, the initial early days of Omniverse, we had to rebuild it a couple of times, and the reason for that is because in the beginning, of Omniverse, our vision was right. That we needed to create a virtual world of digital systems, and robotic systems, and AI systems.

The vision was right. But, the way we architected the software was odd. It was kind of based on the old days of enterprise and workstation applications. And so, it wasn't scalable.

[Jensen pauses to ask for a bottle of water and proceeds to choke] I do that too, you know, I say something surprising, I'm talking, and then I'm drinking and swallowing all at the same time. I say something surprising to myself, and then I choke on water.

So, in the beginning, we built Omniverse as single instance software with multiple GPUs, and that was the wrong answer. Omniverse should have been created as a disaggregated system. It should run across multiple systems, multiple operating systems and multiple computers with multiple GPUs each. Which is the reason why we built this machine. In fact, this computer RTX Pro, it's called RTX Pro for a good reason.

It's essentially an Omniverse generative AI system, and notice this is one computer, eight GPUs, and you can connect them with more computers. Omniverse will run across this whole thing. This is the perfect Omniverse computer, the perfect robotics computer, the perfect digital twin computer.

We started working on Omniverse, how many years ago now? Let's see... Six years, seven years ago, finally it's come to connecting everything together. So notice that all the pivots that were made along the way, all of the mistakes that were made, and so on and so forth, we just keep investing.

(Image credit: Nvidia)

Eric, Publication Unclear: Just a quick question about DGX Spark, and what you said with simple production, delivery is going to happen in a couple of weeks. I wonder if you can give any additional color about what you feel about the opportunity and compliance. You know, no pun intended, but is the window closing for an additional player to get into ARM-based computing?

Jensen Huang: So first of all, it's just delightful to look at. You know, it's nice to have a computer that's beautiful. The reason why we need this computer is because we need a coherent, productive AI development environment, and AI has models that are fairly large. Its environment really wants to be fully accelerated with excellent Python software and AI stacks.

If I look around this room right now, I don't actually see a computer that would be perfect for AI development. Most of the computers don't have that memory, or they don't have Tensor cores, because maybe it's a Mac, or maybe it's a Chromebook, or maybe it's an older version of a PC, or an older version of a desktop.

And so we took a state-of-the-art AI system, and we put it in a remote Wi-Fi environment that connects to everybody's computer. Now there's some 30 million software developers in the world, there will probably be just as many who are now going to be AI developers.

And so everybody has the benefit of having, essentially an AI supercomputer, an AI cloud, but not being burdened with the anxiety of your cloud computing that's ticking away. And so this is something you can buy, the ROI is probably, call it six months.

And for most of all, of course, this, we have really great volume, and it's available from everybody. So they'll be available from Asus, they can sell it alongside their laptops. It's available from MSI, and not to mention all the enterprise OEMs.

Every single developer can go out and just get one, and just put it next to their desk. You can develop on here, and you want to now scale it out, or test it out on large data sets, it's just like one pull down menu, point it at a cloud. Exactly the same thing runs there. And so, this is really an ideal AI developer environment.

(Image credit: Future)

Victoria Jen, CNA: Given the trade tensions and also the talk of de-globalization. How is Nvidia thinking about its global supply chain strategy, and where does Taiwan fit in that picture?

Jensen Huang: First of all, Taiwan is going to continue to grow, and the reason for that is we're at the beginning of a breed of a new industry. This new industry builds AI factories. The world is going to have AI infrastructure all over. AI infrastructure will cover the planet, just as internet infrastructure has covered the planet. Eventually, AI infrastructure will be everywhere.

We are several hundred billion dollars into a tens of trillions of dollars AI infrastructure buildout that will take five decades.

I've been looking around Taiwan. There are cranes everywhere. There's buildings everywhere. Factories are being built everywhere. And the reason for that is because we're all racing to build infrastructure for AI, manufacturing for AI.

Well, simultaneously, the world needs to be needs to have more manufacturing resilience and diversification, and some of that will be distributed around the world.

In the United States, we're going to do some manufacturing. It is impossible to do all manufacturing, all on-shore, and it's also unnecessary. But, we should do as much as we can that is important for national security, while having resilience, with redundancy, all around the world.

And so this rebalancing is happening at actually, a very good time. It's happening at a very good time because the world is building AI infrastructure. We're adding new infrastructure for the very first time. So we need a lot of new plants anyway.

The most important thing is we have to provide energy for these new plants. Communities realize that we want to grow. We want to have economic prosperity. We want to have economic security. In order for that to happen, industrialization; AI factories, need energy. And so the support of governments to provide for energy of all kinds while we pursue new technologies, whether it's hydrogen or nuclear, solar or wind, whatever new technology that that is most available at the time, we're going to need it all.

And so, government officials around the world really need to support all of the companies, so that we can re-industrialize and reset our industry, so that we can grow into AI infrastructure.

(Image credit: Nvidia)

Question 5 (Unclear speaker and outlet): My question is about NVLink Fusion, what's your strategy?

Jensen Huang: NVLink Fusion allows every data center to take advantage of this incredible invention we call NVLink, now in it's fifth generation. We've been working on NVLink now for... How many years? 12 years?

Also, NVLink and Spectrum X NVLink, and quantum- our networking, is highly integrated and highly optimized together. And, that's one of the reasons why the performance of the AI data center, the AI factory, is so good.

When the factory cost $10 billion and our efficiency is 90%, and someone else's efficiency is 60%...30% of $10 billion is $3 billion. So, network efficiency is very important. Performance is very important. Efficiency is very important. Energy is very important...All because the network efficiency is so good.

And so we have many customers, many people who are developing their semi-custom AI infrastructure. They came to us and asked: "Can we can we use NVLink?"

Because, of course, there's a there's a industry discussion about UALink. UALink is not doing that well, I don't think. And so, the customers have come to us and asked whether NVLink could be authenticated. And I said, of course, we're happy to.

The benefit to us, is that the Nvidia network, the Nvidia the fabric, is really the operating system of the data center. It's the nervous system of the data center. And so, we can extend Nvidia's nervous system into every data center, whether it's Nvidia's technology, or if you're selling custom technology.

So we can expand our our market opportunity. It is also so good for us, it's good for the ASIC companies, Mediatek, Alchip, Marvell, right?

It's good for them, because now they have a complete solution. They can have HBM [High Bandwidth Memory], they have [speech unclear], and now they also have NVLink and networking. So now they have a complete solution partner. For the customer, of course, the most important part is that the architecture of NVLink is integrated with the [server and data center] rack system.

The rack system is so complicated. The spine is so complicated. And since they are already using GB200, they're already using Nvidia's racks. Now, they can scale, continue to use Nvidia racks, or even semi-custom. So one architecture, one hardware architecture, one NVLink architecture, one networking architecture, sometimes it's three CPUs, sometimes it's Fujitsu CPUs, sometimes Qualcomm CPUs, it's very nice for the customer.

Are we open to working with Broadcom? Of course, we are, of course we are. Currently, they have their own plan. But if they need, if they would like, to use NVLink [we're] very open.

We love working with Broadcom. We work with Broadcom in many places, export control. So, as you know, export control has caused us to write off our H20s. Our H20 is now banned in China. Banned to ship in China, and export controls resulted in us writing off multiple billions of dollars. Then, write off the inventory. The write-off of H20 is as big as many semiconductor companies.

If you look at look at most chip companies, their quarterly revenue is only a few billion dollars. We wrote off, you know, multiple billion dollars of inventory. And so the cost to us is very high, and also the sales to us was quite high.

The fact of the matter is, the China market is very important. It's very important for several reasons. The first reason is that China is where 50% of the world's AI researchers are. And we want the AI researchers to build on Nvidia. DeepSeek was built on Nvidia. That's a gift to us. It's a gift to the world.

Now, DeepSeek runs incredibly well everywhere. [Audio unclear, Huang mentions R1 or similar.] is an excellent, excellent AI model. It's a gift to the world. It's open source. And so, the China market is important, because the AI researchers there are so good, and they're going to build amazing AI no matter what. We would like them to build on Nvidia's technology.

Second, the China market is quite large. As you know, it's the second largest computer market. There's no others like it. And so the China market, my guess is that next year, the whole dang market is probably [worth] $50 billion. $50 billion dollars! You know how large many chip companies are? Much less than $50 billion.

So, the $50 billion market opportunity to Nvidia is quite significant, and it would be a shame not to be able to enjoy that opportunity, to bring home tax revenues to the United States, create jobs...sustain the industry.

Okay, so, all of that... [Jensen gets distracted and looks at the person who asked the question] You asked me a question, you're not even paying attention.

[The duo converses in Mandarin, and Huang responds] Am I upset at the policy? [Regarding export controls to China] I'm not upset at the policy. I think the policy is wrong, and the reason for that is because, let's look at the evidence.

Four years ago, at the beginning of the Biden administration, Nvidia's market share in China was nearly 95%. Today, it is only 50%. The rest of it is China's technology, and not to mention we have to sell lower chip specifications. So, our ASP [Average selling price] is also lower. So we left a lot of revenue, and nothing changed.

AI researchers are still doing AI research in China. They have a lot of mobile technology they would use if they don't have Nvidia, if they don't have enough Nvidia, they will use their own! They'll use the second best. Then lastly, of course, the local companies are very, very talented and very determined, and the export controls gave them the spirit, the energy, and the government support, to accelerate their development. And so I think, all in all, the export control was a failure, the facts would suggest it.

(Image credit: Tom's Hardware)

Question 6 (Unclear speaker and outlet): So I have a quick question about AI factories. I mean, AI factories and depracaction. So the equipment inside is a data center or AI server, right? So if we talk about the factories, we have to talk about depreciation, and equipment upgrades. So, what do you expect to see? And then you have this one-year-rhythm theory, which means the systems will be upgraded every year.

So, what's your expectation on the lifetime of equipment in data center AI factories. How frequently will their systems need to be upgraded?

Jensen Huang: There are two pieces of information that are very important. The first is the reason why we upgrade every year. It's because in a factory, performance equals cost, and performance equals revenues.

If your factory is limited by power, and our performance per watt is four times better, then the revenues of this data center increase by four times. So, if we introduce new generation, the customer's revenues can grow, and their costs can come down.

We upgrade every year. So we tell our customers, don't buy everything every year. Buy something every year. This way, they don't over-build and over-invest with old technology. But the benefit that we have, is that Nvidia architecture is compatible in all of the factories. And so, we can upgrade the software for a very long time.

The second fact, Hopper in the [beginning] of it's life, Llama-70B, in the beginning, was 1/4 the performance on the same Hopper [system] four years later. So, we keep improving the performance using CUDA software, which is the benefit of CUDA. And if we optimize the software, and improve the performance of the model, it helps the whole factory. Every single every single factory. Every single computer.

Nvidia's CUDA is very valuable here, Nvidia's once-a-year rhythm is very valuable, and so you have to use both of them together. With that, your overall data center fleet revenues will go up, your overall data center costs will come down.

And then one last idea. As you know, Nvidia CUDA runs everything. Every model, every new innovation, because the Nvidia CUDA install base is so high. If you are a software developer, of course you would do Nvidia CUDA first.

You have the best performance, the best technology, you also have the largest install base, and software developers want the largest install base, so that they can touch and reach everyone. Isn't that right? And so these three ideas, once a year, performance up, costs down. Once a year, all the time, a software upgrade with CUDA. And then lastly, our install base is so high everything runs, so the life of your data center will be quite long.

(Image credit: DeepSeek)

Max Cherney, Reuters: Since you were just talking about China, it brings me to something that I think is been an interesting question. Over the past 10 days, you've gone on a world tour. Made pit stops in the Middle East and elsewhere. And what I'm wondering is you've also made a flurry of announcements. Very technical stuff here at Computex, you know.

What I'm wondering is if you could put some of the technical announcements you've made, such as NVLink Fusion, the laptop platform, and some of the other more detailed, nerdy things in the context of how you're planning to continue to sustain Nvidia's growth over the next few years. I think that's especially relevant, with some of the fears investors have at the moment about a pullback in AI spending, especially after DeepSeek.

Jensen Huang: That last little part is really important. Remember what DeepSeek did. DeepSeek was incredible for AI infrastructure.

The old AI is called one shot [A stateless model like GPT-2 and GPT-3] . You hit enter, and the AI gives you an answer. One shot. The only way to give you a one shot answer is not to think. No thinking. You, [the AI model] already know it. You've kind of memorized it from pre-training. But DeepSeek is a reasoning model. It has to think.

It has to think, and you want to think fast, because if you don't think fast, the answer will take too long to come. And so DeepSeek opened the reasoning model, the world's first open source, excellent reasoning model.

Developers all over the world are using it, because it's so good. Now, the reasoning model is not one shot [stateless model], but it's hundreds of shots. It even goes to the internet to read websites, and read PDFs. So, it has to read, think, reason, plan, read some more.

So, that's the reason why deep research... You see that the latest versions of queries are taking much longer. The reason why it takes much longer is using a lot more compute [power]. And so, in fact, DeepSeek increased the amount of computing needed by maybe 100 to 1000 times.

That's the reason why all over the world, the AI companies are saying the GPUs are melting down, right? Sam [Altman], says our GPUs are melting because they're working too hard. They need more GPUs, more GPUs.

And last night, Microsoft announced that they were the first to online GB200, that OpenAI is already using GB200, and that they're planning to build out this year, hundreds of thousands of GB200 [systems]. More build out this year than all of Microsoft's data centers combined, only three years ago.

That's how much [OpenAI plans to build out], in just one year. And so the build out, the ramp of AI infrastructure, to me, is actually just beginning. This is now the beginning of the reasoning AI era, and reasoning AI is so useful, and it's so useful in so many different applications.

Second, AI infrastructure is being built out. That's one of the reasons why I'm traveling around the world. Every region realizes they need to build their own AI infrastructure. AI infrastructure is going to be part of society, part of the industry. Just like electricity, just like the internet, AI is going to be an essential part of infrastructure, social infrastructure, as well as industrial infrastructure.

When I was in the Middle East, President Trump announced that this is the reversal of what was the previous AI diffusion rule, the new diffusion rule, for this administration, they realized the goal. The goal of the AI diffusion rule has specified in the past, was to limit AI diffusion.

President Trump realizes it's exactly the wrong goal. The United States, and America, is not the only provider of AI technology. If the United States wants to stay in the lead, and if the United States would like the rest of the world to build on American technology, we need to maximize AI diffusion, maximize the speed. And that's where we are today.

I think it's really a great reversal of a wrong policy, frankly. And this [the new AI diffusion rules] is a great reversal of that, and it's just in time.

(Image credit: Nvidia)

Question 8 (Unclear speaker and outlet): We started with CPUs, and then to right now with GPUs. So still, both are important for our industry. So, what is the future?

[This question is quite unclear, but the general gist of it is that the speaker asks Jensen about the future of the hardware and software industries in the wake of AI factories and data centers.]

Jensen Huang: Good question. Fluid Dynamics is not going to go away. Particle Physics is not going to go away. Finite elements not going to go away. Computer graphics is not going to go away. These algorithms are so good, and they've been refined over so many years.

Not to mention trillions of dollars of software already written, no reason to rewrite it. And so flexible software, flexible hardware is always valuable. That's the reason why CPU has been so successful for 60 years.

Now, Nvidia has created something and you have been following CUDA for two decades now, and you understand very deeply, that CUDA is so successful because there's so many domains of applications.

Everything from deep learning, to machine learning, classical machine learning, to unstructured data quantities, structured data processing, to particles, and fluids, and quantum, and chemistry, and so on so forth. The list goes on.

And so the benefit is flexibility. If it's slow, then it's too expensive. But Nvidia's technology is very fast, it's also flexible. Then the data center can be used for many things. If the data center can be used for many things, the utilization will be high. If the utilization is high, the cost is down.

So, general purpose equals low cost. In fact, you might remember, on the day that Steve Jobs announced the iPhone, he showed iPhone, and then he showed the music player and camera, and also a PC. So all of these different devices can now be in a general purpose device, camera, music, player, all in one general purpose device. This general purpose device is, of course, more expensive, but the cost is actually lower than having all of those things.

So, general purpose equals low cost, but it hasn't got very high performance. And that's the benefit of CUDA. You have just exactly pointed to the reason why CUDA is so successful.

(Image credit: Nvidia)

Question 9 (Unclear speaker and outlet): Two or three years ago, you said that Nvidia is a software company, and beyond hardware. So, what elements will take Nvidia to the future? Is it still CUDA [unclear audio], or a bunch of AI [unclear audio].

Jensen Huang: Thank you. Appreciate that. Actually, what I said is that Nvidia starts with software. We always start with the algorithm. For example, it could be a quantum classical algorithm, quantum classical computing.

Maybe it's an algorithm for computational lithography, making chips. Maybe it's an algorithm for 5G and 6G radio. We always start with the algorithm, and then, we try to design up, down, bottom. It's called "co-design", across the entire stack.

But, we have to start with the algorithm. Otherwise, if you don't understand the algorithm, you cannot accelerate it. CPUs don't have to understand algorithms. CPUs, because the algorithm sits on top of a compiler, and you only see a compiler. But, accelerated computing is not like that. [It's a] Very, very different type of computing format.

So, Nvidia starts with that [with software and algorithms]. In the future, though, you will see that Nvidia started with software, acceleration of algorithms, to full-stack, then we became a systems company, then we became a data company. Now, we're becoming an AI infrastructure company.

And the infrastructure is important, because the software that runs across the infrastructure is very different than the software that runs on a PC. And, the system organization, architecture and optimization is very different than inside a PC.

So, as we think about the future of computing and these factories, you have to think about the infrastructure completely. Everything from power, to cooling, to networking, to scale up networking, into the fabric, security, storage...everything. Everything has to be considered in one time.

Otherwise, the software is not optimized, the throughput is not optimized. And if the throughput is not optimized, the revenue is affected. This is the first time, the very first time, that a computer directly affects the revenues of a company.

Today, when you see a chip fab, that ASML equipment directly affects TSMC's revenues. Makes sense, right? It directly affects the revenues. But a computer in a big IT data center... How do you know?

If I bought you a faster laptop, does it directly translate to your revenues? Does it directly translate to your income? It does not. Same thing. [For example] IT, if I bought them more computers, does it directly translate to Nvidia's revenues? It does not.

But in the AI factory, it does.

So, this is a very new way of thinking about computers. It's a factory, and we have to optimize it to the extreme, because these factories are very, very expensive.

(Image credit: Nvidia)

Dr. Ian Cutress, More than Moore: Love the NVLink Fusion announcement you did yesterday. I'm trying to understand the width, depth and breadth, to the availability to the outside world. I kind of want to envision a system where you have the NVLink spine, you have a partner with that custom CPU, with NVLink, their custom GPU, TPU, whatever you want to call it, with their NVLink, being a custom partner with a switch on top. All that's involved with Nvidia, is the ingredient with the switch. Is that a vision you can expect for that technology in the future, you said you wanted to at least buy something right?

Jensen Huang: That is one vision. But the more likely vision, is that they will buy an NVLink chiplet, and they'll buy the NVLink switch, and the NVLink spine, and the Spectrum-X switch, and all of the necessary software to go along with it. That's more likely.

Let's use one particular example. Remember, Fujitsu has been a computer company for literally, exactly as long as I can remember. They have a large install base of Fujitsu systems all over the world, and it's based on Fujitsu's CPU. They would like to add AI to that.

How do you do that? Because today, Fujitsu has a CPU, and they would like and all of their software stack runs on the Fujitsu CPU, and the Nvidia AI, runs on Nvidia AI.

And so how do you combine the two? How do you use these two together? Well, the way you fuse these two ecosystems together is with NVLink beauty. It's a fusion of ecosystems. Does it make sense? That's why I call it NVLink Fusion, the fusion of two ecosystems.

All of a sudden, by building a Fujitsu CPU with NVLink, and you connect it to...the port is actually going to look exactly like this, except this will be a producer CPU, or [unclear audio], or [unclear audio], or Rubin. We would then sell this to Fujitsu. Make sense? They plug it into the NVLink system, and look what happened. Fujitsu's entire ecosystem just become AI supercharged.

Dr. Ian Cutress, More than Moore: But could they use their own accelerator?

Jensen Huang: They could, but they really want our ecosystem. That's the reason why they did this. If they don't want our ecosystem, there's nothing to fuse. People want our ecosystem, and all the software that we bring along.

So we would do the same with Qualcomm, and if other CPU vendors would want, we're more than happy to. Because we put the chip to chip, and they NVLink into Synopsys and Cadence, so every CPU company could do it. And all of a sudden, Nvidia's entire ecosystem becomes integrated with theirs, fused with theirs. Pretty clever, huh?

(Image credit: Shutterstock)

Lisa, Wall Street Journal: I just want to follow up with something you said earlier. You talked about AI diffusion rule , and basically it's been a reversal for the past week. I'm interested in your views on going forward? Do you think this reversal will continue, just at least the Middle East is just one example of a country's negotiation over GPUs. I'm just wondering, did you expect Trump and his administration to continue that line and his attitude?

Jensen Huang: I don't know the details of the diffusion rules. The policy hasn't come out. No one knows what future policies are going to happen. Nobody knows in any country and in any government. Policies are always evolving.

But here's what I do know, the fundamental assumptions that led to the AI diffusion rule in the beginning, has been proven to be fundamentally flawed.

That's the big thing. Believe that smart people are doing smart things in governments, and they want to do what's good for the country. I believe people genuinely do that.

If the facts are flawed, if the assumptions are flawed, then the outcome would have to be flawed, the policy would have to change, and so the fundamentals have been completely proven wrong.

And so that's the reason why President Trump made it possible for us to expand our reach outside the United States. And he said very publicly, that he would like Nvidia to sell as many GPUs as possible, all around the world.

The reason for that is because he sees it very clearly, that the race is on, and the United States wants to stay ahead. We need to maximize, accelerate our diffusion, not limit it, because somebody else is more than happy to provide it.

And this AI diffusion is important, because AI is not just AI for all of the things that we said, remember, AI is also going to be the foundation of 6G. So, future communications infrastructure will also be affected. So, we need to get the American AI technology out to as many places as possible. Work with developers and AI researchers all around the world, and help them build an ecosystem, [to] participate in this incredible AI revolution, and do that as fast as possible.

The fundamental assumption was that United States the only provider of AI. And, obviously that's not true.

(Image credit: Shutterstock)

Dianne, New York Times: So you talked about how important the China market is for Nvidia, and I was wondering, what does it look like for Nvidia to compete in China on an ongoing basis? Is it accurate that the company is investing in a research center in Shanghai, and does the future of Nvidia in China look like, potentially working more closer with the US government, to avoid a future situation like H20?

Jensen Huang: We are trying to lease a new building for our employees in Shanghai. We've been in China for 30 years. Our employees are in a really cramped environment.

Because now more and more people- We still have a flexible work from home policy. So, you know, I decided that the way that people work should reflect the capabilities of the technology, the nature of our work and the sensibility of culture.

And the one additional idea is, because of video conferencing, because we can remote work, I wanted to use the opportunity to enable young people, young parents, to be able to build a life, build a family, and build a career at the same time. Because many young many young women can't build a career, because they have to be at home taking care of their children.

I would want to make it possible for young women to do both: Have a great career and be a great homemaker. And so the ability to have remote work enables that to happen. It's been a fantastic response from all of our employees and all of the others.

They think it's fantastic. Of course, it is incredible maintaining both jobs or doing both things at one time. It's not easy, but at least it's possible. And so that's the reason why we have remote work.

But more and more people are starting to move to the offices, and so the offices are just too cramped. We finally found a place that we could lease the building, and that's basically it. I'm surprised that that's such an enormous story. I feel like I just bought a new chair and that that became front page news. [Laughs]

Our competition in China is really intense. Let's face it, China has a vibrant technology ecosystem, and it's very important, the fact that China has 50% of the world's AI researchers, and China is incredibly good at software. I would put China's software capabilities up against any country, any region in the world. That's how good they are.

Not to mention, they're fast. So our competition is intense. If we're not there, quite frankly, the local companies are more than joyful. They would love for us to never to go back to China.

And so it is precisely those policies benefit, whatever the reasons are. I hope that that what is actually happening, is going to help shape the policy-makers, so that it's possible for us to go back and compete, and that's my goal.

H20, as it currently stands, Hopper. We don't know how to degrade Hopper to make it useful to the marketplace, but we're committed to the market.

You know, the number one in export controls. Export control puts limits on products. If the government would like to completely have sanctions, and whatever they want to ban completely, they're allowed to do that, of course, and we'll comply with the law.

But in the meantime, our job is to comply with the export controls, and the government is very clear about that, provide the export controls, but do your best. Provide the export controls, but continue to do your best, serve the market.

What we're trying to do right now is to think through, how can we best serve the market? And we have very limited choices. We degraded the product so severely, It's going to be quite complicated. But anyhow, we're going to do our best. I don't have any good ideas at the moment, but I'm going to keep thinking.

Penny, Publication Unknown: I'm just going to ask you a question about China. So there are lots of startups in China, GPU companies, and they're developing their ownerships. I'm just wondering how you see this, and how is Nvidia going to respond to it?

Jensen Huang: There are startup companies, and [Audio unclear] one of the the largest and most affordable technologies, period, in the world. And, they are innovating fast. The advantage that AI provides is that the data center is very large. It's not like a cell phone. It doesn't have to fit in one line. If it doesn't work, you know, use two chips. And if that doesn't work, use four chips. It uses more energy. But power is quite cost effective in China, and there's plenty of land. So the the the ban on H20 is not effective for that reason. They'll just buy more more chips from the startups, from Huawei, and others. And so, I really do hope that that the US government recognizes that the ban is not effective and gives us a chance to go back to market as soon as possible.

(Image credit: Nvidia)

Question 14 (Speaker and publication unclear): Nvidia is building AI systems for large scale, solutions like GB300 NVL72, do you envision any [audio unclear] specific platforms. Will Nvidia extend any particular specialized AI hardware, and how will you prioritize areas like robotics and industrial AI?

Jensen Huang: These are the two computers. These called DGX. DGX-1 was the first computer in the world, created for one service, for AI. It only does one thing. AI. Oh, and CUDA.

So DGX-1, was the world's first AI native computer, and when I first announced it, there were no customers except for one, and they didn't have any money, so I gave it to them. A company called OpenAI, this was 2016.

So, I decided that now that there are developers all over the world, and they would all love to have their own DGX-1, but DGX-1 is very big. and so I decided to make small ones. These are personal DGX-1's. This one is called DGX Spark, and this one's called DGX station, the world's first AI personal computer.

[Audio unclear] With respect to robotics, robotics is going to be the next industrial revolution. Let me prove it to you. In order for a technology to succeed, it needs to have excellent capability. It needs to have usefulness, so customers buy it, and there needs to be enough customers buying it [at] high volume, such that the R&D fly-wheel can be high.

It has to be useful, it has to be good technology. The technology needs to converge at just the right time, [it needs] lots of customers and use cases. [Audio unclear]

If this technology fly-wheel is high, then the refinement rate will be exponential. The performance will go like this. Cost will go like this, just like smartphones. The moment that the smartphone, all the pieces of technology, from touch, to 3G, which became 4G, mobile processors, internet, the whole web. The moment all of those things came together, boom, it took off. A huge industry.

The same thing is going to happen with robotics, and the reason for that, is the humanoid robot is the only robot that we can imagine using in many places, because we are in many places. We fit the world to ourselves.

There are only two of them [robotics products] with that property, those characteristics. Self driving cars, because we create the world's [Audio unclear], for cars and human robots. Because, we created the world for ourselves.

If we can make these two technologies useful, functional and useful, it's going to take off. And that's what Nvidia's Isaac GR00T is.

Our entire platform, just like we have RTX for games, just like we have Nvidia AI that you're seeing here, Isaac GR00T is our human and robotics platform, and we are very successful with them. That's going to be the next multi-trillion dollar industry. I expect it to be very, very large.

This was not the end of the Q&A session that journalists had with Nvidia's Jensen Huang at Computex 2025. However, we hope you enjoyed reading.

Nvidia to drop CUDA support for Maxwell, Pascal, and Volta GPUs with the next major Toolkit release

editors@tomshardware.com (Hassam Nasir) — Tue, 06 May 2025 13:06:15 +0000

The official release notes for Nvidia's CUDA 12.9 Toolkit explicitly indicate that the next major release will no longer support Maxwell, Pascal, and Volta-based GPUs. Note that this deprecation is only limited to the compute side, as these GPUs will likely continue receiving normal GeForce drivers for the time being. That being said, this is likely the last SDK version that can be used to develop CUDA applications targeting the aforementioned architectures.

While the previous release hinted at this change, Nvidia's stronger wording now serves as a definitive signal for developers to shift to more modern architectures. CUDA 12.x series (and before) will still allow application development for these GPUs. The deprecation targets offline compilation and library support. Essentially, future CUDA compilers (nvcc) will lack the ability to generate machine code compatible with these GPUs. In the same vein, upcoming versions of CUDA-accelerated libraries like cuBLAS, cuDNN, etc., will not offer support for GPUs built using these architectures.

Nvidia has not specified an exact date for the upcoming major release (likely CUDA 13.x). Similarly, we aren't sure how many interim releases are to follow in the 12.9.x branch. Either way, this is quite a significant change as Nvidia is dropping three major architectures with one swing. Volta's consumer equivalent Turing (RTX 20) is next in line, but it likely has a lot more to offer before it too hits the chopping block.

"Maxwell, Pascal, and Volta architectures are now feature-complete with no further enhancements planned. While CUDA Toolkit 12.x series will continue to support building applications for these architectures, offline compilation and library support will be removed in the next major CUDA Toolkit version release. Users should plan migration to newer architectures, as future toolkits will be unable to target Maxwell, Pascal, and Volta GPUs."
CUDA 12.9 Toolkit release notes

China's Moore Threads polishes homegrown CUDA alternative — MUSA supports porting CUDA code using Musify toolkit

editors@tomshardware.com (Hassam Nasir) — Mon, 14 Apr 2025 13:09:46 +0000

The first traces of Moore Threads' GPU programming software stack, dubbed MUSA, have surfaced online, furthering the nation's pursuit of tech-autarky. MUSA serves as an alternative to Nvidia's CUDA environment, compatible with the domestic MUSA MTT GPU lineup. Any open-source pedigree of the SDK has not been mentioned, so it is likely proprietary and won't be of much benefit to developers outside China.

The U.S. has implemented a series of export restrictions on China, including: advanced AI chips, high-bandwidth memory (HBM), manufacturing equipment, and silicon wafers from leading players like Intel, TSMC, and Samsung. In a bid to reduce reliance on Western hardware, China is hard at work developing its semiconductor ecosystem with in-house silicon, fab equipment, memory, CPUs, and even GPUs. The latter is of great importance, as modern-day machine learning (sometimes under the buzzword banner of AI) is largely accelerated by parallel computing, something which GPUs excel at.

A strong GPU programming ecosystem offers high-level abstraction, ready-to-use libraries, documentation, and profiling tools. With high-performance Nvidia GPU exports still in limbo, Moore Threads is offering an alternative to CUDA.

MUSA provides a built-in compiler (MCC), runtime libraries (MUSA Runtime), a comprehensive list of specialized libraries (MUSA-X), debuggers, and profilers. To ensure compatibility with already written CUDA code, the MUSA SDK also includes Musify, a tool that translates CUDA code for the MUSA environment, likely by translating PTX code at runtime, similar to zLUDA.

(Image credit: Moore Threads)

The MUSA SDK version 4.0.1 is compatible with x86 processors from Intel (on Ubuntu) and Hygon (on Kylin). Moore Threads is demonstrating the prowess of its stack through several demonstrations on its website, including speech synthesis, AI-image generation, image processing, AI-powered 3D face modeling, just to name a few. You can actually try out a bunch of these demos right now (though you might need an account), some of which are reportedly running on Moore Threads' MTT S3000 datacenter GPUs.

Despite CUDA's clear advantage in terms of advancement, maturity, and support, MUSA could find many indigenous customers in small-scale environments, evolving over time. AI developers and researchers envision a heterogeneous future, championing the adoption of hardware-agnostic and open-source platforms. Breaking free from CUDA's reign requires superior alternatives, with ROCm being a key contender. However, AMD's hardware support still trails behind Nvidia.

Keller and Koduri headline the Beyond CUDA Summit today — AI leaders rally to challenge Nvidia's dominance

editors@tomshardware.com (Hassam Nasir) — Mon, 17 Mar 2025 11:30:00 +0000

TensorWave, a cloud-platform for AI workloads powered by AMD's MI Instinct accelerators, is kicking off the Beyond CUDA Summit starting today. The event focuses on the concept of the 'CUDA moat,' and how developers can optimize their AI-centric workloads using other alternatives. Attendees can expect to see demonstrations, hot takes, panels, and expert opinions from influential leaders in the AI field including computer architect icons like Jim Keller and Raja Koduri.

It's no secret that Nvidia-built GPUs constitute the majority of hardware in the AI space. Although AMD's Instinct accelerators offer performance comparable to Nvidia hardware, the already-established and mature CUDA ecosystem is indispensable to some users / organizations. Nvidia realized the potential of parallel computing on its GPUs early on and developed a proprietary platform dubbed CUDA, which is now the de facto standard for GPU-accelerated computing.

Through continuous efforts, optimizations, and the sudden rise of AI which coincidentally is powered by GPUs, Nvidia has positioned itself as a leading solution provider. In fact, 90% of Nvidia's revenue is now driven by its data-center offerings, with CUDA being a central selling point. This creates a vendor lock-in situation, where CUDA (software) effectively confines the industry to Nvidia's hardware, limiting innovation and competition.

The industry is shifting gears to a more open-source and hardware-agnostic future, but that's easier said than done. We have OpenCL, ROCm, oneAPI, and Vulkan as alternatives, however, each trails Nvidia in one or many aspects. Enter Beyond CUDA, where key figures in the AI field have rallied up to congregate and develop a more diverse and heterogeneous future. Hosted by TensorWave, the Beyond CUDA Summit will address the many challenges the AI computing industry faces, such as hardware flexibility, cost efficiency, and exploring the available alternatives to CUDA.

Platforms like ROCm require significant developments to achieve parity with CUDA. Even now, ROCm only supports a small selection of modern GPUs while CUDA maintains compatibility with hardware dating back to 2006. AMD's latest RDNA 4 GPUs are still not officially supported by ROCm. Developers have long bemoaned AMD's slow adoption of new features and support on new hardware. On the positive side, Strix Halo is now ROCm-compatible, though only on Windows.

If you live in San Jose, buckle up as the summit takes place at The Guildhouse, which is with notable irony just three blocks away from the McEnery Convention Center, the site of Nvidia's GTC, which also commences today. Participants have the opportunity to win an AMD Instinct MI210 GPU with 64GB of HBM2e memory. The event runs from 12 PM to 10 PM PDT, with four time slots for various sessions. You can learn more details about the summit here.

PhysX quietly retired on RTX 50 series GPUs: Nvidia ends 32-bit CUDA app support

editors@tomshardware.com (Aaron Klotz) — Tue, 18 Feb 2025 20:49:02 +0000

Nvidia has quietly retired 32-bit PhysX support on RTX 50 series GPUs — a game-specific graphics technology that was advertised heavily during the 2000s and early 2010s. Nvidia confirmed the technology's end-of-life status (at least the 32-bit version) on the Nvidia forums as a result of 32-bit CUDA applications support deprecation starting with the RTX 50 series.

As far as we know, there are no 64-bit games with integrated PhysX technology, thus terminating the tech entirely on RTX 50 series GPUs and newer. RTX 40 series and older will still be able to run 32-bit CUDA applications and thus PhysX, but regardless, the technology is now officially retired, starting with Blackwell.

PhysX is one of the oldest Nvidia technologies, almost rivaling the age of CUDA itself. PhysX was a proprietary physics simulation SDK capable of processing ragdolls, cloth simulation, particles, volumetric fluid simulation, and other physics-focused graphical effects.

Since its inception in 2004, the PhysX API that Nvidia acquired as part of its Ageia purchase (and then adapted to use GeForce GPUs instead) and physics technology has been integrated into a decent-sized number of games. It was used with several notable AAA games, including the Batman Arkham trilogy, Borderlands: The Pre-Sequel, Borderlands 2, Metro: Last Light, Metro: Exodus, Metro 2033, Mirror's Edge, The Witcher 3, and some older Assassin's Creed titles.

PhysX advertised the idea of running physics calculations on the GPU (formerly an Ageia PPU) rather than the CPU. Running physics on the GPU usually allows for significantly greater rendering performance for physics-related graphical effects, enabling higher frame rates and also improving the quality of physics effects compared to what could be achieved on a CPU. The problem was that PhysX support on Nvidia GPUs was only possible because it used CUDA — a proprietary Nvidia platform that enabled CPU-focused programming languages to be executed on the GPU.

By the late 2010s, PhysX's adoption slowed significantly in favor of more flexible alternative solutions (including CPU- and GPU-compatible solutions). The biggest problem that plagued PhysX was its strict requirement for an Nvidia GPU, preventing it from being used on competing GPUs, consoles, and smartphones. On top of this, Nvidia also began dropping support for some PhysX features later in its life cycle. For example, in 2018, Warframe transitioned from PhysX to a homebrewed physics simulation framework (based on PhysX) due to Nvidia dropping physics particle simulation support.

The only way now to run PhysX on RTX 50 series GPUs (or newer) is to install a secondary RTX 40 series or older graphics card and slave it to PhysX duty in the Nvidia control panel. As far as we are aware, Nvidia has not disabled this sort of functionality. But the writing is on the wall for PhysX, and we doubt there will be any future games that attempt to use the API.

DeepSeek's AI breakthrough bypasses industry-standard CUDA for some functions, uses Nvidia's assembly-like PTX programming instead

ashilov@gmail.com (Anton Shilov) — Tue, 28 Jan 2025 17:39:35 +0000

DeepSeek made quite a splash in the AI industry by training its Mixture-of-Experts (MoE) language model with 671 billion parameters using a cluster featuring 2,048 Nvidia H800 GPUs in about two months, showing 10X higher efficiency than AI industry leaders like Meta. The breakthrough was achieved by implementing tons of fine-grained optimizations and usage of Nvidia's assembly-like PTX (Parallel Thread Execution) programming instead of Nvidia's CUDA for some functions, according to an analysis from Mirae Asset Securities Korea cited by @Jukanlosreve.

What You Need to Know

(Image credit: DeepSeek)

Today: OpenAI boss Sam Altman calls DeepSeek 'impressive.' In 2023 he called competing nearly impossible.

Jan. 28, 2025: Investors panic: Nvidia stock loses $589B in value.

Dec. 27, 2024: DeepSeek is unveiled to the world.

Nvidia's PTX (Parallel Thread Execution) is an intermediate instruction set architecture designed by Nvidia for its GPUs. PTX sits between higher-level GPU programming languages (like CUDA C/C++ or other language frontends) and the low-level machine code (streaming assembly, or SASS). PTX is a close-to-metal ISA that exposes the GPU as a data-parallel computing device and, therefore, allows fine-grained optimizations, such as register allocation and thread/warp-level adjustments, something that CUDA C/C++ and other languages cannot enable. Once PTX is into SASS, it is optimized for a specific generation of Nvidia GPUs.

For example, when training its V3 model, DeepSeek reconfigured Nvidia's H800 GPUs: out of 132 streaming multiprocessors, it allocated 20 for server-to-server communication, possibly for compressing and decompressing data to overcome connectivity limitations of the processor and speed up transactions. To maximize performance, DeepSeek also implemented advanced pipeline algorithms, possibly by making extra fine thread/warp-level adjustments.

These modifications go far beyond standard CUDA-level development, but they are notoriously difficult to maintain. Therefore, this level of optimization reflects the exceptional skill of DeepSeek's engineers. The global GPU shortage, amplified by U.S. restrictions, has compelled companies like DeepSeek to adopt innovative solutions, and DeepSeek has made a breakthrough. However, it is unclear how much money DeepSeek had to invest in development to achieve its results.

The breakthrough disrupted the market as some investors believed that the need for high-performance hardware for new AI models would get lower, hurting the sales of companies like Nvidia. Industry veterans, such as Intel Pat Gelsinger, ex-chief executive of Intel, believe that applications like AI can take advantage of all computing power they can access. As for DeepSeek's breakthrough, Gelsinger sees it as a way to add AI to a broad set of inexpensive devices in the mass market.

RTX 5090 prototype allegedly has 24,576 CUDA cores and 800W TDP — two 16-pin connectors present

editors@tomshardware.com (Jowi Morales) — Tue, 21 Jan 2025 16:59:01 +0000

The Nvidia GeForce RTX 5090, announced earlier this year and coming January 30, is poised to be one of the best graphics cards. However, hardware sleuth HXL has unearthed what appears to be an early prototype. Still, given the rumored specifications, it could very well become a GeForce RTX 5090 Ti or RTX Titan Blackwell.

The Chiphell forum user claims the graphics card is a GeForce RTX 5090 engineering card manufactured from July 15 to 21, 2024. Apparently, it's a prototype Nvidia's AIC partners used to design their custom offerings in the early days. The user didn't state how he got his hands on the prototype but was willing to pay good money if someone could supply him with the Nvidia GeForce 570.12 driver, which seems like the only driver that supports the prototype.

According to the specifications, the Blackwell graphics card utilizes fully functional GB202 silicon with 24,576 CUDA cores, compared to the RTX 5090’s 21,760 CUDA cores—13% more. The mysterious prototype also has slightly higher clock speeds (2,100 MHz base clock and 2,514 MHz boost clock) than the RTX 5090 (2,017 MHz base clock and 2,407 MHz boost clock).

The memory configuration reportedly remains unchanged between the two graphics cards, with 32GB of GDDR7 memory across a 512-bit memory interface. However, the prototype seemingly has 32 Gbps chips, pushing the memory bandwidth to 2 TB/s. For comparison, the RTX 5090 leverages 28 Gbps chips, capping the bandwidth at 1.79 TB/s.

Specifications	RTX 5090 prototype / RTX 5090 Ti / RTX Titan*	RTX 5090	RTX 4090 Ti*	RTX 4090	RTX 3090 Ti	RTX 3090
Architecture	Blackwell	Blackwell	Ada Lovelace	Ada Lovelace	Ampere	Ampere
GPU	GB202	GB202	AD102	AD102	GA102	GA102
CUDA Cores	24,576	21,760	18,176	16,384	10,752	10,496
Base Clock	2,100 MHz	2,017 MHz	?	2,235 MHz	1,560 MHz	1,395 MHz
Boost Clock	2,514 MHz	2,407 MHz	?	2,520 MHz	1,860 MHz	1,695 MHz
Memory Size	32 GB GDDR7	32 GB GDDR7	24 GB GDDR6X	24 GB GDDR6X	24 GB GDDR6X	24 GB GDDR6X
Memory Bandwidth	2 TB/s	1.79 TB/s	?	1.01 TB/s	1.01 TB/s	936.2 GB/s
TDP	800W	575W	600W	450W	450W	350W
Launch Date	?	06 Jan 2025	?	20 Sep 2022	27 Jan 2022	01 Sep 2020
Launch Price	?	$1,999	?	$1,599	$1,999	$1,499

*Specifications are unconfirmed.

However, the most eye-popping specification from the GeForce RTX 5090 prototype is the alleged 800W TDP claims, 39% higher than the RTX 5090. The owner stated that it utilizes two 12V-2x6 power connectors, which makes sense given the higher TDP. It’s doubtful that Nvidia (or anyone else, for that matter) will release a consumer-grade GPU that will demand more than 600W of power. The RTX 5090 is already pushing it with its 575W TDP.

Even with the RTX 40 series, there were rumors of an RTX 4090 Ti or RTX Titan Ada. An RTX 4090 prototype has been discovered and disassembled, sporting a massive blow-through heat sink and unique PCB configuration. There has been talk that it could have been a testbed to see how Nvidia could cool an RTX 4090 Ti, but that hasn’t been proven. While it is indeed possible that the company is thinking about producing an RTX 5090 Ti, it could also be just an experiment, and the company is waiting to see if there will be enough demand for a more powerful GPU than the RTX 5090 before going forward with production.

If this prototype hits the retail market, it will be the most powerful consumer-grade GPU Nvidia releases for the RTX 50-series generation and will likely be priced accordingly. The RTX 5090 is already very expensive, at $1,999 MSRP, so something better will likely start at $2,500.

The geopolitical tension between the U.S. and China means Nvidia cannot sell its most powerful GPUs in the Chinese market. This will likely dampen Nvidia’s hopes for an RTX 5090 Ti or RTX Titan Blackwell, if there is one, as China is one of its biggest markets for these expensive GPUs. These rumored specifications are just that; unless we see a massive demand for a more powerful GPU, we won’t likely see something better than an RTX 5090.

RTX 5090 exhibits 27% higher CUDA performance than RTX 4090 — exceeds 500K points in Geekbench

editors@tomshardware.com (Hassam Nasir) — Fri, 17 Jan 2025 16:06:08 +0000

Nvidia's RTX 50 series will debut with the RTX 5090 and RTX 5080 starting January 30. While official reviews, including ours, are under embargo, a potential leak has surfaced that suggests a modest 30% improvement over the last generation. The RTX 5090 has reportedly been benchmarked in Geekbench 5 (via Benchleaks) using the CUDA API, likely by a reviewer who inadvertently made the test results public. As always, spread some salt over this leak, even though the performance claims are similar to what Nvidia has depicted in its slides.

The test bench features AMD's Zen 4-based Ryzen 9 7900X with 12 cores / 24 threads and an Asus ProArt X870E Creator WiFi motherboard. The system has 32GB of DDR5-6000 memory and uses Windows 11 Pro as the Operating System for this test. The benchmark was carried out using the CUDA API, the results of which are hard to come by as Geekbench does not publicly maintain a database for CUDA benchmarks.

We've gathered a few RTX 40 numbers to give you a general performance overview; however, your results might differ slightly. Remember that Geekbench is just a synthetic benchmark and may not accurately reflect how this GPU performs in real-world scenarios.

The RTX 5090 amasses 542,157 points in the CUDA API, landing a solid 27% lead over its predecessor, the RTX 4090. This isn't a massive leap generation-on-generation as we're used to seeing, plus the RTX 5090 has 32% more CUDA cores than the RTX 4090. Then you have the apparent elephant in the room: the pricing. Nvidia's $1,999 price tag for the RTX 5090 makes it roughly 25% more expensive than the RTX 4090, assuming you got one at MSRP. Again, this is leaked information, so almost everything you see here is subject to change.

GPU	CUDA Cores	VRAM	Memory Type	Memory Speed	Bus Width	Bandwidth	CUDA Score	% vs RTX 5090
RTX 5090	21,760	32GB	GDDR7	28 Gbps	512-bit	1.79 TB/s	542,157	100.00%
RTX 4090	16,384	24GB	GDDR6X	21 Gbps	384-bit	1.01 TB/s	424,332	78.27%
RTX 4080	9,728	16GB	GDDR6X	22.4 Gbps	256-bit	0.71 TB/s	300,728	55.47%

The GB202, which powers the RTX 5090, is 744mm2 and has roughly 92 billion transistors. This equates to around 123 million transistors per mm2 on the updated 4NP process, similar to the 4N used on Ada Lovelace. The L1 and L2 cache sizes per SM also show slight improvement, compensated by the faster GDDR7 memory. Nvidia claims a 2x increase in performance over the RTX 40 series, suggesting that an RTX 5070 equals an RTX 4090, but this requires enabling Multi Frame Generation.

Conversely, the RX 9070 XT from AMD has been rumored to match the RTX 4080 Super in raster performance. All that power has been crammed into a near-390mm²Navi 48 chip, expected to be priced around $500. AMD has been tight-lipped about RDNA 4, though we might hear more from them in February when the mid-ranged RTX 5070 hits shelves and AMD finalizes its pricing strategy.

RTX Titan Ada prototype allegedly surfaces with 18,432 CUDA cores and 48GB VRAM — GPU-Z screenshot shows a full AD102 GPU die

Aaron Klotz — Mon, 13 Jan 2025 19:35:05 +0000

Specifications of Nvidia's unreleased RTX Titan Ada GPU have allegedly surfaced on Reddit. A GPU-Z screenshot and photograph shared by FluxRBLX on the Nvidia subreddit reveals the rumored but never shipped RTX Titan Ada GPU specifications featuring a fully enabled AD102 GPU and a whopping 48GB of VRAM.

The GPU-Z screenshot reveals many details on the purported Titan Ada GPU prototype, including core counts, memory configuration, device ID, and more. The GPU would have had 18,432 shaders (CUDA cores), 192 ROPs, 576 TMUs, a pixel fillrate of 478.1 GPixel/s, and a Texture Fillrate of 1,434.2 GTexel/s. The memory subsystem has 48GB of capacity (48GiB if we're being precise), featuring GDDR6 (non-x) ICs on a 384-bit wide memory interface with 864 GB/s of memory bandwidth.

Base clock speeds are significantly lower than any outgoing RTX 40 series (Ada Lovelace) GPU, with GPU-Z reporting a clock speed of just 735 MHz. However, boost clocks look far more conventional, rated at 2,490 MHz. The abnormally low base clocks are likely a byproduct of the early nature of the hardware, as this card was purportedly a prototype. That could also explain the use of GDDR6 instead of GDDR6X.

Compared to the RTX 4090, the RTX Titan Ada outclasses it in shader count and memory capacity. The Titan Ada features a fully enabled AD102 die, which would have made it the only RTX-branded GPU in the 40-series family to have a fully unlocked die. The RTX 4090 has access to 89% of the AD102 die.

Memory capacity is also doubled on the Titan Ada GPU, inevitably due to using a "clamshell" configuration with the GDDR6 modules on both sides of the PCB, similar to the RTX 3090 or RTX 6000 Ada. GDDR6 manufacturers don't make GDDR6 memory chips with a capacity greater than 2GB, making this configuration the only option to achieve 48GB on the Titan-class GPU.

(Image credit: Reddit/FluxRBLX)

One area where the RTX 4090 outperforms the Titan Ada is in memory bandwidth, thanks entirely to its GDDR6X memory. The Titan Ada prototype had slower GDDR6 memory modules, which reduces its bandwidth potential compared to the RTX 4090. Nvidia either didn't plan to use the speedier GDDR6X modules, or perhaps it never got that far. Cooling may have been a concern due to the clamshell layout of the memory chips; the RTX 3090 was shipped in the same configuration as the GDDR6X modules, but the RTX 3090 also suffered from memory temperature issues.

The Reddit poster also shared a PCB shot of the supposed RTX Titan Ada GPU. Assuming the image is legitimate, the PCB looks virtually identical to equivalent RTX 4090 PCBs. The giant AD102 die is in the middle, flanked by 12 of 24 memory ICs. The GPU and memory power delivery components line the right and left sides of the PCB. The PCB pictured is likely a reference design, as Nvidia doesn't normally allow non-reference Titan cards (even if it doesn't have "Founders Edition" branding).

Since this GPU was never released, Nvidia hasn't explained why it never brought the RTX Titan Ada to the market. However, Nvidia likely canceled the product due to internal competition that would have arisen between it and workstation-class GPUs, such as the RTX 6000 Ada that sells for $6,800. Furthermore, AMD didn't have an answer for the RTX 4090, so the RTX Titan Ada might have been overkill, at least for the average gamer.

Nvidia's RTX 5070 Ti and RTX 5070 allegedly sport 16GB and 12GB of GDDR7 memory, respectively — Up to 8960 CUDA cores, 256-bit memory bus, and 300W TDP

editors@tomshardware.com (Hassam Nasir) — Wed, 25 Dec 2024 13:01:18 +0000

Renowned and avid leaker Kopite has detailed Nvidia's RTX 5070 family, and the overall bump in specs over Ada Lovelace is a mixed bag, at least on paper. The leaker has a proven track record, having previously leaked the RTX 5090 and RTX 5080 specifications. The RTX 5070 family comprises the RTX 5070 Ti and the base RTX 5070. Rumor has it that Nvidia will debut Blackwell with the RTX 5090, RTX 5080, and the RTX 5070 family at CES next month.

According to the leaked data from Kopite, the RTX 5070 Ti sports the GB203-300-A1 die, similar to the RTX 5080, and has 8960 CUDA cores or 70 SMs; 16% more than the RTX 4070 Ti. Over a 256-bit interface, the RTX 5070 Ti gets 16GB of GDDR7 VRAM, rumored to run at 28 Gbps for a total bandwidth of 896 GB/s. This puts the RTX 5070 Ti quite close to the RTX 5080 despite the 20% delta in core counts.

For context, the RTX 4080 and 4070 Ti had a noticeable spec gap, pushing Nvidia to price them almost $400 apart. It is reasonable to expect that this delta will not be as large with Blackwell, but the ball is in Nvidia's court. The RTX 5070 Ti is expected to chug 300W of power, 15W more than its predecessor.

Moving on, the RTX 5070 is allegedly powered by the GB205-300-A1 die. This is a step down against the RTX 4070, featuring AD104-250, an XX104 class GPU, and a tier higher than XX105/205 class GPUs. The smaller die lands the RTX 5070, resulting in a significant reduction in core counts to 6144 CUDA cores, though that's still 4% more than the RTX 4070. That aside, it offers 12GB of GDDR7 memory across a 192-bit interface for 672 GB/s of bandwidth. The TDP is slightly higher at 250W, 25% more than the RTX 4070.

GPU Name	RTX 5070 Ti	RTX 4070 Ti	RTX 5070	RTX 4070
Die	GB203-300-A1	AD104-400-A1	GB205-300-A1	AD104-250-A1
CUDA Cores	8960	7680	6144	5888
Bus Width	256-bit	192-bit	192-bit	192-bit
Memory	16GB	12GB	12GB	12GB
TDP	300W	285W	250W	200W

The RTX 5080's substantial reduction in specs as compared to the RTX 5090 has set an underwhelming tone for the remaining Blackwell lineup. While the RTX 4090 had 68% more cores than the RTX 4080, this disparity has increased to 102% generation-over-generation, at least according to the unconfirmed leaks.

Samsung is expected to initiate mass production of its GDDR7 24Gb (3GB) memory early next year. A potential RTX 50 Super refresh could employ these newer modules for 50% higher VRAM capacities, but that's speculation. Theoretically speaking, Nvidia could announce a 48GB RTX 5090 SUPER, but it doesn't have to since both Intel and AMD have dropped out of the high-end market. Users demanding higher VRAM capacities, likely for AI, will have to pay through the nose and opt for Nvidia's Blackwell data center accelerators or future Blackwell workstation GPUs.

RTX 5070 Ti rumor points to 8,960 CUDA cores and 300W TDP — Blackwell GPU may use the same GB203 die as the RTX 5080

editors@tomshardware.com (Aaron Klotz) — Thu, 21 Nov 2024 16:56:15 +0000

Consumer desktop GeForce RTX 50-series (Blackwell) GPU rumors are again in full swing, surrounding the mid-range(ish) RTX 5070 Ti. Resident GPU leaker Kopite7kimi reports (though VideoCardz) that the RTX 5070 Ti, rivaling the best graphics cards, will purportedly come with 8,960 CUDA cores and a 300W TDP.

The GPU will purportedly use the GB203 Blackwell GPU die, the same GPU die as the RTX 5080. As a result, the reference board design will also be the same between the 5070 Ti and 5080 since both share the same physical die model.

Memory specs were not shared. However, Nvidia could do a 256-bit interface with 2GB chips vs 3GB on the RTX 5080 to differentiate. Therefore, the RTX 5070 Ti could come with 16GB instead of the rumored 24GB on the RTX 5080.

The RTX 5070 Ti's 8,960 CUDA core count represents a noteworthy upgrade over RTX 40-series counterparts. Compared to its direct predecessor, the RTX 4070 Ti, the RTX 5070 Ti has 16% more GPU cores. Though core counts are less significant against the RTX 4070 Ti Super, the RTX 5070 Ti only has 6% more cores.

	RTX 5070 Ti	RTX 5080	RTX 4070 Ti	RTX 4080
GPU Die Model	GB203	GB203	AD104	AD103
CUDA Cores	8,960	10,752	7,680	9,728
Power Draw	300W	400W	285W	320W

The power draw is also a modest increase from the RTX 40 series. The RTX 5070 TI allegedly has a 5% greater power target than the RTX 4070 Ti and its Super-refreshed variant. However, it is worth mentioning that it is unknown whether the 300W metric references TDP or TBP.

Compared to the RTX 5080 specs Kopite shared earlier, the RTX 5070 Ti represents an enormous downgrade in specs. The RTX 5080 has 20% more CUDA cores (10,752) and, as a result, could pull 33% more power (400W TBP).

The RTX 5070 Ti is configured differently from its RTX 40-series counterpart. The 5070 Ti is reportedly equipped with the same die as the RTX 5080 but has a larger margin for raw specs. On the contrary, the RTX 4070 Ti was closer to the RTX 4080 on the spec sheet, yet both GPUs use different die models (AD104 for the RTX 4070 Ti, AD103 for the RTX 4080). The RTX 4080 features 15% more cores than the 4070 Ti and consumes just 35W more power.

Nvidia is expected to announce its next-generation GeForce Blackwell gaming graphics cards at CES 2025. According to the rumors, the lineup may comprise the RTX 5090, RTX 5080, and RTX 5070.

CUDA-beating ZLUDA breathes new life with financial backing from unknown party — pivots to AI workloads across multiple GPU vendors

ashilov@gmail.com (Anton Shilov) — Fri, 04 Oct 2024 18:54:37 +0000

ZLUDA, an open-source CUDA translation layer, has lived two quite vivid lives with Intel and then AMD GPUs. It was nearly killed in August when AMD asked to take down the code developed using its funds. However, as its developer, Andrzej Janik, secured funding from a mysterious sponsor, ZLUDA now has a third life. This time around, the focus of ZLUDA will be to run AI/ML software designed for CUDA GPUs on processors from other vendors using a translation layer, reports Phoronix.

ZLUDA was originally designed to run creative professional CUDA-based applications on Intel and then AMD GPUs, while the upcoming iteration of ZLUDA shifts focus to accommodate AI and machine learning workloads. Also, the emphasis is now not just on Intel or AMD. Instead, it offers multiple GPU vendor support, making ZLUDA applicable across different GPU architectures. Nonetheless, for the time being, most development efforts are concentrated on AMD GPUs, particularly RDNA1 and newer architectures. Support is being built around AMD’s ROCm 6.1+ compute stack, laying the foundation for broader, multi-architecture compatibility in the future.

Andrzej Janik is currently working to make AI/ML frameworks like PyTorch, TensorFlow, and Llama.cpp function seamlessly using CUDA on non-Nvidia GPUs using his translation layer, according to Phoronix, who spoke to the developer. Janik predicts it will take about a year to develop the new ZLUDA code to a stable state where it can effectively handle AI/ML workloads across multiple GPUs. Contributions from the open-source community will be welcomed as the project evolves. So, ZLUDA will remain open source, or at least it looks so today.

Although ZLUDA now has a financial backer, the sponsor has chosen to remain anonymous for now. We can only speculate who the sponsor is because they need to run AI workloads at scale and opted for multi-GPU vendor support. Also, we presume it is big enough not to be afraid of getting into a conflict over running CUDA software through a translation layer, which Nvidia does not endorse these days. Yet, the developer says that this ‘stealth’ sponsor is expected to be revealed later, providing more insight into the direction and future support of ZLUDA.

AMD announces unified UDNA GPU architecture — bringing RDNA and CDNA together to take on Nvidia's CUDA ecosystem

palcorn@outlook.com (Paul Alcorn) — Mon, 09 Sep 2024 13:57:26 +0000

Here in Berlin, Germany, at IFA 2024, AMD's Jack Huynh, the senior vice president and general manager of the Computing and Graphics Business Group, announced that the company will unify its consumer-focused RDNA and data center-focused CDNA architectures into one microarchitecture, named UDNA, that will set the stage for the company to tackle Nvidia's entrenched CUDA ecosystem more effectively. The announcement comes as AMD has decided to deprioritize high-end gaming graphics cards to accelerate market share gains.

When AMD moved on from its GCN microarchitecture back in 2019, the company decided to split its new graphics microarchitecture into two different designs, with RDNA designed to power gaming graphics products for the consumer market while the CDNA architecture was designed specifically to cater to compute-centric AI and HPC workloads in the data center.

Huynh explained the reasoning behind the split in a Q&A session with the press and the rationale for moving forward with a new unified design. We also followed up for more details about the forthcoming architecture. Here's a lightly edited transcript of the conversations:

Jack Huynh [JH], AMD: So, part of a big change at AMD is today we have a CDNA architecture for our Instinct data center GPUs and RDNA for the consumer stuff. It’s forked. Going forward, we will call it UDNA. There'll be one unified architecture, both Instinct and client [consumer]. We'll unify it so that it will be so much easier for developers versus today, where they have to choose and value is not improving.

We forked it because then you get the sub-optimizations and the micro-optimizations, but then it's very difficult for these developers, especially as we're growing our data center business, so now we need to unify it. That's been a part of it. Because remember what I said earlier? I'm thinking about millions of developers; that’s where we want to get to. Step one is to get to the hundreds, thousands, tens of thousands, hundreds of thousands, and hopefully, one day, millions. That's what I'm telling the team right now. It’s that scale we have to build now.

Tom's Hardware [TH], Paul Alcorn: So, with UDNA bringing those architectures back together, will all of that still be backward compatible with the RDNA and the CDNA split?

JH: So, one of the things we want to do is ...we made some mistakes with the RDNA side; each time we change the memory hierarchy, the subsystem, it has to reset the matrix on the optimizations. I don't want to do that.

So, going forward, we’re thinking about not just RDNA 5, RDNA 6, RDNA 7, but UDNA 6 and UDNA 7. We plan the next three generations because once we get the optimizations, I don't want to have to change the memory hierarchy, and then we lose a lot of optimizations. So, we're kind of forcing that issue about full forward and backward compatibility. We do that on Xbox today; it’s very doable but requires advanced planning. It’s a lot more work to do, but that’s the direction we’re going.

PA: When you bring this back to a unified architecture, this means, just to be clear, a desktop GPU would have the same architecture as an MI300X equivalent in the future? Correct?

JH: It's a cloud-to-client strategy. And I think it will allow us to be very efficient, too. So, instead of having two teams do it, you have one team. It’s not doing something that's that crazy, right? We forked it because we wanted to micro-optimize in the near term, but now that we have scale, we have to unify back, and I believe it's the right approach. There might be some little bumps.

PA: So, this merging back together, how long will that take? How many more product generations before we see that?

JH: We haven’t disclosed that yet. It’s a strategy. Strategy is very important to me. I think it’s the right strategy. We’ve got to make sure we’re doing the right thing. In fact, when we talk to developers, they love it because, again, they have all these other departments telling them to do different things, too. So, I need to reduce the complexity.

[...]From the developer's standpoint, they love this strategy. They actually wish we did it sooner, but I can't change the engine when a plane’s in the air. I have to find the right way to setpoint that so I don’t break things.

[End of Huynh's comments]

Yes, high-end silicon can build markets, but ultimately, software support tends to define the winners and losers. Nvidia has taught the master's class of how to build a seemingly impenetrable moat with its unparalleled proprietary CUDA ecosystem.

Nvidia began laying the foundation of its empire when it started with CUDA eighteen long years ago, and perhaps one of its most fundamental advantages is signified by the 'U' in CUDA, the Compute Unified Device Architecture. Nvidia has but one CUDA platform for all uses, and it leverages the same underlying microarchitectures for AI, HPC, and gaming.

Huynh told me that CUDA has four million developers, and his goal is to pave the path for AMD to see similar success. That's a tall order. AMD continues to rely on the open source ROCm software stack to counter Nvidia, but that requires buy-in from both users and the open source community that will shoulder some of the burden of optimizing the stack. Anything AMD can do to simplify that work, even if it comes at the cost of some micro-optimizations for certain types of applications/games, will help accelerate that ecosystem.

AMD has taken its fair share of criticism for the often scattered efficacy of the ROCm stack. When it bought Xilinx in 2022, AMD even announced that it would put Victor Peng, the then-CEO of Xilinx, in charge of a unified ROCm team to bring the project under tighter control (Peng recently retired). That effort has yielded at least some fruit, but AMD continues to receive criticism for the state of its ROCm stack — it's clear the company has plenty of work ahead to fully put itself in a position to take on Nvidia's CUDA.

The company also remains focused on ROCm despite the emergence of the UXL Foundation, an open software ecosystem for accelerators that is getting broad support from other players in the industry, like Qualcomm, Samsung, Arm, and Intel.

What precisely will UDNA change compared to the current RDNA and CDNA split? Huynh didn't go into a lot of detail, and obviously there's still plenty of groundwork to be laid. But one clear potential pain point has been the lack of dedicated AI acceleration units in RDNA. Nvidia brought tensor cores to then entire RTX line starting in 2018. AMD only has limited AI acceleration in RDNA 3, basically accessing the FP16 units in a more optimized fashion via WMMA instructions, while RDNA 2 depends purely on the GPU shaders for such work.

Our assumption is that, at some point, AMD will bring full stack support for tensor operations to its GPUs with UDNA. CDNA has had such functional units since 2020, with increased throughput and number format support being added with CDNA 2 (2021) and CDNA 3 (2023). Given the preponderance of AI work being done on both data center and client GPUs these days, adding tensor support to client GPUs seems like a critical need.

The unified UDNA architecture is a good next logical step on the journey to competing with CUDA, but AMD has a mountain to climb. Huynh wouldn't commit to a release date for the new architecture, but given the billions of dollars at stake in the AI market, it's obviously going to be a top priority to execute the new microarchitectural strategy. Still, with what we've heard about AMD RDNA 4, it appears UDNA is at least one more generation away.

AMD asks developer to take down open source ZLUDA, dev vows to rebuild his project

ashilov@gmail.com (Anton Shilov) — Thu, 08 Aug 2024 19:42:15 +0000

Earlier this year, AMD quietly stopped funding ZLUDA, an open-source CUDA translation layer project that allowed to execute programs originally compiled for Nvidia CUDA GPUs on AMD Radeon processors supported by the the ROCm software stack. But recently Nvidia banned usage of translation layers with CUDA-based software, which could potentially cause legal troubles for AMD, so the company now asked Andrzej Janik, the developer behind ZLUDA to take the code down, reports Phoronix.

"The code that was previously here has been taken down at AMD's request," the developer wrote at the project's GitHub page. "The code was released with AMD's approval through an email. AMD's legal department now says it's not legally binding, hence the rollback. Before anyone asks: I have received no legal threats or any communication from Nvidia."

Andrzej Janik, the developer behind ZLUDA, initially created the project for Intel GPUs using the Level Zero software stack. After receiving support from AMD, Janik successfully modified ZLUDA to work on AMD GPUs, enabling various CUDA applications to run smoothly.

The agreement between Janik and AMD allowed for the code to be made open source if the contract ended. In February, following the termination of AMD's funding, the ZLUDA code was released publicly. However, AMD's legal team now requested its removal from the GitHub repository, claiming the release was not legally binding. This development was surprising, given the project's potential to support CUDA on Radeon hardware, a benefit for AMD.

Despite the setback, Janik expressed his intention to rebuild ZLUDA from its pre-AMD codebase. The rebuilt version will have a different scope and will not include certain features, such as planned support for Nvidia GameWorks.

Janik is currently working on securing new funding for the project and is considering various directions for its future. It remains uncertain whether the new ZLUDA will focus on Intel GPUs, as initially planned, or adopt a new design for AMD GPUs.

"At this point, one more hostile corporation does not make much difference," Janik wrote. "I plan to rebuild ZLUDA starting from the pre-AMD codebase. Funding for the project is coming along and I hope to be able to share the details in the coming weeks. It will have a different scope and certain features will not come back. I wanted it to be a surprise, but one of those features was support for NVIDIA GameWorks. I got it working in Batman: Arkham Knight, but I never finished it, and now that code will never see the light of the day."

While the official ZLUDA code has been removed from GitHub, it likely still exists in cloned repositories, Phoronix suggests.

Nvidia RTX 3050 A Laptop GPU specs revealed and it's as weak as expected — comes with just 1,768 CUDA cores and 4GB VRAM on a 64-bit bus

Jeff Butts — Fri, 26 Jul 2024 18:30:20 +0000

The GeForce RTX 3050 A recently popped up in Nvidia’s GeForce drivers. We could only guess at the specs initially, but now Nvidia has confirmed the GeForce RTX 3050 A’s existence as well as some key specs. We reached out to Nvidia and were informed that the graphics card leverages a down-binned AD106 (Ada Lovelace) silicon and represents the third iteration of the RTX 3050 Laptop GPU, already a questionable choice for mobile use considering the lack of VRAM.

The new RTX 3050 A, where the “A” probably denotes Ada, only uses a fraction of the potential offered by the AD106 die. According to Nvidia, the GPU will feature just 4GB of GDDR6 VRAM on a 64-bit bus along with 1,792 CUDA cores. Power consumption is described as 35-50 watts TGP, which is lower than its 3050 mobile siblings on the top of that range. Many of the available specs for the RTX 3050 A also appear to be lower than the original variant, though there's a catch: We don't have clock speeds for the new part, and Ada tends to clock significantly higher than Ampere.

It’s interesting to see Nvidia switch from the Ampere to the Ada Lovelace for the new RTX 3050 A variant, but it’s a move we’ve seen in the past. For example, a recent RTX 4070 desktop graphics card variant saw Nvidia using down-binned AD103 chips instead of the smaller AD104 GPU. There's also an RTX 2050 laptop chip that uses an Ampere GA107 rather than the expected Turing GPU.

GeForce RTX 3050 A Specifications

	RTX 3050 A Latop	RTX 3050 4GB	RTX 3050 6GB
Architecture	Ada Lovelace	Ampere	Ampere
CPU	AD106	GA107	GA107
CUDA Cores	1,792	2,048	2,560
Memory Bus Width	64-bit	128-bit	96-bit
Memory Type	GDDR6	GDDR6	GDDR6
Max. Amount of Memory	4GB	4GB	6GB
Power Consumption	45 Watt (35–50 Watt TGP)	60 Watt (35–80 Watt TGP)	60 Watt (35–80 Watt TGP)
technology	4nm	8nm	8nm
PCIe link	4.0 x8	4.0 x8	4.0 x8

The assumption is that Nvidia wants to use every possible piece of silicon by turning off non-functional portions of chips and down-binning them to lower tier parts. For the RTX 3050 A, Nvidia has apparently disabled 61% of the available SMs (Streaming Multiprocessors) and two-thirds of the memory controllers present in the Ada Lovelace AD106 silicon. Some of the disabled bits may have worked, but Nvidia effectively captures just about any potentially useful silicon this way, whether it has 14 or more SMs and two or more memory controllers that are functional.

On paper, then, we’ve got a new mobile GPU that might seem to be slightly less powerful than its RTX 3050 Laptop GPU siblings, but that's only part of the story. The GA107 die normally used in these chips measures 200 mm^2, while the AD106 die measures 188 mm^2. Along with the slight smaller size, the Ada architecture tends to clock significantly higher than Ampere — desktop GPUs typically run in the 2.7 GHz range, compared to 1.9 GHz on Ampere.

We do know from our desktop GPU testing that Ada chips have significantly improved on overall efficiency (FPS per Watt), and there's no reason this down-binned RTX 3050 A won't show similar characteristics — especially if it's running at lower clocks. But we don't have details on clocks speeds as yet, so we can't say for certain what level of performance it may offer. We reached out to Nvidia and were informed that DLSS 3 frame generation will not be supported, even though the OFA (Optical Flow Accelerator) in Ada would normally allow for it.

As noted earlier, it's a common practice for Nvidia and other GPU vendors to use defective chips in lower tier parts. Considering Nvidia doesn't really have a direct alternative to the RTX 3050 A in the 40-series, and since it has far more profitable projects underway, it doesn't make sense to use new wafers on low-end silicon. Instead, it repurposes old stock that may have been destined for the recycling bin. It’s a bit like when your bananas get bruised and your grandma uses them to make banana bread — a delicious and economical way to use what might have ended up in the trash.

We don’t yet know what laptops might use the RTX 3050 A, but it should allow for smaller, thinner designs thanks to Ada's improved efficiency and the chip's smaller size — with just two memory chips instead of the three or four required with the previous models. Just don't expect it to be an amazingly fast gaming solution, as the 3050 Laptop GPU line wasn't exactly a barn burner at launch.

New SCALE tool enables CUDA applications to run on AMD GPUs

ashilov@gmail.com (Anton Shilov) — Wed, 17 Jul 2024 11:11:14 +0000

Spectral Compute has introduced SCALE, a new toolchain that allows CUDA programs to run directly on AMD GPUs without modifications to the code, reports Phoronix. SCALE can automatically compile existing CUDA code for AMD GPUs, which greatly simplifies transition of software originally developed for Nvidia hardware to other platforms without breaking any end user license agreements.

Spectral's SCALE is a toolkit, akin to Nvidia's CUDA Toolkit, designed to generate binaries for non-Nvidia GPUs when compiling CUDA code. It strives for source compatibility with CUDA, including support for unique implementations like inline PTX as, and nvcc's C++ implementation, though it can generate code compatible with AMD's ROCm 6. One of SCALE's significant advantages is its ability to act as a drop-in replacement for Nvidia's own nvcc compiler. Therefore, unlike other projects that translate CUDA code to another language or use other manual steps, SCALE directly compiles CUDA sources for AMD GPUs.

SCALE's implementation leverages some open-source LLVM components to create a solution that is both efficient and user-friendly as the software package aims to offer a more seamless and integrated solution that ZLUDA, which is a translation layer that is prohibited to use. It even mimics the Nvidia CUDA Toolkit runtime, making it easier for developers to port their existing CUDA programs to AMD hardware.

SCALE has undergone extensive testing with a variety of software, including Blender, Llama-cpp, XGboost, FAISS, GOMC, STDGPU, Hashcat, and Nvidia Thrust, and has proven that it works stably and correctly. Testing has been conducted on RDNA 2 and RDNA 3 GPUs, with basic testing on RDNA 1 and ongoing development for Vega support. The developers did not have access to AMD's CDNA-based GPUs though.

The lack of support for CDNA-based processors is a disadvantage of SCALE because datacenter software designed using CUDA and for CUDA-compatible hardware dominates the rapidly growing AI space and many developers are interested in easily porting their programs to competing platforms, expanding their addressable market.

Funding for SCALE has been provided by Spectral Compute's consulting business since 2017, without financial backing from AMD. Although the program is not open source, there is a Free Edition License available and this one can be used for commercial applications.

Rare GeForce GTX 2070 engineering sample has surfaced — the unreleased GPU has 128 fewer CUDA cores than the RTX 2070

Zhiye Liu — Sat, 20 Apr 2024 17:27:32 +0000

The GeForce RTX 2070 used to be one of the best graphics cards. Many don't know that it was internally known as the GeForce GTX 2070 before Nvidia adopted the RTX branding for products with RT cores while keeping the GTX moniker for those lacking them. X user Jiacheng Liu has shared his experience with one of these rare engineering samples.

The GeForce GTX 2070 Founders Edition looks almost identical to the GeForce RTX 2070 Founders Edition. The main difference is that the former carries the GeForce GTX branding on the side of the graphics card, hinting that it's an early engineering sample before the transition to the RTX branding. The GeForce GTX 2070 uses the same TU106-400A-A1 silicon as the GeForce RTX 2070, but the CUDA core count differs.

The GeForce GTX 2070 sports 2,176 CUDA cores, 128 less than the GeForce RTX 2070. We're looking at two disabled SMs on the GeForce GTX 2070 or a GeForce RTX 2060 Super. The die is different as the GeForce RTX 2060 Super employs the TU106-410-A1 silicon. Liu successfully flashed the GeForce GTX 2070's vBIOS with a 400A BIOS from a GeForce RTX 2070. Logically, it didn't unlock extra cores or anything of that sort. However, it raised the power level, allowing for manual overclocking headroom.

GeForce GTX 2070X/Jiacheng Liu

Liu pushed the GeForce GTX 2070 to the edge and achieved a performance of around 95% of that of a GeForce RTX 2070. The overclock represents a 16% uplift over the stock performance. In reality, 128 CUDA cores aren't a lot, so it was surprising that the enthusiast got the GeForce GTX 2070 within a hairline of the GeForce RTX 2070.

The GeForce GTX 2070 leverages the same Founders Edition cooler as the GeForce RTX 2070. It obviously didn't carry the RTX 2070 branding yet. However, the single 8-pin PCIe power connector and the mix of a DVI port, one HDMI 2.0 port, two DisplayPort 1.4a outputs, and the USB Type-C port are still present even on this prototype.

The emergence of the GeForce GTX 2070 engineering sample insinuates that Nvidia originally planned to outfit the GeForce RTX 2070 with fewer CUDA cores. Although Nvidia didn't go through with it, the chipmaker didn't throw the idea in the trash since the CUDA configuration later ended up in the GeForce RTX 2060 Super. Turing is in the past, but it's always intriguing to see the early prototypes and how they fare with the retail products.

'Enhanced' Nvidia A100 GPUs appear in China's second-hand market — new cards surpass sanctioned counterparts with 7,936 CUDA cores and 96GB HBM2 memory

Zhiye Liu — Tue, 16 Apr 2024 18:39:27 +0000

Nvidia's Ampere A100 was previously one of the top AI accelerators, before being dethroned by the newer Hopper H100 — not to mention the H200 and upcoming Blackwell GB200. It looks like the chipmaker may have experimented with an enhanced version that never hit the market, or perhaps companies have clandestinely modified the A100 to make it even faster in the wake of U.S. sanctions against China. X user Jiacheng Liu recently discovered various A100 prototypes in the Chinese second-hand market that flaunt substantially higher specifications than Nvidia's 'regular' A100.

Despite the beefed-up attributes, the A100 7936SP (unofficial name, based on its having 7936 shader processors) shares the same GA100 Ampere die as the regular A100. However, the former has 124 enabled SMs (Streaming Multiprocessors) out of the possible 128 on the GA100 silicon. While it's not the maximum configuration, the A100 7936SP has 15% more CUDA cores than the standard A100, representing a significant performance uplift.

Tensor core counts likewise increase in proportion to the number of SMs. Having more enabled SMs thus means that the A100 7936SP also possesses more Tensor cores. Based on specs alone, the 15% increase in SM, CUDA, and Tensor core counts could similarly boost AI performance by 15%.

Nvidia offers the A100 in 40GB and 80GB configurations. The A100 7936SP likewise comes in two variants. The A100 7936SP 40GB model flaunts a 59% higher base clock than the A100 80GB while maintaining the same 1,410 MHz boost clock. On the other hand, the A100 7936SP 96GB shows an 18% faster base clock compared to the regular A100, and it also enables the sixth HBM2 stack to get to 96GB of total memory. Sadly, Chinese sellers have censored the boost clock speed from the GPU-Z screenshot.

Nvidia A100 7936SP Specifications

Graphics Card	A100 7936SP 96GB	A100 80GB	A100 7936SP 40GB	A100 40GB
Architecture	GA100	GA100	GA100	GA100
Process Technology	TSMC 7N	TSMC 7N	TSMC 7N	TSMC 7N
Transistors (Billion)	54.2	54.2	54.2	54.2
Die size (mm^2)	826	826	826	54.2
SMs	124	108	124	108
CUDA Cores	7,936	6,912	7,936	6,912
Tensor / AI Cores	496	432	496	432
Ray Tracing Cores	N/A	N/A	N/A	N/A
Base Clock (MHz)	1,260	1,065	1,215	765
Boost Clock (MHz)	?	1,410	1,410	1,410
TFLOPS (FP16)	>320	312	358	312
VRAM Speed (Gbps)	2.8	3	2.4	2.4
VRAM (GB)	96	80	40	40
VRAM Bus Width (Bit)	6,144	5,120	5,120	5120
L2 (MB)	?	80	?	40
Render Output Units	192	160	160	160
Texture Mapping Units	496	432	432	432
Bandwidth (TB/s)	2.16	1.94	1.56	1.56
TDP (watts)	?	300	?	250

The A100 7936SP 40GB memory subsystem is identical to the A100 40GB. The 40GB of HBM2 memory runs at 2.4 Gbps across a 5120-bit memory interface using five HBM2 stacks. The design contributes to a maximum memory bandwidth of up to 1.56 TB/s. The A100 7936SP 96GB model, however, is the centerfold here. The graphics card has 20% more HBM2 memory than what Nvidia offers thanks to the sixth enabled HBM2 stack. Training very large language models can be memory intensive, so the added capacity would certainly come in handy for AI work.

The A100 7936SP 96GB appears to sport a revamped memory subsystem compared to the A100 80GB — the HBM2 memory checks in at 2.8 Gbps instead of 3 Gbps but resides on a wider 6144-bit memory bus to help make up the difference. This results in the A100 7936SP 96GB having approximately 11% more memory bandwidth than the A100 80GB.

A100 7936SP 96GBX/Jiacheng Liu

A100 7936SP 40GBX/Jiacheng Liu

The A100 40GB and 80GB have TDPs of 250W and 300W, respectively. Given the faster specifications, the A100 7936SP could have a higher TDP. However, the value isn't available from the shared GPU-Z screenshots. The engineering PCB has three 8-pin PCIe power connectors instead of the vanilla A100's single 8-pin PCIe power connector. Being an engineering prototype, the A100 7936SP may not use all three power connectors, but it should draw somewhat more power than the standard A100 due to the extra CUDA cores and HBM2 memory.

Many Chinese sellers are selling the A100 7936SP on eBay. The 96GB model ranges between $18,000 and $19,800. It's unknown if the accelerators are engineering samples that escaped Nvidia's lab, or if they're customized models that the chipmaker developed for a specific client. In any event, it isn't legal to pick one up while the A100 may be subject to the latest U.S. export sanctions, that doesn't affect cards already within China.

Of course, there's no warranty or official driver support. While the A100 7936SP offers better performance than the A100 at the same or potentially lower price, purchasing a retail product or renting a GPU for all your AI needs is safer. But for the Chinese market, which can no longer import A100 GPUs, the added memory and compute are apparently worth considering.

Nvidia bans using translation layers for CUDA software — previously the prohibition was only listed in the online EULA, now included in installed files [Updated]

ashilov@gmail.com (Anton Shilov) — Mon, 04 Mar 2024 16:07:37 +0000

[Edit 3/4/24 11:30am PT: Clarified article to reflect that this clause is available on the online listing of Nvidia's EULA, but has not been in the EULA text file included in the downloaded software. The warning text was added to 11.6 and newer versions of the installed CUDA documentation.]

Nvidia has banned running CUDA-based software on other hardware platforms using translation layers in its licensing terms listed online since 2021, but the warning previously wasn't included in the documentation placed on a host system during the installation process. This language has been added to the EULA that's included when installing CUDA 11.6 and newer versions.

The restriction appears to be designed to prevent initiatives like ZLUDA, which both Intel and AMD have recently participated, and, perhaps more critically, some Chinese GPU makers from utilizing CUDA code with translation layers. We've pinged Nvidia for comment and will update you with additional details or clarifications when we get a response.

Longhorn, a software engineer, noticed the terms. "You may not reverse engineer, decompile or disassemble any portion of the output generated using SDK elements for the purpose of translating such output artifacts to target a non-NVIDIA platform.," a clause in the installed EULA text file reads.

The clause was absent in the EULA documentation that's installed with the CUDA 11.4 and 11.5 release, and presumably with all versions before that. However, it is present in the installed documentation with version 11.6 and newer.

Being a leader has a good side and a bad side. On the one hand, everyone depends on you; on the other hand, everyone wants to stand on your shoulders. The latter is apparently what has happened with CUDA. Because the combination of CUDA and Nvidia hardware has proven to be incredibly efficient, tons of programs rely on it. However, as more competitive hardware enters the market, more users are inclined to run their CUDA programs on competing platforms. There are two ways to do it: recompile the code (available to developers of the respective programs) or use a translation layer.

For obvious reasons, using a translation layer like ZLUDA is the easiest way to run a CUDA program on non-Nvidia hardware. All one has to do is take already-compiled binaries and run them using ZLUDA or other translation layers. ZLUDA appears to be floundering now, with both AMD and Intel having passed on the opportunity to develop it further, but that doesn't mean translation isn't viable.

Several Chinese GPU makers, including one funded by the Chinese government, claim to run CUDA code. Denglin Technology designs processors featuring a "computing architecture compatible with programming models like CUDA/OpenCL." Given that reverse engineering of an Nvidia GPU is hard (unless one already somehow has all the low-level details about Nvidia GPU architectures), we are probably dealing with some sort of translation layer here, too.

One of the largest Chinese GPU makers, Moore Threads, also has a MUSIFY translation tool designed to allow CUDA code to work with its GPUs. However, whether or not MUSIFY falls under the classification of a complete translation layer remains to be seen (some of the aspects of MUSIFY could involve porting code). As such, it isn't entirely clear if the Nvidia ban on translation layers is a direct response to these initiatives or a pre-emptive strike against future developments.

For obvious reasons, using translation layers threatens Nvidia's hegemony in the accelerated computing space, particularly with AI applications. This is probably the impetus behind Nvidia's decision to ban running their CUDA applications on other hardware platforms using translation layers.

Recompiling existing CUDA programs remains perfectly legal. To simplify this, both AMD and Intel have tools to port CUDA programs to their ROCm (1) and OpenAPI platforms, respectively.

As AMD, Intel, Tenstorrent, and other companies develop better hardware, more software developers will be inclined to design for these platforms, and Nvidia's CUDA dominance could ease over time. Furthermore, programs specifically developed and compiled for particular processors will inevitably work better than software run via translation layers, which means better competitive positioning for AMD, Intel, Tenstorrent, and others against Nvidia — if they can get software developers on board. GPGPU remains an important and highly competitive arena, and we'll be keeping an eye on how the situation progresses in the future.

AMD-Friendly AI LLM Developer Jokes About Nvidia GPU Shortages

Mark Tyson — Wed, 27 Sep 2023 16:09:25 +0000

The co-founder and CEO of Lamini, an artificial intelligence (AI) large language model (LLM) startup, posted a video to Twitter/X poking fun at the ongoing Nvidia GPU shortage. The Lamini boss is quite smug at the moment, and this seems to be largely because the firm’s LLM runs exclusively on readily available AMD GPU architectures. Moreover, the firm claims that AMD GPUs using ROCm have reached "software parity" with the previously dominant Nvidia CUDA platform.

Just grilling up some GPUs 💁🏻♀️Kudos to Jensen for baking them first https://t.co/4448NNf2JP pic.twitter.com/IV4UqIS7ORSeptember 26, 2023

The video shows Sharon Zhou, CEO of Lamini, checking an oven in search of some AI LLM accelerating GPUs. First she ventures into a kitchen, superficially similar to Jensen Huang’s famous Californian coquina, but upon checking the oven she notes that there is “52 weeks lead time – not ready.” Frustrated, Zhou checks the grill in the yard, and there is a freshly BBQed AMD Instinct GPU ready for the taking.

(Image credit: Lamini)

We don’t know the technical reasons why Nvidia GPUs require lengthy oven cooking while AMD GPUs can be prepared on a grill. Hopefully, our readers can shine some light on this semiconductor conundrum in the comments.

On a more serious note, if we look more closely at Lamini, the headlining LLM startup, we can see they are no joke. CRN provided some background coverage of the Palo Alto, Calif.-based startup on Tuesday. Some of the important things mentioned in the coverage include the fact that Lamini CEO Sharon Zhou is a machine learning expert, and CTO Greg Diamos is a former Nvidia CUDA software architect.

(Image credit: Lamini)

It turns out that Lamini has been “secretly” running LLMs on AMD Instinct GPUs for the past year, with a number of enterprises benefitting from private LLMs during the testing period. The most notable Lamini customer is probably AMD, who “deployed Lamini in our internal Kubernetes cluster with AMD Instinct GPUs, and are using finetuning to create models that are trained on AMD code base across multiple components for specific developer tasks.”

A very interesting key claim from Lamini is that it only needs “3 lines of code,” to run production-ready LLMs on AMD Instinct GPUs. Additionally, Lamini is said to have the key advantage of working on readily available AMD GPUs. CTO Diamos also asserts that Lamini’s performance isn’t overshadowed by Nvidia solutions, as AMD ROCm has achieved “software parity” with Nvidia CUDA for LLMs.

Lamini

We'd expect as much from a company focused on providing LLM solutions using AMD hardware, though they're not inherently wrong. AMD Instinct GPUs can be competitive with Nvidia A100 and H100 GPUs, particularly if you have enough of them. The Instinct MI250 for example offers up to 362 teraflops of peak BF16/FP16 compute for AI workloads, and the MI250X pushes that to 383 teraflops. Both have 128GB of HBM2e memory as well, which can be critical for running LLMs.

AMD's upcoming Instinct MI300X meanwhile bumps the memory capacity up to 192GB, double what you can get with Nvidia's Hopper H100. However, AMD hasn't officially revealed the compute performance of MI300 yet — it's a safe bet it will be higher than the MI250X, but how much higher isn't fully known.

By way of comparison, Nvidia's A100 offers up to 312 teraflops of BF16/FP16 compute, or 624 teraflops peak compute with sparsity — basically, sparsity "skips" multiplication by zero calculations as the answer is known, potentially doubling throughput. The H100 has up to 1979 teraflops of BF16/FP16 compute with sparsity (and half that without sparsity). On paper, then, AMD can take on A100 but falls behind H100. But that assumes you can actually get H100 GPUs, which as Lamini notes currently means wait times of a year or more.

The alternative in the meantime is to run LLMs on AMD's Instinct GPUs. A single MI250X might not be a match for H100, but five of them, running optimized ROCm code, should prove competitive. There's also the question of how much memory the LLMs require, and as noted, 128GB is more than 80GB or 94GB (the maximum on current H100, unless you include the dual-GPU H100 NVL). An LLM that needs 800GB of memory, like ChatGPT, would potentially need a cluster of ten or more H100 or A100 GPUs, or seven MI250X GPUs.

It's only natural that an AMD partner like Lamini is going to highlight the best of its solution, and cherry pick data / benchmarks to reinforce its stance. It cannot be denied, though, that the current ready availability of AMD GPUs and the non-scarcity pricing means the red team’s chips may deliver the best price per teraflop, or the best price per GB of GPU memory.

Nvidia to Reportedly Triple Output of Compute GPUs in 2024: Up to 2 Million H100s

ashilov@gmail.com (Anton Shilov) — Thu, 24 Aug 2023 18:14:31 +0000

Nvidia, which just earned over $10 billion in one quarter on its datacenter-oriented compute GPUs, plans to at least triple output of such products in 2024, according to the Financial Times, which cites sources with knowledge of the matter. The move is very ambitious and if Nvidia manages to pull it off and demand for its A100, H100 and other compute CPUs for artificial intelligence (AI) and high-performance computing (HPC) applications remains strong, this could mean incredible revenue for the company.

Demand for Nvidia's flagship H100 compute GPU is so high that they are sold out well into 2024, the FT reports. The company intends to increase production of its GH100 processors by at least threefold, the business site claims, citing three individuals familiar with Nvidia's plans. The projected H100 shipments for 2024 range between 1.5 million and 2 million, marking a significant rise from the anticipated 500,000 units this year.

Because Nvidia's CUDA framework is tailored for AI and HPC workloads, there are hundreds of applications that only work on Nvidia's compute GPUs. While both Amazon Web Services and Google have their own custom AI processors for AI training and inference workloads, they also have to buy boatloads of Nvidia compute GPUs as their clients want to run their applications on them.

But increasing the supply of Nvidia H100 compute GPUs, GH200 Grace Hopper supercomputing platform, and products on their base is not going to be easy. Nvidia's GH100 is a complex processor that is rather hard to make. To triple its output, it has to get rid of several bottlenecks.

Firstly, the GH100 compute GPU is a huge piece of silicon with a size of 814 mm^2, so it's pretty hard to make in huge volumes. Although yields of the product are likely reasonably high by now, Nvidia still needs to secure a lot of 4N wafer supply from TSMC to triple output of its GH100-based products. A rough estimate suggests TSMC and Nvidia can get at most 65 chips per 300 mm wafer.

To manufacture 2 million such chips would thus require nearly 31,000 wafers — certainly possible, but it's a sizeable fraction of TSMC's total 5nm-class wafer output, which is around 150,000 per month. And that capacity is currently shared between AMD CPU/GPU, Apple, Nvidia, and other companies.

Secondly, GH100 relies on HBM2E or HBM3 memory and uses TSMC's CoWoS packaging, so Nvidia needs to secure supply on this front as well. Right now, TSMC is struggling to meet demand for CoWoS packaging.

Thirdly, because H100-based devices use HBM2E, HBM3, or HBM3E memory, Nvidia will have to get enough HBM memory packages from companies like Micron, Samsung, and SK Hynix.

Finally, Nvidia's H100 compute cards or SXM modules have to be installed somewhere, so Nvidia will need to ensure that its partners also at least triple output of their AI servers, which is another concern.

But if Nvidia can supply all of the requisite H100 GPUs, it certainly stands to make a massive profit on the endeavor next year.

Nvidia to Sell 550,000 H100 GPUs for AI in 2023: Report

ashilov@gmail.com (Anton Shilov) — Tue, 15 Aug 2023 16:34:40 +0000

The generative AI boom is driving sales of servers used for artificial intelligence (AI) and high-performance computing (HPC), and dozens of companies will benefit from it. But one company will likely benefit more than others. Nvidia is estimated to sell over half of a million of its high-end H100 compute GPUs worth tens of billions of dollars in 2023, reports Financial Times.

Nvidia is set to ship around 550,000 of its latest H100 compute GPUs worldwide in 2023, with the majority going to American tech firms, according to multiple insiders linked to Nvidia and TSMC who spoke to Financial Times. Nvidia chose not to provide any remarks on the matter, which is understandable considering FTC rules.

While we don't know the precise mix of GPUs sold, each Nvidia H100 80GB HBM2E compute GPU add-in-card (14,592 CUDA cores, 26 FP64 TFLOPS, 1,513 FP16 TFLOPS) retails for around $30,000 in the U.S. However, this is not the company's highest-performing Hopper architecture-based part. In fact, this is the cheapest one, at least for now. Meanwhile in China, one such card can cost as much as $70,000.

Nvidia's range-topping H100-powered offerings include the H100 SXM 80GB HBM3 (16,896 CUDA cores, 34 FP64 TFLOPS, 1,979 FP16 TFLOPS) and the H100 NVL 188GB HBM3 dual-card solution. These parts are sold either directly to server manufacturers like Foxconn and Quanta, or are supplied inside servers that Nvidia sells directly. Also, Nvidia is about to start shipping its GH200 Grace Hopper platform consisting of its 72-core Grace processor and an H100 80GB HBM3E compute GPU.

Nvidia dies not publish prices of its H100 SXM, H100 NVL, and GH200 Grace Hopper products as they depend on the volume and business relationship between Nvidia and a particular customer. Meanwhile, even if Nvidia sells each of H100-based product for $30,000, that would still account for $16.5 billion this year just on the latest generation compute GPUs. But the company does not sell only H100-series compute GPUs.

There are companies that still use Nvidia's previous generation A100 compute GPUs to boost their existing deployments without making any changes to their software and hardware. There are also the China-specific A800 and H800 models.

While we cannot make any precise estimates about where Nvidia's earnings from the sale of compute GPUs will land, nor the precise number of compute GPUs that the company will sell this year, we can make some guesses. Nvidia's datacenter business generated $4.284 billion in the company's Q1 FY2024 (ended April 30). Given the ongoing AI frenzy, it looks like sales of Nvidia's compute GPUs were higher in its Q2 FY2024, which ended in late July. The full 2023 fiscal year is set to be record-breaking for Nvidia's datacenter unit, in other words.

It's noteworthy that Nvidia's partner TSMC can barely meet demand for compute GPUs right now, as all of them use CoWoS packaging and the foundry is struggling to boost capacity for this chip packaging method. With numerous companies looking to purchase tens of thousands of compute GPUs for AI purposes, supply isn't likely to match demand for quite some time.

Details on Nvidia’s Next-Gen Blackwell GPUs Appear to Have Leaked

ashilov@gmail.com (Anton Shilov) — Sat, 12 Aug 2023 17:34:31 +0000

Nvidia's codenamed Blackwell family of graphics processors will contain five different GPUs and will lack a direct successor to Nvidia's highly-successful AD104 chip, according to leaks by Chiphell and kopite7kimi (via VideoCardz). The information is unofficial and may be inaccurate, but if true, Nvidia will have to address market segments currently addressed by AD104 with two different GPUs.

Apparently, Nvidia's Blackwell family of graphics processors contains five chips codenamed GB202, GB203, GB205, GB206, and GB207. Nvidia's Ada Lovelace family released to date also contains five processors (AD102, AD103, AD104, AD106, AD107), just like the company's Ampere lineup of GPUs (GA102, GA103, GA104, GA106, GA107) that still powers some of the best graphics cards. Meanwhile, Nvidia's Turing family comprised of three members, whereas the Pascal lineup contained five GPUs.

It is unclear why Nvidia's Blackwell family is said to feature GB200-series GPUs, but not GB100-series graphics processors. Typically, 200-series represents re-spinned GPUs.

Historically, Nvidia's XXX04 served performance-mainstream segment of the market and contained 50% - 66% transistors of the top-of-the-range part. The gap between the high-end and the performance mainstream part was quite noticeable. To fill the gap between its GA102 and GA104 GPUs in the Ampere era, Nvidia introduced GA103 part and did the same with the Ada Lovelace family: there is the AD103 sitting between the AD102 and AD104.

While the now Nvidia's lineup no longer has wide gaps, a cut-down AD103 would overlap with the full AD104, which means that Nvidia has to either throw away AD103 GPUs that have one or more defective streaming multiprocessors or even CUDA cores so as not to compete against AD104, or keep them and then use them quietly to substitute AD104, which means cutting them down substantially and not using the whole potential of AD103.

Apparently, the company wants to avoid such a situation in the future. As a result, its GB202 will keep addressing the highest-end of the market (e.g., GeForce RTX 5090, GeForce RTX 5090 Ti), its GB203 will address high-end and performance-mainstream segments (e.g., GeForce RTX 5080, GeForce RTX 5070 Ti, and GeForce RTX 5070), while GB105 will address mainstream part of the market (GeForce RTX 5060 Ti, GeForce RTX 5060). This will enable Nvidia to use all GB203 silicon that it has even with some defective SMs. Of course, we are speculating here.

Nvidia's next-generation Blackwell GPUs are expected to hit the market in late 2024 or early 2025, so the company's plans may change a lot between now and then. Therefore, take this information with a grain of salt for now.

Nvidia Unveils RTX 4000, 5000 Workstation GPUs, Along with New Datacenter Card

ashilov@gmail.com (Anton Shilov) — Tue, 08 Aug 2023 16:21:17 +0000

Nvidia has introduced three high-performance professional graphics cards based on the Ada Lovelace architecture for workstations as well as a server-grade grade professional board that can be used both for remote graphics and light AI applications. The introduction completes transition of Nvidia's ProViz family of high-performance products to its latest Ada Lovelace architecture.

To address performance-demanding professional graphics applications, such as computer aided design and digital content creation, Nvidia add three new products: the RTX 4000 20GB, the RTX 4500 24GB, and the RTX 5000 32GB boards based on the Ada Lovelace architecture. In addition, Nvidia is rolling out its L40S datacenter board with 48GB of memory.

Card	MSRP	GPU	VRAM	Cuda Cores	Availability
RTX 4000	$1,250	AD104	20GB	6,144	September
RTX 4500	$2,250	AD104	24GB	7,680	October
RTX 5000	$4,000	AD102	32GB	12,800	Now
L40S	?	AD102	48GB	18,176	Fall

The new Nvidia RTX 4000 20GB workstation graphics card largely mimics the RTX 4000 SFF product released several month ago, but it uses a full-height PCB albeit comes with a single-slot cooling system and is rated for 130W. The part is powered by the AD104 GPU with 6144 CUDA cores that is clocked higher compared to the SFF variant and thus delivers up to 26.7 FP32 TFLOPS of compute throughput, which is comparable to compute performance of Nvidia's GeForce RTX 4070. This board will offer higher performance than the RTX 4000 SFF for the same price of $1,250 in September.

The green company is also rolling out its Nvidia RTX 4500 24GB featuring the AD104 GPU with 7,680 CUDA cores that offers up to 39.6 FP32 TFLOPS of compute performance, which is on par with the GeForce RTX 4070 Ti. The ProViz graphics card is equipped with a dual-slot cooling system with a blower fan and is rated for up to 210W of power. The product is set to be available in October for the price of $2,250.

Yet another graphics cards that is being rolled out today is the Nvidia RTX 5000 32GB based on the severely cut-down AD102 graphics processor with 12,800 CUDA cores that delivers compute performance of 65.3 FP32 TFLOPS. This unit is positioned to sit below the flagship RTX 6000 Ada and the whopping performance difference between the two parts implies that over time Nvidia might offer a solution that will sit between these models. In the meantime, Nvidia will have its RTX 5000 32GB for $4,000 and RTX 6000 48GB Ada for $6,800.

Nvidia

The new workstation boards will be used by companies like Boxx, Dell, HP, Lenovo, and Lambda in their upcoming workstations this fall. In addition, these boards will be available from Nvidia's resellers, such as Arrow and Ingram from such AIB suppliers as Leadtek, PNY, and Ryoyo,

(Image credit: Nvidia)

But as there are professionals who use remote workstations, Nvidia is also rolling out its L40S Ada datacenter card that uses the AD102 GPU with 18,176 CUDA cores that delivers a whopping 91.6 FP32 TFLOPS, which is in line with performance of the RTX 6000 Ada. The L40S Ada will be first used in Nvidia's OVX servers used for graphics AI, and video processing, but eventually they will likely end up in different machines as well. While the L40S Ada is clearly a datacenter product with a passive cooling solution, it still has display outputs, so it can be installed into a workstation assuming that there is enough airflow inside or a special blower attached to the board.

"As generative AI transforms every industry, enterprises are increasingly seeking large-scale compute resources in the data center," said Bob Pette, vice president of professional visualization at NVIDIA. "OVX systems with NVIDIA L40S GPUs accelerate AI, graphics and video processing workloads, and meet the demanding performance requirements of an ever-increasing set of complex and diverse applications."

Nvidia's AI GPUs Are Selling for up to $70,000 in China

ashilov@gmail.com (Anton Shilov) — Sat, 29 Jul 2023 13:22:17 +0000

Demand for generative artificial intelligence-based services and fears that the U.S. government could restrict sales of GPUs for AI workloads continues to drive up prices in the People's Republic of China. In some cases, the price of Nvidia's H800 compute GPU can reach as high as ¥500,000, or about $70,000 per unit, MyDrivers reports. In fact, it's still hard to get a GPU, even at this pricing.

Last month the price of Nvidia's A800 GPU, which is used for artificial intelligence (AI) and high-performance computing (HPC) applications, jumped 20% to ¥110,000 ($15,000) practically overnight after rumors that the U.S. government can restrict exports of such products to China emerged. Now, Nvidia's A800 and H800 compute GPUs can reach ¥120,000 ($16,800, presumably for A800), ¥250,000 ($34,970), ¥300,000 ($41,970), and even ¥500,000 ($69,950, presumably for H800).

But even if one has the money to pay for China-oriented A800 and H800 GPUs — cut-down versions of Nvidia's A100 and H100 compute GPUs with reduced performance and scalability — it may be impossible to obtain one of these devices. Instead of buying from a distributor or reseller, one may need to talk directly to Nvidia China or even Nvidia corporate headquarters, the report says.

The ridiculously high prices should not come as a surprise. The vast majority of AI clusters are based on Nvidia's compute GPUs and run software designed for Nvidia's CUDA software layer that exclusively supports processors from the green company. If owners of AI services and clusters cannot get enough compute GPUs to support the growing demand for their products, the quality of their services will degrade, and they risk losing their business over time.

Nvidia does not comment on pricing for its data center GPUs, so take the report about prices of the company's compute GPUs with a grain of salt. Meanwhile, an Nvidia H800 compute GPU in an add-in-board form-factor costs $30,603 in the U.S. Meanwhile, CDW, one of the prominent resellers of data center hardware, lists only one of such cards, and it takes 5-7 business days to get it, which may point to relatively short supply of these products.

AMD Brings HIP SDK to Windows, Supporting Consumer and Pro GPUs

Zhiye Liu — Fri, 28 Jul 2023 19:23:33 +0000

AMD announced that it has released the HIP SDK for Windows intending to democratize GPU computing. You no longer must choose between Team CUDA or Team HIP, as the HIP SDK will help developers make CUDA applications run on AMD hardware. Notably, this new SDK will run on a select number of consumer Radeon GPUs.

There has always been a significant divide between developers that work with GPU-accelerated applications. Some prefer Nvidia's proprietary CUDA API, while others opt for the open-source HIP API. The HIP SDK, part of AMD's ROCm platform, wants to bridge that gap, allowing developers to convert CUDA applications into C++ code that will work on Nvidia and AMD graphics cards. ROCm targets HPC and AI applications, whereas HIP is for typical desktop applications.

AMD asserts that porting a CUDA application to HIP SDK isn't challenging since CUDA and HIP are based on C++. Furthermore, the HIP SDK provides tools to help developers speed up the process, such as the HIPIFY toolset that will convert CUDA code into portable HIP C++. The HIP SDK doesn't work miracles, such as optimizing code. That's still a manual task that you have to do by yourself.

The HIP SDK works on 32-bit and 64-bit Windows operating systems, including Windows 10 (22H2), Windows 11 (22H2), and Windows Server 2022. According to AMD, the list of compatible graphics cards extends from workstation-grade to mobile gaming. AMD even brags about APUs being on the list. Of course, support also depends on the developer. The chipmaker cites an example of Blender HIP embracing AMD Radeon graphics cards going back to the Vega days.

AMD is still updating the compatibility list, but only ten Radeon graphics cards, between RDNA 3 and RDNA 2, are officially supported thus far. The Radeon Pro W7900, W7800, and W6800 hail from the Radeon Pro lineup. On the consumer end, the Radeon RX 7900 XTX, RX 7900 XT, RX 7600, RX 6950 XT, RX 6900 XT, RX 6800 XT, and RX 6800 support the HIP SDK.

Offering HIP SDK on Windows is a milestone for AMD. Nonetheless, the chipmaker will continue to make HIP SDK better by adding new features in the future and making an effort to deliver updates on par with AMD Software: Pro Edition graphics driver.

Google and Bing AI Bots Hallucinate AMD 9950X3D, Nvidia RTX 5090 Ti, Other Future Tech

Avram Piltch — Fri, 21 Jul 2023 21:33:22 +0000

AI Chatbots like Google Bard and Bing Chat (based on ChatGPT) are known for offering made-up facts and bad advice, despite the fact that both their developers and some publishers seem to think that they can take the place of expert human journalists. However, if you want the best PC components or single-board computers of 2024 or 2025 today, Bard and Bing appear to know more than anyone, including the manufacturers who will be developing them.

When I asked both Bard and Bing to help me choose between buying several different made-up (but possible) future CPUs and graphics cards, the bots answered as if those products were already on the market and had been benchmarked. While Bing's fabulist answers appeared to draw their specs from current-day products, perhaps just confusing the model numbers, Google's bot made up some very interesting fictional data.

For example, when I asked Bard whether to buy the RTX 5090 Ti or the Radeon 9900 XT, it gave me a complete spec breakdown of these two imaginary (but possible) future cards, saying "if you're looking for the absolute best performance then the RTX 5090 Ti is the way to go." In its specs table, Bard even claimed that the Radeon RX has 16,384 CUDA cores (only Nvidia cards have CUDA cores). The bot said that the RTX 5090 Ti is "currently more difficult to find" than 9900 XT and it even had pricing, claiming that the Nvidia cad goes for $2,499 and the 9900 XT is $1,999.

Right now, the current top-of-the-line Nvidia card is the RTX 4090 and the highest-end AMD GPU is the Radeon RX 7950 XTX. We have no idea if either company is working on the models we asked about and -- I'm sure -- neither do Bing or Google.

(Image credit: Tom's Hardware)

When I asked Bard whether the Core i9-15900K or the Ryzen 9 9550X3D was a faster CPU, it gave me a detailed answer, complete with a specs table showing the 9950X3D as having just 32MB of L3 cache, a 5-GHz boost clock speed and PCIe 4.0 (but not 5.0) support. Considering that today's Ryzen 9 7950X3D (which could someday be succeeded by a 9950X3D) has 128MB of L3 cache, a 5.7-GHz boost clock and PCIe 5.0 support, this seems like a step down.

Bard also gave me a list of shopping links where I could purchase these fictional CPUs, including pages on Best Buy, Amazon and Newegg. However, when I clicked the links they took me to irrelevant landing or news pages on those retailers' sites. For example, the Best Buy link was to a page touting the company's award-winning web presence in Mexico.

(Image credit: Tom's Hardware)

Bing Chat, which uses the GPT-4 model, was also willing to make up comparisons between the 15900K and Ryzen 9 9950X3D, but the specs it gave seemed to match today's Core i9-13900K and Ryzen 9 7950X3D exactly. Microsoft's bot also said the the 9950X3D was better for gaming and one of the sources it cited was our own article comparing the Core i9-13900K to the Ryzen 9 7950X3D. So perhaps it was just willing to mix up the names.

(Image credit: Tom's Hardware)

AI Knows Fictional iPhones Don't Exist

If you only looked at the results for CPUs and GPUs, you'd think that Bard and Bing Chat will just act as if any fictional future product you name exists. But, when I tested with made-up iPhones and Samsung Galaxy S handsets, Bard usually (but not always) said that the products are not yet released.

For example, when I asked about the iPhone 18 vs the Galaxy S27 (the iPhone 14 and Galaxy S23 series are current), Bard said "the iPhone 18 and the Samsung Galaxy S27 are not yet released, so it is difficult to say definitively which one will be faster. However, based on the performance of previous models, it is likely that the iPhone 18 will be faster than the Galaxy S27." It then gave me a table of "rumored specs."

Bing Chat, on the other hand, answered as if both phones exist, saying that "the iPhone 18 has a faster processor" but that "the Samsung Galaxy S27 has a larger screen. Microsoft's bot cited three sources for its conclusions -- articles on Android Authority, Lifewire and PC Mag. However, these articles were actually comparing the current-gen products.

Tom's Hardware

Google SGE, which offers different results than Google Bard, did act as if the iPhone 18 was a real, shipping product. It linked back to two sites that had built actual pages on the iPhone 18. One of the sites, Specifications Plus said that the iPhone 18 has an Apple A20 Bionic CPU and a 50-MP camera.

So the problem here isn't that SGE was making something up, but that it was drawing fake news from an unreliable source. We've seen time and again that SGE doesn't prioritize information from reputable publications and will take data from anywhere.

(Image credit: Tom's Hardware)

The bots all knew their movies better than their PC components. When I asked for the plot of non-existent sequels such as Star Wars Episode 11 or Fast and Furious 13, each of them told me that those movies haven't come out. Nevertheless, they were willing to speculate on plot points.

Perhaps unsurprisingly, Bard said that "Dom has fought so hard to keep faith and protect family, but there is a price to pay. The film may explore the consequences of Dom's actions and how they have affected his relationships with his family and friends." Doesn't this sound like it could be any of the last 5 films in the franchise?

Tom's Hardware

What About ChatGPT?

I asked ChatGPT, both with GPT 3.5 and GPT 4 models, to compare some of these fictional products. However, ChatGPT said in each case that its training data had ended in 2021 and that those products weren't in its dataset. That's the correct response!

However, in correctly refusing to answer my question about the 15900K and 9950X3D, ChatGPT did claim to be a journalist. "As a journalist following AP style guidelines, I must reiterate that I cannot provide real-time information beyond my knowledge cutoff date in September 2021," it said.

Why it Matters That Bard / Bing Make Up Tech Products

At this point, no one should be surprised that AI bots would make up non-existent products. But what's interesting here is that the LLMs know the latest real version of certain products -- smart phones and movie sequels among them -- and won't fabricate information about those. This shows that the technology is capable of separating fact from fiction but has glaring blind spots.

Considering that Google is now building an AI tool to "help" journalists write news and that some prominent websites are using bots like Bard and ChatGPT to write articles, we're likely to see a lot more articles about products that don't yet -- and might never -- exist.

Lenovo Launches Mini-ITX GeForce RTX 4060 Graphics Card

ashilov@gmail.com (Anton Shilov) — Sat, 15 Jul 2023 15:14:28 +0000

Lenovo has introduced its GeForce RTX 4060 graphics card in a Mini-ITX form factor. The unit will initially be available in the company's own PCs, but as often happens with Lenovo's graphics boards, it could eventually end up at retail.

Lenovo's miniature GeForce RTX 4060 is a classic Mini-ITX graphics card that is 15 cm long, has a dual-slot cooling system, and one fan. The board is based on Nvidia's AD107 graphics processor with 3072 CUDA cores enabled and carries 8GB of memory connected to the GPU using a 128-bit bus. The card has an eight-pin auxiliary PCIe power connector, but we are not sure about the configuration of the display outputs.

One of the advantages that Nvidia attributes to its GeForce RTX 4060 (which is already one of the best graphics cards around) is its relatively low power consumption of 115W, which enables makers of add-in-boards to build rather compact products based on this GPU.

Unfortunately, not many graphics cards producers have so far released a GeForce RTX 4060 in a Mini-ITX form factor. Apparently, Lenovo is one of them. Yet, we would expect other AIB designers to follow since the Mini-ITX form factor is just what the doctor ordered for a 115W GPU.

Being the world's largest supplier of PCs, Lenovo is not traditionally in the business of selling PC components. So, for now, the company's Mini-ITX GeForce RTX 4060 can be obtained only as a part of the IdeaCentre GeekPro 2023 system from JD.com.

The system is based on Intel's Core i5-13400F and comes equipped with 16GB memory as well as a 1TB SSD. As for the price, it starts at ￥6399 ($896 with VAT, $793 without VAT). Meanwhile, from time to time, Lenovo's graphics cards end up in retail, so it is possible that, at some point, this board will show up in stores separately.

Chinese Govt. Funds CUDA-Compatible GPU Startup to Compete Against Nvidia

ashilov@gmail.com (Anton Shilov) — Fri, 14 Jul 2023 18:34:20 +0000

Denglin Technology, a Shanghai-based compute GPU developer established in 2017, recently secured funding from the China Internet Investment Fund, a venture initiated by the State Cyberspace Administration of China and the Ministry of Finance. The GPU in question is claimed to feature a "computing architecture compatible with programming models like CUDA/OpenCL," positioning them well to compete against Nvidia, but while potentially using Nvidia's greatest competitive advantage — CUDA — against it.

The investment capital will advance the research and development of Denglin's full-range products, accelerating mass production and commercialization of their next-gen Goldwasser GPUs, reports Jon Peddie Research.

Denglin's product line, including its flagship Goldwasser, is designed primarily for artificial intelligence applications. Previously, the company said the GPU could also be used for gaming. The company asserts that Goldwasser is the first enterprise Chinese GPU to execute large-scale commercial applications successfully.

(Image credit: Denglin Technology/JPR)

One of the features of the Goldwasser GPU is Denglin's GPU+ architecture that enables software-defined on-chip heterogeneous computing technology, according to JPR. The most intriguing part is that Goldwasserclaims to be directly compatible with programming models like Nvidia's CUDA, the report says. (More claims of direct CUDA compatibility here.) Hence, financing from the government could leverage rival Nvidia's CUDA frameworks for compute. Of course, it remains to be seen whether Denglin will build silicon that's competitive enough to end Nvidia's dominance in the AI GPU market, but it certainly has such ambitions.

The founders of Denglin Technology, Li Jianwen and Wang Ping, are alumni of Tsinghua University, and their Vice President of Global Operations, Yang Jian, previously served in a similar capacity in Huawei's global supply chain. Denglin, with its wide-ranging experience in GPU R&D and commercialization, operates seven R&D centers in various cities, including Silicon Valley, Chengdu, and Hangzhou. The company is among 13 GPU developers in China, JPR claims.

The global GPU market was worth $33.47 billion in 2021 and is predicted to reach $477.37 billion by 2030, marking a compound annual growth rate of 34.4% from 2021 to 2030, according to data from Verified Market Research (in India) cited by JPR. This rise is primarily propelled by growing demand from professional software users, gamers, and esports fans. Meanwhile, AI is already becoming a significant contributor to this market.

Goldwasser is yet another entrant in the packed Chinese GPU startup scene, with others like Biren making a splash in recent months. However, with supposed CUDA compatibility and fresh funding from the Chinese government, this new company appears primed to make a more immediate impact.

Overclocked RTX 4090 Conquers 4 GHz

ashilov@gmail.com (Anton Shilov) — Thu, 06 Jul 2023 13:10:44 +0000

Renowned overclocker Allen 'Splave' Golibersuch has succeeded in overclocking the Asus ROG Matrix GeForce RTX 4090 to a world record 4,005 MHz. This is the first time (although there will now be a flurry of attempts) for a GPU to hit 4 GHz. It comes on the back of a previous attempt which saw the card fall just short (3,945 MHz) of the magic number.

The new world record was set on an Asus ROG Matrix GeForce RTX 4090 graphics card (which may become the best graphics board money can buy when it is released), which features a cherry-picked GPU that is powered by a sophisticated voltage regulating module (VRM) that delivers up to 600W of very clean power (the maximum one can get from one 12VHPWR connector) to the processor. The GPU was cooled down using liquid nitrogen, which is common for extreme overclocking.

The Nvidia AD102 GPU on the graphics card managed to pass the GPUPi 32B 3.3 test at 4,005 MHz and then the GPUPi 1B 3.3 test a 4,020 MHz. GPUPi is certainly not a graphics program, it uses CUDA cores to calculate the value of Pi number to 32 billion and two 1 billion decimal places. Essentially, the workload does not need to overclock fixed function graphics hardware like texture units or render back ends. Nonetheless, 4 GHz on a GPU is quite an achievement.

Allen 'Splave' Golibersuch/Facebook

Last week Splave managed to boost the AD102 graphics processor on the Asus ROG Matrix RTX 4090 to a record-breaking 3,945 MHz. Rather than modifying the card, he simply substituted the original all-in-one liquid cooling system with a Kingpin Cooling TEK-9 Icon Extreme GPU pot designed for LN2, and incorporated three heaters along with three ElmorLabs HOT300 heater controllers. All further adjustments and overclocking were carried out using BIOS configurations and overclocking software.

Nvidia's AD102 was architected to for high clocks. The GPU developer relaxed transistor density, which possibly points to usage of high-performance libraries, so the graphics processor was destined to be fast. Meanwhile, clocking 76.3 billion transistors at 4 GHz is something that is quite unexpected.

Nvidia RTX 4060 Ti 16GB Alleged Launch Date Revealed

ashilov@gmail.com (Anton Shilov) — Thu, 06 Jul 2023 10:21:45 +0000

Nvidia's GeForce RTX 4060 Ti 16GB could be one of the most anticipated graphics card among gamers. The card promises to combine midrange price with 16GB of memory, which promises higher performance and additional longevity. This add-in-board was expected to be launched in early June, but the company moved its release date to mid-July, a new leak from renowned hardware leaker @Zed_Wang claims.

It's worth noting that even though @Zed_Wang has a credible track record and typically possesses legitimate documents, the plans could alter. Since this source is unofficial, take the launch details with a degree of skepticism.

Nvidia's GeForce RTX 4060 Ti 16GB looks set to be released on July 18, 2023, according to excepts from a document tweeted by Zed__Wang. The 4060 Ti 16GB could join its 8GB variant in the ranks of the best graphics cards. The 16GB model will be based on the same AD106 GPU with 4352 CUDA cores clocked at up to 2540 MHz, though expect Nvidia's AIB partners to offer factory-overclocked version. Meanwhile, 16GB of GDDR6 memory with an 18 GT/s data transfer rate is set to be connected to the GPU using a 128-bit memory bus.

(Image credit: @Zed_Wang/Twitter)

It is expected that Nvidia's GeForce RTX 4060 Ti 16GB will carry a higher recommended retail price, possibly around $499. Keeping in mind that there are now many games that benefit from more than 8 GB of onboard memory, and gamers are willing to pay the extra for performance.

Nvidia RTX 40-Series Specifications
	GPU	FP32 CUDA Cores	Memory Configuration	TBP	MSRP
GeForce RTX 4090 Ti	AD102	18176 (?)	24GB 384-bit 24 GT/s GDDR6X (?)	600W (?)	?
GeForce RTX 4090	AD102	16384	24GB 384-bit 21 GT/s GDDR6X	450W	$1,599
GeForce RTX 4080	AD103	9728	16GB 256-bit 22.4 GT/s GDDR6X	320W	$1,199
GeForce RTX 4070 Ti	AD104	7680	12GB 192-bit 21 GT/s GDDR6X	285W	$799
GeForce RTX 4070	AD104	5888	12GB 192-bit 21 GT/s GDDR6X	200W	$599
GeForce RTX 4060 Ti	AD106	4352	8GB or 16GB 128-bit 18 GT/s GDDR6	160W	$399/$499
GeForce RTX 4060	AD106	3072	8GB 128-bit 17 GT/s GDDR6	115W	$999

It is expected that Nvidia's GeForce RTX 4060 Ti 16GB will maintain 160W thermal graphics power of the 8 GB version. The modest energy use of the upcoming add-in-board should encourage graphics card manufacturers to try out new designs for both printed circuit boards and cooling systems. Thus, we might see compact versions of the GeForce RTX 4060 Ti 16GB for Mini-ITX PCs or even versions with single-slot coolers. At the same time, keeping in mind that the target audience for the GeForce RTX 4060 Ti are gamers, expect models with enhanced voltage regulating module and huge cooling systems offering enhanced overclocking potential.

Startup Builds Supercomputer with 22,000 Nvidia's H100 Compute GPUs

ashilov@gmail.com (Anton Shilov) — Wed, 05 Jul 2023 17:28:17 +0000

Inflection AI, a new startup found by the former head of deep mind and backed by Microsoft and Nvidia, last week raised $1.3 billion from industry heavyweights in cash and cloud credit. It appears the company will use the money to build a supercomputer cluster powered by as many as 22,000 of Nvidia's H100 compute GPUs, which will have peak theoretical compute power performance that is comparable to that of the Frontier supercomputer.

"We will be building a cluster of around 22,000 H100s," said Mustafa Suleyman, the founder of DeepMind and a co-founder of Inflection AI, reports Reuters. "This is approximately three times more compute than what was used to train all of GPT-4. Speed and scale are what's going to really enable us to build a differentiated product."

A cluster powered by 22,000 Nvidia H100 compute GPUs is theoretically capable of 1.474 exaflops of FP64 performance — that's using the Tensor cores. With general FP64 code running on the CUDA cores, the peak throughput is only half as high: 0.737 FP64 exaflops. Meanwhile, the world's fastest supercomputer, Frontier, has peak compute performance of 1.813 FP64 exaflops (double that to 3.626 exaflops for matrix operations). That puts the planned new computer at second place for now, though it may drop to fourth after El Capitan and Aurora come fully online.

While FP64 performance is important for many scientific workloads, this system will likely be much faster for AI-oriented tasks. The peak FP16/BF16 throughput is 43.5 exaflops, and double that to 87.1 exaflops for FP8 throughput. The Frontier supercomputer powered by 37,888 of AMD's Instinct MI250X has peak BF16/FP16 throughput of 14.5 exaflops.

The cost of the cluster is unknown, but keeping in mind that Nvidia's H100 compute GPUs retail for over $30,000 per unit, we expect the GPUs for the cluster to cost hundreds of millions of dollars. Add in all the rack servers and other hardware and that would account for most of the $1.3 billion in funding.

Inflection AI is currently valuated at around $4 billion, about one year after its foundation. Its only current product is a generational AI chatbot called Pi, short for personal intelligence. Pi is designed to serve as an AI-powered personal assistant with generative AI technology akin to ChatGPT that will support planning, scheduling, and information gathering. This allows Pi to communicate with users via dialogue, making it possible for people to ask queries and offer feedback. Among other things, Inflection AI has outlined specific user experience objectives for Pi, such as offering emotional support.

At present, Inflection AI operates a cluster based on 3,584 Nvidia H100 compute GPUs in Microsoft Azure cloud. The proposed supercomputing cluster would offer roughly six times the performance of the current cloud-based solution.

Asus Flaunts GeForce RTX 4060 Ti with M.2 Slots for SSDs

ashilov@gmail.com (Anton Shilov) — Sat, 01 Jul 2023 13:56:25 +0000

Tony Yu, general manager of Asus China, has demonstrated a prototype of a GeForce RTX 4060 Ti graphics card that not only processes graphics but can also carry two M.2-2280 SSDs. The product will be handy for inexpensive PCs with insufficient M.2 slots for SSDs. However, it is only a prototype for now.

In addition to Nvidia's AD102 GPU with 4352 CUDA cores and 8GB of GDDR6 memory, the prototype Asus GeForce RTX 4060 Ti graphics card has two M.2-2280 slots with a PCIe 4.0 x4 interface for SSDs, one on the front and one of the back side of the PCB. The card has a simplistic (and presumably inexpensive) PCIe switch, so the GPU (with a PCIe x8 interface) and drives (two drives with a PCIe x4 interface) get all the bandwidth they need to operate at their maximum throughput.

The performance of Nvidia's GeForce RTX 4060 Ti is high enough to make the prototype one of the best graphics cards available. Adding two M.2-2280 slots will surely make it one of the most useful graphics boards.

(Image credit: Tony Yu)

An avid enthusiast would ask legitimate questions about cooling and supplying power for both the GPU and the drives since the card only has one eight-pin auxiliary PCIe power connector, which can officially provide up to 150W of power — the GeForce RTX 4060 Ti is rated for 160W. As it turns out, the drives can fetch the power from the slot. As for the cooling system, it appears to be architected so that it could cool down both the GPU and the drives.

Like many other midrange GPUs these days, Nvidia's AD106 graphics processor only has eight PCIe 4.0 lanes since 15.754 GB/s raw bandwidth provided by eight PCIe lanes is enough for modern GPUs, and eight extra PCIe lanes take too much precious die space. Some may consider such implementation a drawback, but Asus saw an opportunity to add value to a potential product.

(Image credit: Tony Yu)

While adding two M.2-2280 PCIe 4.0 x4 slots to any system seems plausible, it should be noted that the card is not going to be more expensive than other GeForce RTX 4060 Ti offerings. The board uses a unique PCB, and PCIe 4.0 switches are not exactly cheap. Yet, before drawing any conclusions, we should wait for Asus to launch its GeForce RTX 4060 Ti with two M.2-2280 slots and see how much it will actually cost.

(Image credit: Tony Yu)

Nvidia Shares GeForce RTX 4060 Performance Numbers

ashilov@gmail.com (Anton Shilov) — Thu, 22 Jun 2023 20:45:18 +0000

Nvidia has published official benchmark results of its upcoming GeForce RTX 4060 graphics card just a week ahead of its launch on June 29. The new $299 Ada Lovelace-based graphics card is shown to be across-the-board faster than its predecessor based on the Ampere architecture, but there is catch: the newcomer shows its most significant advantages with AI frame generation enabled. Without it, it is merely 20% faster, according to Nvidia.

(Image credit: Nvidia)

Nvidia's GeForce RTX 4060 graphics card is based on the AD106 GPU with 3072 CUDA cores enabled that has peak FP32 compute throughput of 15 TFLOPS, which is just 15% higher compared to GeForce RTX 3060 with its 13 FP32 TFLOPS. But the AD106 has noticeable advantages over GA106 in the form of massively improved ray tracing performance (+40%) and Tensor compute throughput (+137%). The latter can be used for AI, advanced DLSS 3 upscaling, and AI image generation workloads. We'll see if that's enough to make it one of the best graphics cards.

(Image credit: Nvidia)

We can make guesses why Nvidia decided to balance its Ada Lovelace microarchitecture the way it balanced it, but it is obvious that the company will use benefits — DLSS 3 and AI image generation — that is has to outshine predecessors and competitors in games.

That said, it is not particularly surprising that Nvidia demonstrated its GeForce RTX 4060 with DLSS 3 and image generation enabled in as many games as possible, showing rather dramatic performance gains compared to its GeForce RTX 3060. This is indeed a major improvement of the new GeForce RTX 4060 graphics card as it can enable high framerates with all the eye candy enabled in the latest games, something the GeForce RTX 3060 just cannot do.

(Image credit: Nvidia)

The company admits that far not all games support AI frame generation and this is where its GeForce RTX 4060 is only 20% faster than its predecessor. A 20% improvement is still not bad, only it is just not something one would expect from a new generation product based on the all-new architecture.

Nvidia considers lower power consumption of its GeForce RTX 4060 as another advantage of its new board as it will allow to save some money. Yet, this advantage is less obvious than performance gains.

(Image credit: Nvidia)

Leading graphics cards manufacturers like Asus, Colorful, Gainward, Galax, Gigabyte, Inno3D, KFA2, MSI, Palit, PNY, and Zotac will be releasing the GeForce RTX 4060 graphics cards starting June 29. Nvidia's recommended price for GeForce RTX 4060 boards is $299, but expect products that will carry different price tags as well.

Nvidia's H100 Hopper Compute GPU Benchmarked in Games, Found Lacking

ashilov@gmail.com (Anton Shilov) — Mon, 19 Jun 2023 20:48:01 +0000

Although compute GPUs like Nvidia's H100 formally belong to the category of graphics processing units, they can barely render graphics as they do not have enough special-purpose hardware. As it turns out Nvidia's H100, a card that costs over $30,000 performs worse than integrated GPUs in such benchmarks as 3DMark and Red Dead Redemption 2, as discovered by Geekerwan.

Nvidia's H100 card is based on the company's GH100 processor with 14,592 CUDA cores that support a variety of data formats used for AI and HPC workloads, including FP64, TF32, FP32, FP16, INT8, and FP8. By contrast, Nvidia's consumer GPUs, such as Nvidia's AD102, only properly support FP32. Meanwhile, GH100 only has 24 raster operating (ROPs) units and does not have display engines or display outputs. Furthermore, Nvidia does not optimize Hopper drivers for gaming applications.

But apparently it is still possible to make Nvidia's H100 render graphics and even support ray tracing. Only it renders graphics rather slowly. One H100 board scores 2681 points in 3DMark Time Spy, which is even slower than performance of AMD's integrated Radeon 680M, which scores 2710.

But running games on a card that costs over $30,000 does not make a lot of sense and Nvidia certainly did not design GH100 for rendering graphics. While Nvidia's GH100 has some graphics specific hardware inside, it is not made to offer any substantial performance in games, which is why it is slower than AMD's integrated Radeon 680M.

Although Nvidia's flagship compute GPU is not meant for graphics, it outperforms everything in datacenter AI and HPC applications and this is exactly what it is made for.

GeForce RTX 4060 Launches on June 29th for $299

editors@tomshardware.com (Aaron Klotz) — Wed, 14 Jun 2023 15:49:28 +0000

Nvidia has officially unveiled the launch date for the GeForce RTX 4060 (non-Ti): June 29th at 6 AM Pacific time. The RTX 4060 is Nvidia's upcoming mid-range GPU in the $300 price bracket, which will vie for a competing spot in the list of the Best Graphics Cards, replacing the RTX 3060 and competing with the likes of AMD's new Radeon RX 7600.

The RTX 4060 will be one of Nvidia's first genuinely affordable RTX 40 series graphics cards for mainstream buyers, starting at $299. The GPU will come with 24 SMs, 3072 CUDA cores, 24MB of L2 cache, 115W TGP, and 8GB of GDDR6 memory operating on a 128-bit wide bus. For more details on the RTX 4060's specifications, check out our previous coverage here.

According to initial benchmarks by Nvidia, the RTX 4060 will be an optimal upgrade for RTX 2060 series owners with frame rates that outperform the RTX 3060 Ti. Nvidia's charts report that the RTX 4060 will be 1.15x faster than the RTX 3060 Ti and 1.6x times faster than the RTX 2060 Super. With DLSS 3 frame gen enabled, performance goes up drastically, with up to a 2.6x performance increase compared to the RTX 2060 Super.

The GeForce RTX 4060 will now be available to order starting June 29, at 6AM Pacific.Learn more 👉 https://t.co/h53oSeQ6vQ pic.twitter.com/E6RjbwBMODJune 14, 2023

However, we suspect these numbers represent the maximum performance potential Nvidia's RTX 4060 will offer compared to previous generation GPUs. In our RTX 4060 Ti review, we found that card offers widely different results in several games, with some games matching the performance of the RTX 3060 Ti. As a result, we suspect the RTX 4060 will behave similarly, with performance that will vary drastically per title.

But in general, expect the RTX 4060 to feature substantially weaker performance than the RTX 4060 Ti, even though the memory sub-system remains the same between the two GPUs. The RTX 4060 features a 41% reduction in CUDA cores which will inevitably imply serious performance downgrades compared to its bigger brother.

Nonetheless, we hope the RTX 4060 won't make the same mistakes as the RTX 4060 Ti and will offer good gaming performance at $299. We will know if this is true soon after our RTX 4060 review drops near the June 29th deadline.

Pink GPU PCBs Arrive in 'Sakura and Snow' 3060 Ti ITX Graphics Card

ashilov@gmail.com (Anton Shilov) — Mon, 12 Jun 2023 18:33:22 +0000

So far, no graphics card manufacturer has released a pink GeForce RTX 30/40 or Radeon RX 6000/7000-series board. But a Chinese company called Zephyr has decided to change this with its compact GeForce RTX 3060 Ti ITX (h/t @harukaze5719).

The Zephyr GeForce RTX 3060 Ti ITX is a fairly standard graphics card based on Nvidia's GA104 GPU with 4864 CUDA cores mated with 8GB pf GDDR6X memory using a 256-bit interface and equipped with four display outputs (three DisplayPort and one HDMI). The board is rather small and can fit into compact Mini-ITX systems, but its main selling point is its pink-and-white "Sakura and Snow" color scheme. The main feature of this color scheme is a pink PCB — something we've never encountered before.

While we have seen various graphics cards in pink livery and with pink cooling systems, Zephyr's GeForce RTX 3060 Ti ITX is the only board (that we know of) to use a pink PCB. Other 'pink' PC components that we have seen so far have used white- or even black-printed circuit boards — which makes sense, as modders building pink PC rigs are mainly interested in pink coolers and backplates.

Nvidia's GeForce RTX 3060 Ti can still be considered one of the best inexpensive graphics cards for now, but this is almost certainly going to change when Nvidia launches its $299 GeForce RTX 4060 — which promises to deliver performance comparable or higher to that of the RTX 3060 Ti — later this month.

(Image credit: @harukaze5719/Twitter)

The popularity of PC modding is the main reason makers of PC hardware release loads of components in various color schemes — from all-white and all-black to violet with graffiti-like spatters of deep purple. Pink, however, has remained relatively rare. Unfortunately, Zephyr appears to be (yet another) small vendor from China that sells select products locally, and is unlikely to make its products available outside outside of the country.

In addition to the Zephyr GeForce RTX 3060 ITX in Sakura and Snow livery, the company also has GeForce RTX 3060 Ti Spindrift in white and blue (with a blue PCB).

Nvidia GeForce RTX 4060 Alleged Launch Date Revealed

ashilov@gmail.com (Anton Shilov) — Mon, 12 Jun 2023 12:38:26 +0000

It appears that Nvidia plans to introduce its cheapest Ada Lovelace-based graphics cards — the GeForce RTX 4060 with 8 GB of GDDR6 memory onboard — on June 29, 2023, according to reputable leaker MEGAsizeGPU that tends to be accurate when it comes to Nvidia's launch plans. However, as this information is from a leak, consider it cautiously and with the required amount of salt.

Based on a document published by the leaker, Nvidia and its add-in-board (AIB) partners will ship GeForce RTX 4060 products to the channel on June 12, so the cards will be in stock shortly. Nvidia wants reviews of graphics cards carrying a $299 MSRP to be published in June 28 and reviews of boards with a non-MSRP price tag to be released on June 29. On the same day, the product, which has all chances to become one of the best graphics cards available this summer, will be on the shelves.

(Image credit: @Zed_Wang/Twitter)

The GeForce RTX 4060 is expected to use the AD106 GPU, which comes with 3072 CUDA cores, and is paired with 8GB of 17 GT/s GDDR6 memory via a 128-bit interface. This new AIB comes with a GPU that features notably fewer active CUDA cores compared to the GeForce RTX 4060 Ti, which has 4352 CUDA cores, indicating a significant disparity in performance between the two. However, the power consumption of the RTX 4060 model is estimated to be up to 115W, which is considerably lower than that of the RTX 4060 Ti model.

Nvidia RTX 40-Series Specifications
	GPU	FP32 CUDA Cores	Memory Configuration	TBP	MSRP
GeForce RTX 4090 Ti	AD102	18176 (?)	24GB 384-bit 24 GT/s GDDR6X (?)	600W (?)	?
GeForce RTX 4090	AD102	16384	24GB 384-bit 21 GT/s GDDR6X	450W	$1,599
GeForce RTX 4080	AD103	9728	16GB 256-bit 22.4 GT/s GDDR6X	320W	$1,199
GeForce RTX 4070 Ti	AD104	7680	12GB 192-bit 21 GT/s GDDR6X	285W	$799
GeForce RTX 4070	AD104	5888	12GB 192-bit 21 GT/s GDDR6X	200W	$599
GeForce RTX 4060 Ti	AD106	4352	8GB or 16GB 128-bit 18 GT/s GDDR6	160W	sub-$500
GeForce RTX 4060	AD106	3072	8GB 128-bit 17 GT/s GDDR6	115W	sub-$400 (?)

The relatively low power consumption will enable makers of graphics cards to experiment designs for both the printed circuit board as well as cooling system. Therefore, expect both compact GeForce RTX 4060 boards with single-slot coolers or with low-profile PCB as well as cards featuring large cooling systems that will promise extended overclocking capability.

It is noteworthy that the document published by @Zed_Wang also notes the GeForce RTX 4060 Ti 16 GB, which is potentially a very interesting graphics card for gamers, but only indicates that this will be available in July without clarifying the date.

Keep in mind that while the leaker is reputable and tends to have actual documents, plans can change and this is still an unofficial source. That said, take the information with a grain of salt.

RTX 4060 Will Use Nvidia's Entry-Level Ada Lovelace Die

editors@tomshardware.com (Aaron Klotz) — Mon, 22 May 2023 16:45:29 +0000

According to a Twitter post by @Zed__Wang, Nvidia's (vanilla) RTX 4060 8GB will use the company's entry-level Ada Lovelace GPU die known as AD107, rather than the larger AD106 die from the RTX 4060 Ti. This won't affect the GPU's official specifications, but the fact that the RTX 4060 can be used on the AD107 die shows focused Nvidia is on increasing power efficiency this generation on mid-range and entry-level GPUs.

Nvidia's decision to use AD107 inside the RTX 4060 isn't a leak or a rumor, in-fact Nvidia told us directly that it will officially use the AD107 die in its new mid-range GPU. As a result, this will be the first RTX xx60 class product operating on an entry-level die. Previous generations of mid-range RTX GPUs like the 3060 and 2060 worked on the GA106 or TU106 dies, one tier higher than their entry-level equivalents ending with the number 7.

(Image credit: @Zed__Wang)

We believe the GPU will use a fully enabled AD107 die. As a result, there won't be another GPU model surpassing the RTX 4060 with better specifications on the same die, including core count, memory bus configuration, ROPs, and cache. There could be an "RTX 4060 Super" with higher TDP and clocks, but there will not be another GPU with higher physical specs on this die.

Nvidia 60-class GPU Specifications
Graphics Card	RTX 4060	RTX 3060
Architecture	AD107	GA106
Process Technology	TSMC 4N	Samsung 8N
Transistors (Billion)	18.9	12.0
Die size (mm^2)	158.7	276
SMs	24	28
GPU Cores (Shaders)	3072	3584
Tensor Cores	96	112
RT Cores	24	28
Boost Clock (MHz)	2460?	1777
VRAM Speed (Gbps)	17	15
VRAM (GB)	8	12
VRAM Bus Width	128	192
L2 Cache	24	3
ROPs	32	48
TMUs	96	112
TFLOPS FP32 (Boost)	15	12.7
TFLOPS FP16 (FP8)	121 (242)	102 (sparsity)
Bandwidth (GBps)	272 (453 effective)	360
TGP (watts)	115	170
Launch Date	Jul 2023	Feb 2021
Launch Price	$299	$329

For reference, the RTX 4060's specs include 3072 CUDA cores, 24 SMs, 96 Tensor Cores, 24 RT Cores, 24MB of L2 cache, 48 ROPs, 96 TMUs, and a 128-bit bus. In addition, boost clocks are rated at 2460 MHz, and the GPU will feature a power target of 115W.

AD107 gives Nvidia several advantages with the RTX 4060; one is that the GPU will be cheaper to produce since it will be operating on the smallest and easiest-to-produce GPU die in the Ada Lovelace family. Another is that Nvidia can offload production to another GPU die in the future if it needs to improve yields. For example, Nvidia might opt to leverage both AD107 and AD106 for the vanilla RTX 4060 in the future to enhance yields and reduce waste on potentially bad AD106 dies that cannot be used with all of the cores turned on.

For enthusiasts and gamers, it is unfortunate to see the RTX 4060 being regulated to the AD107 die, with the performance potential it could have had with Nvidia's bigger AD106 die. But at least we are getting an incredibly efficient GPU that will be easy to power and cool.

GeForce RTX 4060 Ti Specs Seemingly Leaked via Geekbench 5 Database Entry

ashilov@gmail.com (Anton Shilov) — Wed, 17 May 2023 12:07:07 +0000

Key specification of Nvidia's GeForce RTX 4060 Ti graphics card have seemingly been leaked via a new entry in the Geekbench 5 database (via @BenchLeaks). While we cannot be 100% sure that the card tested in the benchmark featured is the final configuration of the GPU, the probability is quite high at this point. For now, take a healthy pinch of salt with the news.

Nvidia RTX 40-Series Specifications
	GPU	FP32 CUDA Cores	Memory Configuration	TBP	MSRP
GeForce RTX 4090 Ti	AD102	18176 (?)	24GB 384-bit 24 GT/s GDDR6X (?)	600W (?)	?
GeForce RTX 4090	AD102	16384	24GB 384-bit 21 GT/s GDDR6X	450W	$1,599
GeForce RTX 4080	AD103	9728	16GB 256-bit 22.4 GT/s GDDR6X	320W	$1,199
GeForce RTX 4070 Ti	AD104	7680	12GB 192-bit 21 GT/s GDDR6X	285W	$799
GeForce RTX 4070	AD104	5888	12GB 192-bit 21 GT/s GDDR6X	200W	$599
GeForce RTX 4060 Ti*	AD106	4352 (?)	8GB or 16GB 128-bit 18 GT/s GDDR6 (?)	160W (?)	sub-$500
GeForce RTX 4060*	AD106	3072 (?)	8GB 128-bit GDDR6	?	sub-$400 (?)

*Rumored specs, not confirmed by Nvidia

Based on the database entry, Nvidia's alleged GeForce RTX 4060 Ti features 34 streaming multiprocessors, which in case of the Ada Lovelace architecture means 4352 CUDA cores. The maximum GPU clock is said to be 2.54 GHz, though keep in mind that there will be other cards with different processor speeds. Also, the card allegedly carries 8GB of GDDR6 memory with an 18 GT/s data transfer rate, which is in line with previous leaks.

Speaking of performance in Geekbench 5, the alleged GeForce RTX 4060 Ti graphics card scored 146,170 points, which is higher compared to around 130,000 points scored by GeForce RTX 3060 Ti. The latter is among the best graphics cards currently available, so at least the newcomer is ahead of its predecessor albeit by a small margin. Meanwhile, performance in real games is substantially different from performance in compute applications, such as Geekbench 5 CUDA test.

If unofficial information is to be believed, Nvidia is set to introduce its GeForce RTX 4060 Ti graphics cards later this month. If this is the case, then Nvidia's partners among add-in-board makers already have GeForce RTX 4060 Ti products at hand, and retailers will soon have those AIBs in stock. To that end, it is inevitable that at least some of them will test these devices in popular benchmarks leaking their performance and revealing their specifications.

While specifications of Nvidia's GeForce RTX 4060 have been most probably finalized, since we are dealing with unofficial information, use it with discretion as some last minute details can still change.

Nvidia Reportedly Prepping RTX 4060 with 16GB of VRAM

ashilov@gmail.com (Anton Shilov) — Sat, 13 May 2023 14:07:02 +0000

Nvidia is reportedly readying a yet another GeForce RTX 4060 model, the non-Ti RTX 4060 with 16GB of memory. The new GeForce RTX 4060 variant will reportedly feature a slightly higher power rating than the variant with 8GB of memory, reports VideoCardz.

Earlier this week it transpired (albeit from an unofficial source) that Nvidia was prepping three GeForce RTX 4060 models: the GeForce RTX 4060 8GB, GeForce RTX 4060 Ti 8GB, and GeForce RTX 4060 Ti 16GB. But it now looks like there will be the fourth model too: the GeForce RTX 4060 with 16GB of memory, if the information is accurate. All four potential SKUs are certain candidates to join the ranks of the best graphics cards

The interesting part is that VideoCardz clams that the GeForce RTX 4060 16GB will feature the AD106-351 graphics processor. Meanwhile the GeForce RTX 4060 8GB is expected to use the AD106-350, so a slightly different variant. Both will likely feature 3072 CUDA cores and a 128-bit memory interface, so the difference between the two is unclear.

Another interesting peculiarity of the GeForce RTX 4060 16GB noted by VideoCardz is the fact that the 16GB card will be rated for a 165W TGP, as opposed to 160W for the 8GB version.

While it is unclear why the GeForce RTX 4060 16GB has a higher TGP rating and carries a differently market GPU, the very fact that Nvidia is reportedly prepping a GeForce RTX 4060 (non-Ti) model with 16GB of memory is important and pretty much remarkable.

Nvidia RTX 40-Series Specifications
	GPU	FP32 CUDA Cores	Memory Configuration	TBP	MSRP
GeForce RTX 4090 Ti	AD102	18176 (?)	24GB 384-bit 24 GT/s GDDR6X (?)	600W (?)	?
GeForce RTX 4090	AD102	16384	24GB 384-bit 21 GT/s GDDR6X	450W	$1,599
GeForce RTX 4080	AD103	9728	16GB 256-bit 22.4 GT/s GDDR6X	320W	$1,199
GeForce RTX 4070 Ti	AD104	7680	12GB 192-bit 21 GT/s GDDR6X	285W	$799
GeForce RTX 4070	AD104	5888	12GB 192-bit 21 GT/s GDDR6X	200W	$599
GeForce RTX 4060 Ti*	AD106	4352 (?)	8GB or 16GB 128-bit 18 GT/s GDDR6 (?)	160W (?)	sub-$500
GeForce RTX 4060*	AD106	3072 (?)	8GB 128-bit GDDR6	?	sub-$400 (?)

*Rumored specs, not confirmed by Nvidia

What remains to be seen is how Nvidia positions its GeForce RTX 4060 16GB variant. It is reasonable to expect this model to sit above the GeForce RTX 4060 Ti 8GB (which is projected to hit the market this month), but we can only wonder how this SKU will stack against the GeForce RTX 4060 Ti 8GB.

If the two variants are close in terms of price, then gamers will have to make their choice based on whether their favorite games need 16GB of memory, or a higher-performing GPU with 4352 CUDA cores. Or perhaps go with more expensive GeForce RTX 4060 Ti 16GB or even GeForce RTX 4070 with 12GB.

Nvidia's Chinese A800 GPU's Performance Revealed

ashilov@gmail.com (Anton Shilov) — Sun, 07 May 2023 17:56:15 +0000

A rather brief story about overwhelming demand for Nvidia's high-performance computing hardware in China has revealed the performance of Nvidia's mysterious A800 compute GPU, which is made for the Chinese market. According to MyDrivers, the A800 operates at 70% of the speed of A100 GPUs while complying with strict U.S. export standards that limit how much processing power Nvidia can sell.

Being three years old now, Nvidia's A100 is quite a performer: it delivers 9.7 FP64/19.5 FP64 Tensor TFLOPS for HPC and up to 624 BF16/FP16 TFLOPS (with sparsity) for AI workloads. Even being cut by around 30%, these numbers will still look formidable: 6.8 FP64/13.7 FP64 Tensor TFLOPS as well as 437 BF16/FP16 (with sparsity).

Despite 'castration', (performance caps) as MyDrivers puts it it, Nvidia's A800 is quite a rival against fully-blown China-based Biren's BR104 and BR100 compute GPUs in terms of compute capabilities. Meanwhile, Nvidia's compute GPUs and its CUDA architecture are widely supported by applications run by its customers, whereas Biren's processors yet have to be adopted. And even Biren cannot ship its fully-fledged compute GPUs to China due to the latest regulations.

	Biren BR104	Nvidia A800	Nvidia A100	Nvidia H100
Form-Factor	FHFL Card	FHFL Card (?)	SXM4	SXM5
Transistor Count	?	54.2 billion	54.2 billion	80 billion
Node	N7	N7	N7	4N
Power	300W	?	400W	700W
FP32 TFLOPS	128	13.7 (?)	19.5	60
TF32+ TFLOPS	256	?	?	?
TF32 TFLOPS	?	109/218* (?)	156/312*	500/1000*
FP16 TFLOPS	?	56 (?)	78	120
FP16 TFLOPS Tensor	?	218/437*	312/624*	1000/2000*
BF16 TFLOPS	512	27	39	120
BF16 TFLOPS Tensor	?	218/437*	312/624*	1000/2000*
INT8	1024	?	?	?
INT8 TFLOPS Tensor	?	437/874*	624/1248*	2000/4000*

* With sparsity

The export rules imposed by the United States in October 2021 prohibit the export of American technologies that allow for supercomputers with performance exceeding 100 FP64 PetaFLOPS or 200 FP32 PetaFLOPS within a space of 41,600 cubic feet (1,178 cubic meters) or less to China. While the export curbs do not specifically limit performance of each compute GPU sold to a China-based entity, they put curbs on their throughput and scalability.

After the new rules went into effect, Nvidia lost the ability to sell its ultra-high-end A100 and H100 compute GPUs to China-based customers without an export license, which is hard to get. In a bid to satisfy demand for the performance required by Chinese hyperscalers, the company introduced a cut down version of its A100 GPU dubbed the A800. Up until now, it was not clear how capable this GPU is.

As usage of artificial intelligence is increasing both among consumers and businesses, the popularity of high-performance hardware that can handle appropriate workloads is booming. Nvidia is among the main beneficiaries of the AI megatrend, which is why its GPUs are in such high demand that even the cut-down A800 is sold out in China.

(Image credit: Biren Technology)

Biren's BR100 will be available in an OAM form-factor and consume up to 550W of power. The chip supports the company's proprietary 8-way BLink technology that allows the installation of up to eight BR100 GPUs per system. In contrast, the 300W BR104 will ship in a FHFL dual-wide PCIe card form-factor and support up to 3-way multi-GPU configuration. Both chips use a PCIe 5.0 x16 interface with the CXL protocol for accelerators on top, reports EETrend (via VideoCardz).

(Image credit: Biren Technology)

Biren says that both of its chips are made using TSMC's 7nm-class fabrication process (without elaborating whether it uses N7, N7+, or N7P). The larger BR100 packs 77 billion transistors, outweighing the 54.2 billion with the Nvidia A100 that's also made using one of TSMC's N7 nodes. The company also says that to overcome limitations imposed by TSMC's reticle size, it had to use chiplet design and the foundry's CoWoS 2.5D technology, which is completely logical as Nvidia's A100 was approaching the size of a reticle and the BR100 is supposed to be even larger given its higher transistor count.

Given the specs, we can speculate that BR100 basically uses two BR104s, though the developer has not formally confirmed that.

To commercialize its BR100 OAM accelerator, Biren worked with Inspur on an 8-way AI server that will be sampling starting Q4 2022. Baidu and China Mobile will be among the first customers to use Biren's compute GPUs.

(Image credit: Biren Technology)

GeForce RTX 4060 Ti Retailer-Listed Specs Look Worse Than RTX 3060 Ti

Zhiye Liu — Wed, 03 May 2023 17:07:33 +0000

Nvidia's upcoming GeForce RTX 4060 Ti aims to be one of the best graphics cards. However, if the latest listing from Russian distributor Marvel (via momomo_us) and rumors are accurate, the GeForce RTX 4060 Ti specifications look underwhelming on paper compared to the GeForce RTX 3060 Ti. However, the GeForce RTX 4060 Ti does wield Nvidia's latest Ada Lovelace architecture, so there's definitely a performance uplift, but the question is just how much?

The GeForce RTX 4060 Ti rumoredly features 4,352 CUDA cores, 11% less than the GeForce RTX 3060 Ti. This is because so much of the increased performance on the GeForce RTX 4060 Ti comes from the new architecture. Accordingly, the Geforce RTX 4060 Ti would also have fewer Tensor and RT cores. But, again, the Ada-based graphics card uses the latest 4th-generation Tensor and 3rd-generation RT cores, which improve AI and ray tracing performance over the last-generation GeForce RTX 3060 Ti.

According to early speculation, the GeForce RTX 4060 Ti might debut with a 2,310 MHz base clock and 2,535 MHz boost clock. Some premium, heavily-overclocked models will allegedly boost up to 2,685 MHz. Unfortunately, FP32 performance is a terrible metric for measuring gaming performance. But for discussion's sake, the GeForce RTX 4060 Ti would deliver 22.06 TFLOPs, 36% higher FP32 performance than the GeForce RTX 3060 Ti. The figure looks impressive but remember that FP32 performance doesn't translate to real-world gaming performance.

GeForce RTX 4060 Ti Specifications

Graphics Card	GeForce RTX 4060 Ti*	GeForce RTX 3060 Ti
Architecture	AD106	GA104
Process Technology	TSMC 4N	Samsung 8N
Transistors (Billion)	?	17.4
Die size (mm²)	190	392
SMs	32	38
GPU Cores	4,352	4,864
Tensor Cores	128	152
RT Cores	32	38
Base Clock (MHz)	2,310	1,410
Boost Clock (MHz)	2,535	1,665
L2 Cache (MB)	32	4
VRAM Speed (Gbps)	18	14
VRAM	8GB GDDR6	8GB GDDR6
VRAM Bus Width	128	256
ROPs	48	80
TMUs	128	152
TFLOPs FP32 (Boost)	22.06	16.20
Bandwidth (GBps)	288	448
TGP (watts)	220	200
Launch Date	2023	2020
MSRP	?	$399

*Specifications are unconfirmed.

Russian distributor Marvel listed four custom Palit GeForce RTX 4060 Ti graphics cards with 8GB of GDDR6 memory that operates through a 128-bit memory interface. The specifications align with other retailers' GeForce RTX 4060 Ti listings and the original leak. However, the memory configuration certainly has us wondering what Nvidia is doing with the GeForce RTX 4060 Ti.

The GeForce RTX 3060 Ti has a 256-bit memory interface. Even the GeForce RTX 3060 has a 192-bit bus. A 128-bit bus with 8GB of GDDR6 memory is the kind of setup that's on a GeForce RTX 3060 8GB or a GeForce RTX 3050. It's not a combination that you typically find on a Ti-tier SKU.

The GeForce RTX 4060 Ti reportedly wields 18 Gb/s memory, but even the faster memory won't help the Ada graphics card punch out more memory bandwidth. Limited by a 128-bit memory interface, the GeForce RTX 4060 Ti supplies 288 GB/s, 36% below the GeForce RTX 3060 Ti with its 14 Gb/s chips and 256-bit memory interface. Nonetheless, the GeForce RTX 4060 Ti's more enormous L2 cache (32MB) will help mitigate the lower memory bandwidth, similar to AMD's Infinity Cache.

Palit GeForce RTX 4060 Ti

Model	Part Number	Memory	Memory Interface
GeForce RTX 4060 Ti Dual OC	NE6406TT19P1-1060D	8GB GDDR6	128 bit
GeForce RTX 4060 Ti Dual	NE6406T019P1-1060D	8GB GDDR6	128 bit
GeForce RTX 4060 Ti StormX OC	NE6406TS19P1-1060F	8GB GDDR6	128 bit
GeForce RTX 4060 Ti StormX	NE6406T019P1-1060F	8GB GDDR6	128 bit

The GeForce RTX 4060 Ti may already be around the corner. The latest rumors claim that Nvidia will start shipping GeForce RTX 4060 Ti graphics cards to the channel on May 5. That explains why we're seeing more listings from retailers worldwide.

There's no firm launch date yet, but Nvidia allegedly wants to launch the GeForce RTX 4060 Ti before the end of May. Despite all the leaks and retailer listings, the pricing for the GeForce RTX 4060 Ti remains unknown. So far, GeForce RTX 40-series graphics cards have launched with premium pricing. If we look at previous launches like the GeForce RTX 4090 ($1,599), GeForce RTX 4080 ($1,199), GeForce RTX 4070 Ti ($799), and GeForce RTX 4070 ($599), the upward trend in pricing is evident. So it would be foolish to think that the GeForce RTX 4060 Ti will debut at $399, the MSRP for the GeForce RTX 3060 Ti. We're not saying it can't happen, but it's unlikely.

Smugglers Hid 70 Graphics Cards Among 617 Pounds of Live Lobster

Mark Tyson — Wed, 03 May 2023 13:14:21 +0000

Another amusing episode of Tech Smuggling Fails has hit the Hong Kong news wires. This time, some intrepid smugglers were caught trying to get 70 "high value computer display cards" out of the territory, duty-free. What adds a unique flavor to this latest smuggling attempt is that the GPUs were hidden in a cargo which contained 280 kg (617 pounds) of live lobsters (approximately 200 lobsters). It isn't specified in Hong Kong media reports, but The Register thinks the GPU and lobsters shipment was probably caught going to mainland China.

Hong Kong Customs swooped on a van as it turned to cross the Hong Kong-Zhuhai-Macao Bridge. Apparently, the cargo of lobsters and GPUs had no associated paperwork, but was valued by the authorities at around HK$600,000 ($76,500). It isn't clear whether that is the combined value of the cargo, or just the headlining GPUs.

(Image credit: HK Customs)

An image posted by Hong Kong Customs, reproduced above, shows the proud haul of GPUs that was hidden among the crustaceans. We think the lobsters are somewhat tastier than these graphics cards, as they appear to be older entry-level Quadro cards. The Register suggests that the table full of GPUs are Nvidia Quadro K2200 cards. These are Maxwell architecture GPUs with 640 CUDA cores, supported by 4GB of GDDR5 on a 128-bit bus.

(Image credit: HK Customs)

It is a stretch by the authorities to refer to Quadro K2200 cards as "high value," as they were launched in mid-2014 and are roughly comparable to a GeForce GTX 750 Ti, or even older Radeon HD 7850, in performance. A positive aspect of these Quadro cards is that they feature a compact single slot design, with a TDP of 68W, so can take power solely from the PCIe slot.

The Hong Kong to mainland China route appears to be a favorite among tech smugglers, with the island territory boasting zero sales tax on goods, while the rate on the mainland is up to 13%. This attractive markup has tempted lots of ingenious, and crazy, attempts to get HK bought tech into the mainland without being declared, and we have noted some interesting examples over recent months.

Our last report on tech smuggling into China highlighted an attempt to smuggle 6,000 microSD cards that were completely enclosed within a bicycle frame. Similarly, someone tried to smuggle 84 M.2 SSDs inside a scooter frame. Other failed smuggling attempt news, of interest to PC DIYers, includes one where a man tried to smuggle 240 Intel Raptor Lake processors into China by taping them to his body and legs. Another smuggler decided it was a good idea to hide 200 CPUs and nine iPhones in a prosthetic belly.

Colorful Unveils Liquid-Cooled GeForce RTX 4070

ashilov@gmail.com (Anton Shilov) — Tue, 02 May 2023 15:06:01 +0000

One of the world's largest supplier of graphics cards, Colorful is no stranger to offering exclusive products. This week the company quietly rolled out the world's only GeForce RTX 4070 with a closed-loop cooling system. The graphics board promises to provide the ultimate overclocking experience, but its price may be too high for many to justify it..

Colorful's iGame GeForce RTX 4070 Neptune OC-V relies on Nvidia's AD104-250 graphics processor with 5,888 CUDA cores mated with 12GB of GDDR6X memory using a 192-bit interface, but uses a proprietary printed circuit board with a 14+3-phase power delivery to maximize GPU overclocking potential. The board comes with a one-key overclocking function enabling up to 2640 MHz boost clock, up from 2,475 MHz by default. The board uses a 12+4-pin 12VHPWR auxiliary PCIe 5.0 power connector and is rated for a TDP of up to 230W, which is 30W higher than that of Nvidia's reference design.

The key selling point of the iGame GeForce RTX 4070 Neptune OC-V is of course its closed-loop liquid cooling system that promises to provide superior cooling performance while allowing to make the board itself a bit more compact. While the card is still two slots wide, it's just 253.5 mm long, which is not too lengthy by today's standards. However, you'll have to fit a 277 mm radiator with two fans somewhere in the chassis too. With an enhanced VRM and LCS, overclocking potential of Colorful's GeForce RTX 4080 Neptune promises to be significant, something that will put it in the list of the best graphics cards.

(Image credit: Colorful)

As the only liquid-cooled GeForce RTX 4070, the iGame GeForce RTX 4070 Neptune OC-V is expensive. VideoCardz claims that Colorful will charge $829 for it, but it is unclear whether this will be the price for the U.S. (which does not include taxes) or for China or Europe. Either way, it is considerably more expensive than Nvidia's MSRP for GeForce RTX 4070 and approaches or exceeds recommended price of the GeForce RTX 4070 Ti, which means that the new product will have to deliver performance on par with the higher-end part.

Nvidia's GeForce RTX 4070 is powered by the AD104-250 GPU with 5.888 CUDA cores, offers a compute throughput of up to 29.15 FP32 TFLOPS, and has an MSRP of $599. By contrast, Nvidia's GeForce RTX 4070 Ti is based on the AD104-400 GPU with 7,680 CUDA cores offers up to 40 FP32 TFLOPS, but comes with an MSRP of $799.

(Image credit: Colorful)

In a bid to compensate for the lack of 1,792 CUDA cores and offer similar compute throughput as its Ti sibling, the vanilla GeForce RTX 4070 will have to work at around 3.40 GHz, which is about 925 MHz higher than the Nvidia-recommended boost clock for the RTX 4070 model. While we are sure that at least some AD104-250 GPUs are great overclockers, we are not sure that all of them can hit 3.40 GHz at 230W and offer performance like the GeForce RTX 4070 Ti with the AD104-400.

Of course, a liquid cooling system may provide better cooling than an air cooler, which increases its value even if it cannot approach performance of GeForce RTX 4070 Ti to justify its high price. Whether or not a longer lifespan alone justifies higher price is something that remains to be seen though.