'Clarified' US sanctions do not impact Nvidia RTX 4090D 'Dragon' or H20 GPUs [Updated]

Nvidia RTX 4090D
(Image credit: Nvidia)

Update 4/42024 6:15am PT: We have been notified that the refined and 'clarified' U.S. sanctions do not impact Nvidia's existing sanctions-compliant GPUs for China, specifically the H20 and RTX 4090D.

The new document includes "Corrections and Clarifications" on the export controls, and some of the language was confusing and misinterpreted, by us and other sites. Specifically, the document details "adjusted peak performance" (APP) and "weighted teraflops" (WT), with a limit of 70 TFLOPS or less. We have received additional information from Nvidia on the restrictions and clarifications, and the short summary is that the sanctions-compliant H20 and 4090D GPUs are not impacted.

The specific reasons that the 4090D isn't affected has to do with the definitions. First, the guidelines are for computer systems, not individual GPUs, and more specifically they are for systems with memory coherence — as an example, a 4-way DGX H100 system would fall under this classification.

In an email from Nvidia, it states: "Processor combinations share memory when any processor is capable of accessing any memory location in the system through the hardware transmission of cache lines or memory words, without the involvement of any software mechanism, which may be achieved using “electronic assemblies” specified in 4A003.c, z.1, or z.3."

The other important detail is that the "adjusted peak performance" applies to FP64 throughput, and it's "weighted" because the value gets scaled based on whether it's a vector processor or a scalar (non-vector) processor. In other words, FP64 done via vector units like Nvidia Tensor cores is different from FP64 done via a CPU running 64-bit calculations. (That's a simplification, as CPUs can also include vector units.)

To determine the "weighted teraflops" and "adjusted peak performance," take the aggregate FP64 throughput of the system. Then multiply by 0.9 for vector processors or by 0.3 for non-vector processors. So going back to the 4-way DGX H100 as an example, the H100 SXM variant of the GPU has 67 teraflops of vector FP64 throughput. Four of them in aggregate would deliver 268 teraflops, and multiplied by 0.9 gives 241.2 — well above the 70 weighted teraflops limit. And of course, the HGX H100 would have already been restricted even prior to the more recent updates.

So, what has actually changed? Succinctly, not much. These are not new export controls or restrictions but rather an addendum to attempt to clarify the official "speed limits." The RTX 4090D for its part hardly offers any FP64 throughput, only 1.15 TFLOPS, though it still comes close to the 4,800 TPP limit.

Original unedited article (which misinterpreted the 'clarifications' described above):

The United States government has revised its Chinese semiconductor export restrictions to encompass more high-performance hardware. Specifically, any semiconductor chip offering over 70 "Weighted TeraFLOPS" of performance is now banned from export to China without a license. This lowered limit now includes Nvidia's Chinese-exclusive RTX 4090D "Dragon" graphics card.

The RTX 4090D was made specifically to comply with the U.S. China export bans several months back. The RTX 4090 exceeded the 4,800 Total Processing Power (TPP) limit by 10%, and so Nvidia created the 4090D to come in below that limit (it lands at 4,707 TPP). Amazingly, the new 70 TFLOPS limit is only 5% lower than the RTX 4090D's 73.5 TFLOPS performance figure.

While this change was seemingly inevitable, we have to question whether it's even meaningful. After the launch of the RTX 4090D, the U.S. government has warned Nvidia that its tactics wouldn't go unnoticed, and it has now moved to ban Nvidia's China-exclusive GPU. But does a 5% reduction in the GPU 'speed limit' even matter, and if so, what happens when Nvidia makes a new GPU that comes in below that limit?

The RTX 4090D is a cut-down variant of the RTX 4090, featuring 14,592 CUDA cores and a 425W TBP. Compared to the outgoing RTX 4090, the RTX 4090D has 12.8% fewer CUDA cores and a 5.9% lower TDP. All other core specifications remain the same between the two. The only exception is the base clock, which has been brought up slightly to 2.28 GHz from 2.23 GHz.

Swipe to scroll horizontally
RTX 4090D vs 4090 Specifications
Row 0 - Cell 0 RTX 4090DRTX 4090
SMs114128
CUDA Cores14,59216,384
Tensor Cores456512
RT Cores114128
Boost Clock2,520MHz2,520MHz
Base Clock2,280MHz2,235MHz
VRAM Speed21Gbps21Gbps
VRAM Capacity24GB GDDR6X24GB GDDR6X
VRAM Bus Width384-bit384-bit
VRAM Bandwidth1,008GB/s1,008GB/s
L2 Cache72MB72MB
ROPs176176
TMUs456512
TGP425W450W
Total Processing Power47075285
TOPICS
Aaron Klotz
Contributing Writer

Aaron Klotz is a contributing writer for Tom’s Hardware, covering news related to computer hardware such as CPUs, and graphics cards.