AMD's Instinct MI355X accelerator will reportedly consume 1,400 watts
CDNA 4 challenges Blackwell Ultra.

Mark Papermaster, chief technology officer of AMD, formally introduced the company's Instinct MI355X accelerators for AI and HPC at ISC 2025 — revealing massive performance improvements for AI inference, but also pointing to nearly doubled power consumption of the new flagship GPU compared to its predecessor from 2023, reports ComputerBase.
AMD's CDNA 4 enters the scene
AMD's Instinct MI350X-series GPUs are based on the CDNA 4 architecture that introduces support for FP4 and FP6 precision formats alongside FP8 and FP16. These lower-precision formats have grown in relevance in AI workloads, particularly for inference. AMD positions its Instinct MI350X processors primarily for inference, which makes sense as scale out world size of MI350X continues to be limited to eight GPUs, which reduces their competitive capabilities compared to Nvidia's Blackwell GPUs. Still Pegatron is readying a 128-way MI350X machine.
AMD's Instinct MI350X family of AI and HPC GPUs consists of two models: the default Instinct MI350X module with a 1000W power consumption designed for air cooling as well as the higher-performance Instinct MI355X that will consume up to 1400W and will be designed primarily for direct liquid cooling (even though AMD believes that some of its clients will be able to use air cooling with the MI355X).
Both SKUs will come with 288GB HBM3E memory that will offer up to 8 TB/s of bandwidth, but the MI350X will offer a maximum FP4/FP6 performance of 18.45 PFLOPS, whereas the MI355X is said to push the maximum FP4/FP6 performance to 20.1 PFLOPS. On paper, both Instinct MI350X models outperform Nvidia's B300 (Blackwell Ultra) GPU that tops at 15 FP4 PFLOPS, though it remains to be seen how AMD's MI350X and MI355X perform in real-world applications.
Row 0 - Cell 0 | AMD Instinct MI325X GPU | AMD Instinct MI350X GPU | AMD Instinct MI350X Platform (8x OAM) | AMD Instinct MI355X GPU | AMD Instinct MI355X Platform (8x OAM) |
GPUs | Instinct MI325X OAM | Instinct MI350X OAM | 8x Instinct MI350X OAM | Instinct MI355X OAM | 8x Instinct MI355X OAM |
GPU Architecture | CDNA 3 | CDNA 4 | CDNA 4 | CDNA 4 | CDNA 4 |
Dedicated Memory Size | 256 GB HBM3E | 288 GB HBM3E | 2.3 TB HBM3E | 288 GB HBM3E | 2.3 TB HBM3E |
Memory Bandwidth | 6 TB/s | 8 TB/s | 8 TB/s per OAM | 8 TB/s | 8 TB/s per OAM |
Peak Half Precision (FP16) Performance | 2.61 PFLOPS | 4.6 PFLOPS | 36.8 PFLOPS | 5.03 PFLOPS | 40.27 PFLOPS |
Peak Eight-bit Precision (FP8) Performance | 5.22 PFLOPS | 9.228 PFLOPS | 72 PFLOPS | 10.1 PFLOPS | 80.53 PFLOPS |
Peak Six-bit Precision (FP6) Performance | - | 18.45 PFLOPS | 148 PFLOPS | 20.1 PFLOPS | 161.06 PFLOPS |
Peak Four-bit Precision (FP4) Performance | - | 18.45 PFLOPS | 148 PFLOPS | 20.1 PFLOPS | 161.06 PFLOPS |
Cooling | Air | Air | Air | DLC / Air | DLC / Air |
Typical Board Power (TBP) | 1000W Peak | 1000W Peak | 1000W Peak per OAM | 1400W Peak | 1400W Peak per OAM |
When it comes to performance comparison against its predecessor, FP8 compute throughput of the MI350X is listed at approximately 9.3 PFLOPS, while the faster MI355X is said to be 10.1 PFLOPS, up from 2.61/5.22 FP8 FLOPS (without/with structured sparsity) in case of the Instinct MI325X — this represents a significant performance improvement. Meanwhile, the MI355X also outperforms Nvidia's B300 by 0.1 FP8 PFLOPS.
Faster GPUs incoming
Papermaster expressed confidence that the industry will continue to develop even more powerful CPUs and accelerators for supercomputers to achieve zettascale performance in about a decade from now. However, that performance will come at the cost of a steep increase of power consumption, which is why a supercomputer offering a ZetaFLOPS performance could consume 500 MW of power — half of what a nuclear power plant can produce.
At ISC 2025, AMD presented data showing that top supercomputers have consistently followed a trajectory where compute performance doubles roughly every 1.2 years. The graph covered performance from 1990 to the present, demonstrating peak system GFLOPs. Early growth was driven by CPU-only systems, but from around 2005, a shift to heterogeneous architectures — mixing CPUs with GPUs and accelerators — took over. Now, in what AMD calls 'AI Acceleration Era,' systems like El Capitan and Frontier are pushing beyond 1 ExaFLOP, continuing the exponential growth trend with increasingly AI-specialized hardware.
Stay On the Cutting Edge: Get the Tom's Hardware Newsletter
Get Tom's Hardware's best news and in-depth reviews, straight to your inbox.
But performance comes at a cost of power consumption. To maintain performance growth, memory bandwidth and power scaling have become urgent challenges. AMD's slide indicated that GPU memory bandwidth must more than double every two years to preserve the ratio of bandwidth per FLOPS. This has required increasing the number of HBM stacks per GPU, which in turn results in larger and more power-hungry GPUs and modules.
Indeed, power consumption of accelerators for supercomputers is increasing rapidly. While AMD's Instinct MI300X introduced in mid-2023 consumed 750W peak, the Instinct MI355X, set to be formally unveiled this week, will feature a peak power consumption of 1,400W. Papermaster envisions 1,600W accelerators in 2026 – 2027 and then 2,000W processors later this decade. By contrast, AMD's peers from Nvidia seem to be even more ambitious when it comes to power consumption as their Rubin Ultra GPUs featuring four reticle-sized compute chiplets are expected to consume as much as 3,600W.
The good news is that in addition to increased power consumption, supercomputers and accelerators have also been gaining performance efficiency rapidly. Another one of AMD's ISC 2025 keynote slides illustrated that performance efficiency increased from about 3.2 GFLOPS/W in 2010 to approximately 52 GFLOPS/W by the time exascale systems like Frontier arrived.
Looking ahead, maintaining this pace of performance scaling will require doubling energy efficiency every 2.2 years. A projected zettascale system delivering 1,000× exaflop-class performance would need around 500 MW of power at an efficiency level of 2,140 GFLOPs/W (a 41-fold increase from today). Without such gains, future supercomputers could demand gigawatt-scale energy — comparable to an entire nuclear power plant, making them way too expensive to operate.
AMD believes that to increase the performance of supercomputers dramatically a decade from now, not only it will need to make a number of architectural breakthroughs, but the industry will have to keep pace with compute capabilities to provide adequate memory bandwidth. Still, using nuclear reactors to power supercomputers seems in the 2030s seems to be a more and more realistic possibility.
Follow Tom's Hardware on Google News to get our up-to-date news, analysis, and reviews in your feeds. Make sure to click the Follow button.

Anton Shilov is a contributing writer at Tom’s Hardware. Over the past couple of decades, he has covered everything from CPUs and GPUs to supercomputers and from modern process technologies and latest fab tools to high-tech industry trends.
-
DS426 CDNA4 appears to be a decent improvement over CDNA3 as both the MI325X and "default" MI355X are rated at 1000W, yet perf is up considerably on the latter. I'm sure the AI zealots also appreciate the new data formats.Reply -
Stomx Curious about this computer slang: how chiplets could be "reticle-sized" ?Reply
Wiki: A reticle also known as crosshair, is a pattern of fine lines or markings built into the eyepiece of an optical device such as a telescopic sight, spotting scope, theodolite, microscope to provide measurement reference during visual inspection -
Stomx Also curious - when making specialized GPUs by removing HPC-oriented FP64 instructions hardware on a chip, how much silicon is actually saved? If it is relatively small and negligible, let it will still stay there. When all this AI hardware will be similarly like a fire become obsolete and go to the city dumps in masses then with FP64 at least it will be used for HPCReply -
DavidC2
A reticle in this case basically means after the light passes through a mirror, it projects an image of the pattern onto a surface. In semiconductors, it's about 850mm2. So the maximum size per chiplet is 850mm2.Stomx said:Curious about this computer slang: how chiplets could be "reticle-sized" ?
Wiki: A reticle also known as crosshair, is a pattern of fine lines or markings built into the eyepiece of an optical device such as a telescopic sight, spotting scope, theodolite, microscope to provide measurement reference during visual inspection
It's significant. FP64 is actually fairly power(only applicable when using FP64) and die area intensive. When you look at whole chip area wise, it would in the range of 10-20%. If you look at individual SM level, it's going to be even more significant, because chips have components that doesn't compute such as memory controllers, IO connections, and caches and other accelerators. It may be in the range of 20-30% in that case.Stomx said:Also curious - when making specialized GPUs by removing HPC-oriented FP64 instructions hardware on a chip, how much silicon is actually saved? If it is relatively small and negligible, let it will still stay there. When all this AI hardware will be similarly like a fire become obsolete and go to the city dumps in masses then with FP64 at least it will be used for HPC
Full FP64 compliance is set by IEEE, and it wasn't used by any AMD/Nvidia GPUs until fairly recently. It's because compliance took extra effort and there's the power and transistor cost associated with it. Then, "GPUs" actually started living up to the name and did more than just run games, and went into supercomputers.
The difference between FP64 and FP32 is precision. In High Performance Computing(HPC) where you are simulating real world stuff, accuracy is important. In games where you have millions of pixels moving rapidly and changing all the time... not so much.
AI sacrifices precision even more. FP16, and Int8 for examples. I would not be surprised if AI hallucinations are caused significantly by this. The problem is, FP32 takes twice as much units as FP16, and FP64 is twice the amount of FP32, so that's why they use lower precision.
The precision losses are more significant though. It's like comparing 16-bit colors vs 8 bit colors. One is 2 to the power of 16 colors, which is 65,536, and other is 2 to the power of 8, which is only 256. So by going from FP64 to FP32, you might go from practically zero errors, to suddenly few % errors on FP32. And then if the AI is trying to relearn from finished data, you are multiplying the errors. This isn't even taking into account problems with algorithms, or the quality of the original data. It's like a printer that takes a 1200 dpi picture and prints at 300 dpi. -
Jame5 Still, using nuclear reactors to power supercomputers seems in the 2030s seems to be a more and more realistic possibility.
Great article. Last sentence typo is disconcerting though. Not the best way to end it.