AMD unwraps Instinct MI500 boasting 1,000X more performance versus MI300X — setting the stage for the era of YottaFLOPS data centers

The demands of AI data centers compute capability are set to increase dramatically from around 100 ZettaFLOPS today to around 10+ YottaFLOPS* in the next five years (approximately by about 100 times), according to AMD. Thus, to stay relevant, hardware makers must increase performance of their products across the full stack every year. AMD does its best, so during the company's CES keynote its chief executive Lisa Su announced Instinct MI500X-series AI and HPC GPUs due in 2027.

"Demand for compute is growing faster than ever," said Lisa Su, chief executive of AMD. "Meeting that demand means continuing to push the envelope on performance far beyond where we are today. MI400 was the major inflection point in terms of delivering leadership training across all workloads, inference, and scientific computing. We are not stopping there. Development of our next-generation MI500-series is well underway. With MI500, we take another major leap on performance. It is built on our next gen CDNA 6 architecture [and] manufactured on 2nm process technology and uses higher speed HBM4E memory."

Article continues below

Achieving a 1000X performance increase in four years is a major achievement, though we should keep in mind that between the Instinct MI300X and Instinct MI500 there is a three-generational instruction set architecture (ISA) gap (CDNA 3 => CDNA 6), a three generational memory gap (HBM3 => HBM4E), an addition of FP4 and other low-precision formats, faster scale-up interconnects, and possibly PCIe 6.0 interconnection to host CPU.

Nonetheless, the Instinct MI500 will be an all-new generation of AMD's AI and HPC GPUs with major architectural improvements, which probably include substantially higher tensor/matrix-compute density, tighter integration between compute and memory, and significantly improved performance-per-watt perhaps achieved by a combination of ISA and TSMC's N2P fabrication process.

*One YottaFLOPS equals to 1,000 ZettaFLOPS, or one million ExaFLOPS.

Follow Tom's Hardware on Google News, or add us as a preferred source, to get our latest news, analysis, & reviews in your feeds.

TOPICS

Anton Shilov is a contributing writer at Tom’s Hardware. Over the past couple of decades, he has covered everything from CPUs and GPUs to supercomputers and from modern process technologies and latest fab tools to high-tech industry trends.

9 Comments Comment from the forums

usertests

Achieving a 1000X performance increase in four years is a major achievement, though we should keep in mind that between the Instinct MI300X and Instinct MI500 there is a three-generational instruction set architecture (ISA) gap (CDNA 3 => CDNA 6), a three generational memory gap (HBM3 => HBM4E), an addition of FP4 and other low-precision formats, faster scale-up interconnects, and possibly PCIe 6.0 interconnection to host CPU.
That's pretty unfathomable marketing. I have to imagine it's some edge case or something that couldn't run well in lower memory capacity, with lower precision added.
Reply
edzieba

AMD says that its Instinct MI500X GPUs will offer up to 1,000 times higher AI performance compared to the Instinct MI300X accelerator from late 2023, but does not exactly define comparison metrics.
Presumably Bungholiomarks. Anything else would hardly be considered a reputable performance metric!
Reply
emerth

I'm thinking 1000x the FP4 perf compared to Mi300 FP16 perf. That or AMD is implementing 2 bit FP.
Reply
bit_user

emerth said:
I'm thinking 1000x the FP4 perf compared to Mi300 FP16 perf. That or AMD is implementing 2 bit FP.
Well, AMD claims MI300X had up to 5.22 POPS of sparse int8 performance. I wonder if they're comparing theoretical MI500X performance on something like BFP4 vs. the actual achieved performance of the MI300X. Even then, 1000x seems like quite a stretch. I could probably believe 100x, though.
Reply
qwertymac93

The 1000x claim is probably for whole system performance, not just a single card. Taking into account interconnect advancements and larger addressable memory, 1000x seems possible, if unfair in the real world. You'd never use these systems beyond what they are clearly bottlenecked.
Reply
bit_user

qwertymac93 said:
The 1000x claim is probably for whole system performance, not just a single card. Taking into account interconnect advancements and larger addressable memory, 1000x seems possible,
System performance is determined by your biggest bottleneck. It really doesn't matter what else you do, as that bottleneck will be the limiting factor.

As such, improvements in memory bandwidth, compute, and I/O are never multiplicative. One of them will be the limiting factor and how much you improved the system from whatever was your previous bottleneck is what determines system performance.

qwertymac93 said:
You'd never use these systems beyond what they are clearly bottlenecked.
This statement doesn't make sense. The bottleneck fundamentally constrains actual use. You cannot push it beyond what the bottleneck allows.
Reply
qwertymac93

bit_user said:
...

This statement doesn't make sense. The bottleneck fundamentally constrains actual use. You cannot push it beyond what the bottleneck allows.
Sure you can. You can run a 10GB model on an 8GB card and have part of the model paged in system RAM. And it'll run way slower since the bottleneck will have shifted. And then next year you can run the same model on a slightly faster card with 12GB of vram and claim a 100x improvement. 😏 That's what I was trying to imply here. Not that an actual customer would do that, but what customers do in the real world and what manufacturers claim aren't always aligned, are they?
Reply
bit_user

qwertymac93 said:
You can run a 10GB model on an 8GB card and have part of the model paged in system RAM. And it'll run way slower since the bottleneck will have shifted. And then next year you can run the same model on a slightly faster card with 12GB of vram and claim a 100x improvement. 😏
I'll accept that it could be something like that.

I wonder if the slide deck has been published. If so, it might contain some end notes which provide more insight into that number. Without more information, we can't really say any more.
Reply
DS426

Hmm, more die space for tensor/matrix cores makes things trickier when talking about the transition to UDMA. Looks like we won't see it any earlier than 2028, especially if RDNA5 launches in 2027H2. I'm good with that though as some ISA improvement over RDNA4 combined with a bigger die (or dies) could produce a real beast. AMD just has to figure out what memory to employ that's both performant and not overly expensive (and available at all, lol).
Reply

Show more comments