AMD's Instinct MI250X compute accelerator is undoubtedly one of the most impressive products the company has released in recent years. This card will power the industry's first exascale supercomputer called Frontier, as well as smaller upcoming high-performance computing (HPC) deployments. Unfortunately, very few of us will get to see this OAM board (and other compute accelerators) in real life, but Patrick Kennedy from ServeTheHome filled this gap this week with pictures of the system on display. Our rough math says that each of the two GPU dies measures ~790mm^2, putting them among the largest GPUs made. That large die is rumored to consume 550W of power.
All three American exascale supercomputers announced to date will use HPE's Cray Shasta supercomputer architecture. Two of them (Frontier and El Capitan) will be powered by AMD's EPYC processors and Instinct accelerators, whereas the third will be based on Intel's Xeon Scalable CPUs and Ponte Vecchio compute GPUs (Aurora). Since AMD is set to power the world's first (at least as far as official numbers are concerned) exascale system that is due to be deployed in the coming weeks or months, the company naturally demonstrated HPE's Cray EX235a node featuring its EPYC processor and Instinct MI200-series accelerators at SuperComputing 21.
AMD's Instinct MI250X compute GPU codenamed Aldebaran consists of two graphics compute dies (GCDs) that each packs 29.1 billion transistors and is equipped with 64GB of HBM2e memory connected using a 4096-bit interface (128GB HBM2e over an 8192-bit interface in total). With 14,080 stream processors and 96 FP64 TFLOPS performance, the Instinct MI250X is the highest-performing HPC accelerator released to date. The part comes in an open accelerator module (OAM) form factor and measures 102mm x 165mm, which is pretty large.
Each GCD has its own set of supporting chips, including power controllers, voltage regulating modules, and firmware. We have no idea about the function that the huge white box in the lower-left corner of the card performs, but we will do our best to find out if we get to play with this card someday. Make sure to visit ServeTheHome for more AMD Instinct MI250X pictures.
Knowing the dimensions of the card and the dimensions of some of the chips used on it (e.g., a SOIC-8 chip located on the left to the GPU package), we can make some very rough guesses about the dimensions of the Aldebaran CGDs. Of course, this type of napkin math isn't very accurate, especially on images like these, but it looks like we are dealing with chips that have die sizes of around 745 mm^2 ~ 790 mm^2.
To put these die sizes into context, Nvidia's A100 is 826 mm^2. Keeping in mind how many FP64 stream processors each Aldebaran packs (7040 SPs) and the fact that these SPs need to be fed with plenty of data, we understand that the design is very SRAM intensive, which is why the die size is huge (since SRAM barely scales these days).
Complex processors tend to consume a lot of power, and the OAM form-factor is just what the doctor ordered for such accelerators as it can supply up to 700W of power. Rumor has it that AMD's Instinct MI250X consumes up to 550W delivered via a 26-phase voltage regulating module. To cool down such a beast, HPE plans to use a liquid cooling system. It remains to be seen what kind of cooling other types of systems will use.
One interesting thing about the demonstrated card is that it still carries an ES (engineering sample) marking, even though AMD has shipped its Instinct MI200-series compute GPUs for revenue since the second quarter. So, perhaps the pictured card is not final and commercial boards are slightly different.
Another important detail is that the card was made in Canada, where AMD's current graphics division ATI Technologies used to be headquartered. Apparently, AMD still has a big presence in Canada and even makes (or at least prototypes) some of its most important products there. That's an important detail because the Instinct MI250X cards will be installed into exascale supercomputers that are set to be used for some of the most complex computations, including those that concern national security.