Over the past several months, Intel's Raja Koduri has been slowly revealing portions of the upcoming Xe Graphics architecture and lineup. The Xe family will span everything from low power integrated and entry level graphics solutions under the Xe LP Graphics brand, up through data center multi-chip solutions with die stacking. The latter is we're talking about.
As discussed last week, Intel's Xe HP Graphics will come in three variants. The base model, which has been shown several times, has a single tile with 512 EUs (Execution Units) and most likely two HMB2e stacks. Intel hasn't confirmed exact specs, but it did show performance scaling for a computational workload of the 1-tile, 2-tile, and 4-tile variants:
The scaling from the additional tiles might look almost too perfect, but it's important to note that this is not a real-time graphics workload. Splitting work between GPUs for technologies like SLI and CrossFire is far more difficult to do and scaling from additional GPUs usually only nets gamers 50-80% more performance at best. For computational workloads, however, tasks are often independent and can thus hit perfect scaling.
Lest anyone doubt that the 4-tile GPU doesn't actually exist and is merely a publicity stunt, Raja whipped out the large package and briefly flashed it at the camera during his Hot Chips presentation. And yes, it's really big — much bigger than any other chip package we've seen.
Whether the 4-tile Xe HP will ever be put into production, or if it's merely a test product while Intel prepares Xe HPC, aka Ponte Vecchio, is a different matter.
Xe HP only uses EMIB to scale to multi-tile configurations. Xe HPC will also include a Rambo Cache tile, Foveros die stacking, and Co-EMIB with additional enhancements. Ponte Vecchio is planned for use in the upcoming Aurora supercomputer, and it was supposed to be manufactured on Intel's now-delayed 7nm lithography.
In the meantime, Intel now has 1-tile, 2-tile, and 4-tile Xe HP silicon in its labs. As you'd expect, the EMIB linking means the packages for the latter two are basically 2x and 4x the size of the base design, so the GPUs require three separate sockets.
The 4-tile implementation of Xe HP Raja showed off is capable of around 42 TFLOPS of FP32 compute. However, that's not actually the maximum capability. Raja also mentioned that the 4-tile chip is capable of reaching "petaflops scale computing," or >1000 TFLOPS. That's thanks to the presence of tensor cores, though we don't know the exact configuration.
Like Nvidia's A100 architecture and Google's TPUv4, Xe HP supports tensor cores. We assume these are capable of 128 operations per cycle, with one tensor core per EU. With 2048 EUs, that gives us:
2048 × 128 × 2 (FMA) = 524,288
We're missing clock speed, which would suggest either a 2GHz baseline for one petaflop, or potentially a different tensor core arrangement that can do more than 128 ops per clock. Either way, it should make it much easier for supercomputers to reach the level of exascale computing.