AMD is currently the only vendor with both x86 processors and discrete graphics cards under one roof, at least until Intel's Xe graphics roll out, giving Team Red some flexibility with its interconnect technology. This tech has been particularly useful in the world of high-performance computing (HPC), as evidenced by an AMD presentation at the Rice Oil and Gas HPC conference yesterday.
AMD initially announced at its Next Horizon event in 2018 that it would extend its Infinity Fabric between the data center MI60 Radeon Instinct GPUs to enable a 100 Gbps link between GPUs, much like Nvidia's NVLink. But with its Frontier supercomputer announcement in May, AMD divulged that it would expand the approach to enable memory coherency between CPUs and GPUs.
The annual Rice Oil and Gas HPC event hasn't concluded yet, but according to a tweet from Intersect 360 Research analyst Addison Snell yesterday, AMD announced that future Epyc+Radeon generations will include shared memory/cache coherency between the GPU and CPU over the Infinity Fabric, similar to what AMD enabled in its Raven Ridge Ryzen products.
We also got a glimpse of some slides presented at Rice Oil and Gas, courtesy of a tweet from Extreme Computing Research Center senior research scientist Hatem Ltaief.
AMD's charts highlight the divide between power efficiency of various compute solutions, like semi-custom SoCs and FPGAs, GPGPUs, and general purpose x86 compute cores, and highlights the FLOPS performance relative to both power consumed and the amount of silicon area required to deliver that performance. As we can see, general purpose CPUs lag behind, but optimizations for vectorized code that use dedicated SIMD pathways can boost performance in both metrics. However, GPUs still hold a commanding lead in terms of both power efficiency and area consumed.
Leveraging cache coherency, like the company does with its Ryzen APUs , enables the best of both worlds and, according to the slides, unifies the data and provides a "simple on-ramp to CPU+GPU for all codes."
AMD also provided some examples of the code required to use a GPU without unified memory, while accommodating a unified memory architecture actually alleviates much of the coding burden.
AMD famously embraced the Heterogeneous Systems Architecture (HSA - deep dive here) to tie together Carrizo's fixed-function blocks, touting that feature among its marketing materials. Much like the approach of extending an Infinity Fabric link between the CPU and GPU, HSA provides a pool of cache-coherent shared virtual memory that eliminates data transfers between components to reduce latency and boost performance.
For instance, when a CPU completes a data processing task, the data may still require processing in the GPU. This requires the CPU to pass the data from its memory space to the GPU memory, after which the GPU then processes the data and returns it to the CPU. This complex process adds latency and incurs a performance penalty, but shared memory allows the GPU to access the same memory the CPU was utilizing, thus reducing and simplifying the software stack.
Data transfers often consume more power than the actual computation itself, so eliminating those transfers boosts both performance and efficiency, and extending those benefits to the system level by sharing memory between discrete GPUs and CPUs gives AMD a tangible advantage over its competitors in the HPC space.
While AMD still appears to be a member of the HSA foundation, it no longer actively promotes HSA in communications with the press. In either case, it's clear the core tenets of the open architecture live on in AMD's new proprietary implementation, which likely leans heavily on its open ROCm software ecosystem that is now enjoying the fruits of DOE sponsorship.
AMD has blazed a path in this regard and secured big wins for exascale-class systems, including the recent El Capitan supercomputer that will hit two exaflops and wields the new Infinity Fabric 3.0, but Intel is also working on its Ponte Vecchio architecture that will power the Aurora supercomputer at the U.S. Department of Energy's (DOE's) Argonne National Laboratory. Intel's approach leans heavily on its OneAPI programming model and also aims to tie together shared pools of memory between the CPU and GPU (lovingly named Rambo Cache). It will be interesting to learn more about the differences between the two approaches as more information trickles out.
Meanwhile, Nvidia might suffer in the supercomputer realm because it doesn't produce both CPUs and GPUs and, therefore, cannot enable similar functionality. Is this type of architecture, and the underlying unified programming models, required to hit exascale-class performance within acceptable power envelopes? That's an open question, but while Nvidia is part of the CXL consortium which should offer coherency features, both AMD and Intel have won exceedingly important contracts for the U.S. DOE's exascale-class supercomputers (the broader server ecosystem often adopts the winning HPC techniques), but Nvidia hasn't made any announcements about such wins, despite its dominating position for GPU-accelerated compute in the HPC and data center space.
That CXL feature conceptually should remove interaction with caches of connected processors/gpus/fpgas/nnps using biased coherency bypass.
Codeplay is porting Intel's dpc++ to run on NVDA's processors.
NVDA and AMD have both joined the CXL Consortium, and Papermaster recently made positive comments about it.
Did AMD present strong advantages for using Infinity Fabric vs PCIE5/CXL for their heterogeneous GPU interconnect?
Oh? Doesn't the Perlmutter system qualify? It has 112 V100s in it.
Imagine a package like the 3970 or 3990 with half of those CPU chiplets being GPU chiplets. Then throw a few chiplets with GPU-CPU-unified HBM4 memory in there. That could be an amazing machine.
About two years ago I said, "You're going to see an APU with a chiplette for CPU, a chiplette forGPU, an IO Die, and HBM package that is part of unified memory dedicated to graphics calls."
My only miss was I predicted Zen 2 (Ryzen 3000) would come out of the gate this way. I was correct with chiplettes, but missed the GPU chiplette on package. But I'm betting we'll see something like this soon.
While HSA is being less emphasized, one of the problems I had resolving (when I drew up block diagrams) was indeed cache coherency between chiplettes. I knew infinity fabric was the answer. But I didn't have the exact answer in terms of algorithms to keep from overloading it. That one took me a while to figure out.
It's a similar problem dealing with GPU to GPU chiplettes. Everyone thought I was nuts. They likended it to SLI/Crossfire. But it isn't SLI/Crossfire with a unified memory architecture. It's just a matter of resolving tiles and sharing the data differences between them. Even NVIDIA posted a paper about how it wasn't practical. But they are all looking a lot closer at it now.
112 V100s is about 14 Petaflops or 0.014 Exaflops. The wins AMD has are for super computers in the 1-10 Exaflop range (70x - 700x more powerful).
So I'd say, No, the perlmutter system doesn't qualify.
No it doesn't even compare. AMD has selling many thousands of pieces of hardware for the world's fastest super computer. Having 112 v100's is nothing by comparison.
AMD is also moving into data centers with both CPU's and GPU's and since nVidia doesn't make a CPU they can't compete in that market. Epyc is a better choice than anything Intel is offering. The price difference is huge and epyc has better performance. So much so that VMware is changing how they charge based on core count...oddly one of the tiers ends with the exact core count of Xeon...weird