AMD is currently the only vendor with both x86 processors and discrete graphics cards under one roof, at least until Intel's Xe graphics roll out, giving Team Red some flexibility with its interconnect technology. This tech has been particularly useful in the world of high-performance computing (HPC), as evidenced by an AMD presentation at the Rice Oil and Gas HPC conference yesterday.
AMD initially announced at its Next Horizon event (opens in new tab) in 2018 that it would extend its Infinity Fabric between the data center MI60 Radeon Instinct GPUs to enable a 100 Gbps link between GPUs, much like Nvidia's NVLink. But with its Frontier supercomputer announcement (opens in new tab) in May, AMD divulged that it would expand the approach to enable memory coherency between CPUs (opens in new tab)and GPUs (opens in new tab).
The annual Rice Oil and Gas HPC event hasn't concluded yet, but according to a tweet from Intersect 360 Research analyst Addison Snell yesterday (opens in new tab), AMD announced that future Epyc+Radeon generations will include shared memory/cache coherency between the GPU and CPU over the Infinity Fabric, similar to what AMD enabled in its Raven Ridge Ryzen products.
We also got a glimpse of some slides presented at Rice Oil and Gas, courtesy of a tweet (opens in new tab) from Extreme Computing Research Center senior research scientist Hatem Ltaief.
AMD's charts highlight the divide between power efficiency of various compute solutions, like semi-custom SoCs (opens in new tab)and FPGAs, GPGPUs, and general purpose x86 compute cores, and highlights the FLOPS performance relative to both power consumed and the amount of silicon area required to deliver that performance. As we can see, general purpose CPUs lag behind, but optimizations for vectorized code that use dedicated SIMD pathways can boost performance in both metrics. However, GPUs still hold a commanding lead in terms of both power efficiency and area consumed.
Leveraging cache coherency, like the company does with its Ryzen APUs (opens in new tab), enables the best of both worlds and, according to the slides, unifies the data and provides a "simple on-ramp to CPU+GPU for all codes."
AMD also provided some examples of the code required to use a GPU without unified memory, while accommodating a unified memory architecture actually alleviates much of the coding burden.
AMD famously embraced the Heterogeneous Systems Architecture (HSA - deep dive here) (opens in new tab) to tie together Carrizo's fixed-function blocks, touting that feature among its marketing materials. Much like the approach of extending an Infinity Fabric link between the CPU and GPU, HSA provides a pool of cache-coherent shared virtual memory that eliminates data transfers between components to reduce latency and boost performance.
For instance, when a CPU completes a data processing task, the data may still require processing in the GPU. This requires the CPU to pass the data from its memory space to the GPU memory, after which the GPU then processes the data and returns it to the CPU. This complex process adds latency and incurs a performance penalty, but shared memory allows the GPU to access the same memory the CPU was utilizing, thus reducing and simplifying the software stack.
Data transfers often consume more power than the actual computation itself, so eliminating those transfers boosts both performance and efficiency, and extending those benefits to the system level by sharing memory between discrete GPUs and CPUs gives AMD a tangible advantage over its competitors in the HPC space.
While AMD still appears to be a member of the HSA foundation, it no longer actively promotes HSA in communications with the press. In either case, it's clear the core tenets of the open architecture live on in AMD's new proprietary implementation, which likely leans heavily on its open ROCm software ecosystem that is now enjoying the fruits of DOE sponsorship.
AMD has blazed a path in this regard and secured big wins for exascale-class systems, including the recent El Capitan supercomputer that will hit two exaflops and wields the new Infinity Fabric 3.0, but Intel is also working on its Ponte Vecchio architecture (opens in new tab) that will power the Aurora supercomputer (opens in new tab) at the U.S. Department of Energy's (DOE's) Argonne National Laboratory. Intel's approach leans heavily on its OneAPI programming model and also aims to tie together shared pools of memory between the CPU and GPU (lovingly named Rambo Cache). It will be interesting to learn more about the differences between the two approaches as more information trickles out.
Meanwhile, Nvidia might suffer in the supercomputer realm because it doesn't produce both CPUs and GPUs and, therefore, cannot enable similar functionality. Is this type of architecture, and the underlying unified programming models, required to hit exascale-class performance within acceptable power envelopes? That's an open question, but while Nvidia is part of the CXL consortium which should offer coherency features, both AMD and Intel have won exceedingly important contracts for the U.S. DOE's exascale-class supercomputers (the broader server ecosystem often adopts the winning HPC techniques), but Nvidia hasn't made any announcements about such wins, despite its dominating position for GPU-accelerated compute in the HPC and data center space.