AMD is building the world's fastest supercomputer, Frontier, which will deliver exascale-class performance for the US Oak Ridge National Laboratory (ORNL). The supercomputer brings a lot of new technologies to the table, and AMD is laying the groundwork for the software stack that will enable the Frontier to run smoothly. As reported by Phoronix, that work continues in the form of newly-submitted Linux kernel patches.
The Frontier supercomputer is a $600 million project that aims to provide more than 1.5 ExaFLOPs of computational power that will be used by ORNL for work on various government projects. Using next-generation EPYC processors and Radeon Instinct graphics cards from AMD, this system will bring a combination of novel memory, storage, and processing elements into one system.
According to today's Linux kernel patch submitted by AMD, "AMD is building a system architecture for the Frontier supercomputer with a coherent interconnect between CPUs and GPUs. This hardware architecture allows the CPUs to coherently access GPU device memory. We have hardware in our labs and we are working with our partner HPE on the BIOS, firmware, and software for delivery to the DOE."
That stands in contrast to Intel's Aurora, which was projected to be the U.S.'s first supercomputer at the time of its announcement. However, that system has now been delayed into the 2022-2023 timeframe, meaning that the AMD-powered Frontier will not only be the fastest exascale-class computer in the world, it will also be the first.
The continued code work continues. Back in May, AMD began the work of ensuring proper support for Frontier's leading-edge storage subsystem. Frontier involves one of the first large-scale deployments with GPU-to-CPU memory coherency, which will require additional code work and qualification. As you can see in the patch notes below, today's work advances Frontier's memory management capabilities.
"The system BIOS advertises the GPU device memory (aka VRAM) as SPM (special purpose memory) in the UEFI system address map. The amdgpu driver registers the memory with devmap as MEMORY_DEVICE_PUBLIC using devm_memremap_pages. This patch series adds MEMORY_DEVICE_PUBLIC, which is similar to MEMORY_DEVICE_GENERIC in that it can be mapped for CPU access, but adds support for migrating this memory similar to MEMORY_DEVICE_PRIVATE."
"We also included and updated two patches from Ralph Campbell (Nvidia), which change ZONE_DEVICE reference counting as requested in previous reviews of this patch series (see https://patchwork.freedesktop.org/series/90706/). Finally, we extended hmm_test to cover migration of MEMORY_DEVICE_PUBLIC. This work is based on HMM and our SVM memory manager, which has landed in Linux 5.14 recently."