AMD Continues Frontier Exascale Supercomputer Enablement

AMD is building the world's fastest supercomputer, Frontier, which will deliver exascale-class performance for the US Oak Ridge National Laboratory (ORNL). The supercomputer brings a lot of new technologies to the table, and AMD is laying the groundwork for the software stack that will enable the Frontier to run smoothly. As reported by Phoronix, that work continues in the form of newly-submitted Linux kernel patches.

The Frontier supercomputer is a $600 million project that aims to provide more than 1.5 ExaFLOPs of computational power that will be used by ORNL for work on various government projects. Using next-generation EPYC processors and Radeon Instinct graphics cards from AMD, this system will bring a combination of novel memory, storage, and processing elements into one system.

That stands in contrast to Intel's Aurora, which was projected to be the U.S.'s first supercomputer at the time of its announcement. However, that system has now been delayed into the 2022-2023 timeframe, meaning that the AMD-powered Frontier will not only be the fastest exascale-class computer in the world, it will also be the first. 

The continued code work continues. Back in May,  AMD began the work of ensuring proper support for Frontier's leading-edge storage subsystem. Frontier involves one of the first large-scale deployments with GPU-to-CPU memory coherency, which will require additional code work and qualification. As you can see in the patch notes below, today's work advances Frontier's memory management capabilities.

"The system BIOS advertises the GPU device memory (aka VRAM) as SPM (special purpose memory) in the UEFI system address map. The amdgpu driver registers the memory with devmap as MEMORY_DEVICE_PUBLIC using devm_memremap_pages. This patch series adds MEMORY_DEVICE_PUBLIC, which is similar to MEMORY_DEVICE_GENERIC in that it can be mapped for CPU access, but adds support for migrating this memory similar to MEMORY_DEVICE_PRIVATE."

"We also included and updated two patches from Ralph Campbell (Nvidia), which change ZONE_DEVICE reference counting as requested in previous reviews of this patch series (see https://patchwork.freedesktop.org/series/90706/). Finally, we extended hmm_test to cover migration of MEMORY_DEVICE_PUBLIC. This work is based on HMM and our SVM memory manager, which has landed in Linux 5.14 recently."