Intel teased more information about its forthcoming Xe Graphics Architecture, which it will use in a tiered model to attack every facet of the graphics landscape spanning from desktop PC gaming and mobility uses to the data center. The new architecture scales to thousands of execution units (EUs) and comes with several new memory enhancements, like a new Rambo Cache and XEMF interface.
Datacenter use-cases also equate to the bleeding edge high-performance computing (HPC) and supercomputing realm, so Intel made the announcements here at its Intel HPC Developer Conference in Denver, Colorado, on the cusp of the annual Supercomputing tradeshow.
Intel also shared more information about its data center roadmap (including Sapphire Rapids processors) and its new OneAPI programming model.
We're here at the HPC Developer Conference to listen to Raja Koduri's keynote, and we'll add more information as it is revealed. We already have plenty to discuss, so let's dive in.
Intel's Xe Graphics "Ponte Vecchio" Architecture
The first iteration of the Xe Graphics Architecture comes in the form of Intel's new 7nm "Ponte Vecchio" graphics card for general-purpose compute workloads, with this model designed specifically for HPC and AI workloads. Intel bills this card as its first "exascale graphics card," but that level of compute will require multiple cards working together over a fast and flexible fabric. This new card will debut in the Aurora Supercomputer at Argonne National Laboratory in 2021, which will be the world's first exascale-class supercomputer.
Intel splits the Xe Architecture up into three designs that each address different segments: Data center, consumer graphics cards, and AI use-cases (HP); integrated graphics on its processors (LP); and the high-tier Xe HPC for high performance computing, with the latter being designed specifically for compute. The consumer version of the Xe Graphics card for gaming will lead the way in 2020, likely on the 10nm process.
Intel will fab Ponte Vecchio on the 7nm process, signaling that it is well on track to the next node beyond its notorious 10nm process. Intel's CEO Bob Swan recently announced the company had completed power-on testing of its prototype, the DG1, and that version is largely thought to be fabbed on Intel's 10nm process.
Intel isn't following the graphics industry's trend of using a single large monolithic die for its graphics card, though, as it will use an MCM (Multi-Chip Module) design that consists of multiple compute units tied together via a fabric (as you can see in the image above). Given that Intel will spread the graphics workload over multiple chips, the recent surfacing of Xe Graphics drivers that support multiple GPUs takes on a whole new meaning: These drivers likely serve as a key component for addressing the multi-die architecture within each discreet GPU.
The fact that Ponte Vecchio lands as the first 7nm product from Intel makes sense, even though this is a drastic departure from Intel's established norm. GPUs have many repeating structures that inherently tolerate defects easily, and designing in additional redundancy in some areas, like critical pathways, can further defray the risk of manufacturing the die on the early revisions of the node. It's also noteworthy that Intel uses small chiplets, which improves yields.
These modules employ Intel's latest packaging technology, such as the Foveros 3D chip packaging technology, which consists of a 3D stacked die design that allows the company to use larger process nodes for non-compute elements (like I/O) or even mix-and-match CPU, GPU or AI processors. The company will also use EMIB (Embedded Multi-Die Bridge) technology (deep dive here) to tie the HBM packages to the compute die.
The cards can scale to thousands of execution units (EUs), though those obviously wouldn't be in a single chiplet, and Intel says that each EU offers a 40X increase in double-precision floating point performance.
The new scalable XEMF (XE Memory Fabric) ties the units (compute and memory) together with a coherent memory interface, and Intel says the fabric can scale to thousands of Xe nodes. A large unified "Rambo" cache, packaged together with Foveros technology, provides what Koduri termed as massive amounts of memory bandwidth that is available to both the GPUs and CPUs simultaneously.
The cards support both SIMT (GPU) and SIMD (CPU) vector widths, or it can process SIMT and SIMD simultaneously for maximum performance. The inclusion of SIMD processing will add AVX-type processing capabilities to data center accelerators, which is likely a move to make code more portable via OneAPI, which we'll cover shortly.
Intel surely has plenty of IP for GPU SIMD processing due to its history with Larabee, which used SIMD processing entirely. Melding the two approaches would give Intel a nice blend of the two processing techniques. The ability to run both concurrently is a nice extra benefit.
Intel isn't divulging key specifications of the graphics card yet, like the number of execution units per chiplet or clock rates, but the EMIB connection will provide the fastest possible data transfer rates, and at the best power efficiency, over the relatively short distances between the compute die and HBM memory.
Intel also divulged that the cards would use the Compute Express Link (CXL) interconnect in an Intel-branded "Xe Link" implementation that provides memory coherency between both the numerous units on the graphics card and the CPU. This is aided by a small SoC that sits at the nexus of the fabric that connects multiple compute packages, which is required for the CXL interface. Memory coherency reduces data transfers, which often consume more time and energy than the actual compute workload.
Most importantly, the CXL interface requires the PCIe 5.0 connection, so Ponte Vecchio will land the new standard. As we know from previous disclosures, CXL-infused devices can communicate over both the CXL interface and the standard PCIe interface, but it isn't yet clear how Intel will stratify that type of dual functionality, meaning we may not see the CXL interface in the consumer-level gaming graphics cards, at least not as an external interface. Instead, the cards might simply operate over the PCIe 5.0 interface.
The PCIe 5.0 interface lines up well with Intel's plans for its Sapphire Rapids data center chips that it will deploy in tandem with the cards for the Aurora supercomputer. According to leaked information, those chips will bring PCIe 5.0 and DDR5 to the table, and today Intel divulged that the chips would debut in 2021 (more on that later).
Ponte Vecchio's compute units consist of a data-parallel vector-matrix engine paired with an undisclosed variant of high bandwidth memory (HBM), and are designed to deliver fast double-precision floating point throughput. Intel has also worked in many new RAS (Reliability, Availability, and Serviceability) features inspired by its Xeon CPU lineup, like ECC for memory and caches.
Intel Xe Ponte Vecchio Graphics in Aurora Supercomputer
The DOE and Intel's announcement that the Xe Graphics architecture and Intel's host processors would power the Aurora supercomputer marked a significant win for the company, especially as it also tied in the company's "future generation" of the Optane DC Persistent Memory DIMMs that serves as one of Intel's key "data center adjacencies."
The original announcement was light on details, though we do know that Aurora will be the world's first exascale supercomputer and will be delivered in 2021. This is actually the second iteration of the Aurora design after Intel famously missed the 2018 design target with its Xeon Phi Knights Hill processors, which have now been retired.
Today Intel disclosed the node architecture, which consists of two Sapphire Rapids processors paired with six Ponte Vecchio graphics units. Each card will house 16 compute units.
Data throughput is key here, so the design features an all-to-all architecture within the node to guarantee data delivery to the compute units via eight fabric endpoints per node. This scheme uses a combination of the CXL interface to create a unified memory pool between all tiers of memory, including the HBM in the graphics cards, the Sapphire Rapids processors, the DRAM in the system (undisclosed amount), and the Optane Persistent DIMMs.
The overall system will feature "more than" 200 racks of servers, but Intel didn't disclose if these will be 1U or 2U racks, so it's impossible to reverse-engineer the numbers to determine the amount of server nodes. All told, the system will have 230 petabytes of storage and "more than" 10 petabytes of memory.
Intel Data Center Roadmap
Intel's lack of public-facing roadmaps has long been a sore point for its customers, as it doesn't engender much faith in Intel's future plans, especially in light of the company's failed transition to the 10nm node. The company unveiled a new roadmap here at the event, which comes as an update to the roadmap it shared in 2018.
The new roadmap confirms the Sapphire Rapids processors in 2021, but doesn't divulge many other details, like the process node. Given that Intel will use 7nm for Ponte Vecchio in 2021, it's safe to assume the data center processors will use the same node. Intel says the chips will provide an unprecedented amount of scalability for both scale-up and scale-out implementations.
As mentioned, extremely credible leaks (from Huawei) have indicated that Sapphire Rapids will support PCIe 5.0 and come with eight-channel DDR5 memory. Those chips will drop into the Eagle Stream platform.
Using this radically new architecture will require specialized coding, so Intel and the DOE are using the company's new OneAPI programming model, which Intel designed to simplify programming across its GPU, CPU, FPGA, and AI accelerators. The software goes by the tagline of "no transistor left behind," and given its goals, that's an accurate statement.
OneAPI provides unified libraries that will allow for applications to move seamlessly between Intel's different types of compute. If successful, this could be a key differentiator that other firms will not be able to match with as many forms of compute, so the DOE's support here is key to Intel's long-term goals. Interestingly, OneAPI will also work with other vendors' hardware, marking yet another milestone in Intel's changing view of fostering industry-standard interfaces and models, as opposed to its long-held tradition of using proprietary solutions.
Intel says it designed the software to provide "choice without compromising performance and eliminating the complexity of separate code bases, multiple-programming languages, and different tools and workflows. oneAPI preserves existing software investments with support for existing languages while delivering flexibility for developers to create versatile applications." The software is now in public beta on the Intel Developer Cloud. Intel has also created a migration tool that converts CUDA code to OneAPI, which is a clear shot across Nvidia's bow, as the CUDA programming language serves as its primary moat.
If there is any company that is ready to use a fabric to weave together multiple chips using high bandwidth HBM memory, it would be AMD. But even AMD has not gone this route yet which leads me to believe their is something that causes it to run slower as a consumer GPU.
As far as the consumer space is concerned, Xe is primarily aimed at datacenters, that's what the scalability aspects are aimed at. I doubt the consumer space will be seeing more than a single GPU die for a while, if anything.
This announcement was from the Supercomputer conference, so it has little to do with consumer graphics cards.
We've yet to hear what they'll do in the Radeon/Geforce space.
"The 10nm Ice Lake ramp-up continues in the second half of 2020 and will provide more microarchitecture and architectural features for both traditional HPC and AI, said Hazra."
They actually stated it would be 2020 for the consumer gaming card.
What I find most interesting is the way its described it feels like they are talking about Terascale again. Terascale was basically a bunch of cut down P54C chips that could run CPU or GPU tasks.
I wouldn't doubt that Intel is taking all they have learned over the years and throwing it into the best they can. Might actually give nVidia a run for its money if done right in the HPC market.
They are refering to Ice Lake server chips there, which have AVX-512 for both mobile and server chips, which is a step forward for vectorized code.
"Intel's new DL Boost suite adds support for multiple new AI features, which the company claims makes it the only CPU specifically optimized for AI workloads. Overall, Intel claims these technologies provide a 14X performance increase in AI inference workloads. Intel also added support for new VNNI (Vector Neural Network Instructions) that optimize instructions for smaller data types commonly used in machine learning and inference. VNNI instructions fuse three instructions together to boost int8 (VPDPBUSD) performance and fuse two instructions to boost int16 (VPDPWSSD) performance. These AVX-512 instructions will still operate within the normal AVX-512 voltage/frequency curve during the operations. "
more here: https://www.tomshardware.com/reviews/intel-cascade-lake-xeon-optane,6061-2.html