Intel Announces 7nm Ponte Vecchio Graphics Cards, Sapphire Rapids CPUs, and Data Center Roadmap

Intel teased more information about its forthcoming Xe Graphics Architecture, which it will use in a tiered model to attack every facet of the graphics landscape spanning from desktop PC gaming and mobility uses to the data center. The new architecture scales to thousands of execution units (EUs) and comes with several new memory enhancements, like a new Rambo Cache and XEMF interface.

Datacenter use-cases also equate to the bleeding edge high-performance computing (HPC) and supercomputing realm, so Intel made the announcements here at its Intel HPC Developer Conference in Denver, Colorado, on the cusp of the annual Supercomputing tradeshow.

We're here at the HPC Developer Conference to listen to Raja Koduri's keynote, and we'll add more information as it is revealed. We already have plenty to discuss, so let's dive in.

Intel's Xe Graphics "Ponte Vecchio" Architecture

Image 1 of 8

The first iteration of the Xe Graphics Architecture comes in the form of Intel's new 7nm "Ponte Vecchio" graphics card for general-purpose compute workloads, with this model designed specifically for HPC and AI workloads. Intel bills this card as its first "exascale graphics card," but that level of compute will require multiple cards working together over a fast and flexible fabric. This new card will debut in the Aurora Supercomputer at Argonne National Laboratory in 2021, which will be the world's first exascale-class supercomputer.

Intel splits the Xe Architecture up into three designs that each address different segments: Data center, consumer graphics cards, and AI use-cases (HP); integrated graphics on its processors (LP); and the high-tier Xe HPC for high performance computing, with the latter being designed specifically for compute. The consumer version of the Xe Graphics card for gaming will lead the way in 2020, likely on the 10nm process.

Image 1 of 1

Intel will fab Ponte Vecchio on the 7nm process, signaling that it is well on track to the next node beyond its notorious 10nm process. Intel's CEO Bob Swan recently announced the company had completed power-on testing of its prototype, the DG1, and that version is largely thought to be fabbed on Intel's 10nm process.

Intel isn't following the graphics industry's trend of using a single large monolithic die for its graphics card, though, as it will use an MCM (Multi-Chip Module) design that consists of multiple compute units tied together via a fabric (as you can see in the image above). Given that Intel will spread the graphics workload over multiple chips, the recent surfacing of Xe Graphics drivers that support multiple GPUs takes on a whole new meaning: These drivers likely serve as a key component for addressing the multi-die architecture within each discreet GPU.

The fact that Ponte Vecchio lands as the first 7nm product from Intel makes sense, even though this is a drastic departure from Intel's established norm. GPUs have many repeating structures that inherently tolerate defects easily, and designing in additional redundancy in some areas, like critical pathways, can further defray the risk of manufacturing the die on the early revisions of the node. It's also noteworthy that Intel uses small chiplets, which improves yields.

These modules employ Intel's latest packaging technology, such as the Foveros 3D chip packaging technology, which consists of a 3D stacked die design that allows the company to use larger process nodes for non-compute elements (like I/O) or even mix-and-match CPU, GPU or AI processors. The company will also use EMIB (Embedded Multi-Die Bridge) technology (deep dive here) to tie the HBM packages to the compute die.

Image 1 of 6

The cards can scale to thousands of execution units (EUs), though those obviously wouldn't be in a single chiplet, and Intel says that each EU offers a 40X increase in double-precision floating point performance.

The new scalable XEMF (XE Memory Fabric) ties the units (compute and memory) together with a coherent memory interface, and Intel says the fabric can scale to thousands of Xe nodes. A large unified "Rambo" cache, packaged together with Foveros technology, provides what Koduri termed as massive amounts of memory bandwidth that is available to both the GPUs and CPUs simultaneously.

The cards support both SIMT (GPU) and SIMD (CPU) vector widths, or it can process SIMT and SIMD simultaneously for maximum performance. The inclusion of SIMD processing will add AVX-type processing capabilities to data center accelerators, which is likely a move to make code more portable via OneAPI, which we'll cover shortly.

Intel surely has plenty of IP for GPU SIMD processing due to its history with Larabee, which used SIMD processing entirely. Melding the two approaches would give Intel a nice blend of the two processing techniques. The ability to run both concurrently is a nice extra benefit.

Image 1 of 3

Intel isn't divulging key specifications of the graphics card yet, like the number of execution units per chiplet or clock rates, but the EMIB connection will provide the fastest possible data transfer rates, and at the best power efficiency, over the relatively short distances between the compute die and HBM memory.

Intel also divulged that the cards would use the Compute Express Link (CXL) interconnect in an Intel-branded "Xe Link" implementation that provides memory coherency between both the numerous units on the graphics card and the CPU. This is aided by a small SoC that sits at the nexus of the fabric that connects multiple compute packages, which is required for the CXL interface. Memory coherency reduces data transfers, which often consume more time and energy than the actual compute workload.

Most importantly, the CXL interface requires the PCIe 5.0 connection, so Ponte Vecchio will land the new standard. As we know from previous disclosures, CXL-infused devices can communicate over both the CXL interface and the standard PCIe interface, but it isn't yet clear how Intel will stratify that type of dual functionality, meaning we may not see the CXL interface in the consumer-level gaming graphics cards, at least not as an external interface. Instead, the cards might simply operate over the PCIe 5.0 interface.

The PCIe 5.0 interface lines up well with Intel's plans for its Sapphire Rapids data center chips that it will deploy in tandem with the cards for the Aurora supercomputer. According to leaked information, those chips will bring PCIe 5.0 and DDR5 to the table, and today Intel divulged that the chips would debut in 2021 (more on that later).

Ponte Vecchio's compute units consist of a data-parallel vector-matrix engine paired with an undisclosed variant of high bandwidth memory (HBM), and are designed to deliver fast double-precision floating point throughput. Intel has also worked in many new RAS (Reliability, Availability, and Serviceability) features inspired by its Xeon CPU lineup, like ECC for memory and caches.

Intel Xe Ponte Vecchio Graphics in Aurora Supercomputer

The DOE and Intel's announcement that the Xe Graphics architecture and Intel's host processors would power the Aurora supercomputer marked a significant win for the company, especially as it also tied in the company's "future generation" of the Optane DC Persistent Memory DIMMs that serves as one of Intel's key "data center adjacencies."

The original announcement was light on details, though we do know that Aurora will be the world's first exascale supercomputer and will be delivered in 2021. This is actually the second iteration of the Aurora design after Intel famously missed the 2018 design target with its Xeon Phi Knights Hill processors, which have now been retired.

Image 1 of 3

Today Intel disclosed the node architecture, which consists of two Sapphire Rapids processors paired with six Ponte Vecchio graphics units. Each card will house 16 compute units.

Data throughput is key here, so the design features an all-to-all architecture within the node to guarantee data delivery to the compute units via eight fabric endpoints per node. This scheme uses a combination of the CXL interface to create a unified memory pool between all tiers of memory, including the HBM in the graphics cards, the Sapphire Rapids processors, the DRAM in the system (undisclosed amount), and the Optane Persistent DIMMs.

The overall system will feature "more than" 200 racks of servers, but Intel didn't disclose if these will be 1U or 2U racks, so it's impossible to reverse-engineer the numbers to determine the amount of server nodes. All told, the system will have 230 petabytes of storage and "more than" 10 petabytes of memory.

Intel Data Center Roadmap

Intel's lack of public-facing roadmaps has long been a sore point for its customers, as it doesn't engender much faith in Intel's future plans, especially in light of the company's failed transition to the 10nm node. The company unveiled a new roadmap here at the event, which comes as an update to the roadmap it shared in 2018.

The new roadmap confirms the Sapphire Rapids processors in 2021, but doesn't divulge many other details, like the process node. Given that Intel will use 7nm for Ponte Vecchio in 2021, it's safe to assume the data center processors will use the same node. Intel says the chips will provide an unprecedented amount of scalability for both scale-up and scale-out implementations.

As mentioned, extremely credible leaks (from Huawei) have indicated that Sapphire Rapids will support PCIe 5.0 and come with eight-channel DDR5 memory. Those chips will drop into the Eagle Stream platform.

Intel OneAPI

Image 1 of 7

Using this radically new architecture will require specialized coding, so Intel and the DOE are using the company's new OneAPI programming model, which Intel designed to simplify programming across its GPU, CPU, FPGA, and AI accelerators. The software goes by the tagline of "no transistor left behind," and given its goals, that's an accurate statement.

OneAPI provides unified libraries that will allow for applications to move seamlessly between Intel's different types of compute. If successful, this could be a key differentiator that other firms will not be able to match with as many forms of compute, so the DOE's support here is key to Intel's long-term goals. Interestingly, OneAPI will also work with other vendors' hardware, marking yet another milestone in Intel's changing view of fostering industry-standard interfaces and models, as opposed to its long-held tradition of using proprietary solutions.

Intel says it designed the software to provide "choice without compromising performance and eliminating the complexity of separate code bases, multiple-programming languages, and different tools and workflows. oneAPI preserves existing software investments with support for existing languages while delivering flexibility for developers to create versatile applications." The software is now in public beta on the Intel Developer Cloud. Intel has also created a migration tool that converts CUDA code to OneAPI, which is a clear shot across Nvidia's bow, as the CUDA programming language serves as its primary moat.

See more CPUs News

TOPICS

Paul Alcorn is the Editor-in-Chief for Tom's Hardware US. He also writes news and reviews on CPUs, storage, and enterprise hardware.

10 Comments Comment from the forums

mihen

It's interesting that their GPUs will use multiple chips. To me it signals that Intel will not initially be targeting the consumer market with their GPUs and instead target compute GPUs.
If there is any company that is ready to use a fabric to weave together multiple chips using high bandwidth HBM memory, it would be AMD. But even AMD has not gone this route yet which leads me to believe their is something that causes it to run slower as a consumer GPU.
Reply
InvalidError

mihen said:
If there is any company that is ready to use a fabric to weave together multiple chips using high bandwidth HBM memory, it would be AMD.
I don't see why AMD would be any better-suited than Intel, Intel has a whole decade of extra experience with memory controllers in NUMA environments as AMD does from providing most supercomputer CPUs between the last viable Opteron and current day. This is fundamentally the same set of challenges as multi-socket systems shrunk down to multi-chip package level. Packaging-wise, Intel has more multi-chip packaging options than AMD does, so no advantage for AMD there either. AMD may have "more experience" with HBM but you have to keep in mind that the fundamental operating principles of DRAM haven't changed since PC60 SDRAM, so being earlier at adopting a new packaging or frequency/timing range standard with minor tweaks does not mean much.

As far as the consumer space is concerned, Xe is primarily aimed at datacenters, that's what the scalability aspects are aimed at. I doubt the consumer space will be seeing more than a single GPU die for a while, if anything.
Reply
DavidC1

mihen said:
It's interesting that their GPUs will use multiple chips. To me it signals that Intel will not initially be targeting the consumer market with their GPUs and instead target compute GPUs.

This announcement was from the Supercomputer conference, so it has little to do with consumer graphics cards.

We've yet to hear what they'll do in the Radeon/Geforce space.
Reply
JayNor

This quote from the hpcwire coverage is interesting. I wasn't expecting further changes to the ice lake core, but maybe so... Any comments on this statement by the toms' author?

"The 10nm Ice Lake ramp-up continues in the second half of 2020 and will provide more microarchitecture and architectural features for both traditional HPC and AI, said Hazra."

https://www.hpcwire.com/2019/11/17/intel-debuts-new-gpu-ponte-vecchio-and-outlines-aspirations-for-oneapi/
Reply
JamesSneed

All we keep hearing about Xe Graphics is related to the HPC space. With that said I think it would be a smart move to first make products for HPC.
Reply
InvalidError

JamesSneed said:
All we keep hearing about Xe Graphics is related to the HPC space. With that said I think it would be a smart move to first make products for HPC.
If you have finite fab capacity available, go for the higher-value markets first like tiny laptop CPUs, HPC, servers and FPGAs. That's why leaked Intel roadmaps showed no apparent plans for anything beyond 14nm in the consumer space until at least 2021. (Though Intel is now claiming it will have 10nm desktop parts later next year, the last time Intel produced desktop parts that weren't on roadmaps as such at least two years ahead of time was Broadwell which turned into little more than a paper launch - limited number of grossly overpriced SKUs, most of which unobtainable until a few months prior to Skylake's launch.)
Reply
jimmysmitty

InvalidError said:
I don't see why AMD would be any better-suited than Intel, Intel has a whole decade of extra experience with memory controllers in NUMA environments as AMD does from providing most supercomputer CPUs between the last viable Opteron and current day. This is fundamentally the same set of challenges as multi-socket systems shrunk down to multi-chip package level. Packaging-wise, Intel has more multi-chip packaging options than AMD does, so no advantage for AMD there either. AMD may have "more experience" with HBM but you have to keep in mind that the fundamental operating principles of DRAM haven't changed since PC60 SDRAM, so being earlier at adopting a new packaging or frequency/timing range standard with minor tweaks does not mean much.

As far as the consumer space is concerned, Xe is primarily aimed at datacenters, that's what the scalability aspects are aimed at. I doubt the consumer space will be seeing more than a single GPU die for a while, if anything.

The consumer version of the Xe Graphics card for gaming will lead the way in 2020, likely on the 10nm process.

They actually stated it would be 2020 for the consumer gaming card.

What I find most interesting is the way its described it feels like they are talking about Terascale again. Terascale was basically a bunch of cut down P54C chips that could run CPU or GPU tasks.

I wouldn't doubt that Intel is taking all they have learned over the years and throwing it into the best they can. Might actually give nVidia a run for its money if done right in the HPC market.
Reply
Paul Alcorn

JayNor said:
This quote from the hpcwire coverage is interesting. I wasn't expecting further changes to the ice lake core, but maybe so... Any comments on this statement by the toms' author?

"The 10nm Ice Lake ramp-up continues in the second half of 2020 and will provide more microarchitecture and architectural features for both traditional HPC and AI, said Hazra."

https://www.hpcwire.com/2019/11/17/intel-debuts-new-gpu-ponte-vecchio-and-outlines-aspirations-for-oneapi/

They are refering to Ice Lake server chips there, which have AVX-512 for both mobile and server chips, which is a step forward for vectorized code.

"Intel's new DL Boost suite adds support for multiple new AI features, which the company claims makes it the only CPU specifically optimized for AI workloads. Overall, Intel claims these technologies provide a 14X performance increase in AI inference workloads. Intel also added support for new VNNI (Vector Neural Network Instructions) that optimize instructions for smaller data types commonly used in machine learning and inference. VNNI instructions fuse three instructions together to boost int8 (VPDPBUSD) performance and fuse two instructions to boost int16 (VPDPWSSD) performance. These AVX-512 instructions will still operate within the normal AVX-512 voltage/frequency curve during the operations. "

more here: https://www.tomshardware.com/reviews/intel-cascade-lake-xeon-optane,6061-2.html
Reply
JayNor

Yeah, they've already had avx512 and dlboost in Ice Lake laptop chips. I took the comment to suggest they are adding something else to the Ice Lake Server chips. It could make sense for them to add a second avx512 unit to each core, as they've done in Cascade Lake server chips, if that's what you mean, but they might also want to reserve that as a differentiation for Cooper Lake.
Reply
DZIrl

I do not see how Intel does not understand process. We have 14++++++++++++++++++ now we must have 10++++++++++++++++ before switching to 7.
Reply

Show more comments