Intel Details Core Ultra ‘Meteor Lake’ Architecture, Launches December 14

(Image credit: Intel)
Intel Innovation 2023

Meteor Lake: Core Ultra Architecture Detailed, Due Dec. 14
Meteor Lake GPU: Chip doubles Integrated Graphics Performance Per Watt
5th-Gen Xeon: Intel Shows 288-Core Processor, Arrives Dec. 14
Arrow Lake Wafer: Showcase for 20A Process Node; Chips in 2024
Pike Creek: World’s First UCIe-Connected Chiplet-Based CPU
Lunar Lake-Based PC: Intel also unveils Panther Lake for 2025

Intel shared the deep-dive details of its disruptive new Meteor Lake processors during its Intel Tech Tour in Malaysia, and while the company isn’t sharing product-level details yet, like the different chip models, before their launch on December 14, it’s whipping the covers off its new 3D performance hybrid architecture. That includes details about the chips’ CPU and GPU core microarchitectures, neural processing unit, Foveros 3D packaging that melds multiple chiplets into one chip, a new approach to power management, and its new low-power-island e-cores that create a third tier of CPU compute power in addition to the standard P-cores and E-cores. Intel also shared core details about its new EUV-enabled Intel 4 process node, which it says is delivering the best initial yields the company has seen in a decade.

Intel says its new design methodology results in stunning gains in power efficiency, but it hasn’t shared performance benchmarks yet. The company describes its move to Foveros 3D packaging technology as its largest architectural shift in 40 years, a fair statement given the radical new packaging technology will pave the way to more advanced chips in the future, with CEO Pat Gelsinger even calling it the company’s next ‘Centrino moment.’ These are needed changes as Intel looks to regain the lead over its primary fab competitor, TSMC, in process node tech, and outmaneuver its primary chip competitor, AMD, with its new chiplet-based architecture. Intel also has to fend off Apple, which has now made a disruptive entrance into the laptop market with faster, more power efficient processors.

Meteor Lake marks not only a fundamental rethinking of Intel’s processor design, but also its approach to fabricating its processors – these are the company’s first mainstream chips to use silicon from a competing fab. Intel leans on TSMC’s process node tech for three of the four active tiles on the processor, selecting two less expensive TSMC nodes for some functions, and one higher-density and higher-performance TSMC node than its own ‘Intel 4’ node that it uses for its CPU tile.

We’ll start at the top, covering the basic design elements, and then dive into the deeper details of each unit. We also have details of the overall AI implementation and software, along with the design decisions behind the new Foveros 3D-based architecture.

Intel Meteor Lake Architecture Overview

Intel refers to its die disaggregation technique as a ‘tiled’ architecture, whereas the rest of the industry refers to this as a chiplet architecture. In truth, there really isn’t much technical differentiation between the two terminologies. Intel says that a ‘tiled’ processor refers to a chip using advanced packaging, which enables parallel communication between the chip units, while standard packaging employs a serial interface that isn’t as performant or energy efficient. However, other competing processors with advanced packaging are still referred to as chiplet-based, so the terms are largely interchangeable.

Swipe to scroll horizontally
Intel Meteor Lake Tile/ChipletManufacturer / Node
CPU TileIntel / 'Intel 4'
3D Foveros Base DieIntel / 22FFL (Intel 16)
GPU Tile (tGPU)TSMC / N5 (5nm)
SoC TileTSMC / N6 (6nm)
IOE TileTSMC / N6 (6nm)

Meteor lake has four disaggregated active tiles mounted atop one passive interposer: a Compute (CPU) tile, graphics (GPU) tile, SoC tile, and I/O tile. All these units are Intel-designed and feature Intel microarchitectures, but external foundry TSMC will manufacture the I/O, SoC, and GPU tiles, while Intel manufactures the CPU tiles on its Intel 4 process. All four of these active tiles ride on top of a single unifying Intel-produced Foveros 3D base tile that ties together the functional units with high enough bandwidth and low enough latency for the chip to function as close to one monolithic die as possible.

All told, Meteor Lake has three compute units that can process AI workloads, the CPU, NPU and GPU. AI workloads will be directed to each unit based on workload requirements, which we’ll dig into a bit further below.

Meteor Lake Compute (CPU) Tile

Intel fabs its compute (CPU) tile with the Intel 4 process, which it selected because it affords opportunities for tightly tuning its process node for the specific requirements of a high-powered CPU. We’ll dive into the details of the Intel 4 node later in the article.

As before, Intel has a mixture of P-core and E-cores, with P-cores handling latency sensitive single-threaded and multi-threaded work, while the E-cores step in to handle both background and heavily threaded tasks. However, these two types of cores are now augmented by two new low-power-island e-cores located on the SoC tile. These two new cores are geared for the lowest-power tasks, which we’ll cover in the SoC tile section below. Intel calls this new three-tier core hierarchy the 3D performance hybrid architecture.

The compute tile carries the Redwood Cove P-Cores and Crestmont E-cores, and surprisingly, there aren’t many IPC improvements to speak of. In fact, while the Redwood Cove cores do have some improvements under the hood, they don’t provide an improvement in instructions per clock (IPC) throughput. Intel says Redwood Cove is akin to what it has traditionally called a ‘tick,’ meaning its basically the same microarchitecture and IPC as found in the Golden Cove and Raptor Cove microarchitectures used with the 12th and 13th generation Alder/Raptor Lake processors.

With a tick, instead of relying upon microarchitectural IPC gains, Intel instead uses a proven architecture to unlock the advantages of a more refined and smaller process; in this case, Intel 4. The new Intel 4 process does provide better performance at any given point on the voltage/frequency curve than the Intel 7 node previously used in Intel’s PC chips, meaning it can either run faster at the same power level, or run at the same speed with lower power. Intel says it focused on extracting higher power efficiency with this design, so it’s clear that we shouldn’t expect radical performance gains from the P-cores. Intel does say the Intel 4 process confers a 20% improvement in power efficiency, which is impressive.

Intel did do some plumbing work to accommodate the new tiled design, like improving the memory and cache bandwidth both on a per-core and package level, which could result in an extra bit of improvement in multi-threaded workloads. It also added enhanced telemetry data for its power management unit, which helps improve power efficiency and generate better real-time data that’s fed to the Thread Director, thus ensuring the correct workloads are placed on the correct cores at the right time.

Intel’s Crestmont E-Core microarchitecture does have a 3% IPC improvement over the previous-gen Gracemont, but much of that stems from the addition of support for Vector Neural Network Instructions (VNNI) instructions that boost performance in AI workloads. Intel also made unspecified improvements to the branch prediction engine.

Crestmont does have one major new advance, though: This Crestmont architecture supports arranging the e-cores into either two or four-core clusters that share a 4MB L2 cache slice and 3MB of L3 cache. The previous-gen Gracemont didn’t have that capability, so Intel could only use e-cores in four-core clusters. Now Intel can carve out smaller dual-e-core clusters with twice the amount of cache per core, and that’s exactly the approach it took for the low-power-island e-cores on the SoC tile – those cores use the same Crestmont architecture as the standard e-cores on the compute die, but they are tuned for the TSMC N6 process node.

As with prior generations, each E-core is single-threaded. Intel also doubled the L1 cache to 64KB and employs a 6-wide decode engine (dual 3-wide to improve latency and power consumption), 5-wide allocate, and 8-wide retire.

The Crestmont cores do not support AMX or AVX-512, but they do not support AVX10. [EDIT 9/22/2023: Corrected article to reflect that Meteor Lake does not support AVX10.]

Meteor Lake Graphics (GPU) Tile

The GPU tile is fabbed on the TSMC N3 process node and employs Intel’s Xe-LP architecture, which now has many of the same features as Intel’s Xe-HPG architecture that’s found in its discrete graphics cards.

Intel claims a doubling of performance and performance-per-watt for this unit over the prior-gen, among other highlights. Intel also disaggregated the graphics tile by splitting off the Xe Media and Display engine blocks from the main engine on the GPU die and moving them onto the SOC tile, which helps with power consumption in many scenarios.

Naturally, the GPU tile is optimized for 3D performance, which includes hardware-accelerated ray tracing, mesh shading, variable rate shading, and sampler feedback. Intel also tuned the graphics engine’s voltage and frequency curve to run at much lower voltages and reach higher clock speeds. The GPU can also perform high-throughput AI operations using DP4A acceleration.

The slides above contain a good overview of the new design, and our resident GPU guru Jarred has written up a deeper dive on the Meteor Lake graphics unit, which you can read here.

Meteor Lake SoC Tile

The SoC tile is fabbed on the low-power TSMC N6 process and serves as the central communication point for the tiles with a new-next gen uncore. Given the SoC tile’s focus on low power usage, Intel also calls it the low power island. The SoC tile comes with two new compute clusters, the two low-power-island e-cores and Intel’s Neural Processing Unit (NPU), a block that’s used entirely for AI workloads, among many other units.

Intel moved all the media, display and imaging blocks from the GPU engine to the SoC tile, which helps maximize power efficiency by allowing those functions to operate on the SoC tile while the GPU is in a lower power state. The GPU tile is also fabbed on the more expensive TSMC N5 node, so removing these non-performance-sensitive blocks allowed Intel to better utilize the pricier transistors on the GPU tile for graphics compute. As such, the SoC tile houses the display interfaces, like HDMI 2.1, DisplayPort 2.1, and DSC 1.2a, while also supporting 8K HDR and AV1 encoding/decoding.

The SoC tile resides next to the GPU tile, and the two communicate over a die-to-die (tile-to-tile) interface on one side of the die. You can see this tile-to-tile interface in the above album. Each side of the interconnect has a primary mainband interface that provides the bandwidth required to pass data between the chips. This connection runs through the underlying Foveros 3D silicon, thus providing a much more efficient pathway than standard traces that run through organic substrates (like PCBs). Additionally, a secondary connection provides the interfaces for clock, test and debug signals, along with a dedicated Power Management Controller (PMC) interface between the tiles.

The GPU is connected to an isolated high-performance cache-coherent network on chip (NOC) that connects the NPU, low-power e-cores, and media and display engines to ensure they have efficient access to the memory bandwidth provided by the memory controllers that also reside on the same bus. This NOC also connects to the compute (CPU) tile via another tile-to-tile interface on the other side of the tile.   

Intel has a second lower-power IO Fabric (not coherent) that connects to the I/O tile via another tile-to-tile interface. This IO fabric also contains other lower-priority devices, like Wi-Fi 6E and 7, Bluetooth, the security engines, ethernet, PCIe, SATA, and the like.

That leaves Intel with two independent fabrics on the die, but they must be able to communicate with one another. Intel connected the two with an I/O Cache (IOC) that buffers traffic between the two fabrics. This does incur a latency penalty for cross-fabric communication between the two buses, but the additional latency falls within performance targets for the low-priority I/O fabric.