Intel Architecture Day 2021: Alder Lake Chips, Golden Cove and Gracemont Cores

Intel's hybrid x86 architecture began with low-power Lakefield processors that didn't find much success in the market: In fact, Intel has already sent them off to the retirement home. Some of the teething pains with those early hybrid x86 chips boiled down to subpar operating system support – Windows 10X was supposed to arrive with enhanced scheduling to unlock the efficiencies of the hybrid design, but Microsoft canceled the operating system.

As such, Intel's expansion of the hybrid architecture to its high-performance products is a risky move, largely because the challenge lies in assuring that the correct type of workloads land on the correct execution cores. It's easy to see that having a core that excels at high-performance workloads isn't much help if the high-performance workloads consistently land in the slower cores, and thread scheduling systems based entirely on static rules (priority, foreground, background) tend to be inefficient and create software programming overhead.

(Image credit: Intel)

That's where Intel's Thread Director technology comes in. This hardware-based technology provides enhanced telemetry data to Windows 11 to assure that threads are scheduled to either the P or E cores in an optimized and intelligent manner, potentially easing one of the major pain points for a hybrid architecture in a standard desktop environment. It's also transparent to software.

This technology works by feeding the Windows 11 operating system with low-level telemetry data that is collected from within the processor itself, thus informing the scheduler about the state of the core, be it power, thermal or otherwise. (As we'll cover shortly, Intel has integrated a new power microcontroller in each Gracemont core, a first, that collects similar data on the order of microseconds instead of milliseconds, so it might be part of the new telemetry system.)

(Image credit: Intel)

Additionally, Thread Director can also detect the instruction mix (scalar/vector) used in any given thread at a nanosecond granularity, and then communicate that data to the Windows 11 scheduler so the thread can be steered to the correct execution core, be that a high-performance P-Core or an efficient E-Core. Typically, vector/AI workloads will be prioritized to performance cores while scalar instructions and background tasks are moved to efficiency cores. However, the system is dynamic, so thread placement decisions can vary based on the mix of conditions and workloads present on the processor at any given time.

Additionally, threads can go through various phases and instruction mixes over their lifetime, so the scheduler constantly re-adjusts to the current situation based on the real-time telemetry data. This is helpful when the number of threads designated for 'performance' outnumber the available cores, for instance. In that case, less demanding 'performance' threads, such as a program in a spin loop, can be moved off to the efficiency cores while more deserving workloads are assigned to the performance core.

Previously, the operating system didn't have access to this type of telemetry data to inform scheduling decisions, instead using simple data like whether the process was a foreground or background task. This enhanced system allows the operating system and processor to work in tandem to assure correct scheduling in real time, thus avoiding intensive software re-coding. This is a promising sign that existing code will run well on the Alder Lake processors.

If programmers want more granular control, that's there, too. The new approach also enables programmers to specify that certain threads are used in a certain manner through an expansion of the PowerThrottling API, which allows developers to assign a QoS attribute to their threads. Additionally, a new EcoQos classification tags threads that respond best on the efficiency cores to assure they are prioritized to execute on the E-Cores.

Microsoft says that the Edge browser and 'various' Windows 11 components now take advantage of the EcoQos classification system.

(Image credit: Intel)

This looks to be a promising and less-intrusive (at least from a coding standpoint) method of ensuring that the correct threads land on the correct cores, thus delivering optimal performance. That said, we'll have to see it in action before we can pass judgement on its efficacy – much of its potency will boil down to the latency involved with the process of communicating telemetry data and moving the thread, and intel isn't sharing those details yet. Additionally, it's possible that an excess of communication between the Thread Director and the Windows 11 scheduler could create a challenging workload of its own, so finding the right amount of granularity will be key to assuring both timely thread placement and a minimum of system overhead.

The system is already far in development, and Microsoft says that further enhancements to the engine are already underway and in planning for Windows 11, with more details to be shared at a later date.

Alder Lake chips will also work fine with a bog-standard Windows 10 operating system – existing thread-scheduling techniques continue to work with the processors, just not as well. While the chips work, you'll miss out on the enhanced capabilities of Thread Director (that's Windows 11 only), which will have a varying impact on performance and power consumption based on instruction type and application usage models. In other words, your mileage will vary.

Intel AVX-512 Support Culled

Finally, it has long been known that the Gracemont cores do not support the AVX-512 instruction set, and speculation has been rife about how the code would work on Alder Lake processors, if at all. Intel's answer is simple: AVX-512 will not work on either type of core present in Alder Lake. The high-performance cores do feature the Golden Cove architecture that supports AVX-512 natively, but Intel has fused that feature off (yes, the 512-bit FMA is still present and consumes die area) for the consumer chips. In contrast, server chips with Golden Cove have two 512-bit FMAs and fully support AVX-512. Meanwhile, the Gracemont cores are simply not AVX-512 capable, and disabling support allows the Alder Lake chip to have uniform ISA support. 

Paul Alcorn
Managing Editor: News and Emerging Tech

Paul Alcorn is the Managing Editor: News and Emerging Tech for Tom's Hardware US. He also writes news and reviews on CPUs, storage, and enterprise hardware.

  • TerryLaze
    Alder Lake does not support AVX-512 under any condition (fused off in P cores, not supported in E cores).
    Called it that all they need to do to get power draw down to ryzen levels is to turn off avx.
    If they also locked down power limits, at least on non OC boards, they can sell it as super future low power tech.
    Reply
  • JWNoctis
    No AVX-512 at all?...Yeah, that's gonna be a rather huge regression for those applications that made use of them, which is admittedly uncommon in consumer space.

    But then there's not that much difference between Core, Pentium, and Celeron lines anymore, unless they are going to detune IPC in microcode or something. What's the name of the next one, I wonder?
    Reply
  • TerryLaze
    JWNoctis said:
    No AVX-512 at all?...Yeah, that's gonna be a rather huge regression for those applications that made use of them, which is admittedly uncommon in consumer space.

    But then there's not that much difference between Core, Pentium, and Celeron lines anymore, unless they are going to detune IPC in microcode or something. What's the name of the next one, I wonder?
    Rocketlake didn't get any pentiums or celerons, no reason to believe that alder lake will have them.
    Now the celeron (atom) is going to be integrated in the core... :p
    Reply
  • NightLight
    Promising stuff... great time to buy some shares!
    Reply
  • Giroro
    Intel keeps talking up how great their tiny gracemont cores are... But if 4 gracemont cores were able to outperform 1 Golden Cove core, then the entire CPU would be gracemont. I think its no coincidence that their desktop CPUs tacked on exactly enough tiny cores to confuse people into thinking they have parity with 16-core ryzen. Just like how they renamed their 10nm process to give the illusion of parity.

    I have no confidence whatsoever that their 8C/8c/24t processor has better multithreaded performance than a hypothetical 10C/0c/20t processor. If that were the case, then the configurations would be more like 0C/40c/40t... Or maybe even 2C/32c/36t.

    But no, this is all about how they can technically get away with selling what is essentially an 8 core processor, using a giant sign that says 16 CORES* WORLD'S BEST EFFICIENCY**!
    They at least know performance matters a little bit, because a 40 CORE CPU has to got be pretty tempting to somebody in their marketing department, regardless of how bad it would be.
    Reply
  • Johnpombrio
    This will probably be my next CPU replacing my i9-9900K. I need at least PCI-4.0 for my 2TB Samsung 980 Pro ($313 lightning deal in Amazon Prime day in June). I will never need all of these cores tho.
    Reply
  • mdd1963
    Cautiously optimistic, but, I recall feeling the same way before 11th gen released...

    This time I will be pessimistic until happily (hopefully) proven wrong. :)

    (Need some BF1/BF5 1080P benchmarks to truly know if Alder Lake is 'mo betta'!)
    Reply
  • JamesJones44
    Giroro said:
    Intel keeps talking up how great their tiny gracemont cores are... But if 4 gracemont cores were able to outperform 1 Golden Cove core, then the entire CPU would be gracemont. I think its no coincidence that their desktop CPUs tacked on exactly enough tiny cores to confuse people into thinking they have parity with 16-core ryzen. Just like how they renamed their 10nm process to give the illusion of parity.

    I have no confidence whatsoever that their 8C/8c/24t processor has better multithreaded performance than a hypothetical 10C/0c/20t processor. If that were the case, then the configurations would be more like 0C/40c/40t... Or maybe even 2C/32c/36t.

    But no, this is all about how they can technically get away with selling what is essentially an 8 core processor, using a giant sign that says 16 CORES* WORLD'S BEST EFFICIENCY**!
    They at least know performance matters a little bit, because a 40 CORE CPU has to got be pretty tempting to somebody in their marketing department, regardless of how bad it would be.

    How do you explain M1s multi thread performance then, the quote makes little sense. There is a lot more to CPU design than the number of cores and their single thread IPC. I don't claim to know if they will be able to compete with a 10 big core CPU, but the M1 and other hybrid architectures prove that very good multi thread performance can be had with the big little design.
    Reply
  • ezst036
    Giroro said:
    I think its no coincidence that their desktop CPUs tacked on exactly enough tiny cores to confuse people into thinking they have parity with 16-core ryzen. Just like how they renamed their 10nm process to give the illusion of parity.

    There may be some of that, but Intel at this point can afford to cede some of the high end to AMD. They don't have to outright win, they just have to be competitive enough. And Intel is also prepping to fight AMD as well on the GPU front.(also nVidia)

    Intel's biggest threat is ARM. They cannot afford to keep taking it on the chin any longer in mobile. Alder Lake big.little will be a game changer even if it doesn't get the final mile to energy efficiency utopia.

    But really, I think people also forget or they discount that the pressure from manufacturing also is playing a factor here. Intel's fab woes go back how many years now? Intel needs small cores partially, and manufacturing woes in all sectors of chip manufacturing is going to force AMD to do the same with big.LITTLE. They've got Jaguar or Bobcat or whatever the latest iteration of that little core was, it won't be long before it's tacked on for some AMD big.LITTLE also.

    16 big cores is simply more stress on manufacturing than 8 big and 8 small when you factor in the big picture and tons of silicon wafer after wafer after wafer. Alder Lake helps Intel to help Intel out on their fab woes.
    Reply
  • ezst036
    JamesJones44 said:
    How do you explain M1s multi thread performance then

    You only need one word.

    Optimization.

    Apple controls all aspects of MacOS, and are particularly fans of cutting off their own customers after so many years. They don't want, don't need, and simply don't carry a lot of legacy "baggage" - even if you spent $8000 on your computer. Apple will cut you off.

    You explain M1 performance with optimizations under the hood.
    Reply