AMD Talks Hybrid Ryzen CPU Concepts, Avoiding Intel's AVX-512 Problem
AMD wants to avoid Intel's mistakes.
During Computex 2023, I had a chance to visit AMD's towering offices in Taipei, Taiwan, to see the company's Ryzen AI demo and speak with David McAfee, the Corporate VP and GM of the Client Channel Business. Most of our conversation centered on AMD's efforts in the consumer AI space, but I also squeezed in a few questions about AMD's take on hybrid CPUs. McAfee told me AMD has a different vision of hybrid processors than Intel that would avoid the complexity that forced Intel to remove AVX-512 support from its chips.
I interviewed AMD CTO Mark Papermaster two weeks ago in Antwerp, Belgium. He told me that we would "see high-performance cores mixed with power-efficient cores mixed with acceleration" in future AMD client [consumer] processors, signaling that, like Intel before it, AMD would adopt a hybrid CPU execution core design in the future. That wasn't too surprising -- we saw the first signs of two different CPU core types in AMD's software manuals months ago. Besides, AMD is already laying the foundation with its coming EPYC Bergamo chips with dense Zen 4c cores akin to efficiency cores.
AMD's current Ryzen 7040 laptop chips already feature a hybrid design, but not with two different types of CPU cores. Instead, the Ryzen 7040 has just one type of CPU core paired with an in-built AI accelerator engine that operates independently of the CPU and GPU cores. This engine provides advantages for certain types of AI inference workloads, but the CPU and GPU cores are better for other types of inference. So, the trick is to direct the different AI workloads to the correct type of cores to extract the best performance and power efficiency.
Throwing separate performance and efficiency CPU cores into that mix would introduce yet another compute option for AI inference workloads, and I asked McAfee if, conceptually, it would be feasible that efficiency cores would be better for AI than a dedicated piece of silicon (the AI engine). McAfee explained that the AI engines' strict focus on AI-specific operations would give it an efficiency advantage over any general-purpose CPU compute -- even an efficiency core.
Then we shifted to discussing Intel's hybrid chips, which have two types of cores, each with its own unique microarchitecture. That's created interesting problems: Intel's performance cores support AVX-512, but the smaller efficiency cores do not. That led Intel to disable AVX-512 support entirely (forcibly in the end), thus de-featuring its own chip and wasting precious die area.
I asked McAfee how AMD felt about that approach to hybrid designs.
"What I will say is this, I think the way that we think about it, the approach of two very different performance and efficiency cores with very different ISA support and IPC and capability is not necessarily the right approach," McAfee responded. "I think it invites far more complexity around what can execute where, and as we've looked at different options for core design, that's not the approach that we're taking.
"I think as we roll more of this out over time, what you'll see from us is an approach that takes into consideration the advantages that different core targeting can provide, but doing it in a way that's much more, from an application perspective, much more homogeneous."
We already know that AMD's Zen 4C efficiency cores, which it will use in the upcoming Bergamo server chips, will support the same instructions, like AVX-512, as the full-featured performance cores. However, they'll have a cut-down cache hierarchy to reduce die area consumption. The goal of both core types having the same IPC with the performance and efficiency cores is important. In contrast, Intel's efficiency cores have lower IPC than its performance cores (that could result in tradeoffs in its other e-core aspirations, like Sierra Forest).
Stay On the Cutting Edge: Get the Tom's Hardware Newsletter
Get Tom's Hardware's best news and in-depth reviews, straight to your inbox.
"ISA, first of all, keeping that consistent to where a workload can operate on any core, has dramatic advantages," McAfee said. "And even when you look at a Ryzen desktop CPU today, the way that the Windows scheduler is plumbed, the ability to identify cores that are faster, slower, etc., and steer threads to different cores depending on the ranking or capability within a CPU; That's a well-established technique that we've used for quite some time. This then leads to, in our opinion, using a mechanism where the capability of the cores is more consistent.
"This is a far more tried and true way to look at bringing multiple different core targeting types into a design. I think the Intel approach invites a lot of complexity into the way that it operates. And I think our analysis has been that. I don't think you'll see us go down that path in the same way they have, if and when it comes to a Ryzen processor." McAfee concluded.
Unlike Papermaster, McAfee was noncommital on if or when hybrid would come to Ryzen, and we don't know where AMD would first introduce a hybrid architecture with Ryzen, be it with a monolithic APU or one of its chiplet-based models. However, it is clear that AMD envisions a hybrid future that would avoid the tradeoffs we've seen with Intel's design decisions behind the Alder and Raptor Lake processors.
Some of AMD's own decisions might be informed by analyzing Intel's missteps, or it may have just been the common sense of IP reuse with the existing core architecture -- it's a far lighter lift to tweak a microarchitecture than embarking upon a clean-sheet design. In either case, the ability to preserve support for AVX-512 would likely give AMD the performance advantage in vectorized workloads, provided Intel doesn't follow suit.
Conversely, one could argue that Intel's approach of having a separate microarchitecture tuned for lower-power operation is a better approach, albeit if it were paired with uniform ISA support across both types of cores. If Intel has corrected its ISA mismatch with Meteor Lake and maintained support for AVX-512 across both core types, it could also prove to be a potent combo.
In either case, it's clear that while AMD would be second to market with a hybrid design, it will take a much different approach. Only time will tell how the two techniques stack up in the benchmarks.
Paul Alcorn is the Managing Editor: News and Emerging Tech for Tom's Hardware US. He also writes news and reviews on CPUs, storage, and enterprise hardware.
-
Speaking of Intel's take on AVX-512 hybrid approach on consumer CPUs, the AVX-512 instructions were never added to the efficiency cores because it would use way too much die space and defeat the whole purpose of efficiency cores.Reply
And, this is why scalable vector ISAs like the RISC-V vector extensions are superior to fixed-size SIMD. You can support both kinds of microarchitecture while running the exact same code.
Though, catching the bad instruction fault on the E-cores and only scheduling the thread on the P-cores would be something that could be added at least to LINUX (some third party patches were in the works), if Intel had not fused the feature entirely.
Developers were also reluctant to use AVX-512 is because the CPU takes a heavy frequency hit when this mode was engaged. I don't see how Intel didn't notice this before. Then, there's also a small additional penalty of about three percent when switching into and out of 512-bit execution mode.
Since AVX-512 couldn't be enabled at all unless you disable the efficiency cores completely, then they also wouldn’t benefit your workload unless disabling those extra cores was offset by the AVX-512 instructions on the remaining cores.
If your code benefits from AVX-512, it'll probably benefit from turning off the efficiency cores too. Sixteen 256-bit AVX channels are the same as eight 512-bit channels in theoretical throughput. Because there are fewer load/store commands and fewer bunches of setup code to run, overall theoretical efficiency should be higher.
The real solution would be for Intel to detect the presence of AVX-512 instructions then automatically and unconditionally pin the thread to the big cores. It wouldn't be THAT hard either, just catch the unknown instruction exception and see if it is AVX-512, and then move the thread. -
truerock I asked McAfee how AMD felt about that approach (Intel's) to hybrid designs.Reply
"What I will say is this, I think the way that we think about it, the approach of two very different performance and efficiency cores with very different ISA support and IPC and capability is not necessarily the right approach," McAfee responded. "I think it invites far more complexity around what can execute where, and as we've looked at different options for core design, that's not the approach that we're taking.
I know I keep repeating myself and a lot of people disagree. Regardless, IMO, putting low performance cores on a CPU designed for a HEDT is a dumb idea.
Even more... putting a GPU on a CPU designed for HEDTs is an even dumber idea. -
RedBear87 That's created interesting problems: Intel's performance cores support AVX-512, but the smaller efficiency cores do not. That led Intel to disable AVX-512 support entirely (forcibly in the end), thus de-featuring its own chip and wasting precious die area
I would stress the part where the article says "forcibly in the end"; lack of AVX-512 isn't an issue, it's a feature, Intel loves its segmentation and assumed that blocking/removing AVX-512 in its mainstream Alder Lake and Raptor Lake CPUs would boost (somewhat) the sales of its HEDT Sapphire Rapids. AMD is looking at this issue from a completely different angle, since they (still) haven't become so lunatic to fuse off features like AVX-512 that are physically present on their chips. -
JayNor Intel joined something named "RISE" with Samsung and QCOM, to work on RISC-V designs. Perhaps they can make use of it in a SIMD accelerator. Slap a CXL/PCIE5 interface on a RISC-V avx512 accelerator chip and get all the avx512 heat you want to pay for.Reply
There may be another option in the works. One of the PVC GPU presentation mentions a common SIMD architecture with the CPU. How far are we from getting standard c++ inclusion for something like SYCL that would detect accelerators with avx512 capabilities at runtime and jit compile the kernels to take advantage? You can have this now if you use dpc++. -
msroadkill612 The key to AMD's smooth roadmap, & much else, is Infinity Fabric IMO.Reply
AMD figured they couldn't beat intel at big fancy chips, so they focused on a fancy bus to team simpler resource modules into consumer & large scale modules.
This often meant that old, prevalidated subsystems (IO eg.) could be left intact, & performance objectives are pursued in focused/isolated areas.
Less to re-design, test & trouble shoot. Their product roadmap timelines have been achieved far better.
AMD's focus on IF was a big part of it's amazing resurgence from the brink. They made inferior early Zen parts initially, but won via superior architecture (including costs). -
usertests AMD's current Ryzen 7040 laptop chips already feature a hybrid design, but not with two different types of CPU cores. Instead, the Ryzen 7040 has just one type of CPU core paired with an in-built AI accelerator engine that operates independently of the CPU and GPU cores.
An incidental accelerator shouldn't make the CPU/APU be considered a hybrid design. Maybe mismatched cache per chiplet (7900X3D, 7950X3D) could be thought of as hybrid, since that sounds like what Zen 4 + Zen 4C and Zen 5 + Zen 5C could mostly be.
I asked McAfee if, conceptually, it would be feasible that efficiency cores would be better for AI than a dedicated piece of silicon (the AI engine). McAfee explained that the AI engines' strict focus on AI-specific operations would give it an efficiency advantage over any general-purpose CPU compute -- even an efficiency core.
That answer should have been obvious, come on now.
truerock said:I know I keep repeating myself and a lot of people disagree. Regardless, IMO, putting low performance cores on a CPU designed for a HEDT is a dumb idea.
Even more... putting a GPU on a CPU designed for HEDTs is an even dumber idea.
Not if it's done right.
I'm going to assume you are counting Ryzen 7000 as HEDT. That iGPU costs almost no die area, is in the TSMC N6 I/O chiplet separate from CPU cores, and can be more than enough for users who just want some display outs, video decode/encode, and even light gaming. Yeah, AMD should add a similar iGPU to Threadripper CPUs.
Ryzen (or any) CPUs can already have certain cores that consistently boost a little higher than others, so those cores would be favored. We've also seen 7900X3D and 7950X3D with the differing levels of cache per CCX. AMD's hybrid/heterogeneous implementation could be a more extreme version of these two things.
At this point I just want to know the specifics of how they would reduce the die area of a low performance core, other than removing half the L3 cache or using a smaller process node. That's one of the last pieces of the puzzle. -
PaulAlcorn
You're right -- it is completely obvious, and I knew the answer before I asked — it's an interviewing tactic. The intention of this question was to open up the topic of efficiency cores paired with performance cores in a hybrid config, which AMD hasn't spoken about publicly. The only reason I left mention of that question, which did work to get him to speak about a topic they aren't really talking about yet, was because I used the word "conceptually" in the question. It is important that I disclose that, as much of his answer came as a follow-on to that question, and I don't want to misconstrue his statements.usertests said:That answer should have been obvious, come on now.
<snip>
I think that an accelerator does make the chip a hybrid architecture, as it is connected to the same memory space as the CPU and GPU cores. -
Co BIY RedBear87 said:lack of AVX-512 isn't an issue, it's a feature, Intel loves its segmentation and assumed that blocking/removing AVX-512 in its mainstream Alder Lake and Raptor Lake CPUs would boost (somewhat) the sales of its HEDT Sapphire Rapids.
The most inefficient core is one that the customer is not paying for! Giving away free capability is not efficient.
Chiplets, tiles and design tools that allow greater flexibility are going to lead to ever more market segmentation to reduce what economists would term "consumer surplus" (the consumer mirror image to profit).
RedBear87 said:AMD is looking at this issue from a completely different angle, since they (still) haven't become so lunatic to fuse off features like AVX-512 that are physically present on their chips.
AMD doesn't market segment as aggressively because their smaller volumes (and perhaps less control of the production side ) don't allow them to do so as efficiently. This can result in more "consumer surplus" (the feeling of getting more than you paid for) if you are in the right part of the SKU stack. It also means that finding a perfect match to your requirements/budget may be harder. -
truerock
I agree with your assertions.msroadkill612 said:The key to AMD's smooth roadmap, & much else, is Infinity Fabric IMO.
AMD figured they couldn't beat intel at big fancy chips, so they focused on a fancy bus to team simpler resource modules into consumer & large scale modules.
This often meant that old, prevalidated subsystems (IO eg.) could be left intact, & performance objectives are pursued in focused/isolated areas.
Less to re-design, test & trouble shoot. Their product roadmap timelines have been achieved far better.
AMD's focus on IF was a big part of it's amazing resurgence from the brink. They made inferior early Zen parts initially, but won via superior architecture (including costs).
I think Intel has superior resources and some superior technologies - but, Intel overthinks, over engineers and puts too much effort in trying to differentiate its CPUs.
The AMD approach of KISS has a lot of short-term benefit. -
Actually, it's worth noting that the first gossip and leak about AMD's hybrid future surfaced in a June 2021 patent application.Reply
This patent indicates, "A method for relocating a computer-implemented task from a relatively less-powerful processor to a relatively more-powerful processor"
This diagram shows the arrangement of Big/Little processors. There has been no update on this though. The hesitancy was due to Windows not offering a proper task scheduler in Windows 10, according to AMD, and it wanted those cores to be useful instead of just a marketing gimmick to let it say it has more cores.
AMD had applied for this patent describing methods by which one type of CPU would move work over to another type of CPU:
https://i.imgur.com/4mEXZzr.jpg
As per this patent, CPUs would rely on core utilization metrics to determine when it was appropriate to move a workload from one type of CPU to the other.
Metrics include the amount of time the CPU has been working at maximum speed, the amount of time the CPU has been using maximum memory, average utilization over a period of time, and a more general category in which a workload is moved from one CPU to the other based on unspecified metrics related to the execution of the task.
So if I understand this correctly, when the CPU determines that a workload should move from CPU A to CPU B, the core currently performing the work (CPU A, in this case), is put into an idle or stalled state.
The architecture state of CPU A is saved to memory and loaded by CPU B, which continues the process. AMD's patent describes these shifts as bi-directional -- the small core can shift work to the large, or vice-versa.
https://www.freepatentsonline.com/y2021/0173715.html