Intel files patent for 'Software Defined Supercore' — increases single-thread performance and IPC by mimicking ultra-wide execution using multiple cores
Reverse Hyper-Threading?

Intel has patented a technology it calls 'Software Defined Supercore' (SDC) that enables software to fuse the capabilities of multiple cores to assemble a virtual ultra-wide 'supercore' capable of improving single-thread performance, provided that it has enough parallel work. If the technology works as it is designed to, then Intel's future CPUs could offer faster single-thread performance in select applications that can use SDC. For now, this is just a patent which may or may not become a reality.
Intel's Software Defined Supercore (SDC) technologies combine two or more physical CPU cores to cooperate as a single high-performance virtual core by dividing a single thread's instructions into separate blocks and executing them in parallel. Each core runs a distinct portion of the program, while specialized synchronization and data-transfer instructions ensure that the original program order is preserved, maximizing instructions per clock (IPC) with minimal overhead. This approach is designed to improve single-thread performance without increasing clock speeds or building wide, monolithic cores, which can increase power consumption and/or transistor budgets.
Modern x86 CPU cores can decode 4–6 instructions and then execute 8-9 micro-ops per cycle after the instructions are decoded into micro-ops, which achieves peak IPC performance for such processors. By contrast, Apple's custom Arm-based high-performance cores (e.g., Firestorm, Avalanche, Everest) can decode up to 8 instructions per cycle and then execute over 10 instructions per cycle under ideal conditions. This is why Apple's processors typically offer significantly higher single-threaded performance and lower power consumption compared to Arm counterparts.
While it is technically possible to build an 8-way x86 CPU core (i.e., a superscalar x86 processor that can decode, issue, and retire up to 8 instructions per clock), in practice, it has not been done because of front-end bottlenecks as well as diminishing returns in terms of performance increase amid significant power and area costs. In fact, even modern x86 CPUs can typically hit 2–3-4 sustained IPC on general workloads, depending on software. So, instead of building an 8-way x86 CPU core, Intel's SDC proposes pairing two or more 4-wide units to cooperate as one large core in cases where it makes sense.
On the hardware side, each core in an SDC-enabled system includes a small dedicated hardware module that manages synchronization, register transfers, and memory ordering between paired cores. These modules utilize a reserved memory region — known as the wormhole address space — to coordinate live-in/live-out data and synchronization operations, ensuring that instructions from separate cores retire in the correct program order. The design supports both in-order and out-of-order cores, requiring minimal changes to the existing execution engine, which results in a compact design in terms of die space.
On the software side, the system uses either a JIT compiler, a static compiler, or binary instrumentation to split a single-threaded program into code segments to assign different blocks to different cores. It injects special instructions for flow control, register passing, and sync behavior, enabling the hardware to maintain execution integrity. Support by the operating system is crucial as the OS dynamically decides when to migrate a thread into or out of super-core mode based on runtime conditions to balance performance and core availability.
Intel's patent does not provide exact numerical performance gain estimates, but it implies that in select scenarios, it is realistic to expect the performance of two 'narrow' cores to approach the performance of a 'wide' core.
Get Tom's Hardware's best news and in-depth reviews, straight to your inbox.
Follow Tom's Hardware on Google News, or add us as a preferred source, to get our up-to-date news, analysis, and reviews in your feeds. Make sure to click the Follow button!

Anton Shilov is a contributing writer at Tom’s Hardware. Over the past couple of decades, he has covered everything from CPUs and GPUs to supercomputers and from modern process technologies and latest fab tools to high-tech industry trends.
-
bit_user Intel bought a startup promoting this concept, Soft Machines, almost a decade ago. They called it "VISC".Reply
https://www.blopeur.com/2021/10/30/Intel-VISC-Processor-Architecture-Patent.html
Those numbers are rather dated. Zen 4 had only 4-wide decode. Zen 5 gives us 2x 4-wide decoders per core, but they're per-thread (meaning one is idle when only a single thread is using a core).The article said:Modern x86 CPU cores can decode 4 – 6 instructions and then execute 8 - 9 micro-ops per cycle after instructions are decoded into micro-ops
From Golden to Redwood Cove (Alder Lake to Meteor Lake) Intel did 6-wide decode. Lion Cove increased it to 8 (source: https://chipsandcheese.com/p/lion-cove-intels-p-core-roars ). But, these numbers can be deceiving. Intel's P-cores usually distinguish between "simple" and "complex" instructions, with only a couple complex decode slots and the rest being limited to "simple" instructions.
Gracemont had 2x 3-wide decoders, which Skymont boosted to 3x3. In some interviews, Intel has stated that its decoders almost never saturate, but that adding another 3-wide decode block was simply the easiest way to add frontend bandwidth (keeping in mind that Skymont has no separate micro-op cache). Would be interesting to see how Skymont's actual decode throughput compares with different P-cores, on the same instruction streams of varying types and complexity. For sure, Skymont is not decoding 9 instructions per cycle, in practice.
https://chipsandcheese.com/p/intel-details-skymont
Apple's P-cores now have 10-wide decode, in the M4.The article said:Apple's custom Arm-based high-performance cores (e.g., Firestorm, Avalanche, Everest) can decode up to 8 instructions per cycle
https://www.techpowerup.com/322195/apple-introduces-the-m4-chip
Arm's Cortex-X925 also features 10-wide decode.
Chips & Cheese has been looking at this. Here's Zen 5 on gaming + an assortment of non-gaming workloads:The article said:In fact, even modern x86 CPUs can typically hit 2–3-4 sustained IPC on general workloads, depending on software.
Source: https://chipsandcheese.com/p/running-gaming-workloads-through
Here's Lion Cove:
Source: https://chipsandcheese.com/p/intels-lion-cove-p-core-and-gaming
For non-gaming, I'd characterize most of the cases on Zen 5 as 2-4 IPC, while the center of Lion Cove's distribution is a little more in the 2-3 range. It's annoying that the charts aren't both scaled to 6 IPC on the X-axis. -
dalek1234 I think that if this was possible and actually worked where single-threaded performance was improved, somebody would have implemented it by now.Reply
Software is always much slower than baking that functionality into the silicon. Jim Keller worked on the concept described in this article, but on hardware level. It was called Rentable Units; where multiple physical cores could switch to behave like a large single-core, greatly improving single-threaded performance. Jim Keller never completed the project though. He left Intel after two years because it sucked working for Intel. Intel did continue the project, but Pat Gelsinger cancelled it, citing 'cost'.
So Intel's solution is to now doing it in software. Well, good luck with that. Maybe they are just patenting their inventions now so that they can get more money out of them when they sell them. That's what Blackberry did to raise money, back in the day, sell their IP. -
hotaru251 honestly I doubt it'll happne but I'd love for it to work that way... imagine a TR system using all that pwoer for single thread performance....Reply -
bit_user My take on this is basically that it's a more efficient way to exploit coarse-level ILP (Instruction-Level Parallesism) than continuing to double-down on ever deeper and wider cores. The scheduling logic needed to keep ever larger cores fed just doesn't scale terribly well, especially with respect to the real-world gains achieved.Reply
What I find particularly intriguing is to look at this (let's call it VISC, for lack of a better term) in conjunction with SMT. It'd be really interesting to use VISC to partition the scheduling problem, but then still execute multiple of these nano-threads on the same physical core, with the same shared backend resources. -
bit_user
It's a hard problem and requires a lot of work on both the hardware and software end of things, in order to make it work. As long as conventional approaches for scaling performance have continued to deliver gains, I think implementing such a complex solution couldn't be justified.dalek1234 said:I think that if this was possible and actually worked where single-threaded performance was improved, somebody would have implemented it by now.
I'm sure it's not infinitely flexible. They must limit it to just the cores which share a cluster, like how the current E-cores are arranged in clusters of 4. I don't imagine you'd ever have more than 4-way scalability on this, and perhaps limited to only 2.hotaru251 said:imagine a TR system using all that pwoer for single thread performance.... -
Chiller2U Not to be picky, but Intel did not "patent" this...yet. The link is to a filed and published application, which has not yet been granted.Reply