Intel files patent for 'Software Defined Supercore' — increases single-thread performance and IPC by mimicking ultra-wide execution using multiple cores

Intel
(Image credit: Intel)

Intel has patented a technology it calls 'Software Defined Supercore' (SDC) that enables software to fuse the capabilities of multiple cores to assemble a virtual ultra-wide 'supercore' capable of improving single-thread performance, provided that it has enough parallel work. If the technology works as it is designed to, then Intel's future CPUs could offer faster single-thread performance in select applications that can use SDC. For now, this is just a patent which may or may not become a reality.

Intel's Software Defined Supercore (SDC) technologies combine two or more physical CPU cores to cooperate as a single high-performance virtual core by dividing a single thread's instructions into separate blocks and executing them in parallel. Each core runs a distinct portion of the program, while specialized synchronization and data-transfer instructions ensure that the original program order is preserved, maximizing instructions per clock (IPC) with minimal overhead. This approach is designed to improve single-thread performance without increasing clock speeds or building wide, monolithic cores, which can increase power consumption and/or transistor budgets.

On the hardware side, each core in an SDC-enabled system includes a small dedicated hardware module that manages synchronization, register transfers, and memory ordering between paired cores. These modules utilize a reserved memory region — known as the wormhole address space — to coordinate live-in/live-out data and synchronization operations, ensuring that instructions from separate cores retire in the correct program order. The design supports both in-order and out-of-order cores, requiring minimal changes to the existing execution engine, which results in a compact design in terms of die space.

Intel's patent does not provide exact numerical performance gain estimates, but it implies that in select scenarios, it is realistic to expect the performance of two 'narrow' cores to approach the performance of a 'wide' core.

TOPICS
Anton Shilov
Contributing Writer

Anton Shilov is a contributing writer at Tom’s Hardware. Over the past couple of decades, he has covered everything from CPUs and GPUs to supercomputers and from modern process technologies and latest fab tools to high-tech industry trends.

  • Thunder64
    This has been talked about for decades. I'll believe it when I see it.
    Reply
  • bit_user
    Intel bought a startup promoting this concept, Soft Machines, almost a decade ago. They called it "VISC".
    https://www.blopeur.com/2021/10/30/Intel-VISC-Processor-Architecture-Patent.html
    The article said:
    Modern x86 CPU cores can decode 4 – 6 instructions and then execute 8 - 9 micro-ops per cycle after instructions are decoded into micro-ops
    Those numbers are rather dated. Zen 4 had only 4-wide decode. Zen 5 gives us 2x 4-wide decoders per core, but they're per-thread (meaning one is idle when only a single thread is using a core).

    From Golden to Redwood Cove (Alder Lake to Meteor Lake) Intel did 6-wide decode. Lion Cove increased it to 8 (source: https://chipsandcheese.com/p/lion-cove-intels-p-core-roars ). But, these numbers can be deceiving. Intel's P-cores usually distinguish between "simple" and "complex" instructions, with only a couple complex decode slots and the rest being limited to "simple" instructions.

    Gracemont had 2x 3-wide decoders, which Skymont boosted to 3x3. In some interviews, Intel has stated that its decoders almost never saturate, but that adding another 3-wide decode block was simply the easiest way to add frontend bandwidth (keeping in mind that Skymont has no separate micro-op cache). Would be interesting to see how Skymont's actual decode throughput compares with different P-cores, on the same instruction streams of varying types and complexity. For sure, Skymont is not decoding 9 instructions per cycle, in practice.
    https://chipsandcheese.com/p/intel-details-skymont
    The article said:
    Apple's custom Arm-based high-performance cores (e.g., Firestorm, Avalanche, Everest) can decode up to 8 instructions per cycle
    Apple's P-cores now have 10-wide decode, in the M4.
    https://www.techpowerup.com/322195/apple-introduces-the-m4-chip
    Arm's Cortex-X925 also features 10-wide decode.

    The article said:
    In fact, even modern x86 CPUs can typically hit 2–3-4 sustained IPC on general workloads, depending on software.
    Chips & Cheese has been looking at this. Here's Zen 5 on gaming + an assortment of non-gaming workloads:

    Source: https://chipsandcheese.com/p/running-gaming-workloads-through
    Here's Lion Cove:

    Source: https://chipsandcheese.com/p/intels-lion-cove-p-core-and-gaming
    For non-gaming, I'd characterize most of the cases on Zen 5 as 2-4 IPC, while the center of Lion Cove's distribution is a little more in the 2-3 range. It's annoying that the charts aren't both scaled to 6 IPC on the X-axis.
    Reply
  • dalek1234
    I think that if this was possible and actually worked where single-threaded performance was improved, somebody would have implemented it by now.

    Software is always much slower than baking that functionality into the silicon. Jim Keller worked on the concept described in this article, but on hardware level. It was called Rentable Units; where multiple physical cores could switch to behave like a large single-core, greatly improving single-threaded performance. Jim Keller never completed the project though. He left Intel after two years because it sucked working for Intel. Intel did continue the project, but Pat Gelsinger cancelled it, citing 'cost'.

    So Intel's solution is to now doing it in software. Well, good luck with that. Maybe they are just patenting their inventions now so that they can get more money out of them when they sell them. That's what Blackberry did to raise money, back in the day, sell their IP.
    Reply
  • hotaru251
    honestly I doubt it'll happne but I'd love for it to work that way... imagine a TR system using all that pwoer for single thread performance....
    Reply
  • bit_user
    My take on this is basically that it's a more efficient way to exploit coarse-level ILP (Instruction-Level Parallesism) than continuing to double-down on ever deeper and wider cores. The scheduling logic needed to keep ever larger cores fed just doesn't scale terribly well, especially with respect to the real-world gains achieved.

    What I find particularly intriguing is to look at this (let's call it VISC, for lack of a better term) in conjunction with SMT. It'd be really interesting to use VISC to partition the scheduling problem, but then still execute multiple of these nano-threads on the same physical core, with the same shared backend resources.
    Reply
  • bit_user
    dalek1234 said:
    I think that if this was possible and actually worked where single-threaded performance was improved, somebody would have implemented it by now.
    It's a hard problem and requires a lot of work on both the hardware and software end of things, in order to make it work. As long as conventional approaches for scaling performance have continued to deliver gains, I think implementing such a complex solution couldn't be justified.

    hotaru251 said:
    imagine a TR system using all that pwoer for single thread performance....
    I'm sure it's not infinitely flexible. They must limit it to just the cores which share a cluster, like how the current E-cores are arranged in clusters of 4. I don't imagine you'd ever have more than 4-way scalability on this, and perhaps limited to only 2.
    Reply
  • Chiller2U
    Not to be picky, but Intel did not "patent" this...yet. The link is to a filed and published application, which has not yet been granted.
    Reply