AMD's upcoming RDNA 5 GPUs might improve dual-issue execution & use shader units more efficiently — LLVM patch adds new FMA instruction to ease compiling

AMD RDNA 3 GPU Architecture Deep Dive
(Image credit: AMD)

The next generation of Radeon GPUs from AMD are expected to be a significant upgrade over RDNA 4, and one of the issues Team Red seems to be tackling is dual issue execution. That's the GPU's ability to execute two instructions in the same cycle — AMD's cards have had this feature since RDNA 3, but strict pairing rules meant that compilers couldn't always take advantage of it, limiting theoretical peak performance. A new LLVM patch now suggests that AMD will be solving this on RDNA 5.

On a technical level, the existing system, known as VOPD, largely only worked with simpler 2-operand instructions, which made it harder for compilers to schedule compatible instruction pairs. VOPD3 will expand this to 3-operand instructions, so it would be able to support operations like fused multiply-add (FMA). In fact, V_FMA_F32 was added in this very pull request and that's how we can infer it'll be on RDNA 5.

Article continues below

This would allow dual issue execution to happen more often, leading to a potentially massive increase in FP32 throughput (in some cases). Shader units will spend less time waiting for clock cycles and instead get more work done, making each instruction more efficient. This could help in demanding scenarios, such as rendering, which means game engines will be able to able to optimize for dual issue VALU.

Reducing the number of cases where pairing fails due to restrictions is a key step to making the hardware more efficient without brute-forcing IPC uplifts through silicon. FMA instructions are also important when it comes to neural rendering, so things like upscaling and frame-gen tech can also get a boost here, even if the hardware itself is not more performant — since dual issue execution improves efficiency regardless.

You can check out the Coelacanth's Dream article linked above if you're interested in more specifics, but be warned that it's very dense. Moreover, RDNA 5 is a ways out at this point, and more consumer-facing updates like higher core counts would certainly be a more marketable trait. Still, seeing a GPU reach its advertised FP32 throughput more easily and more consistently is a big architectural win.

Google Preferred Source

Follow Tom's Hardware on Google News, or add us as a preferred source, to get our latest news, analysis, & reviews in your feeds.

Hassam Nasir
Contributing Writer

Hassam Nasir is a die-hard hardware enthusiast with years of experience as a tech editor and writer, focusing on detailed CPU comparisons and general hardware news. When he’s not working, you’ll find him bending tubes for his ever-evolving custom water-loop gaming rig or benchmarking the latest CPUs and GPUs just for fun.

  • -Fran-
    Anything about fixing the chiplet design with RDNA5?

    Regards.
    Reply
  • Faiakes
    This sounds like driver issue.

    Couldn't this apply to rdna 4?
    Reply
  • bit_user
    I'd just point out a couple of things:
    Needing to handle VOPD3 adds more complexity to the decoders, which might not have been possible in RDNA 3 & 4, without a penalty of some sort (pipeline stages, clock speed, etc.)
    They had to beef up the X pipeline to handle more instruction types, which costs die area.
    Running more instructions in parallel increases energy usage and would've come in RDNA 3 & 4 at higher power and/or lower clock speeds.
    The point is that, while this may seem like "free performance", it's actually not. It's the sort of thing that makes sense to improve IPC, while moving to a smaller process node.

    It also strikes me as interesting that VOPD is basically the hint of a return to VLIW (although I wouldn't consider dual-issue to be "Very Long"). You're having the compiler explicitly schedule multiple execution resources. That's a first-order characteristic of VLIW.


    BTW, if anyone is curious, you can see the RDNA 4 ISA, documented here:
    https://docs.amd.com/v/u/en-US/rdna4-instruction-set-architecture
    Neither Nvidia nor Intel documents the ISA of their shader cores like that. Nvidia does document PTX, but that's a pseudo-assembly and not the actual hardware ISA.
    Reply
  • bit_user
    Faiakes said:
    This sounds like driver issue.
    No, this is lifting some hardware limitations.

    Faiakes said:
    Couldn't this apply to rdna 4?
    No, the fact that they're reporting on LLVM (i.e. the open source compiler infrastructure for RDNA GPUs) is simply giving us a peak at how the ISA of RDNA 5 will differ from RDNA 4.

    The change they noticed is basically teaching the compiler about the upcoming hardware. Whether or not you do that doesn't change the hardware or its actual capabilities.
    Reply
  • Alpha_Lyrae
    In RT workloads, dual-issue came under vector register pressure, as RT tends to eat up VGPRs and RDNA3/4 had to secure two separate VGPRs for the dual-issue instruction, else hardware will simply refuse to launch it. So, if uArch is low on available VGPRs, dual-issue becomes a very hard ask.

    It looks like VOPD3 can map to the same input VGPRs for X/Y, conserving a resource that is increasingly coming under heavy pressure in wavefront queues. Ports could be double banked or there could be specialized tags to keep things organized.

    The other thing I see here is addition of signed and unsigned integer ops, which also allows for dual-issue I32/U32 for the first time; this is most likely the reason the CU is twice as wide as before, going from 64SPs to 128SPs, though INT32 could have been added to the extra FP32-only ALU, I think there were resource limitations (along with the mentioned execution limitations). Combining the WGP into a local CU means all of the caches and registers are now global to the CU instead of just the LDS and GDS. 2x integer ops will be important for neural rendering operations. If there's still a WGP, then this is also 2x wider at 256SPs or 8x SIMD32s. Or, there's a chance AMD also moved to SIMD64 with pseudo-SIMD32 support.

    Gfx1250 looks to be CDNA5, as it has support for dual-issue 64-bit instructions.
    Reply
  • bit_user
    Alpha_Lyrae said:
    In RT workloads, dual-issue came under vector register pressure,
    Are you talking specifically about RDNA 3? Chips & Cheese found that RDNA 4's dynamic register allocation was able to achieve full occupancy in RT workloads:

    Source: https://chipsandcheese.com/p/dynamic-register-allocation-on-amds
    Alpha_Lyrae said:
    as RT tends to eat up VGPRs and RDNA3/4 had to secure two separate VGPRs for the dual-issue instruction, else hardware will simply refuse to launch it.
    I think it's worse than that. From the RDNA 4 manual's description of VOPD:
    "The two operations must be independent of each other. This instruction has certain restrictions that must be met - hardware does not function correctly if they are not."
    Alpha_Lyrae said:
    there's a chance AMD also moved to SIMD64 with pseudo-SIMD32 support.
    Why would they go back to Wave64?
    Reply
  • usertests
    -Fran- said:
    Anything about fixing the chiplet design with RDNA5?

    Regards.
    RDNA5 could have single GCDs of different sizes shared between desktop cards, some of the laptop APUs for the first time, and Xbox Helix. With other things like memory controllers on a different chiplet.

    K0B08iCFgkk
    Reply
  • Alpha_Lyrae
    bit_user said:
    Why would they go back to Wave64?
    Wave64 is still supported. Wave32 is often used for vertex work items (and compute/async compute too), while wave64 is almost exclusively used for pixel work. So, SIMD64 would enable easy 1-cycle wave64 and double wave32 ops within SIMD64, each using SIMD32 in a pseudo-mode, much like the dual ALU did in RDNA3/4. This would remove most of the restrictions entirely. Wave32 can still be supported in SIMD64. For compatibility, you can require that 2 wave32 ops must be scheduled concurrently, which is essentially 2xFP32 and now 2xINT32.

    SIMD64 is also mostly beneficial for CDNA5, as prior compilers for CDNA4 and below based on gfx9 used 4xSIMD16 per CU or 4 cycles of gathering and executing SIMD16/SIMD16/SIMD16/SIMD16 (all the same instruction), and hoping an instruction doesn't branch. If CDNA5 is SIMD64, it could do the same op in 1-cycle with same throughput (greatly reduced execution latency, frees resources faster) or 4x instruction throughput in the same 4 cycles.

    There's a lot of latency hiding and deep parallel work queues in modern GPUs, so in isolation, 4x throughput per SIMD64 (vs SIMD16) and 2-4 SIMD64s per CU sounds great (CDNA lacks gfx engines, so CU can be physically wider than RDNA). In practice, you still need to ensure you can fully fill wavefront queues by not being locally resource limited. Dynamic allocation of registers is a good start, but there's more to be done on that front.
    Reply
  • bit_user
    Alpha_Lyrae said:
    Wave64 is still supported.
    Not by the new VOPD3 instructions. Probably not by other new features, either. It seems like they've been steadily trying to move away from it, because I think it just leads to poor utilization.

    Nvidia uses SIMD-32, as well. TBH, I don't know exactly where Intel is at, but I know they optionally extended their SIMD width in either Gen11 or Xe and might support 32 (or maybe it maxes out at 16).

    To the extent they are trying to drop Wave64 support, I think they really hurt themselves by not taking a clearer stance on its deprecation, though. It clearly seems to be there for backward-compatibility reasons, and IMO they ought to put a stake in the ground and say when it'll be dropped. I had thought maybe it's all about the consoles, but after reading about how Sony basically forked RDNA2, I'm not sure if in anything in RDNA3+ still matters for them.

    Alpha_Lyrae said:
    There's a lot of latency hiding and deep parallel work queues in modern GPUs, so in isolation, 4x throughput per SIMD64 (vs SIMD16) and 2-4 SIMD64s per CU sounds great (CDNA lacks gfx engines, so CU can be physically wider than RDNA). In practice, you still need to ensure you can fully fill wavefront queues by not being locally resource limited. Dynamic allocation of registers is a good start, but there's more to be done on that front.
    Well, you seem to be touching on this, but juggling more thread contexts means you need larger physical register files. This is the cost of using SMT for latency-hiding. If SIMD-64 means more stalls, that translates into needing more SMT threads (i.e. wavefronts) and therefore more registers. That seems like a good argument against it.

    Anyway, I appreciate your insights. I mostly just read about this stuff - don't have much time to fiddle with it, myself. Thanks for taking the time to reply!
    : )
    Reply
  • beyondlogic
    -Fran- said:
    Anything about fixing the chiplet design with RDNA5?

    Regards.

    i personally think they have shelfed that idea since rdna 3 was a absolute disaster.
    Reply