AMD Details Asynchronous Shaders In DirectX 12, Promises Performance Gains

AMD has been working closely with Microsoft on the upcoming DirectX 12 API, and it likes to show off once in a while how well its graphics cards will support some of those features. For example, there are the so-called "Asynchronous Shaders," which are a different way of handling task queues than was possible in older graphics APIs and is potentially much more efficient.

In DirectX 11, there are two primary ways of synchronous task scheduling: multi-threaded graphics and multi-threaded graphics with pre-emption and prioritization, each with their advantages and disadvantages.

Before we continue, we must clarify a couple of terms. The GPU's shaders do the drawing of the image, computing of the game physics, post-processing and more, and they do this by being assigned various tasks. These tasks are delivered through the command stream, which is the main command queue of tasks that the shaders need to execute. The command stream is generated through merging individual command queues, which consist of multiple tasks and break spaces.

These empty parts in the queue exist because tasks in a single queue aren't generated one right after another in multi-threaded graphics; tasks in one queue are sometimes only generated after tasks in another queue. Due to these break spaces, a single queue cannot utilize the shaders to their full potential.

Generally speaking, there are three command queues: the graphics queue, the compute queue, and the copy queue.

The simplest way to describe how synchronous multi-threaded graphics works is that the command queues are merged by switching between one another on time intervals – one queue will go to the main command stream briefly, and then the next queue will go, and so on. Therefore, the gaps mentioned above remain in the central command queue, meaning that the GPU will never run at 100 percent actual load. In addition, if an urgent task comes along, it must merge with the command queue and wait for the rest of the command queues to finish executing. Another way of thinking of this is multiple sources at a traffic light merging into a single lane.  

This lead to the birth of pre-emption and prioritization, which works in exactly the same way that synchronous multi-threaded graphics does, except that the merging point will prioritize urgent tasks. Furthermore, the central command queue can be paused in order to make way for the urgent task, allowing it to be executed with the lowest latency possible. The catch, however, is that all the other tasks have to get paused, which can lead to performance issues due to switching overhead. Additionally, the issue with the gaps remains, meaning there is still room for improvement regarding performance. You can think of this as the above traffic junction, but with the possibility of making way for the emergency services.

In DirectX 12, however, a new merging method called Asynchronous Shaders is available, which is basically asynchronous multi-threaded graphics with pre-emption and prioritization. What happens here is that the ACEs (Asynchronous Compute Engines) on AMD's GCN-based GPUs will interleave the tasks, filling the gaps in one queue with tasks from another, kind of like merging onto a highway where nobody moves to the side for you. Despite that, it can still move the main command queue to the side to let priority tasks pass by when necessary. It probably goes without saying that this leads to a performance gain.

On AMD's GCN GPUs, each ACE can handle up to eight queues, and each ACE can address its own fair share of shaders. The most basic GPUs have just two ACEs, while more elaborate GPUs carry eight.

To bring numbers to the table, AMD ran the LiquidVR SDK sample, which ran at 245 FPS with Asynchronous Shaders off and post-processing off. With post-processing enabled, it dipped to 158 FPS, but upon enabling both Asynchronous Shaders and post-processing, the framerate jumped to 230 FPS, nearly that of the original. Of course, this is probably a best-case scenario, but what it can mean is that you essentially get post-processing effects almost for free.

Anyway, the big reason why Asynchronous Shaders are interesting is clearly a performance gain. They will not only be able to ensure that all the gaps in the queue are filled for improved performance, but due to the way ACEs can fill the queue and handle tasks, it will also help with timing and latency. Basically, the increased parallelism and newly available headroom will ensure that more frames make their way to the screen even faster, which can be especially interesting for purposes such as VR.

With VR demanding higher resolutions than we've seen before for a comfortable viewing experience, and with VR requiring a higher framerate to reduce headaches, minimize nausea, and increase immersion, we're at a point where we need more GPU power than ever before. At this point, merely throwing more GPU power at the problem simply doesn't work anymore; GPUs need to become more powerful, yes, but we also need to use them more efficiently.

Is all of this AMD's work? Probably not, but the company is working closely with Microsoft to ensure the best support possible. During the briefing, the spokesman mentioned that he had seen no such information regarding support from its competitors, but we know that Nvidia is always very "hush hush" about unannounced products. It should be noted, however, that Asynchronous Shaders isn't something new only to DirectX 12; it will also be a part of the new Vulkan API as well as LiquidVR, and it exists in AMD's Mantle.

Follow Niels Broekhuijsen @NBroekhuijsen. Follow us @tomshardware, on Facebook and on Google+.

Niels Broekhuijsen

Niels Broekhuijsen is a Contributing Writer for Tom's Hardware US. He reviews cases, water cooling and pc builds.

  • Calculatron
    I love the change in focus for working smarter, not harder.
    Reply
  • Larry Litmanen
    I read about AMD making amazing products for the last 10 years and somehow, someway they are always behind.
    Reply
  • basroil
    Is this a previously unmentioned DX12 api feature or is it just more AMD marketing renaming (like calling their SMT designs "cluster-multithread")? From the looks of it they are just talking about executeIndirect, which was shown on Intel (and later Nvidia) chips at GDC.
    Reply
  • bloc97
    Great, always glad to see free performance improvements.
    Reply
  • Shankovich
    Thanks for this push AMD. If you didn't make Mantle, this probably wouldn't be coming today.
    Reply
  • ykki
    So, AMD when can we get the 300 series (a specific date would be nice)?
    Reply
  • Sony clearly saw the future moving away from traditional raster-based rendering for GPU's as the hardware evolves towards general compute.

    XBOX ONE GPU-

    1.18 TF GPU (12 CUs) for games
    768 Shaders
    48 Texture units
    16 ROPS
    2 ACE/ 16 queues

    PLAYSTATION 4 GPU-

    1.84TF GPU (18 CUs) for games + 56%
    1152 Shaders +50%
    72 Texture units +50%
    32 ROPS + 100%
    8 ACE/64 queues +300%

    Here's how important ACE's are:

    http://www.dualshockers.com/2014/09/03/ps4s-powerful-async-compute-tech-allows-the-tomorrow-childrens-developer-to-save-5-ms-per-frame/?
    Reply
  • PaulBags
    So, why can't it do this anyway? If the shaders compute the data why does it matter what dx version? Why does dx or anything else need to see the process under the shaders and understand it to make use of it?
    Reply
  • basroil
    15584048 said:
    So, why can't it do this anyway? If the shaders compute the data why does it matter what dx version? Why does dx or anything else need to see the process under the shaders and understand it to make use of it?

    You have it backwards, the shader code itself needs to be in a format that can be compiled by the version of DX (or higher usually) it is targeting, and any functions used by that shader have to be valid for the DX version you target. It's no different than trying to run a DX11 game on an XP machine, it just won't work because it has invalid calls.

    If DX can't compile the shader code (shaders are compiled at run time usually), no compute, so DX needs to be able to read the shader code and make sure it's valid (and also compile it, no magical compilation).
    Reply
  • PaulBags
    15584157 said:
    15584048 said:
    So, why can't it do this anyway? If the shaders compute the data why does it matter what dx version? Why does dx or anything else need to see the process under the shaders and understand it to make use of it?

    You have it backwards, the shader code itself needs to be in a format that can be compiled by the version of DX (or higher usually) it is targeting, and any functions used by that shader have to be valid for the DX version you target. It's no different than trying to run a DX11 game on an XP machine, it just won't work because it has invalid calls.

    If DX can't compile the shader code (shaders are compiled at run time usually), no compute, so DX needs to be able to read the shader code and make sure it's valid (and also compile it, no magical compilation).
    Hmm, woosh, straight over my head. Sounds like I need to do some serious reading.
    Reply