Bottlenecks & Render Times
Before we get to the actual data analysis, let’s take another look at the fixed string of commands under DirectX 11 and its predecessors. The graph shows that the load on each CPU core isn't even close to equal, and that it takes a long time for you to actually see the end result, which is called Present Time in the log. It’s also interesting to see the large amount of time spent by the driver and CPU.
The same picture looks very different under DirectX 12. The load on each CPU core is more evenly distributed, which leads to more of the tasks being completed in parallel. The time spent by the driver is now spread across all of the threads. It’s not like the process takes less processor time, but the time to reach Present Time is cut down significantly because parts of the process are executed in parallel.
Batches (GPU Commands)
We start out by comparing the number of GPU commands (batches) per rendered frame. It turns out that this ratio actually isn’t all that different for different CPUs or even different manufacturers.
The analysis above is just a starting point. Next, we need to examine how the CPU’s load is distributed by analyzing the percentage of time that the CPU is in use. Even while making good use of parallelization, this measure is still a mirror image of gaming performance. Fast graphics cards produce more frames in the same period of time than slower ones, which means that the CPU has to complete more tasks in the same time period as well.
So, at which point does the CPU become a bottleneck? It’s either when too few cores are actually working on the tasks to get them done on time due to a lack of parallelization, or if there is just not enough computing power available even with all cores working in parallel. The latter basically means that the CPU isn’t fast enough to keep up with the GPU, even when working at maximum capacity.
There are other potential CPU-related bottlenecks as well. For instance, if a lot of data is worked on in parallel, and then needs to be shuffled back and forth between other subsystems, some other interface could slow you down.
The more the GPU is kept busy, the easier the CPU’s job gets. This effect is easy to see when comparing different resolutions. AMD’s Radeon R9 Fury X bucks the trend in a good way, though. It’s a fast graphics card, but doesn’t torture the CPU as much as some of the other cards in its segment. Clearly, parallelization improved markedly between AMD’s Hawaii and Fiji architectures.
Again, a lot of a frame’s overall render time is consumed by the driver. AMD loses a lot more time this way than Nvidia. This phenomenon could just be blamed on driver overhead, but then again, running many asynchronous tasks does create some additional work.
This is the time used to actually output the frame to the observer. Consequently, it’s the last step. The overall time varies quite a bit depending on the resolution. This is due to the resolution’s influence on the CPU bottlenecks that we discussed above and the additional GPU load with higher resolutions.
If the GPU is the bottleneck, then the Present Time is very short, since the CPU already had a lot of time to finish its computations.
Render Time Bottom Line
AMD is the clear winner with its current graphics cards. Real parallelization and asynchronous task execution are just better than splitting up the tasks via a software-based solution. It’s not entirely clear just how much of a difference this will make in real-life gaming scenarios, but it's certainly a technology that could see more use in the future.