Turing Improves Performance in Today’s Games
Some enthusiasts have expressed concern that Turing-based cards don’t boast dramatically higher CUDA core counts than their previous-gen equivalents. The older boards even have higher GPU Boost frequencies. Nvidia didn’t help matters by failing to address generational improvements in today’s games at its launch event in Cologne, Germany. But the company did put a lot of effort into rearchitecting Turing for better per-core performance.
To start, Turing borrows from the Volta playbook in its support for simultaneous execution of FP32 arithmetic instructions, which constitute most shader workloads, and INT32 operations (for addressing/fetching data, floating-point min/max, compare, etc.). When you hear about Turing cores achieving better performance than Pascal at a given clock rate, this capability largely explains why.
In generations prior, a single math data path meant that dissimilar instruction types couldn’t execute at the same time, causing the floating-point pipeline to sit idle whenever non-FP operations were needed in a shader program. Volta sought to change this by creating separate pipelines. Although Nvidia eliminated the second dispatch unit assigned to each warp scheduler, it also claimed that instruction issue throughput rose.
How is that possible? It's all about the composition of each architecture's SM.
Check out the two block diagrams below. Pascal has one warp scheduler per quad, with each quad containing 32 CUDA cores. A quad's scheduler can issue one instruction pair per clock through the two dispatch units with the stipulation that both instructions come from the same 32-thread warp, and only one can be a core math instruction. Still, that's one dispatch unit per 16 CUDA cores.
In contrast, Turing packs fewer CUDA cores into an SM, and then spreads more SMs across each GPU. There's now one scheduler per 16 CUDA cores (2x Pascal), along with one dispatch unit per 16 CUDA cores (same as Pascal). Gone is the instruction-pairing constraint. And because Turing doubles up on schedulers, it only needs to issue an instruction to the CUDA cores every other cycle to keep them full (with 32 threads per warp, 16 CUDA cores take two cycles to consume them all). In between, it's free to issue a different instruction to any other unit, including the new INT32 pipeline. The new instruction can also be from any warp.
Turing's flexibility comes from having twice as many schedulers as Pascal, so that each one has less math to feed per cycle, not from a more complicated design. The schedulers still issue one instruction per clock cycle. It's just that the architecture is better able to utilize resources thanks to its improved balance throughout the SM.
According to Nvidia, the potential gains are significant. In a game like Battlefield 1, for every 100 floating-point instructions, there are 50 non-FP instructions in the shader code. Other titles bias more heavily toward floating-point math. But the company claims there are an average of 36 integer pipeline instructions that would stall the floating-point pipeline for every 100 FP instructions. Those now get offloaded to the INT32 cores.
Despite the separation of FP32 and INT32 paths on its block diagrams, Nvidia says each Turing SM contains 64 CUDA cores to keep things straightforward. The Turing SM also comprises 16 load/store units, 16 special-function units, 256KB of register file space, 96KB of shared memory and L1 data cache, four texture units, eight Tensor cores, and one RT core.
On paper, an SM in the previous-generation GP102 appears more complex, sporting twice as many CUDA cores, load/store units, SFUs, texture units, just as much register file capacity, and more cache. But remember that the new TU102 boasts as many as 72 SMs across the GPU, while GP102 topped out at 30 SMs. The result is a Turing-based flagship with 21% more CUDA cores and texture units than GeForce GTX 1080 Ti, but also way more SRAM for registers, shared memory, and L1 cache (not to mention 6MB of L2 cache, doubling GP102’s 3MB).
That increase of on-die memory plays another critical role in improving performance, as does its hierarchical organization. Consider the three different data memories: texture cache for textures, L1 cache for load/store data, and shared memory for compute workloads. As far back as Kepler, each SM had 48KB of read-only texture cache, plus a 64KB shared memory/L1 cache. In Maxwell/Pascal, the L1 and texture caches were combined, leaving 96KB of shared memory on its own. Now, Turing combines all three into one shared and configurable 96KB pool.
The benefit of unification, of course, is that regardless of whether a workload is optimized for L1 or shared memory, on-chip storage is utilized rather than sitting idle as it may have before. Moving L1 functionality down has the additional benefit of putting it on a wider bus, doubling L1 cache bandwidth (at the TPC level, Pascal supports 64 bytes per clock cache hit bandwidth, while Turing can do 128 bytes per clock). And because those 96KB can be configured as 64KB L1 and 32KB shared memory (or vice versa), L1 capacity can be 50% higher on a per-SM basis.
Combined, Nvidia claims that the effect of its redesigned math pipelines and memory architecture is a 50% performance uplift per CUDA core. To keep those data-hungry cores fed more effectively, Nvidia paired TU102 with GDDR6 memory and further optimized its traffic reduction technologies (like delta color compression). Pitting GeForce GTX 1080 Ti’s 11 Gb/s GDDR5X modules against RTX 2080 Ti’s 14 Gb/s GDDR6 memory, both on an aggregate 352-bit bus, you’re looking at a 27%-higher data rate/peak bandwidth figure across the board. Then, depending on the game, when RTX 2080 Ti can avoid sending data over the bus, effective throughput increases even more by double-digit percentages.
MORE: Best Graphics Cards
MORE: Desktop GPU Performance Hierarchy Table
MORE: All Graphics Content