Additional Reading: Geometry, Raster, ROP, And GPGPU
Geometry Shader And Raster Engine
After our little sidebar on tessellation, let’s continue the tour along Nvidia's geometry pipeline. The last stage consists of the Geometry Shader, which first appeared with Direct3D 10, allowing vertices to be added to or taken away from primitives. We’re a far cry from T&L with our dear old GeForce (it’s been 10 years already).
This is an area where the new GeForce GTX 480 has evolved the most compared to the previous architecture, with no less than a 311% improvement in RightMark's Hyperlight.
Finally, the PolyMorph Engine performs the viewport transform and perspective correction calculations before passing the vertices and all of their attributes to the Raster Engine. The Raster Engine is made up of three main stages. First, the equations for the edges of the triangle are calculated, and the triangles that are not facing the camera are rejected. Then, the rasterizer generates the pixels (and the samples in the case of MSAA) covered by the triangle before passing all the data to a Z-Cull unit, now a familiar entity, which avoids performing pixel shading on hidden pixels through the use of a hierarchical Z-buffer.
Reworked ROP Units
As we saw earlier, Nvidia increased the number of ROPs, but it also made a few changes to them. The units' "graphics" performance hasn’t been changed (one 32-bit integer pixel per clock cycle, one FP16 pixel over two clocks, or an FP32 pixel over four clocks), but Nvidia has greatly optimized atomic operations (that is, memory operations carried out in one single transaction) with no interruption possible. This type of operation is extremely useful in parallel programming, when several threads can attempt to access the same resource. Nvidia claims very big gains--up to 20x in the case of atomic operations to a single address and 7.5x for contiguous memory regions, even if, in practice, these gains are probably more the result of using the L2 cache, rather than any substantial modification of the ROP units.
Nvidia also claims a substantial improvement to its compression algorithms, resulting in better efficiency with 8x anti-aliasing.
Finally, GF100 has a new 32x Coverage Sampling Anti-Aliasing (CSAA) mode and the possibility of combining CSAA and transparency multi-sampling to improve anti-aliasing of transparent surfaces. An interesting detail: until now, the number of pixels generated by the setup engine and the number of ROP units increased together. Now, GF100 has 48 ROP units and its rasterizer can generate only 32 pixels per cycle. That may seem like an odd choice at first, but in practice the latest games usually aren't played on high-end hardware without MSAA, and they often use floating-point frame buffers, thereby imposing a heavier workload on the ROP units, which take several cycles to process the pixels. So, the increase in the number of ROP units is justified, but in certain very simple rendering passes they’ll be underutilized by the rasterizers.
GPGPU
It’s hard to discuss the GF100 without talking about GPGPU, since Nvidia focused on it so much during its presentations of the new architecture. When the company designed G80, the GPGPU market was in its infancy. Nvidia’s choices proved to be good ones, both from the point of view of hardware (with Shared Memory) and software (Compute Shaders and OpenCL have programming paradigms that are very close to CUDA). But it’s impossible to offer a perfect solution on the first try, and Nvidia continues to develop CUDA, adding support for double-precision and atomic instructions with GT200. But those were just incremental improvements. With its Fermi architecture, Nvidia was able to take advantage of all the expertise gained from several years of work with CUDA to offer a much more powerful solution.
The first point that has been greatly improved with GF100, and that will directly benefit GPGPU applications, is support for double-precision arithmetic. As we saw earlier, the addition of double-precision support to the GT200 really looked like a quick and dirty solution, added just to stake out the territory. With a single 64-bit unit compared to eight 32-bit units, the GT200’s double-precision performance wasn’t really up to snuff. Without using dedicated units, AMD was even able to gain the advantage by lowering the performance of its GPU by "only" a factor of four with DP, which was only half the performance hit on Nvidia's architecture.
But with GF100, Nvidia completely reworked the architecture, and the dedicated MAD unit has been done away with. Now, the same units handle single- and double-precision calculation, and performance is reduced only by half with double-precision. So, the impact on performance is much more reasonable comparable to what’s observed on CPUs when using SSE. The advantage of using a GPU for this type of calculation should now be attractive enough to motivate programmers to do the necessary recoding.
While we’re on the subject of floating-point calculation, note that GF100 supports the most recent IEEE 754-2008 standard, with all of the required rounding algorithms and floating-point multiply-add double (FMAD) instructions. It maintains the precision of the calculation throughout and does only a single rounding, unlike the classic multiplication-addition (MAD) instruction, which performs two roundings. Note, however, that its direct competitor, AMD's RV870, is not really behind in this area because it also supports the latest floating-point standard and the FMAD instruction.
We described another advantage favoring Fermi in GPGPU environments on the previous page: a new memory hierarchy. In certain cases, a little scratchpad RAM can work miracles, but there are situations where nothing can replace cache memory. Also, GF100 has optional support for ECC memory, and all of the GPU’s internal memory (L1 and L2 cache, plus shared memory) is also protected.
Nvidia corrected two bottlenecks that could slow the performance of its chips in GPGPU mode. Like a CPU, a GPU gives the illusion of executing several tasks in parallel by alternating between them, with each task being given a portion of the GPU time. The main difference between CPUs and GPUs is that, on the latter, switching tasks is extremely expensive. Nvidia grappled with this issue in designing GF100, optimizing these operations. Context-switch time is now down to less than 25 microseconds. With that degree of improvement, frequent inter-kernel communication is now feasible where it wasn’t before.
Another big improvement: up until now, the GPU could only execute one kernel at a time on the entire GPU. With large kernels, that wasn’t a problem, and all resources were used. But with small kernels, it was possible for part of the GPU to go unused. GF100 is now capable of executing several kernels in parallel (in practice, up to one per multiprocessor), which results in more efficient use of the GPU, even with large kernels. That’s because when larger kernels reach the end of their execution, it’s possible for them to have an insufficient number of blocks to occupy the entire GPU.
Management of branching has been optimized too, with support for predication instructions. These execute two divergent portions of code in parallel before determining which one is to be kept. This avoids the additional cost of the branching instruction, which can be advantageous when the code to be executed is limited.
The final new feature is a unified memory space. Until now, PTX ISA 1.0 (the virtual instruction set to which CUDA programs are compiled) had three address spaces: the global, device- and system-wide space, the private local space for each thread, and the space shared by all the threads in a given block. The target of a load/store instruction had to be determined at compile time, which made it difficult to completely implement pointers whose target can change dynamically at run time. With the PTX ISA 2.0 supported by GF100, a single address space is used, which, among other things, makes support for C++ programs possible. C++ objects largely depend on the use of pointers to implement virtual functions, whose behavior can change at run time depending on the dynamic type of the object.
Let’s give credit where credit is due. After several months of stagnation, the little world of 3D has finally awakened, and we now have two Direct3D 11 architectures, designed with quite different approaches.