Page 3:Vive le GeForce FX!
Page 4:The advent of GPGPU
Page 6:The CUDA APIs
Page 7:A Few Definitions
Page 8:The Theory: CUDA from the Hardware Point of View
Page 9:Hardware Point of View, Continued
Page 10:The Theory: CUDA from the Software Point of View
Page 11:In Practice
Page 15:Conclusion, Continued
Meanwhile, as CPU makers were tearing their hair out trying to find a solution to their problems, GPU makers were continuing to benefit more than ever from the advantages of Moore’s Law.
Why weren’t they handicapped in the same way as their confreres who design CPUs? The reason is very simple: CPUs are designed to get maximum performance from a stream of instructions, which operates on diverse data (such as integers and floating-point calculations) and performs random memory accesses, branching, etc. Up to that point, architects were working to extract more parallelism of instructions – that is, to launch as many instructions as possible in parallel. Accordingly, the Pentium introduced superscalar execution, making it possible to launch two instructions per cycle under certain conditions. The Pentium Pro ushered in out-of-order execution of instructions in order to make optimum use of calculating units. The problem is that there’s a limit to the parallelism that is possible to get out of a sequential stream of instructions, and consequently, blindly increasing the number of calculating units is useless, since they remain unused most of the time.
Conversely, the operation of a GPU is sublimely simple. The job consists of taking a group of polygons, on the one hand, and generating a group of pixels on the other. The polygons and pixels are independent of each other, and so can be processed by parallel units. That means that a GPU can afford to devote a large part of its die to calculating units which, unlike those of a CPU, will actually be used.
GPUs differ from CPUs in another way. Memory access in a GPU is extremely coherent – when a texel is read, a few cycles later the neighboring texel will be read, and when a pixel is written, a few cycles later a neighboring pixel will be written. By organizing memory intelligently, performance comes close to the theoretical bandwidth. That means that a GPU, unlike a CPU, doesn’t need an enormous cache, since its role is principally to accelerate texturing operations. A few kilobytes are all that’s needed to contain the few texels used in bilinear and trilinear filters.
- Vive le GeForce FX!
- The advent of GPGPU
- The CUDA APIs
- A Few Definitions
- The Theory: CUDA from the Hardware Point of View
- Hardware Point of View, Continued
- The Theory: CUDA from the Software Point of View
- In Practice
- Conclusion, Continued