The Architecture in Detail
A SIMT Architecture?
You’re familiar with the terms SIMD and MIMD, but with the GT200 Nvidia describes its Shader Multiprocessors as "SIMT units." So what are they? The acronym stands for Single Instruction Multiple Threads, and the main difference between it and SIMD mode is that the size of the vectors being processed has no predefined width. Concretely, with a sufficient number of threads, the processor behaves like a scalar processor. To grasp the difference, remember how pixel shader units operated in previous architectures.
The rasterizer generated quads – squares of 2x2 pixels, where each pixel is made up of a vector with four single-precision floating-point values (R,G, B, A) or (X, Y, Z, W), which are the formats most often used in 3D calculations. These quads then moved to an ALU, which was operating in 16-way SIMD mode – applying the same instruction to all 16 floating-point numbers. This is a simplification for the purpose of illustrating the principle; in practice GeForce 6 and 7 had a mode called co-issue for executing two instructions per vector.
Since the G80, this mode of operation has been reworked – the rasterizer still generates quads, which are placed in a buffer. When 8 quads (32 pixels, a "warp" in CUDA terminology) are present in the buffer, they can be executed by a multiprocessor in SIMD mode. So what’s the difference? It’s in how the data are organized: Instead of working on four vectors of four floating-point numbers organized like this: (R, G, B, A, R, G, B, A, R, G, B, A, R, G, B, A), the multiprocessors work on vectors of 32 floating-point numbers, each made up of a single component of each of the 32 threads:
(R, R, R, R, R, R, R, R, R, R, R, R, R, R, R, R, R, R, R, R, R, R, R, R, R, R, R, R, R, R, R, R) then(G, G, G, G, G, G, G, G, G, G, G, G, G, G, G, G, G, G, G, G, G, G, G, G, G, G, G, G, G, G, G, G) etc.
In SIMD programming, the first data alignment is called AoS (Array Of Structures), and the second SoA (Structure of Arrays). This second organization results in better performance. Provided there’s sufficient data to fill a vector, the processor behaves, from the programmer’s point of view, like a scalar processor since the SIMD units are always used at 100% regardless of the width of the data being processed. Conversely, AoS achieves peak performance only when the same instruction is applied to all four components of each vector.