Hardware Point of View, Continued
This memory area provides a way for threads in the same block to communicate. It’s important to stress the restriction: all the threads in a given block are guaranteed to be executed by the same multiprocessor. Conversely, the assignment of blocks to the different multiprocessors is completely undefined, meaning that two threads from different blocks can’t communicate during their execution. That means that using this memory is complicated. But it can also be worthwhile, because except for cases where several threads try to access the same memory bank, causing a conflict; the rest of the time, access to shared memory is as fast as access to the registers.
The shared memory is not the only memory the multiprocessors can access. Obviously they can use the video memory, but it has lower bandwidth and higher latency. Consequently, to limit too-frequent access to this memory, Nvidia has also provided its multiprocessors with a cache (approximately 8 KB per multiprocessor) for access to constants and textures.
The multiprocessors also have 8,192 registers that are shared among all the threads of all the blocks active on that multiprocessor. The number of active blocks per multiprocessor can’t exceed eight, and the number of active warps are limited to 24 (768 threads). So, an 8800GTX can have up to 12,288 threads being processed at a given instant. It’s worth mentioning all these limits because it helps in dimensioning the algorithm as a function of the available resources.
Optimizing a CUDA program, then, essentially consists of striking the optimum balance between the number of blocks and their size – more threads per block will be useful in masking the latency of the memory operations, but at the same time the number of registers available per thread are reduced. What’s more, a block of 512 threads would be particularly inefficient, since only one block might be active on a multiprocessor, potentially wasting 256 threads. So, Nvidia advises using blocks of 128 to 256 threads, which offers the best compromise between masking latency and the number of registers needed for most kernels.