The Theory: CUDA from the Software Point of View
From a software point of view, CUDA consists of a set of extensions to the C language, which of course recalls BrookGPU, and a few specific API calls. Among the extensions are type qualifiers that apply to functions and variables. The keyword to remember here is __global__, which when prefixed to a function indicates that the latter is a kernel – that is, a function that will be called by the CPU and executed by the GPU. The __device__ keyword designates a function that will be executed by the GPU (which CUDA refers to as the “device”) but can only be called from the GPU (in other words, from another __device__ function or from a __global__ function). Finally, the __host__ keyword is optional, and designates a function that’s called by the CPU and executed by the CPU – in other words, a traditional function.
There are a few restrictions associated with __device__ and __global__ functions: They can’t be recursive (that is, they can’t call themselves) and they can’t have a variable number of arguments. Finally, regarding __device__ functions resident in the GPU’s memory space, logically enough it’s impossible to obtain their address. Variables also have new qualifiers that allow control of the memory area where they’ll be stored. A variable preceded by the keyword __shared__ indicates that it will be stored in the streaming multiprocessors’ shared memory.The way a __global__ function is called is also a little different. That’s because the execution configuration has to be defined at the time of the call – more concretely, the size of the grid to which the kernel is applied and the size of each block. Take the example of a kernel with the following signature:
__global__ void Func(float* parameter);
which will be called as follows:
Func<<< Dg, Db >>>(parameter);
where Dg is the grid dimension and Db the dimension of a block. These two variables are of a new vector type introduced by CUDA.
The CUDA API essentially comprises functions for memory manipulation in VRAM: cudaMalloc to allocate memory, cudaFree to free it and cudaMemcpy to copy data between RAM and VRAM and vice-versa.
We’ll end this overview with the way a CUDA program is compiled, which is interesting: Compiling is done in several phases – first of all the code dedicated to the CPU is extracted from the file and passed to the standard compiler. The code dedicated to the GPU is first converted into an intermediate language, PTX. This intermediate language is like an assembler, and so enables the generated source code to be studied for potential inefficiencies. Finally, the last phase translates this intermediate language into commands that are specific to the GPU and encapsulates them in binary form in the executable.