Once you’ve pored over Nvidia’s documentation, it’s hard to resist getting your hands a little dirty. After all, what better way is there of judging an API than trying to write a little program using it? That’s where most of the problems come to the surface, even if everything looks perfect on paper. It’s also the best way to see if you’ve assimilated all the concepts described in the CUDA documentation.
And it’s actually quite easy to dive into such a project. A lot of high-quality free tools are available. For this test we used Visual C++ Express 2005, which had everything we needed. The hardest part was finding a program that was simple enough to port to the GPU without spending weeks to do so, but at the same time interesting enough to make the adventure worthwhile. We ended up choosing a code snippet we had that takes a height map and calculates the corresponding normal map. We won’t go into the details of the function, which is not of much interest in itself at this point. To be brief, we’re dealing with a convolution: For each pixel of the starting image, we apply a matrix that will determine the color of the resulting pixel in the generated image from the adjacent pixels, using a more or less complicated formula. The advantage of this function is that it’s very easily parallelizable, making it an ideal test of what CUDA is capable of.
The other advantage is that we already had a CPU implementation we could easily compare the result of our CUDA version with – which avoided, as programmers say, having to reinvent the wheel. (When a programmer uses that expression, it means that the time saved can be spent more productively in exhaustive testing of a recent FPS game or close observation of an athletic contest via the medium of HDTV – and we’re no exception.)
We should repeat that the purpose of this test was to familiarize us with tools in the CUDA SDK, and not to do a comparative benchmark between a CPU version and a GPU version. Since this was our first attempt at a CUDA program, we didn’t have high expectations about performance. And, since this wasn’t a critical piece of code, the CPU version wasn’t all that optimized, and so a direct comparison of the results wouldn’t really be of interest.