However we did decide to measure the processing time to see if there was any advantage to using CUDA even with our crude implementation, or on the other hand if was going to take long, exhaustive practice to get any real control over the use of the GPU. The test machine was our development box – a laptop computer with a Core 2 Duo T5450 and a GeForce 8600M GT, operating under Vista. It’s far from being a supercomputer, but the results are interesting since our test is not all that favorable to the GPU. It’s fine for Nvidia to show us huge accelerations on systems equipped with monster GPUs and enormous bandwidth, but in practice many of the 70 million CUDA GPUs existing on current PCs are much less powerful, and so our test is quite germane.
The results we got are as follows for processing a 2048x2048 image:
- CPU 1 thread: 1419 ms
- CPU 2 threads: 749 ms
- CPU 4 threads: 593 ms
- GPU (8600M GT) blocks of 256 pixels: 109 ms
- GPU (8600M GT) blocks of 128 pixels: 94 ms
- GPU (8800 GTX) blocks of 128 pixels / 256 pixels: 31 ms
Several observations can be made about these results. First of all you’ll notice that despite our crack about programmers’ laziness, we did modify the initial CPU implementation by threading it. As we said, the code is ideal for this situation – all you do is break down the initial image into as many zones as there are threads. Note that we got an almost linear acceleration going from one to two threads on our dual-core CPU, which shows the strongly parallel nature of our test program. Fairly unexpectedly, the four-thread version proved faster, whereas we were expecting to see no difference at all on our processor, or even – and more logically – a slight loss of efficiency due to the additional cost generated by the creation of the additional threads. What explains that result? It’s hard to say, but it may be that the Windows thread scheduler has something to do with it; but in any case the result was reproducible. With a texture with smaller dimensions (512x512), the gain achieved by threading was a lot less marked (approximately 35% as opposed to 100%) and the behavior of the four-thread version was more logical, showing no gain over the two-thread version. The GPU was still faster, but less markedly so (the 8600M GT was three times faster than the two-thread version).
The second notable observation is that even the slowest GPU implementation was nearly six times faster than the best-performing CPU version. For a first program and a trivial version of the algorithm, that’s very encouraging. Notice also that we got significantly better results using smaller blocks, whereas intuitively you might think that the reverse would be true. The explanation is simple – our program uses 14 registers per thread, and with 256-thread blocks it would need 3,584 registers per block, and to saturate a multiprocessor it takes 768 threads, as we saw. In our case, that’s three blocks or 10,572 registers. But a multiprocessor has only 8,192 registers, so it can only keep two blocks active. Conversely, with blocks of 128 pixels, we need 1,792 registers per block; 8,192 divided by 1,792 and rounded to the nearest integer works out to four blocks being processed. In practice, the number of threads are the same (512 per multiprocessor, whereas theoretically it takes 768 to saturate it), but having more blocks gives the GPU additional flexibility with memory access – when an operation with a long latency is executed, it can launch execution of the instructions on another block while waiting for the results to be available. Four blocks would certainly mask the latency better, especially since our program makes several memory accesses.
Stay on the Cutting Edge
Join the experts who read Tom's Hardware for the inside track on enthusiast PC tech news — and have for over 25 years. We'll send breaking news and in-depth reviews of CPUs, GPUs, AI, maker hardware and more straight to your inbox.
New Chinese GPU arrives to challenge Nvidia's AI dominance but falls woefully short - Loongson unveils LG200 GPGPU, up to 1 Tflops of performance per node
Yes, you can have too many CPU cores - Ampere's 192-core chips break ARM64 Linux kernel in two-socket systems, company requests higher core count support
CUDA software enables GPUs to do tasks normally reserved for CPUs. We look at how it works and its real and potential performance advantages.Reply
Nvidia's CUDA: The End of the CPU? : Read more
Well if the technology was used just to play games yes, it would be crap tech, spending billions just so we can play quake doesnt make much sense ;)Reply
Wow a gaming GFX into a serious work horse LMAO.Reply
The Best thing that could happen is for M$ to release an API similar to DirextX for developers. That way both ATI and NVidia can support the API.Reply
And no mention of OpenCL? I guess there's not a lot of details about it yet, but I find it surprising that you look to M$ for a unified API (who have no plans to do so that we know of), when Apple has already announced that they'll be releasing one next year. (unless I've totally misunderstood things...)Reply
Im not gonna bother reading this article, I just thought the title was funny seeing as how Nvidia claims CUDA in NO way replaces the CPU and that is simply not their goal.Reply
I´d like it better if DirectX wouldnt be used.Reply
Anyways, NV wants to sell cuda, so why would they change to DX ,-)
I think the best way to go for MS is announce to support OpenCL like Apple. That way it will make things a lot easier for the developers and it makes MS look good to support the oen standard.Reply
Mr RobotoVery interesting. I'm anxiously awaiting the RapiHD video encoder. Everyone knows how long it takes to encode a standard definition video, let alone an HD or multiple HD videos. If a 10x speedup can materialize from the CUDA API, lets just say it's more than welcome.I understand from the launch if the GTX280 and GTX260 that Nvidia has a broader outlook for the use of these GPU's. However I don't buy it fully especially when they cost so much to manufacture and use so much power. The GTX http://en.wikipedia.org/wiki/Gore-Tex 280 has been reported as using upwards of 300w. That doesn't translate to that much money in electrical bills over a span of a year but never the less it's still moving backwards. Also don't expect the GTX series to come down in price anytime soon. The 8800GTX and it's 384 Bit bus is a prime example of how much these devices cost to make. Unless CUDA becomes standardized it's just another niche product fighting against other niche products from ATI and Intel.On the other hand though, I was reading on Anand Tech that Nvidia is sticking 4 of these cards (each with 4GB RAM) in a 1U formfactor using CUDA to create ultra cheap Super Computers. For the scientific community this may be just what they're looking for. Maybe I was misled into believing that these cards were for gaming and anything else would be an added benefit. With the price and power consumption this makes much more sense now. Agreed. Also I predict in a few years we will have a Linux distro that will run mostly on a GPU.Reply