OpenCL In Action: Post-Processing Apps, Accelerated

DirectCompute And OpenCL: The Not-So-Secret Sauces

On and off over the past couple of years, we've tried to evaluate AMD's efforts to advocate Stream and the implementations of Nvidia's CUDA infrastructure. Ultimately they're tough comparisons to make, though. Proprietary efforts wind up limiting the number of components you can match up against each other in a severe way. We were able to combine Stream, CUDA, and Quick Sync into one transcoding piece thanks to the diligent engineering of CyberLink and Arcsoft: Video Transcoding Examined: AMD, Intel, And Nvidia In-Depth.

Since then, AMD has moved away from a proprietary approach to general-purpose GPU (GPGPU) processing in favor of industry-standard DirectCompute and OpenCL APIs. With these, developers can more easily take advantage of the GPU’s programmable logic to perform highly parallelized tasks faster and often more efficiently than an x86 CPU on its own. Such tasks often exist within graphics-intensive workloads, but developers are gradually expanding how GPUs—and now APUs—can be applied in other areas. In fact, APUs may turn out to be a more optimized solution because they feature silicon that is ideal for both single data item (SISD) and multiple data item (SIMD) processing on the same die. Whereas applications used to stress one data type or the other, we’re now seeing increasingly graphical interfaces applied to structured data software, making a hybrid approach to processing more forward-looking. Nvidia, in comparison, is still pushing CUDA hard. But it isn't ignoring OpenCL; the company's drivers incorporate OpenCL 1.1 support.

In late 2006, ATI began catering to developers who wanted to dive more deeply into SIMD, vector-based, highly parallel computing tasks. Soon, the ATI Stream SDK and Brook+ language started providing tools that let software vendors get, as ATI said, “closer to the metal” in graphics processors. But a broader, standards-oriented approach was needed. This is where the Windows DirectX API called DirectCompute and its counterpart from the Khronos Group, OpenCL, came into play. As with DirectX and OpenGL, Windows-based apps are likely to adopt DirectCompute while OpenCL has a more platform-agnostic design.

With standard APIs on the table, developers are finally comfortable adopting GPU/APU acceleration in ways that simply didn’t happen when AMD and Nvidia were each pursuing their own competing interests. To give you a taste of what’s on deck in this article series, we're going to be exploring graphics hardware-based acceleration in:

  • Video post-processing
  • Gaming
  • Personal smart cloud apps
  • Videoconferencing
  • Video editing
  • Media transcoding
  • Productivity and security software
  • Photography and facial recognition
  • Advanced user interface design

If AMD and Nvidia are to be believed, we should expect to see GPU/APU acceleration spread through a more diverse range of applications, introducing significant performance gains. Will more expensive graphics cards or more complex APUs deliver better results? Probably. Thousands of stream processors should naturally do more work in less time than hundreds. But even modest mainstream APUs should deliver quantifiable benefits.

Note that AMD’s architecture allows for the APU and certain discrete GPUs to work in tandem, much like CrossFire or SLI. So, it should be possible to start on a budget and scale up acceleration down the road. We don't really touch on this multi-GPU functionality here today, but we might as the series progresses.

  • DjEaZy
    ... OpenCL FTW!!!
  • amuffin
    Will there be an open cl vs cuda article comeing out anytime soon? :ange:
  • do I win a 7970 for OpenCl tasks?
  • deanjo
    DjEaZy... OpenCL FTW!!!
    Your welcome.

  • bit_user
    amuffinWill there be an open cl vs cuda article comeing out anytime soon?At the core, they are very similar. I'm sure that Nvidia's toolchain for CUDA and OpenCL share a common backend, at least. Any differences between versions of an app coded for CUDA vs OpenCL will have a lot more to do with the amount of effort spent by its developers optimizing it.
  • bit_user
    Fun fact: President of Khronos (the industry consortium behind OpenCL, OpenGL, etc.) & chair of its OpenCL working group is a Nvidia VP.

    Here's a document paralleling the similarities between CUDA and OpenCL (it's an OpenCL Jump Start Guide for existing CUDA developers):

    NVIDIA OpenCL JumpStart Guide

    I think they tried to make sure that OpenCL would fit their existing technologies, in order to give them an edge on delivering better support, sooner.
  • deanjo
    bit_userI think they tried to make sure that OpenCL would fit their existing technologies, in order to give them an edge on delivering better support, sooner.
    Well nvidia did work very closely with Apple during the development of openCL.
  • nevertell
    At last, an article to point to for people who love shoving a gtx 580 in the same box with a celeron.
  • JPForums
    In regards to testing the APU w/o discrete GPU you wrote:

    However, the performance chart tells the second half of the story. Pushing CPU usage down is great at 480p, where host processing and graphics working together manage real-time rendering of six effects. But at 1080p, the two subsystems are collaboratively stuck at 29% of real-time. That's less than half of what the Radeon HD 5870 was able to do matched up to AMD's APU. For serious compute workloads, the sheer complexity of a discrete GPU is undeniably superior.

    While the discrete GPU is superior, the architecture isn't all that different. I suspect, the larger issue in regards to performance was stated in the interview earlier:

    TH: Specifically, what aspects of your software wouldn’t be possible without GPU-based acceleration?

    NB: are also solving a bandwidth bottleneck problem. ... It’s a very memory- or bandwidth-intensive problem to even a larger degree than it is a compute-bound problem. ... It’s almost an order of magnitude difference between the memory bandwidth on these two devices.

    APUs may be bottlenecked simply because they have to share CPU level memory bandwidth.

    While the APU memory bandwidth will never approach a discrete card, I am curious to see whether overclocking memory to an APU will make a noticeable difference in performance. Intuition says that it will never approach a discrete card and given the low end compute performance, it may not make a difference at all. However, it would help to characterize the APUs performance balance a little better. I.E. Does it make sense to push more GPU muscle on an APU, or is the GPU portion constrained by the memory bandwidth?

    In any case, this is a great article. I look forward to the rest of the series.
  • What about power consumption? It's fine if we can lower CPU load, but not that much if the total power consumption increase.