OpenCL In Action: Post-Processing Apps, Accelerated

Benchmark Results: vReveal On The FX-8150 And Radeon HD 5870

Again, we're trying to compare the experience you get before and after applying DirectCompute/OpenGL acceleration. The trick is identifying the right environment to flag as normal. Is the 480p video of DVD still an acceptable baseline, or is 1080p the very least you'd accept? In an app like vReveal, should we expect people to only stabilize their videos, or will they take the kitchen sink approach and apply every effect? Ultimately, these four variables were what we compared using GPU-based acceleration, and then turning it off.

We’re going to examine our data in terms of CPU utilization, measuring system impact, as well as render speed. Rather than indicate frames per second (which pegs at 30 and stays there, telling us very little), vReveal spits back a percentage of real-time at which a render job is operating. This is probably a more meaningful number to the average user. For instance, if a one-minute video clip is rendering at 50%, the render job takes two minutes to complete.

Predictably, we’re going to see a lot of data showing that GPU acceleration takes a large strain off of the CPU, and 1080p video is much more demanding than 480p.

The Radeon HD 5870 benefiting from OpenCL-based acceleration handles a single image enhancement on 480p with ease, offloading almost the entire task from an FX processor. But don’t assume that this means the FX is impervious. Without GPU assistance, that one effect applied to the 1080p video via software chews up almost half of the chip's processing time. Enabling GPU acceleration brings utilization down to about 10%, validating claims of a 4x to 5x benefit from hardware-based compute.

Now, what does GPU-assist mean in terms of time and getting a job completed? With only one render effect active, even our 1080p clip is processed in real-time. Note that the 480p clip can render in real-time without hardware acceleration, but the 1080p clip cannot, and requires almost twice as much time to complete.

Let’s crank up the dial and apply six render effects. Interestingly, there isn’t much difference between the 480p and 1080p clips when you use AMD's fastest CPU to manipulate them. Simply having six effects to chew on almost fully utilizes the processor, regardless of resolution.

When we enable GPU acceleration on the Radeon HD 5870, the performance improvement is even more dramatic than our single-effect baseline. We measure an 11x difference in CPU utilization with the 480p clip and 6x for the 1080p. Even still, it's interesting that the 1080p workload is twice as demanding as 480p with six effects applied.

Although this test pegs our FX-8150 at nearly 70% utilization, you can almost render the 480p clip in real-time using software mode. The same cannot be said for the more demanding 1080p clip, which slows to 13% of real-time when AMD's FX is the only resource operating on it. That means processing takes about eight minutes for every one minute of video footage you feed through.

Enabling hardware acceleration brings us back up to 92% of real-time—a 7x gain under a worst-case load.

  • DjEaZy
    ... OpenCL FTW!!!
  • amuffin
    Will there be an open cl vs cuda article comeing out anytime soon? :ange:
  • do I win a 7970 for OpenCl tasks?
  • deanjo
    DjEaZy... OpenCL FTW!!!
    Your welcome.

  • bit_user
    amuffinWill there be an open cl vs cuda article comeing out anytime soon?At the core, they are very similar. I'm sure that Nvidia's toolchain for CUDA and OpenCL share a common backend, at least. Any differences between versions of an app coded for CUDA vs OpenCL will have a lot more to do with the amount of effort spent by its developers optimizing it.
  • bit_user
    Fun fact: President of Khronos (the industry consortium behind OpenCL, OpenGL, etc.) & chair of its OpenCL working group is a Nvidia VP.

    Here's a document paralleling the similarities between CUDA and OpenCL (it's an OpenCL Jump Start Guide for existing CUDA developers):

    NVIDIA OpenCL JumpStart Guide

    I think they tried to make sure that OpenCL would fit their existing technologies, in order to give them an edge on delivering better support, sooner.
  • deanjo
    bit_userI think they tried to make sure that OpenCL would fit their existing technologies, in order to give them an edge on delivering better support, sooner.
    Well nvidia did work very closely with Apple during the development of openCL.
  • nevertell
    At last, an article to point to for people who love shoving a gtx 580 in the same box with a celeron.
  • JPForums
    In regards to testing the APU w/o discrete GPU you wrote:

    However, the performance chart tells the second half of the story. Pushing CPU usage down is great at 480p, where host processing and graphics working together manage real-time rendering of six effects. But at 1080p, the two subsystems are collaboratively stuck at 29% of real-time. That's less than half of what the Radeon HD 5870 was able to do matched up to AMD's APU. For serious compute workloads, the sheer complexity of a discrete GPU is undeniably superior.

    While the discrete GPU is superior, the architecture isn't all that different. I suspect, the larger issue in regards to performance was stated in the interview earlier:

    TH: Specifically, what aspects of your software wouldn’t be possible without GPU-based acceleration?

    NB: are also solving a bandwidth bottleneck problem. ... It’s a very memory- or bandwidth-intensive problem to even a larger degree than it is a compute-bound problem. ... It’s almost an order of magnitude difference between the memory bandwidth on these two devices.

    APUs may be bottlenecked simply because they have to share CPU level memory bandwidth.

    While the APU memory bandwidth will never approach a discrete card, I am curious to see whether overclocking memory to an APU will make a noticeable difference in performance. Intuition says that it will never approach a discrete card and given the low end compute performance, it may not make a difference at all. However, it would help to characterize the APUs performance balance a little better. I.E. Does it make sense to push more GPU muscle on an APU, or is the GPU portion constrained by the memory bandwidth?

    In any case, this is a great article. I look forward to the rest of the series.
  • What about power consumption? It's fine if we can lower CPU load, but not that much if the total power consumption increase.