OpenCL In Action: Post-Processing Apps, Accelerated

Tom's Hardware Q&A With MotionDSP

In what seems forever ago, this author had his first brush with MotionDSP on Tom’s Guide while trying to find out if the photo/video analysis done in movies is actually possible. Can someone take blurry, small footage and process it into a clear, recognizable face? MotionDSP’s vReveal is the consumer version of its Ikena software, which caters to vertical markets, including government agencies. The company’s algorithms can provide some rather amazing results (although there are limits), but this sort of post-processing takes a heavy hardware toll. What should you expect when getting into this sort of processing? We talked to MotionDSP CTO Nik Bozinovic and CEO Sean Varah to find out.

TH: How did you guys get started with hardware-based acceleration?

SV: When we were starting five years ago, there wasn’t an easy way to program GPUs. Basically, as we were working on our algorithms as early as 2008, we realized that as video was moving from sub-SD to standard-def to HD, there would be an increased need for high-performance computing. Only with that can we process video in real-time, which is absolutely mission critical on our professional-grade product. So the need to use heterogeneous computing was very obvious long before it became a reality, before vendors like AMD and Nvidia started supporting it. It’s been on the road map for years, but it’s only been in the last 12 to 24 months that we’re finally seeing that promise of having supercomputer-like performance from a GPU become a reality. Honestly, it’s really making a big impact on our bottom line.

MotionDSP's Sean Varah

TH: What are you doing with heterogeneous computing capabilities in your software?

SV: A couple things. One is...there are several simultaneous things that are wrong with video at any one time. It’s very rare that you have the perfect camera in perfect conditions, especially with consumers. There could be problems with resolution, noise, lighting, stabilization. So, in vReveal, we’ve packaged a series of different video filters that together can address these problems, and we’ve made the process incredibly simple by putting them all into an automatic, one-click fix operation. The reason we can make that a snappy experience for consumers is because we utilize heterogeneous computing.

NB: In addition to stabilization, we have noise enhancement, or noise removal, which we call a cleaning filter. We have auto light balance, sharpening, contrast improvement, and, as Sean mentioned, all this can happen automatically, with the complexity hidden from the user. But this is where heterogeneous computing comes in extremely handy. We have advanced video processing and video analysis tools that are all taking advantage of heterogeneous computing. I’ll just give you a couple of examples. In vReveal, we can, in near-real-time, create panoramas out of panning videos. You can take any panning shot, click a button in vReveal, and just mere seconds later end up with a beautiful stitched panorama in our pro-grade software called Ikena. We have a similar thing where stitching happens on the fly, and you can create massive mosaics used for different things, and that wouldn’t be possible without using GPU. Sure, we started as a video enhancement company, but now, using these GPU capabilities, we’re way beyond that.

TH: Specifically, what aspects of your software wouldn’t be possible without GPU-based acceleration?

SV: Well, doing it in real-time wouldn't be possible without the GPU.

NB: It’s especially true with higher-resolution media because, aside from just the sheer compute bottleneck that you solve by going to GPU—or heterogeneous computing in general, as opposed to running things only on the CPU—you are also solving a bandwidth bottleneck problem. What I mean by that is this: for our software to work and create the desired results, we have to look at a number of frames at the same time—from a couple to 30, 40, or 50 frames. It’s a very memory- or bandwidth-intensive problem to even a larger degree than it is a compute-bound problem. You have a double-win when you execute something like that on a GPU because, just to copy a large number of massive, uncompressed, high-definition frames is something you can do on a GPU, but it’s impossible on the CPU. It’s almost an order of magnitude difference between the memory bandwidth on these two devices.

  • DjEaZy
    ... OpenCL FTW!!!
    Reply
  • amuffin
    Will there be an open cl vs cuda article comeing out anytime soon? :ange:
    Reply
  • Hmmm...how do I win a 7970 for OpenCl tasks?
    Reply
  • deanjo
    DjEaZy... OpenCL FTW!!!
    Your welcome.

    --Apple
    Reply
  • bit_user
    amuffinWill there be an open cl vs cuda article comeing out anytime soon?At the core, they are very similar. I'm sure that Nvidia's toolchain for CUDA and OpenCL share a common backend, at least. Any differences between versions of an app coded for CUDA vs OpenCL will have a lot more to do with the amount of effort spent by its developers optimizing it.
    Reply
  • bit_user
    Fun fact: President of Khronos (the industry consortium behind OpenCL, OpenGL, etc.) & chair of its OpenCL working group is a Nvidia VP.

    Here's a document paralleling the similarities between CUDA and OpenCL (it's an OpenCL Jump Start Guide for existing CUDA developers):

    NVIDIA OpenCL JumpStart Guide

    I think they tried to make sure that OpenCL would fit their existing technologies, in order to give them an edge on delivering better support, sooner.
    Reply
  • deanjo
    bit_userI think they tried to make sure that OpenCL would fit their existing technologies, in order to give them an edge on delivering better support, sooner.
    Well nvidia did work very closely with Apple during the development of openCL.
    Reply
  • nevertell
    At last, an article to point to for people who love shoving a gtx 580 in the same box with a celeron.
    Reply
  • JPForums
    In regards to testing the APU w/o discrete GPU you wrote:

    However, the performance chart tells the second half of the story. Pushing CPU usage down is great at 480p, where host processing and graphics working together manage real-time rendering of six effects. But at 1080p, the two subsystems are collaboratively stuck at 29% of real-time. That's less than half of what the Radeon HD 5870 was able to do matched up to AMD's APU. For serious compute workloads, the sheer complexity of a discrete GPU is undeniably superior.

    While the discrete GPU is superior, the architecture isn't all that different. I suspect, the larger issue in regards to performance was stated in the interview earlier:

    TH: Specifically, what aspects of your software wouldn’t be possible without GPU-based acceleration?

    NB: ...you are also solving a bandwidth bottleneck problem. ... It’s a very memory- or bandwidth-intensive problem to even a larger degree than it is a compute-bound problem. ... It’s almost an order of magnitude difference between the memory bandwidth on these two devices.

    APUs may be bottlenecked simply because they have to share CPU level memory bandwidth.

    While the APU memory bandwidth will never approach a discrete card, I am curious to see whether overclocking memory to an APU will make a noticeable difference in performance. Intuition says that it will never approach a discrete card and given the low end compute performance, it may not make a difference at all. However, it would help to characterize the APUs performance balance a little better. I.E. Does it make sense to push more GPU muscle on an APU, or is the GPU portion constrained by the memory bandwidth?

    In any case, this is a great article. I look forward to the rest of the series.
    Reply
  • What about power consumption? It's fine if we can lower CPU load, but not that much if the total power consumption increase.
    Reply