Intel, AMD, And Nvidia: Decode And Encode Support
Decode Support: A History Of Formats
There is a caveat with video decoded in hardware. Not every solution exposes the same degree of processing.
For Intel, motion compensation was the only hardware-accelerated stage of the video pipeline for several generations of graphics products (GMA 900, 850, 3000, and 3100). That meant that you used a software decoder to uncompress the video bitstream before Intel's logic circuits performed motion compensation. It wasn't until a later revision of 3100 that we actually saw full hardware-based decoding of MPEG-2. Support for VC-1 and H.264 didn't come until the 4500MHD. Remember that GMA 500 doesn't count since Imagination Technologies developed it for Intel.
Meanwhile, AMD recently released UVD 3 with its Radeon HD 6000-series. Originally, UVD 1 supported full VC-1 and H.264 decoding. UVD 2 added frequency transformation and motion compensation to MPEG-2. UVD 3 adds full decoding support for MVC, MPEG-2, and MPEG-4/DivX/Xvid. Note that UVD on AMD's 5000-series cards underwent a firmware-level revision. There was enough of a hardware change that AMD added dual-stream decoding support. The Radeon HD 4000-series already had picture-in-picture and dual-stream decoding support, but this was limited to SD resolutions.
Nvidia started out with MPEG-1/MPEG-2 hardware-based decoding on its GeForce FX. The first generation of PureVideo emerged when Nvidia took that hardware, improved deinterlacing and overlay resizing, and built it onto the GeForce 6000-series. For the most part, decoding acceleration was limited, as it excluded frequency transformation, initial run and variable length decoding. H.264 hardware decoding didn't crop up until GeForce 6600. Today, we are up to the fourth generation of PureVideo, which adds MPEG-4 (Advanced) Simple Profile bitstream decoding, along with MVC for 3D Blu-ray content.
Encode Support
Up until Quick Sync, there was no such thing as a fixed-function encoder (for desktop PCs). Nearly all encoding was achieved on the software side using pure CPU horsepower. If you had a fast computer, you could encode faster. It was as simple as that. And, if you remember far enough back, there was a time when software-based encoding was only done on a single thread, severely limiting performance. Times have changed, and the process can largely be parallelized.
For the most part, we are still dealing with software-based encoders. The only difference is that now we have encoders that can do all of the work in the GPU by way of GPGPU programming libraries. And while we’ve all been trained to think that general-purpose GPU computing is the future, at least relative to the more limited parallelism offered by a CPU, the tasks we’re talking about here simply cannot run as quickly or as efficiently (power-wise) in general-purpose logic circuit. So, now we end up comparing GPGPU-based encoders that operate using hardware that lives on a graphics card to Intel's mixed fixed-function/general purpose implementation.
On the encode side, you have fixed-function logic working in concert with the programmable execution units. There’s a media sampler block attached to the EUs (Intel calls this a co-processor) that handles motion estimation, augmenting the programmable logic. Of course, the decoding tasks that happen during a transcode travel down the same fixed-function pipeline already discussed, so there’s additional performance gained there. Feed in MPEG-2, VC-1, or AVC, and you get MPEG-2 or AVC output from the other side.
Depending on the application in question, the way each company employs Quick Sync is naturally going to be different. Take CyberLink, for example. PowerDVD 10 capitalizes on the pipeline’s decode acceleration. A MediaEspresso project is going to be significantly more involved—it’ll read the file in, decode, encode, and turn back the output stream. Then, in PowerDirector, a video editing app, you have to factor in post-processing—the effects and compositing that happens before everything gets fed into the encode stage.
Optimizing for Transcoding
The transcode pipeline involves reading a file in, decoding it, encoding it, and outputting it. Before the development of GPGPU-based encoding, transcoding software used the CPU to copy data from video memory (where it resided after hardware-accelerated decoding) and sent it back to system memory, where the CPU was able to perform the encode stage.
Because the fixed-function decoder is on the same piece of silicon as a GPU, the software can skip copying the data back to system memory (step four in the first diagram). General-purpose GPU-based transcoding allows almost the entire process to happen on one piece of silicon. These are performance-oriented considerations, though, and our focus today is on quality. Let's move on.