More Inklings: Video Transcoding
Let’s be honest—smooth video playback probably isn’t what springs to mind when you picture the melding of CPU and GPU resources onto a single die. You, like us, are looking for a more compelling application able to show off the strengths of both previously-separate worlds working together on a parallelized task. Video transcoding would be the perfect fit.
Unfortunately, Intel took a lot of the wind out of AMD’s sails with its introduction of Quick Sync, explicitly designed to accelerate decode and encode. Found on ultra-low voltage Sandy Bridge processors that dip down to 17 W, this functionality doesn’t come cheap—the least-expensive SKU (Intel’s Core i5-2537M) costs $250 for the processor alone. But its performance in this sort of workload is compelling. In contrast, Brazos-based platforms should be found for less than $500 for the entire netbook/nettop. Motherboards with the E-350 soldered on should be available for less than $100. Comparing Zacate to Sandy Bridge consequently really isn’t fair.
The compromise you make in stepping toward a more budget-friendly design is less aggressive transcode acceleration. The transcode pipeline involves reading a file in, decoding it, encoding it, and outputting it. AMD’s Zacate APU is able to accelerate the decode stage, naturally. From there, it’s able to take advantage of the fact that the GPU and CPU are on the same die to speed up the process of copying data from graphics memory to the processor. AMD markets this capability as Fast Copy Optimization.
The way it works is simple. Previously, transcoding apps used CPU instructions to copy decoded video data from a graphics card on one end of the PCI Express bus to the processor, where post-processing and encoding took place. This interaction between dissimilar memory spaces burned CPU cycles. On a modern desktop processor, that probably wasn’t a debilitating bottleneck. But in a more mobile implementation, burnt cycles not only hold back performance more noticeably, but also have an adverse impact on power consumption. Fast Copy facilitates DMA to copy the same data without using CPU cycles, freeing the two Bobcat cores to work on the encode.
Wait—encode is happening on the processor? We have 80 stream processors in a pair of SIMD engines—why not offload to those in much the same way that Intel involved its EUs in encode acceleration on Sandy Bridge? AMD does have encode acceleration available on its discrete graphics products. But the two SIMD engines on Zacate simply aren’t powerful enough to demonstrate an appreciable benefit. This functionality will be available through the Sabine platform’s Llano APU, so we’ll have to wait until later this year to see how well it works.
In the meantime, one of CyberLink’s competitors, ArcSoft, is working on its own OpenCL-based encoder that may or may not change the Brazos performance story in the near-term. CyberLink is going the OpenCL route later this year as well. But again, both companies are more likely focused on Llano, which has the GPU muscle to make encoding worth offloading to graphics.