Video Transcoding Examined: AMD, Intel, And Nvidia In-Depth

Page 12 of 13:

Inside The Black Box: GPGPU Encoding

Alright, so we've established that video quality analysis isn't easy, unless you're looking at a clear and present mistake.But how are Nvidia and AMD handling transcoding on the GPU? More specifically, how are they taking what logically seems like a serial process and turning it into a parallel one?

Threaded encoding is dynamic. When a software encoder is optimized for a multi-core CPU, each thread tries to encode an individual frame. However, multi-threaded time allocation is controlled by the OS without any software oversight. This means a programmer can't control which thread finishes first and gets allocated CPU resources. For example, core number one may be just completing the encode for frame 80, even though cores two and three aren't finished with frame 78 and 79. As there aren't extra buffers for frame 81, the dynamic bitrate for the next frame gets altered. You need to do this in order to optimize for threaded performance, otherwise you have threads waiting on one another, and it ends up being a one-core/one-thread encode. That is why frame 81 in transcoding trial #1 can be different from transcoding trial #2.

This is only one way to program for multi-threaded encoders. There are other strategies that programmers employ. For example, you could choose to divide the work by slices and have n slices per frame. You could also assign specific parts of the encoding pipeline to each thread (for instance, motion search, macroblock encoding, entropy encoding, rate control). There are advantages and disadvantages with each strategy. For consistent scaling, specifically with systems with many CPU cores, you should divide the work by frames, because each portion of the encoding pipeline takes differing times to complete. The deterministic portion of this process is the complex bitrate control inherent to each encoder, but it may be tied into specific operations like motion search/estimation.

That alone isn't what sets one encoder apart from another. Stages within the encoding pipeline may not differ sequentially, but the process itself may. Operations like motion search can significantly differ between encoders. One might predict a starting point based on the prior frame or based on neighboring macroblocks. As Mike Schmit, senior manager of digital video software at AMD explains, "If you are processing many frames (or macroblocks) at once, you will not have the luxury/advantage of a predictor, so many frames/macroblocks will start their search with no predictor. This can cause different outcomes. But knowing this, a programmer could force the serial path to also start with no predictors to force identical outcomes. This probably wouldn't happen in real code because it would be a slower choice."

Image 1 of 2

All of these programming strategies introduce some randomness into transcoding, as it is another factor that can throw off comparisons between encoders, specifically those that run on the CPU. So, even if you transcode with the same software encoder, you can still end up with different-quality video over multiple transcoding runs.

MediaConverter: SW Encode / Decode - Single Core

MediaConverter is the only software out of the three that allows you to select how many cores you want transcoding to monopolize. If you transcode with a single core, you will also get the same file size output every single time. This makes sense, since the entire process is now serial. Frame 81 depends on 80 clearing the buffer.

What does this have to do with graphics processors? A lot, actually. I accidentally reran a benchmark twice and stumbled on a curiosity. The file sizes of GPGPU- and fixed-function-encoded video are the same, no matter how many times you try it. For example, when we run Quick Sync-based encode and decode with MediaConverter, Badaboom, or MediaEspresso, we get the same file size every time. Why does this parallel process behave like a serial one?

Obviously, there are some things that have to be run in a serial manner, like reassembling the encoded frames. But the same file size means that frame #80 is always encoded the same way every single time. This is the same thing we see on a single-thread video transcode. What is going on here?

For the most part, the experts we interviewed said that while they have control over the flow of data, they don't really know what happens between the time they pass a frame to a reference library and when it is passed back, encoded.

Frankly, answers were very hard to come by until we talked to Mike Schmit, who leads the team that does video codec research and development. He gave the following answer: "Even though the GPU is all about parallelism, it is sort of similar to a single core where you have the SSE instructions...and they're just a 16-byte wide instruction. Essentially, one way (not the only way) to program the GPU shaders is to basically program as if it is just a thousand-wide SIMD instruction inside of SSE. So, it's still deterministic, but it doesn't have that randomness that you might think. It completely depends on the programmer and how they attack the problem."

Current page: Inside The Black Box: GPGPU Encoding

Prev Page Playing Devil's Advocate: "There is No Spoon" Next Page Final Words

TOPICS

52 Comments Comment from the forums

spoiled1

Tom,
You have been around for over a decade, and you still haven't figured out the basics of web interfaces.

When I want to open an image in a new tab using Ctrl+Click, that's what I want to do, I do not want to move away from my current page.

Please fix your links.
Thanks
Reply
spammit

omgf, ^^^this^^^.

I signed up just to agree with this. I've been reading this site for over 5 years and I have hoped and hoped that this site would change to accommodate the user, but, clearly, that's not going to happen. Not to mention all the spelling and grammar mistakes in the recent year. (Don't know about this article, didn't read it all).

I didn't even finish reading the article and looking at the comparisons because of the problem sploiled1 mentioned. I don't want to click on a single image 4 times to see it fullsize, and I certainly don't want to do it 4 times (mind you, you'd have to open the article 4 separate times) in order to compare the images side by side (alt-tab, etc).

Just abysmal.
Reply
cpy

THW have worst image presentation ever, you can't even load multiple images so you can compare them in different tabs, could you do direct links to images instead of this bad design?
Reply
ProDigit10

I would say not long from here we'll see encoders doing video parallel encoding by loading pieces between keyframes. keyframes are tiny jpegs inserted in a movie preferably when a scenery change happens that is greater than what a motion codec would be able to morph the existing screen into.
The data between keyframes can easily be encoded in a parallel pipeline or thread of a cpu or gpu.
Even on mobile platforms integrated graphics have more than 4 shader units, so I suspect even on mobile graphics cards you could run as much as 8 or more threads on encoding (depending on the gpu, between 400 and 800 Mhz), that would be equal to encoding a single thread video at the speed of a cpu encoding with speed of 1,6-6,4GHz, not to mention the laptop or mobile device still has at least one extra thread on the CPU to run the program, and operating system, as well as arrange the threads and be responsible for the reading and writing of data, while the other thread(s) of a CPU could help out the gpu in encoding video.

The only issue here would be B-frames, but for fast encoding video you could give up 5-15MB video on a 700MB file due to no B-frame support, if it could save you time by processing threads in parallel.
Reply
intelx

first thanks for the article i been looking for this, but your gallery really sucks, i mean it takes me good 5 mins just to get 3 pics next to each other to compare , the gallery should be updated to something else for fast viewing.
Reply
_Pez_

Ups ! for tom's hardware's web page :P, Fix your links. :) !. And I agree with them; spoiled1 and spammit.
Reply
AppleBlowsDonkeyBalls

I agree. Tom's needs to figure out how to properly make images accessible to the readers.
Reply
kikireeki

spoiled1Tom, You have been around for over a decade, and you still haven't figured out the basics of web interfaces.When I want to open an image in a new tab using Ctrl+Click, that's what I want to do, I do not want to move away from my current page.Please fix your links.Thanks
and to make things even worse, the new page will show you the picture with the same thumbnail size and you have to click on it again to see the full image size, brilliant!
Reply
acku

Apologies to all. There are things I can control in the presentation of an article and things that I cannot, but everyone here has given fair criticism. I agree that right click and opening to a new window is an important feature for articles on image quality. I'll make sure Chris continues to push the subject with the right people.

Web dev is a separate department, so we have no ability to influence the speed at which a feature is implemented. In the meantime, I've uploaded all the pictures to ZumoDrive. It's packed as a single download. http://www.zumodrive.com/share/anjfN2YwMW

Remember to view pictures in the native resolution to avoid scalers.

Cheers
Andrew Ku
TomsHardware.com
Reply
Reynod

An excellent read though Andrew.

Please give us an update in a few months to see if there has been any noticeable improvements ... keep your base files for reference.

I would imagine Quicksynch is now a major plus for those interested in rendering ... and AMD and NVidia have some work to do.

I appreciate the time and effort you put into the research and the depth of the article.

Thanks,

:)
Reply

Show more comments