Ambient Occlusion, Continued
Ambient occlusion can also be performed via pixel shaders. Developers have a choice between which method to use and, going into this article, we were a bit in the dark about why DirectCompute might be preferable. After all, we’d seen enough early benchmarks showing that using DC-enabled effects could significantly impact graphics performance (and not in a positive way). Using compute resources to achieve a feature that couldn’t be done otherwise was one thing, but why pick DirectCompute when shaders were already getting the job done? Well, for starters, DirectCompute has no more of an impact on performance than pixel shaders.
“For each pixel the occlusion term is calculated for, multiple reads of the depth texture are required,” says Codemasters’ Thomas. “In a pixel shader, each texture read costs cycles. In a compute shader, the LDS (local data share) is filled with the nearby depth information from the depth texture, and subsequent reads are significantly cheaper compared to a texture fetch.”
In this series, we want to keep returning to the question of heterogeneous computing and how adept today's hardware is at handling the tasks discussed. How do APUs compare to discrete graphics and host processors operating over PCI Express? If texture fetches are coming from memory, and APUs are relying on a shared system RAM architecture, does this inherently handicap an APU's ability execute this task efficiently, or is its proximity to the host processing resources a boon instead?
“HDAO only requires the depth of the scene as an input,” says Thomas. “This has to be rendered first, but in practice most games already have this information hanging around from either the g-buffer or a depth pre-pass. The depth buffer is a video memory resource and the implementation of HDAO would be no different on an APU compared to a GPU. The technique is very memory efficient since the only extra memory requirement is for the output texture. This is another reason why the technique is becoming increasingly popular.”
This is born out in our test results, and it’s an important point to make up front. You're going to look at our upcoming Battlefield 3 results and see that the APU only manages an average of 14 FPS with horizon-based ambient occlusion (HBAO) enabled—a clearly unplayable rate. With the Radeon HD 7970 card pulling in results 8.5x greater, it'd be natural to assume that the APU simply can’t handle the DirectCompute load. But don’t let the article’s context mislead you. Even with ambient occlusion disabled, the APU system only averages 16.6 FPS.
Battlefield 3's load is such that it's the APU's graphics muscle is unable to keep up. It's not the chip's heterogeneous architecture killing performance. We simply need hardware with more horsepower.