The East-Side GPU
Llano’s CPU side borrows heavily from existing processor technology, so it shouldn’t be a surprise that the GPU portion of the die is also quite similar to Radeon graphics cards on the market today. The Sumo core is essentially an updated version of the Redwood GPU found in the Radeon HD 5500 and 5600 cards.
As you can see, there aren’t many differences between them until you look outside the hub and render back-ends. Llano’s GPU accesses memory through the integrated northbridge, but it still has a 128-bit interface that delivers bandwidth that’d be comparable to a discrete card with DDR3 memory. Outside the hub, the Fusion APU has two display controllers and UVD3 capabilities, where Redwood has four display controllers and UVD2.
When it comes to raw specifications, both of these GPUs are virtually identical: based on AMD's VLIW5 architecture, just like the rest of the Radeon HD 5000 series, each thread processor (previously referred to as a Stream processor) contains four Radeon cores plus one special function ALU, a branch unit, and special purpose registers. Sumo has five SIMD engines, each containing 16 thread processors and four texture units. Calculate it out and you have 400 Radeon cores and 20 texture units in total, with two render back-ends, each capable of four color ROPs (adding up to eight at the end of the day). These are the same specifications as the Radeon HD 5570 and 5670 cards.
Why not use hardware from the Radeon HD 6000 generation? According to AMD, the complexities of aligning work schedules between the graphics team and Llano team are responsible. The Radeon HD 5000 series is quite similar to the 6000 series anyway, so there’s really little negative impact of which to speak, especially considering that the UVD engine does get updated in Llano.
The A8-series APUs take advantage of the full 400-shader GPU, while the A6 has one of its SIMD engines disabled, yielding 320 Radeon cores and 16 texture units (similar to the Radeon HD 5550). The A4 has yet another SIMD engine stripped from its belly, resulting in a total of 240 Radeon cores and eight texture units. One of the render back-ends is also shut down, limiting this model to four ROPs. On a side note, as far as I can remember there was only ever one other Radeon card with 240 cores: the Radeon HD 2900 GT, which itself was a crippled version of the 320-core Radeon HD 2900 XT.
We’re not going to rehash the technical nuances of the VLIW5 architecture—we’ve been there and done that in the Radeon HD 5870 launch. What we are going to do is take a closer look at what Llano’s GPU does differently. And there are some significant differences.
For instance, the UVD engine is updated to version three with power gating capabilities from the Radeon HD 6000 series. This means that MPEG-4 Part 2 (which includes DivX and Xvid), MPEG-2, and the Multi-View Codec (MVC) that Blu-ray 3D uses receive decode acceleration. Yes, Llano is capable of 3D playback over HDMI. In addition, the power gating capabilities make it possible to play back media using the fixed-function UVD3 block instead of the GPU’s shaders, saving a great deal of power in the process. AMD claims that Llano has the ability to play back two Blu-ray disks on one battery charge as a result of this optimization.
The memory interface and host interface required (and received) radical changes, as the APU communicates to memory through the integrated northbridge. The GPU can now write directly to the same cache that the CPU traditionally had exclusive access to. Having said that, the GPU portion of the die has priority access to memory through a true dual-channel 128-bit interface, which is the same width as the Radeon HD 5570 and 5670. The bandwidth is limited by system memory, which is significantly slower than GDDR5. Note that Llano’s GPU memory interface is twice as wide as the 64-bit interface used on the lower-power E- and C-series Fusion processors.
The Fusion APU also boasts a unique ability that dedicated graphics cards can not possess: direct access to unified memory shared between the CPU and GPU, something that makes Zero Copy and Pin-in-Place possible. To understand the advantage, consider how a discrete graphics card works today; texture maps are created in system memory and then transferred to virtual memory in Windows. When the system needs to bind the texture, it first makes sure it’s in virtual memory, then the OS copies it to DRAM, and the DMA of the PCIe bus transfers it to the graphics memory for access. Simply put, there’s a lot of copying going on that can cause significant latency.
But an APU doesn’t need to copy memory contents because the GPU and APU blocks share access to the same memory. Zero Copy can access virtual memory directly. Just update the page tables and point to it; no copying is necessary. Application memory can be pinned in place without copying it through the operating system staging buffers. When very large data sets are involved, the APU can even outrun a dedicated GPU (Ed.: I covered this optimization, which AMD was calling Fast Copy previously, in ASRock's E350M1: AMD's Brazos Platform Hits The Desktop First. Brazos is also able to share that memory space, which was previously separate, and enjoy a latency reduction).
That’s a best-case scenario. And on the whole, Llano is about 5-7% slower than a dedicated card because of the extra latencies involved. CPUs and GPUs aren’t all that compatible when you get right down to it. The GPU has to give the CPU low-latency access to memory, reorganize its memory accesses, and deal with extra latency because of it. A lot of work went into memory handling, and while some efficiency is lost, the final performance is very close to a discrete part with the same specifications.
Aside from these differences, the GPU block is identical to any other Radeon HD 5000 card. It features the TeraScale 2 unified processing architecture, full DirectX 11 support (something AMD repeatedly pointed out that Sandy Bridge doesn’t offer), OpenGL 4.1, MSAA, SSAA, and MLAA anti-aliasing, angle-independent anisotropic filtering, and OpenCL 1.1 support. While it certainly isn’t as powerful as a high-end discrete card, the point to take away is that this isn’t a crippled or cut-down GPU. It’s capable of the exact same features as any other Radeon.
The A-series APUs have a unique capability that, at least in theory, complements the integrated GPU nicely. They’re able to work cooperatively with separate discrete graphics for a net performance boost. Even more surprising is Llano’s ability to cooperate with GPUs that are faster or slower than its own integrated engine. Dual Graphics does not require identical GPUs to work properly, nor does it cripple the faster GPU to the specifications of the lowest common denominator, as we’ve seen from CrossFire. It actually load balances the available graphics hardware for more performance. For instance, if the discrete GPU is twice as fast as the on-die graphics, the driver takes one frame from the APU for every two frames it takes from the dedicated card.
This asymmetrical CrossFire implementation sounds like a fantastic idea, but there are serious limitations. First, it doesn’t work at all unless it’s driving a DirectX 10 or 11 application. And if you run a DirectX 9 or earlier game engine, it actually degrades performance to the slower of the two graphics options installed.
Update: According to AMD, actual production models should revert to the faster of the two graphics options installed when running a game engine using DirectX versions lower than 10. The company claims that the early test hardware we were given suffered from this issue, so we'll have to wait for actual production units to verify. OpenGL performance is not supported by Dual Graphics and is always delivered by the GPU handling the primary display outputs.
Even when it does work, the feature is somewhat inconsistent, and we definitely noticed stuttering, despite benchmark results claiming faster raw frame rates. Finally, Dual Graphics won’t work unless the performance ratio is at least as close as two-to-one—for instance, if a graphics card is three times as fast as Llano’s GPU, Dual Graphics won’t work. We’ll cover the performance ramifications shortly.
Another limitation with the Sabine notebook platform is that OEMs will have to decide between Dual Graphics or Eyefinity support. Since the A-series notebook uses both dedicated display controllers for the APU and Dual Graphics configurations, if you want to use three displays in Eyefinity, you’d have to use the discrete card’s controllers. In other words, no Dual Graphics would be possible. The lack of Eyefinity support is probably unimportant in the laptop space, though.
Having pointed out these unfortunate side-effects, we think there’s a lot of potential here. Assuming that AMD puts more resources into driver development and fixes the issues we’ve encountered, Dual Graphics could be a serious consideration for the consumer. Having a Fusion–based system might mean that you could spend $50 for a graphics card and end up with the same performance as an $80 model. It becomes even more attractive in the notebook space, as the graphics subsystem can revert to the power-saving APU when battery life is a consideration, then switch to both GPUs when an outlet is available for much better performance.
Even if you match it up to a discrete card too powerful to allow for Dual Graphics operation, the APU is able to execute OpenCL calls while the graphics card handles 3D rendering. This is a forward-looking scenario, mind you, but if game developers embrace this application interaction for tasks like physics calculation, it introduces some interesting possibilities.