Building Cayman By Improving Cypress
According to AMD, it had four principal design goals in building Cayman: more efficiency, improved geometry performance, new image quality features, and better power management.
First, it wanted to create a more efficient graphics and compute architecture. The motivation behind this decision is sound enough—AMD was seeing a VLIW rate of roughly 3.4 in games. So, removing the special function transcendental unit and distributing its functionality across the other four units was actually a good performance per area optimization that promised to keep the GPU running within the observed operating rate. There are situations where performance could take a hit (when the VLIW utilization spikes above four), but AMD says that's unlikely.
More important, AMD needed to create a more efficient architecture. Stuck on TSMC’s 40 nm manufacturing node, the company had to figure how to get more performance per millimeter of die space, rather than simply focusing on adding absolute performance. By shifting from its five-way VLIW architecture to a four-way design, AMD claims a 10% improvement to performance per square millimeter of die, as it’s able to add more SIMDs to the same amount of space.
Streamlining the architecture doesn’t make it any less capable. The four stream processors now have identical capabilities, absorbing the special function unit’s role as well. In its VLIW4 configuration, each stream processor can do:
- Four 32-bit FP FMA, MAD, MUL, or ADD per clock
- Two 64-bit FP ADD per clock
- One 64-bit FP FMA or MUL per clock
- One FP Special Function per clock
- Four 24-bit Int MAD, MUL, or ADD per clock
- Four 32-bit Int ADD or bitwise opps per clock
- One 32-bit Int MAD or MUL per clock
- One 64-bit ADD per clock
Moving beyond the GPU’s shading core, its render back-ends are able to handle 16-bit integer ops 2x faster, while 32-bit FP ops are 2x-4x faster. According to AMD, this most directly affects anti-aliasing performance.
Augmenting Compute Performance
Although AMD’s compute-oriented aspirations are often taken less seriously than Nvidia’s, this does sound like an area that received some attention with Cayman. For instance, whereas the Radeon HD 5800-series cards perform double-precision math at one-fifth of the single-precision rate, Cayman operates at one-quarter the SP rate. Although the Radeon HD 6970’s peak single-precision rate is a touch lower than Radeon HD 5870 (2.7 TFLOPS versus 2.72 TFLOPS), you end up with 675 GFLOPS of peak double-precision math on the Radeon HD 6970 compared to 5870’s 544 GFLOPS.
Note also that the Barts GPU sacrifices DP altogether, focusing on gaming performance rather than compute capabilities.
Cayman also incorporates dual bidirectional DMA engines, which ideally yield faster reads and writes to and from system memory over the PCI Express bus.
Finally, AMD gives Cayman the ability to handle independent applications across the GPU. This is in contrast to Fermi, which can handle multiple kernels, so long as they’re spawned from the same CPU thread. Interestingly, that functionality isn’t part of DirectX 11, so AMD has to instead enable it through OpenCL sometime in the future.
Aside from those functionality tweaks, Cayman retains Cypress’ cache structure. Each SIMD has its own 8 KB L1 cache for computational work, aside from the 16 KB L1 texture cache, plus a 32 KB local data share. Four 128 KB L2 caches continue keeping those SIMDs fed with information, and there is still a 64 KB global repository shared by all of the SIMDs.