Local and Global Data Share
With the RV770, the AMD engineers didn’t stop at optimizing their architecture to only slightly increase the die real-estate— they also borrowed a few good ideas from the competition. The G80 had introduced a small, 16-KB memory area per multiprocessor that’s entirely under the programmer’s control, unlike a cache. This memory area, accessible in CUDA applications, can share data among threads. AMD has introduced its version of this with the RV770. It’s called Local Data Share and is exactly the same size as its competitor’s Shared Memory. It also plays a similar role by enabling GPGPU applications to share data among several threads. The RV770 goes even further, with another memory area (also 16 KB) called Global Data Share to enable communication among SIMD arrays.
Texture units
While the ALUs haven’t undergone a major modification, the texture units have been completely redesigned. The goal was obvious – as with the rest of the GPU, it was to increase performance significantly while maintaining as small a die area as possible. The engineers set fairly ambitious goals, aiming for an increase of 70% in performance for an equivalent die area. To do that, they focused their efforts largely on the texture cache. The bandwidth of the L1 texture cache was increased to 480 GB/s.
But that’s not all; the L1 cache that was shared by all the SIMD arrays has been broken down into 10 cache memories, one per SIMD array, and each contains only data exclusive to the corresponding SIMD array. Shared data are now stored in an L2 cache, which has also been completely redesigned, now having a bandwidth 384 GB/s to the L1 cache. In order to reduce latency, this L2 cache has been positioned near the memory controllers.Let’s see what the results of these improvements are in practice:
Compared to its direct competitor, the 9800 GTX, the Radeon HD 4850 showed first-rate performance with single and dual texturing, while not giving up any performance in terms of raw fill rate – which is to be expected considering the 40 texture units for 16 ROPs (to simplify, “2.5 texture units per pixel,” as they used to say in another era). On the other hand, with triple and quad texturing, the RV770, logically enough, can’t compete with the G92’s 64 texture units (the equivalent of “4 texture units per pixel”); but in all cases the RV770 proved to be closer to its theoretical performance than its competitor.