The Architecture in Detail
Like Nvidia, AMD has chosen to build on its previous architecture rather than starting from scratch. It’s very much the same as that of the R600, which had already been re-used for the RV670.
SIMD cores
The architecture initially introduced with the Xenos, which is the same GPU used in the Xbox 360, is based on a group of SIMD arrays. The Xenos had three SIMD arrays, and the R600 and RV670 have four. The RV770 goes much further with ten.
As you’ve deduced, each SIMD array contains 80 ALUs since the GPU has 800 ALUs. That’s true, but it’s a slightly simplified view of reality. In practice, the 80 ALUs are not independent of each other. They’re grouped together in five-way VLIW units – 16 units per SIMD array.This organization implies certain restrictions on the instructions executed; each of the five instructions of a VLIW bundle has to be independent from the others. It’s up to the compiler to find enough independent instructions to saturate the ALUs – unlike the G80, which uses a more "hardware" solution.
Here’s an example to illustrate what we just described:
- I1 FADD R1, R1, 3.14
- I2 FMUL R2, R1, 1.41
- I3 FMAD R3, R0, 0.5, 0.5
In this case, Instructions 1 and 3 can share the same bundle, but not Instruction 2, which depends on the result of Instruction 1. If the compiler can’t find enough operations in its window of instructions, it has to fill the bundle with NOP instructions that don’t do anything, thus reducing the chip’s performance. What all that adds up to in the present case is that Nvidia ALUs will hit their peak performance more often because they’re less dependent on the underlying code; but the down side is that they’re much more costly in terms of transistors. AMD’s units depend strongly on the compiler’s performance (the compiler that’s “internal” to the driver, which reorganizes the assembler instructions generated by the HLSL), but AMD can afford to include a much larger number on a die that’s still significantly smaller.
The VLIW units themselves haven’t been heavily reworked; there are four units capable of executing a FMAD or an integer addition and a special unit capable of executing either a FMAD or an integer multiplication, or a transcendental function (sine, cosine, log, exp, etc.) The only real improvement is to bit-shifting operations in integers, which can now be handled by any of the five units, whereas on the 2900/3800 only the special unit could perform these operations. Rather than make them more powerful, AMD has concentrated on optimizing them in order to reduce their size on the die to be able to fit more of them on the device.