The Vector Unit And The New Instruction Set
But as you can well imagine, the Pentium cores aren’t what gives Larrabee its processing power. To be able to compete with GPUs on their home playing field, you need a lot more than an FPU or even SSE. So Intel equipped each core with a vector unit operating on 16 elements simultaneously (compared to four for SSE or the Cell’s SPUs). These units are capable of operating on integers, single-precision floating-point numbers, and double-precision floating-point numbers. While the throughput is consequently reduced by half, it is still greater than current GPUs, which are between two and four times slower in the case of AMD’s and practically 10 times slower than Nvidia’s when moving from single to double precision.
Rather than extending the SSE instruction set (again) to support the new vector unit, the Intel engineers created a new one, called Larrabee new instructions (LRBni). Intel is rather vague about the instructions supported at the moment, but we should learn more about that at the upcoming Game Developers Conference (GDC). Intel plans several press conferences at the trade show during which Michael Abrash, of RAD Game Tools, and Intel’s Tom Forsyth should communicate details about the instruction set.
We do already know several things, however: The instruction set supports up to three operands, enabling implementation of multiply-and-add (MAD) instructions and also execution of non-destructive operations, unlike SSE, in which one of the source registers is overwritten to write the value of the result. Compared to the VMX instruction set found in the Cell’s PowerPC Processing Element (PPE), for example, which operates only on registers, here one of the operands can be read directly from the L1 cache, enabling its use as an extended register file. This unit is also very flexible, since it can reorganize the data in a register or execute various conversions in the “exotic” formats frequently found in GPUs without loss of performance, or in the worst case, with only a slight reduction in performance. These conversions can be executed directly at the time the data is loaded from cache memory, allowing them to be stored in memory in a compact form, which maximizes the quantity of data contained in the cache memory.
Another interesting particularity of the unit is its ability to execute scatter/gather operations, which are typically problematic in a GPU. SIMD units are generally very constraining when it comes to memory access. A vector is read in memory from a single address that often has particular constraints regarding memory alignment. Larrabee is much more flexible. It’s possible to load or store the 16 elements of a vector in memory from 16 different addresses contained in another vector. Obviously, totally incoherent memory accesses will negatively impact the cache memory, and in the worst case, up to 16 cycles will be necessary to perform this type of operation (a maximum of one line of cache is read per cycle).