In Detail: The Scalar Unit And SMT
Now, let’s look at the cores in detail. As we said, they’re based on the Pentium’s design, while Intel has also made some significant modifications. The legacy of the P54C is undeniable in the scalar unit, which uses Pentium’s superscalar execution pipeline with two units, U and V.
The first is capable of executing all scalar x86 instructions, while the second is limited to a fairly complete subset (excluding, for example, complex arithmetic and logical instructions like multiplication and division). However, Intel has made several modifications to the Pentium core. First of all the engineers added 64-bit support, and they also added several instructions for controlling the level two cache memory. These instructions are especially important with streaming-type applications that don’t follow the principle of temporal locality found in traditional applications. That is, once an operation has been executed for the data, they’re certain not to be used again within a short period of time.
This behavior tends to prove disastrous with the LRU algorithm cache memories use, which will spend its time discarding important data to cache data that will be used only once. Aware of this problem, the Larrabee’s engineers added instructions for marking lines of cache data as a low priority, indicating that the data in them can be replaced as soon as they’ve been accessed. In this way, Intel has combined the best of both worlds: scratchpad-type (buffer memory) operation and the transparence of a standard cache memory, with a mechanism for coherence among the caches of the different cores.
Another change consisted of adding Simultaneous Multithreading (SMT). This technology has just made a comeback in Intel architectures with the Core i7, and is built into the Larrabee processors, where its importance is increased by the in-order nature of their cores. Modern CPUs are capable of re-organizing the execution of instructions to maximize use of the calculating units, which the Larrabee cores can’t do. Consequently, certain sequences of code can make very little use of resources, but by interlacing several threads, it’s possible to increase that use at a lower cost. If instruction one blocks execution of instruction two of thread A, then all you do is switch threads and execute instruction one on thread B.
The engineers have enabled execution of four threads per core, obviously with separate registers for each. Using four threads also enables the latency of access to the level one cache memory to be covered. In order not to diminish the efficiency of the L1 instruction and data caches, their size was increased from 8 KB each on the Pentium to 32 KB for the Larrabee cores.