Unlike the changes made in moving from Core to Core 2, Intel hasn’t done much to Nehalem’s front end. It has the same four decoders that made their appearance with the Conroe—three simple and one complex. It is still capable of macro-ops fusion and so offers a theoretical maximum throughput of 4+1 x86 instructions per cycle.
Though there are no revolutionary changes at first glance, the devil is in the details. As we noted in our article on the Barcelona architecture, increasing the number of processing units is an extremely inefficient way to boost performance. The cost is high and the gains shrink more and more with each addition, following the law of diminishing returns. So instead of adding a new decoder, the engineers concentrated on making the existing ones more efficient.
First, they added support for macro-ops fusion in 64-bit mode, which is justified for an architecture like Nehalem that makes no attempt to hide its ambitions in the server market segment. But the engineers didn’t stop there. Where the Conroe architecture could fuse only a limited number of instructions, the Nehalem architecture supports a greater number of variations, making it possible to use macro-ops fusion more frequently.
Another new feature introduced by the Conroe has also been improved: the Loop Stream Detector. Behind this name lies what is in fact a data buffer that holds a few instructions (18 x86 instructions on Core 2s). When the processor detects a loop, it disables certain parts of the pipeline. Since a loop consists of executing the same instructions a given number of times, it’s not necessary to perform branch prediction or to recover the instruction from the L1 cache at each iteration of the loop. So the Loop Stream Detector acts as a small cache memory that short-circuits the first stages of the pipeline in such situations. The gains made via this technique are twofold: it decreases power consumption by avoiding useless tasks and it improves performance by reducing the pressure on the L1 instruction cache.
With the Nehalem architecture, Intel has improved the functionality of the Loop Stream Detector. First of all the buffer is larger—it can now store 28 instructions. But what’s more, its position in the pipeline has changed. In Conroe, it was located just after the instruction fetch phase. It’s now located after the decoders; this new position allows a larger part of the pipeline to be disabled. The Nehalem’s Loop Stream Detector no longer stores x86 instructions, but rather µops. In this sense, it’s similar to the Pentium 4’s trace cache concept. It’s no surprise to find certain innovations ushered in by that architecture in the Nehalem, given that the Hillsboro team now in charge of the Nehalem was responsible for the Pentium 4 project. However, where the Pentium 4 used the trace cache exclusively, since it could only count on one decoder in case of a data cache miss, Nehalem has the benefit of the power of its four decoders, while the Loop Stream Detector is only an additional optimization for certain situations. In a way, it’s the best of both worlds.