Intel Core i7 (Nehalem): Architecture By AMD?

Reading And Decoding Instructions

Unlike the changes made in moving from Core to Core 2, Intel hasn’t done much to Nehalem’s front end. It has the same four decoders that made their appearance with the Conroe—three simple and one complex. It is still capable of macro-ops fusion and so offers a theoretical maximum throughput of 4+1 x86 instructions per cycle.

Though there are no revolutionary changes at first glance, the devil is in the details. As we noted in our article on the Barcelona architecture, increasing the number of processing units is an extremely inefficient way to boost performance. The cost is high and the gains shrink more and more with each addition, following the law of diminishing returns. So instead of adding a new decoder, the engineers concentrated on making the existing ones more efficient.

First, they added support for macro-ops fusion in 64-bit mode, which is justified for an architecture like Nehalem that makes no attempt to hide its ambitions in the server market segment. But the engineers didn’t stop there. Where the Conroe architecture could fuse only a limited number of instructions, the Nehalem architecture supports a greater number of variations, making it possible to use macro-ops fusion more frequently.

Another new feature introduced by the Conroe has also been improved: the Loop Stream Detector. Behind this name lies what is in fact a data buffer that holds a few instructions (18 x86 instructions on Core 2s). When the processor detects a loop, it disables certain parts of the pipeline. Since a loop consists of executing the same instructions a given number of times, it’s not necessary to perform branch prediction or to recover the instruction from the L1 cache at each iteration of the loop. So the Loop Stream Detector acts as a small cache memory that short-circuits the first stages of the pipeline in such situations. The gains made via this technique are twofold: it decreases power consumption by avoiding useless tasks and it improves performance by reducing the pressure on the L1 instruction cache.

With the Nehalem architecture, Intel has improved the functionality of the Loop Stream Detector. First of all the buffer is larger—it can now store 28 instructions. But what’s more, its position in the pipeline has changed. In Conroe, it was located just after the instruction fetch phase. It’s now located after the decoders; this new position allows a larger part of the pipeline to be disabled. The Nehalem’s Loop Stream Detector no longer stores x86 instructions, but rather µops. In this sense, it’s similar to the Pentium 4’s trace cache concept. It’s no surprise to find certain innovations ushered in by that architecture in the Nehalem, given that the Hillsboro team now in charge of the Nehalem was responsible for the Pentium 4 project. However, where the Pentium 4 used the trace cache exclusively, since it could only count on one decoder in case of a data cache miss, Nehalem has the benefit of the power of its four decoders, while the Loop Stream Detector is only an additional optimization for certain situations. In a way, it’s the best of both worlds.

  • cl_spdhax1
    good write-up, cant wait for the new architecture , plus the "older" chips are going to become cheaper/affordable. big plus.
    Reply
  • neiroatopelcc
    No explaination as to why you can't use performance modules with higher voltage though.
    Reply
  • neiroatopelcc
    AuDioFreaK39TomsHardware is just now getting around to posting this?Not to mention it being almost a direct copy/paste from other articles I've seen written about Nehalem architecture.I regard being late as a quality seal really. No point being first, if your info is only as credible as stuff on inquirer. Better be last, but be sure what you write is correct.
    Reply
  • cangelini
    AuDioFreaK39TomsHardware is just now getting around to posting this?Not to mention it being almost a direct copy/paste from other articles I've seen written about Nehalem architecture.
    Perhaps, if you count being translated from French.
    Reply
  • randomizer
    Yea, 13 pages is quite alot to translate. You could always use google translation if you want it done fast :kaola:
    Reply
  • Duncan NZ
    Speaking of french... That link on page 3 goes to a French article that I found fascinating... Would be even better if there was an English version though, cause then I could actually read it. Any chance of that?

    Nice article, good depth, well written
    Reply
  • neiroatopelcc
    randomizerYea, 13 pages is quite alot to translate. You could always use google translation if you want it done fast I don't know french, so no idea if it actually works. But I've tried from english to germany and danish, and viseversa. Also tried from danish to german, and the result is always the same - it's incomplete, and anything that is slighty technical in nature won't be translated properly. In short - want it done right, do it yourself.
    Reply
  • neiroatopelcc
    I don't think cangelini meant to say, that no other english articles on the subject exist.
    You claimed the article on toms was a copy paste from another article. He merely stated that the article here was based on a french version.
    Reply
  • enewmen
    Good article.
    I actually read the whole thing.
    I just don't get TLP when RAM is cheap and the Nehalem/Vista can address 128gigs. Anyway, things have changed a lot since running Win NT with 16megs RAM and constant memory swapping.
    Reply
  • cangelini
    I can't speak for the author, but I imagine neiro's guess is fairly accurate. Written in French, translated to English, and then edited--I'm fairly confident they're different stories ;)
    Reply