Intel Core i7 (Nehalem): Architecture By AMD?

Memory Access And Prefetcher

Optimized Unaligned Memory Access

With the Core architecture, memory access was subject to several restrictions in terms of performance. The processor was optimized for access to memory addresses that were aligned on 64-byte boundaries—the size of one cache line. Not only was access slow for unaligned data, but execution of an unaligned load or store instruction was more costly than for aligned instructions, regardless of actual alignment of the data in memory. That’s because these instructions generated several µops for the decoders to handle, which reduced the throughput of this type of instruction. As a result, compilers avoided generating this type of instruction, by substituting sequences of instructions that were less costly.

Thus, memory reads that overlapped two cache lines took a performance hit of approximately 12 cycles, compared to 10 for writes. The Intel engineers have optimized these accesses to make them faster. First of all, there’s no performance penalty for using the unaligned versions of load/store instructions in cases where the data are aligned in memory. In other cases, Intel has optimized these accesses to reduce the performance hit compared to that of the Core architecture.

More Prefetchers Running More Efficiently

With the Conroe architecture, Intel was especially proud of its hardware prefetchers. As you know, a prefetch is a mechanism that observes memory access patterns and tries to anticipate which data will be needed several cycles in advance. The point is to return the data to the cache, where it will be more readily accessible to the processor while trying to maximize bandwidth by using it when the processor doesn’t need it.

This technique produced remarkable results with most desktop applications, but in the server world the result was often a loss of performance. There are many reasons for that inefficiency. First of all, memory accesses are often much less easy to predict with server applications. Database accesses, for example, aren’t linear—when an item of data is accessed in memory, the adjacent data won’t necessarily be called on next. That limits the prefetcher’s effectiveness. But the main problem was with memory bandwidth in multi-socket configurations. As we said earlier, there was already a bottleneck between processors, but in addition, the prefetchers added additional pressure at this level. When a microprocessor wasn’t accessing memory, the prefetchers kicked in to use bandwidth they assumed was available. They had no way of knowing at that precise point that the other processor might need the bandwidth. That meant the prefetchers could deprive a processor of bandwidth that was already at a premium in this kind of configuration. To solve the problem, Intel had no better solution to offer than to disable the prefetchers in these situations—hardly a satisfactory answer.

Intel says the problem is solved now, but provides no details on the operation of the new prefetch algorithms; all its says is that it won’t be necessary to disable them for server configurations. But even if Intel hasn’t changed anything, the gains stemming from the new memory organization and the resulting wider bandwidth should limit any negative impact of the prefetchers.

  • cl_spdhax1
    good write-up, cant wait for the new architecture , plus the "older" chips are going to become cheaper/affordable. big plus.
    Reply
  • neiroatopelcc
    No explaination as to why you can't use performance modules with higher voltage though.
    Reply
  • neiroatopelcc
    AuDioFreaK39TomsHardware is just now getting around to posting this?Not to mention it being almost a direct copy/paste from other articles I've seen written about Nehalem architecture.I regard being late as a quality seal really. No point being first, if your info is only as credible as stuff on inquirer. Better be last, but be sure what you write is correct.
    Reply
  • cangelini
    AuDioFreaK39TomsHardware is just now getting around to posting this?Not to mention it being almost a direct copy/paste from other articles I've seen written about Nehalem architecture.
    Perhaps, if you count being translated from French.
    Reply
  • randomizer
    Yea, 13 pages is quite alot to translate. You could always use google translation if you want it done fast :kaola:
    Reply
  • Duncan NZ
    Speaking of french... That link on page 3 goes to a French article that I found fascinating... Would be even better if there was an English version though, cause then I could actually read it. Any chance of that?

    Nice article, good depth, well written
    Reply
  • neiroatopelcc
    randomizerYea, 13 pages is quite alot to translate. You could always use google translation if you want it done fast I don't know french, so no idea if it actually works. But I've tried from english to germany and danish, and viseversa. Also tried from danish to german, and the result is always the same - it's incomplete, and anything that is slighty technical in nature won't be translated properly. In short - want it done right, do it yourself.
    Reply
  • neiroatopelcc
    I don't think cangelini meant to say, that no other english articles on the subject exist.
    You claimed the article on toms was a copy paste from another article. He merely stated that the article here was based on a french version.
    Reply
  • enewmen
    Good article.
    I actually read the whole thing.
    I just don't get TLP when RAM is cheap and the Nehalem/Vista can address 128gigs. Anyway, things have changed a lot since running Win NT with 16megs RAM and constant memory swapping.
    Reply
  • cangelini
    I can't speak for the author, but I imagine neiro's guess is fairly accurate. Written in French, translated to English, and then edited--I'm fairly confident they're different stories ;)
    Reply