Sign in with
Sign up | Sign in

Memory Access And Prefetcher

Intel Core i7 (Nehalem): Architecture By AMD?
By

Optimized Unaligned Memory Access

With the Core architecture, memory access was subject to several restrictions in terms of performance. The processor was optimized for access to memory addresses that were aligned on 64-byte boundaries—the size of one cache line. Not only was access slow for unaligned data, but execution of an unaligned load or store instruction was more costly than for aligned instructions, regardless of actual alignment of the data in memory. That’s because these instructions generated several µops for the decoders to handle, which reduced the throughput of this type of instruction. As a result, compilers avoided generating this type of instruction, by substituting sequences of instructions that were less costly.

Thus, memory reads that overlapped two cache lines took a performance hit of approximately 12 cycles, compared to 10 for writes. The Intel engineers have optimized these accesses to make them faster. First of all, there’s no performance penalty for using the unaligned versions of load/store instructions in cases where the data are aligned in memory. In other cases, Intel has optimized these accesses to reduce the performance hit compared to that of the Core architecture.

More Prefetchers Running More Efficiently

With the Conroe architecture, Intel was especially proud of its hardware prefetchers. As you know, a prefetch is a mechanism that observes memory access patterns and tries to anticipate which data will be needed several cycles in advance. The point is to return the data to the cache, where it will be more readily accessible to the processor while trying to maximize bandwidth by using it when the processor doesn’t need it.

This technique produced remarkable results with most desktop applications, but in the server world the result was often a loss of performance. There are many reasons for that inefficiency. First of all, memory accesses are often much less easy to predict with server applications. Database accesses, for example, aren’t linear—when an item of data is accessed in memory, the adjacent data won’t necessarily be called on next. That limits the prefetcher’s effectiveness. But the main problem was with memory bandwidth in multi-socket configurations. As we said earlier, there was already a bottleneck between processors, but in addition, the prefetchers added additional pressure at this level. When a microprocessor wasn’t accessing memory, the prefetchers kicked in to use bandwidth they assumed was available. They had no way of knowing at that precise point that the other processor might need the bandwidth. That meant the prefetchers could deprive a processor of bandwidth that was already at a premium in this kind of configuration. To solve the problem, Intel had no better solution to offer than to disable the prefetchers in these situations—hardly a satisfactory answer.

Intel says the problem is solved now, but provides no details on the operation of the new prefetch algorithms; all its says is that it won’t be necessary to disable them for server configurations. But even if Intel hasn’t changed anything, the gains stemming from the new memory organization and the resulting wider bandwidth should limit any negative impact of the prefetchers.

Display all 30 comments.
This thread is closed for comments
  • 3 Hide
    cl_spdhax1 , October 14, 2008 7:15 AM
    good write-up, cant wait for the new architecture , plus the "older" chips are going to become cheaper/affordable. big plus.
  • 0 Hide
    neiroatopelcc , October 14, 2008 7:55 AM
    No explaination as to why you can't use performance modules with higher voltage though.
  • 4 Hide
    neiroatopelcc , October 14, 2008 8:08 AM
    AuDioFreaK39TomsHardware is just now getting around to posting this?Not to mention it being almost a direct copy/paste from other articles I've seen written about Nehalem architecture.

    I regard being late as a quality seal really. No point being first, if your info is only as credible as stuff on inquirer. Better be last, but be sure what you write is correct.
  • 6 Hide
    cangelini , October 14, 2008 8:08 AM
    AuDioFreaK39TomsHardware is just now getting around to posting this?Not to mention it being almost a direct copy/paste from other articles I've seen written about Nehalem architecture.


    Perhaps, if you count being translated from French.
  • 0 Hide
    randomizer , October 14, 2008 8:19 AM
    Yea, 13 pages is quite alot to translate. You could always use google translation if you want it done fast :kaola: 
  • 0 Hide
    Duncan NZ , October 14, 2008 8:21 AM
    Speaking of french... That link on page 3 goes to a French article that I found fascinating... Would be even better if there was an English version though, cause then I could actually read it. Any chance of that?

    Nice article, good depth, well written
  • -1 Hide
    neiroatopelcc , October 14, 2008 8:21 AM
    randomizerYea, 13 pages is quite alot to translate. You could always use google translation if you want it done fast

    I don't know french, so no idea if it actually works. But I've tried from english to germany and danish, and viseversa. Also tried from danish to german, and the result is always the same - it's incomplete, and anything that is slighty technical in nature won't be translated properly. In short - want it done right, do it yourself.
  • 0 Hide
    neiroatopelcc , October 14, 2008 8:28 AM
    I don't think cangelini meant to say, that no other english articles on the subject exist.
    You claimed the article on toms was a copy paste from another article. He merely stated that the article here was based on a french version.
  • 0 Hide
    enewmen , October 14, 2008 8:41 AM
    Good article.
    I actually read the whole thing.
    I just don't get TLP when RAM is cheap and the Nehalem/Vista can address 128gigs. Anyway, things have changed a lot since running Win NT with 16megs RAM and constant memory swapping.
  • 5 Hide
    cangelini , October 14, 2008 9:17 AM
    I can't speak for the author, but I imagine neiro's guess is fairly accurate. Written in French, translated to English, and then edited--I'm fairly confident they're different stories ;) 
  • 2 Hide
    neiroatopelcc , October 14, 2008 10:17 AM
    Questions to the author (or anyone else who's understood what I have not)
    1) How's the loop detection feature know when it is a loop ? The diagrams posted don't show any connection between it and the 'front' of the pipeline, so how can it know that the next operation is the same if it hasn't yet entered the loop?
    2) On page 8 there's a diagram with a 4 socket setup showing 2 io hubs. Are they connected to the same pcie bus and whatever else they interface with? or are only 2 of the sockets able to directly access a given resource?
    3) With the modular design, would one risk buying a cpu that doesn't work in a motherboard because it is intended for a 2 or 4 socket system? or are they all the same, simply with some qpi's disabled?
    4) Am I right assuming that qpi replaces fsb when it has to communicate with an i/o hub only? (as shown in one of the top diagrams on page 8) Or is it used for every one of the 'blue' lines on the lower diagram (10 total in a 4 socket layout). The latter would mean 4 qpi's are barely enough to satisfy bandwidth needs in a server enviroment. I imagine an esx server with 4 processors (32 threads) can easily demand memory from dram pools not linked to the local core the threads are running on, and use 96GB/s (3x32) of the 102GB/s (4x12,8x2) total theoretical bandwidth in addition to some of the local 32GB/s bandwidth from the socket a given core/thread is running on. So if this scenario is correct, is it possible to increase the speed of the qpi (read: oc the link) to increase available bandwidth? And what happends if one would successfully find ddr3-1600 modules that would run within the 1,65v limitation? Wouldn't that mean the qpi was already at its limit? (38,4GB/s per dram pool x 3 sockets not local to the core that runs a thread). I know memory isn't truely the bottleneck in modern computers, but I still find it wierd that they put so much effort into the memory controller if it isn't actually the problem. Simply adding a few qpi links between the sockets and the chipset would've solved the bandwidth issue without limiting usable memory types by choosing a certain cpu. Sure it wouldn't have improved latencies, but honestly, who cares? neither in a gaming pc, netbook or any number of common server configurations is it the memory lantecy that is the bottleneck.
    5) How much time should one assume is wasted when a core on conroe flushes the l2 cache? they seem to have solved the issue and as consequence increased cache latency (which should turn into slower overall cache performance). In english : can we expect any gain from this change?
    6) Would the immensely increased tlb size improve performance in newer games which precache loads of data? (thinking quicker retrieval of texture data etc)
    7) Page 12 mentions unalligned memory access, which I've never heard of before. Appearently compilers already try to avoid this situation, so can we expect the improvement to handling such to be of interest? What's the point of improving a feature to handle a situation that hardly ever arises in the first place?
  • 1 Hide
    Reynod , October 14, 2008 10:22 AM
    Is the loop stream detector to stop the sorts of problems that were occuring with the long pipes in the P4 with REPLAY ... causing major IPC problems despite the high clock frequency?

    http://www.xbitlabs.com/articles/cpu/display/replay.html

    This was a very good article and is not a copy ... well done Fedy !!
  • 1 Hide
    V3NOM , October 14, 2008 10:41 AM
    lol i barely understood a word of that..
  • 0 Hide
    apache_lives , October 14, 2008 11:25 AM
    neiroatopelccNo explaination as to why you can't use performance modules with higher voltage though.


    perhaps it will burn out the IMC within the chip since its all done at 45nm, 1.6+v would be deadly, imagine air cooling a 3ghz quad core chip at ~2v? i take it it shares the rail even within the cpu so
  • -1 Hide
    neiroatopelcc , October 14, 2008 11:37 AM
    Perhaps so, but why does the supply voltage to the dram sockets have to pass thru the cpu then? (if that's what it does) Why can't that be supplied by the motherboard and only data sent from the dram be sent to the cpu? or better yet, pass thru a piece of hardware that stabilizes the signals at a lower voltage level, so the cpu doesn't fry even if one was to attempt booting with memory running at 2v or the like.
  • 0 Hide
    apache_lives , October 14, 2008 12:36 PM
    neiroatopelccPerhaps so, but why does the supply voltage to the dram sockets have to pass thru the cpu then? (if that's what it does) Why can't that be supplied by the motherboard and only data sent from the dram be sent to the cpu? or better yet, pass thru a piece of hardware that stabilizes the signals at a lower voltage level, so the cpu doesn't fry even if one was to attempt booting with memory running at 2v or the like.


    depends on how connected that ram is, there might be advantages etc this way, and it also makes you wonder if AMD suffers from this - iv heard of extreme overclockers killing ram channels on AMD's etc

    on the other hand who cares about high performance memory - 3 x 1333mhz is going to be better then 2 x 1600+mhz channels etc, along with the fact its an IMC based setup etc and average maximum bandwidths of ~32gb/s vs the current average maximum of ~12.8gb/s etc
  • 0 Hide
    Shadow703793 , October 14, 2008 12:40 PM
    One interesting thing is L2 cache is quite low. Any one care to explain exactly why?
  • 0 Hide
    neiroatopelcc , October 14, 2008 12:43 PM
    L2 is low because it no longer needs to be bigger. It isn't shared between cores. It'll just full up from l3 cache later (which is plenty big)

    as for the memory issue. Who'd want to run 3x1333 if they could run 6x1600 ? any enthusiast will only be satisfied with the best, and 1333 just isn't it. Not even 1600 is. ddr3-1333 is basicly obsolete, and it's not even mainstream yet. It's a disaster really.
  • 1 Hide
    Scarchunk , October 14, 2008 1:18 PM
    Considering the title of this article, there's not much about the direct comparisons between nehalem and barcelona. It's basically just a break down of nehalem with AMD hardly mentioned at all. I expected more from the title.
  • 1 Hide
    Reynod , October 14, 2008 1:30 PM
    If the supply rail for the RAM is tied to the CPU then it is possibly to ensure there is minimal ripple or othere observable intereference which might limit the speed the ram is operated at ... not so much each indididual stick, but as it is triple channel, I imagine the cpu would be very intolerant of impedence mismatch ... causing signal processing degredation at the interconnect point.

    A case of perhaps minimising reflected impedence?

    Just my theory anyway ... remember ... I am only here for the humour ... not the technology.

    AMD4LIFE
Display more comments