Memory Access And Prefetcher
Optimized Unaligned Memory Access
With the Core architecture, memory access was subject to several restrictions in terms of performance. The processor was optimized for access to memory addresses that were aligned on 64-byte boundaries—the size of one cache line. Not only was access slow for unaligned data, but execution of an unaligned load or store instruction was more costly than for aligned instructions, regardless of actual alignment of the data in memory. That’s because these instructions generated several µops for the decoders to handle, which reduced the throughput of this type of instruction. As a result, compilers avoided generating this type of instruction, by substituting sequences of instructions that were less costly.
Thus, memory reads that overlapped two cache lines took a performance hit of approximately 12 cycles, compared to 10 for writes. The Intel engineers have optimized these accesses to make them faster. First of all, there’s no performance penalty for using the unaligned versions of load/store instructions in cases where the data are aligned in memory. In other cases, Intel has optimized these accesses to reduce the performance hit compared to that of the Core architecture.
More Prefetchers Running More Efficiently
With the Conroe architecture, Intel was especially proud of its hardware prefetchers. As you know, a prefetch is a mechanism that observes memory access patterns and tries to anticipate which data will be needed several cycles in advance. The point is to return the data to the cache, where it will be more readily accessible to the processor while trying to maximize bandwidth by using it when the processor doesn’t need it.
This technique produced remarkable results with most desktop applications, but in the server world the result was often a loss of performance. There are many reasons for that inefficiency. First of all, memory accesses are often much less easy to predict with server applications. Database accesses, for example, aren’t linear—when an item of data is accessed in memory, the adjacent data won’t necessarily be called on next. That limits the prefetcher’s effectiveness. But the main problem was with memory bandwidth in multi-socket configurations. As we said earlier, there was already a bottleneck between processors, but in addition, the prefetchers added additional pressure at this level. When a microprocessor wasn’t accessing memory, the prefetchers kicked in to use bandwidth they assumed was available. They had no way of knowing at that precise point that the other processor might need the bandwidth. That meant the prefetchers could deprive a processor of bandwidth that was already at a premium in this kind of configuration. To solve the problem, Intel had no better solution to offer than to disable the prefetchers in these situations—hardly a satisfactory answer.
Intel says the problem is solved now, but provides no details on the operation of the new prefetch algorithms; all its says is that it won’t be necessary to disable them for server configurations. But even if Intel hasn’t changed anything, the gains stemming from the new memory organization and the resulting wider bandwidth should limit any negative impact of the prefetchers.