TLB
For many years now, processors have been working not with physical memory addresses, but with virtual addresses. Among other advantages, this approach lets more memory be allocated to a program than the computer actually has, keeping only the data necessary at a given moment in actual physical memory with the rest remaining on the hard disk. This means that for each memory access a virtual address has to be translated into a physical address, and to do that an enormous table is put in charge of keeping track of the correspondences. The problem is that this table gets so large that it can’t be stored on-chip—it’s placed in main memory, and can even be paged (part of the table can be absent from memory and itself kept on the hard disk).
If this translation stage were necessary at each memory access, it would make access much too slow. As a result, engineers returned to the principle of physical addressing by adding a small cache memory directly on the processor that stored the correspondences for a few recently accessed addresses. This cache memory is called a Translation Lookaside Buffer (TLB).Intel has completely revamped the operation of the TLB in their new architecture. Up until now, the Core 2 has used a level 1 TLB that is extremely small (16 entries) but also very fast for loads only, and a larger level 2 TLB (256 entries) that handled loads missed in the level 1 TLB, as well as stores.
Nehalem now has a true two-level TLB: the first level of TLB is shared between data and instructions. The level 1 data TLB now stores 64 entries for small pages (4K) or 32 for large pages (2M/4M), while the level 1 instruction TLB stores 128 entries for small pages (the same as with Core 2) and seven for large pages. The second level is a unified cache that can store up to 512 entries and operates only with small pages. The purpose of this improvement is to increase the performance of applications that use large sets of data. As with the introduction of two-level branch predictors, this is further evidence of the architecture’s server orientation.
Let’s go back to SMT for a moment, since it also has an impact on the TLBs. The level 1 data TLB and the level 2 TLB are shared dynamically between the two threads. Conversely, the level 1 instruction TLB is statically shared for small pages, whereas the one dedicated to large pages is entirely replicated—this is understandable given its small size (seven entries per thread).