SMT Implementation
Still, the impact of SMT on performance is positive most of the time and the cost in terms of resources is still very limited, which explains why the technology is making a comeback. But programmers will have to pay attention because with Nehalem, all threads are not created equal. To help solve this puzzle, Intel provides a way of precisely determining the exact topology of the processor (the number of physical and logical processors), and programmers can then use the operating system affinity mechanism to assign each thread to a processor. This kind of thing shouldn’t be a problem for game programmers, who are already in the habit of working that way because of the way the Xenon processor (the one used in the Xbox 360) works. But unlike consoles, where programmers have very low-level access, on a PC the operating system’s thread scheduler will always have the last word.
Since SMT puts a heavier load on the out-of-order execution engine, Intel has increased the size of certain internal buffers to avoid turning them into bottlenecks. So the reorder buffer, which keeps track of all the instructions being executed in order to reorder them, has increased from 96 entries on the Core 2 to 128 entries on Nehalem. In practice, since this buffer is partitioned statically to keep any one thread from monopolizing all the resources, its size is reduced to 64 entries for each thread with SMT. Obviously, in cases where a single thread is executed, it has access to all the entries, which should mean that there won’t be any specific situations where Nehalem turns out to have worse performance than its predecessor.
The reservation station, which is the unit in charge of assigning instructions to the different execution units, has also increased in size: from 32 to 36 entries. But unlike the reorder buffer, here partitioning is dynamic, so that a thread can take up more or fewer entries as a function of its needs.
Two other buffers have also been resized: the load buffer and the store buffer. The former has 48 entries as opposed to 32 with Conroe, and the latter 32 instead of 20. Here too, partitioning between threads is static.
Another consequence of the return of SMT is that the performance of thread synchronization instructions has improved, according to Intel.