Entering The Execution Pipeline - Pentium 4's Trace Cache, Continued
The below example shows the actual code in the upper box and the actual content of the trace cache in the lower box. Unused code is not stored inside the trace cache.
From my description of 'µOPs' above you may remember the case when an x86-instruction is rather complex. Then the decoder requires the micro code ROM of the processor to produce a sometimes very long chain of µOPs. In this case the trace cache doesn't get filled up with all of those µOPs. As a placeholder it only contains some kind of flag, which signalizes that the micro instruction sequencer is supposed to supply the µOPs to the next pipeline stage. It is not known how many µOPs per clock the micro instruction sequencer is able to deliver, but it would not be surprising if it is less than the 3 µOPs per clock that the trace cache can send to the next pipeline stage. This can obviously have an important performance impact on the Pentium 4 CPU, which has been tuned for simple instructions, but which seems to suffer from complex ones, as you will see further down as well.
As mentioned in short above, the trace cache can also be of significant benefit in case of a mispredicted branch. In this case the alternative code could already be found in the trace cache. To check if certain code already resides in the trace cache, it has a rather complex structure of tags, indices and cache lines.