x86 instructions must be decoded somewhere. Which sounds better to you: A) Decoding instructions on the fly for every instruction, even those recently executed already and sitting in the L1 cache, or B) Decoding all instructions ahead of time and placing the decoded instructions in the L1 cache, allowing the decoding stage to be entirely skipped for anything in the L1 cache? I would prefer the latter. Having a cache of pre-decoded instructions speeds things up.
yes - but only if you have a very limited instruction Issueing and executing (which I agree goes perfect with P4).
but when you have a 3 way instruction decoder who can decode 3 complex instruction and issue 9 micro ops which perfectly suffecint for 3*ALUs 3*AGU FMUL, FADD, FSTORE which are all out-of-order and come from a 72uop queue there is very litle need to cache instrcutions...
the l2 cache is still limiting in case of brench-miss-prediction and the 12uop trace cache can also go empty since the execution resources of the P4 out wieght the instrcution Issueing.
while the tace cache will be helpful for P4 its obvious that the over-all execution design is inferror to other wider super-scalar microprocessors.
In essence, the Out-of-Order Execution Logic keeps its buffers full by loading new instructions from the Trace Cache at all times. Even when there is no way to execute instructions in a parallel fashion at the moment, it is still loading new instructions from the Trace Cache to replenish its buffers.
Out of order execution instruciton aligining and reordering has been already interduced by <i> another x86 CPU Manufacturer. </i> and while buffers and out of order execution does help come close to the Theoretical limit of 3uops per cycle (on avreage) in the P4 - it cannot possibly on avreage execute more then 3 instruction per-cycle.
while <i> another x86 CPU </i> also have out of order and reordering execution mechanizem (both for the ALUs AGUs and Floating point units) which as you said - help uops execute as quickly and effecently in the wide super scalar fully out of oreder execution unit.
This post is best viewed with common sense enabled