Raystonn can u please explain this

xcom_cheetah

Distinguished
Oct 24, 2001
71
0
18,630
i was just checking this P4 review on vr-zone and
http://www.vr-zone.com/reviews/Intel/P42400/page5.htm
here they used this benchmark calibrator 0.9e and in this benchmark the TLB miss is 50 cycles for P4 and only 5 cycle for XP+ .. is this rite..?? then y is it so huge ? and where r they located.. r they not on-chip.. or are they on motherboard's northbridge... ?? can u shed any more light that how much performance loss this huge delay is causing...??
 

Raystonn

Distinguished
Apr 12, 2001
2,273
0
19,780
Due to superb branch prediction algorithms, the Pentium 4 was designed with a trace cache instead of a classic L1 instruction cache. The trace cache stores instructions after they have already been decoded into RISC-style instructions, whereas a classic L1 instruction cache stores standard x86 instructions. We call these RISC-style instructions "µops" (pronounced micro-ops).

The trace cache does not perform any TLB checks because this cache uses virtual addressing. When dealing only with this L1 cache (the trace cache), there is no TLB. A TLB is accessed only when the L2 cache must be accessed. Thus, attempting to measure the latency of an "L1 TLB" is a misnomer for the Pentium 4. It is not clear what this application actually is measuring, but it certainly is not an L1 TLB. The Pentium 4 does not have one and it does not need one. This application is likely measuring the L2 TLB, which would of course be much slower than anything associated with the L1 cache.

-Raystonn


= The views stated herein are my personal views, and not necessarily the views of my employer. =
 

IIB

Distinguished
Dec 2, 2001
417
0
18,780
Due to superb branch prediction algorithms
P4's Brench predection is best of its kind (and more) but the algoritem has litle to do with Cache structure.

the Pentium 4 was designed with a trace cache instead of a classic L1 instruction cache
errr... wrong here ray - the P4 does have an L1 Cache (after all where from would he issue the Decoder? not the L2) its 8 Kbyte in size
the Trace Cache is another Cahce - it sits between the decoder and the quee which feeds the execution units. it's bulit that way in-order to recover decoded instruction in case of brench-miss-prediction Instruction could be rappiedly recoverd from it - saving the extra Cycles complex decoding takes.
the trace cache is able to hold 12uops and issue 4uops per cycle.

whereas a classic L1 instruction cache stores standard x86 instructions. We call these RISC-style instructions "µops" (pronounced micro-ops).
RISC instruction are not micro-ops.
RISC is a middle level instruction and thus too have to go through decoding (RISC process do contain decoders).


about the TLB score - maybe the testing methology is to quistion...?

This post is best viewed with common sense enabled<P ID="edit"><FONT SIZE=-1><EM>Edited by iib on 04/03/02 02:56 AM.</EM></FONT></P>
 

Raystonn

Distinguished
Apr 12, 2001
2,273
0
19,780
Please read <A HREF="http://www.intel.com/technology/itj/q12001/articles/art_2.htm" target="_new">this</A> on the Microarchitecture of the Pentium 4 Processor. Pay special attention to page 5 entitled "Netburst Microarchitecture".

I will quote a bit from it:

"The Trace Cache is the primary or Level 1 (L1) instruction cache of the Pentium 4 processor and delivers up to three uops per clock to the out-of-order execution logic. Most instructions in a program are fetched and executed from the Trace Cache. Only when there is a Trace Cache miss does the NetBurst microarchitecture fetch and decode instructions from the Level 2 (L2) cache. This occurs about as often as previous processors miss their L1 instruction cache. The Trace Cache has a capacity to hold up to 12K uops. It has a similar hit rate to an 8K to 16K byte conventional instruction cache."


As far as the analogy to RISC-style instructions, this is mostly correct. Standard x86 instructions are decoded to µops. These µops are very similar to RISC instructions.

At any rate, perhaps this application is attempting to measure access to a non-existant instruction TLB. Is it attempting to measure TLB access using instructions or data? The L1 data cache does have a TLB, but the L1 instruction cache (trace cache) does not...

-Raystonn


= The views stated herein are my personal views, and not necessarily the views of my employer. =
 

IIB

Distinguished
Dec 2, 2001
417
0
18,780
"The Trace Cache is the primary or Level 1 (L1) instruction"

Your right. the there is no Level 1 (L1) instruction Cache - only an L1 DATA cahce (8kbyte).

seems farily idiotic to me - why would you want to feed your Decoder X86 instruction from a Higher-latncy L2 Chace?

not only that - but the Trace cache is limited to 3uops (not 4 as I thought) per cycle issueing - whats the point of having 4 ALUs 2 Double pumped if you are limited to only 3 micro ops a cycle? I understand that some calculation do take more then 1 clock-cycle (due to adressing depenedenties and such) but this still seems rather wrong.




This post is best viewed with common sense enabled<P ID="edit"><FONT SIZE=-1><EM>Edited by iib on 04/03/02 03:42 AM.</EM></FONT></P>
 

Raystonn

Distinguished
Apr 12, 2001
2,273
0
19,780
seems farily idiotic to me - why would you want to feed your Decoder X86 instruction from a Higher-latncy L2 Chace?
x86 instructions must be decoded somewhere. Which sounds better to you: A) Decoding instructions on the fly for every instruction, even those recently executed already and sitting in the L1 cache, or B) Decoding all instructions ahead of time and placing the decoded instructions in the L1 cache, allowing the decoding stage to be entirely skipped for anything in the L1 cache? I would prefer the latter. Having a cache of pre-decoded instructions speeds things up.


not only that - but the Trace cache is limited to 3uops (not 4 as I thought) per cycle issueing - whats the point of having 4 ALUs 2 Double pumped if you are limited to only 3 micro ops a cycle? I understand that some calculation do take more then 1 clock-cycle (due to adressing depenedenties and such) but this still seems rather wrong.
The Out-of-Order Execution Logic has several buffers that it uses to smooth and reorder the flow of instructions to optimize performance as they go down the pipeline and get scheduled for execution. Instructions are aggressively reordered to allow them to execute as quickly as their input operands are ready. This out-of-order execution allows instructions in the program following delayed instructions to proceed around them as long as they do not depend on those delayed instructions. Out-of-order execution allows the execution resources such as the ALUs and the cache to be kept as busy as possible executing independent instructions that are ready to execute.

In essence, the Out-of-Order Execution Logic keeps its buffers full by loading new instructions from the Trace Cache at all times. Even when there is no way to execute instructions in a parallel fashion at the moment, it is still loading new instructions from the Trace Cache to replenish its buffers.

-Raystonn


= The views stated herein are my personal views, and not necessarily the views of my employer. =
 

The_MaguS

Distinguished
Mar 25, 2002
269
0
18,780
Raystonn just whupped some azz =)

<font color=blue> There's no such thing as hell, but you can make it if you try.</font color=blue>
 

IIB

Distinguished
Dec 2, 2001
417
0
18,780
x86 instructions must be decoded somewhere. Which sounds better to you: A) Decoding instructions on the fly for every instruction, even those recently executed already and sitting in the L1 cache, or B) Decoding all instructions ahead of time and placing the decoded instructions in the L1 cache, allowing the decoding stage to be entirely skipped for anything in the L1 cache? I would prefer the latter. Having a cache of pre-decoded instructions speeds things up.
yes - but only if you have a very limited instruction Issueing and executing (which I agree goes perfect with P4).
but when you have a 3 way instruction decoder who can decode 3 complex instruction and issue 9 micro ops which perfectly suffecint for 3*ALUs 3*AGU FMUL, FADD, FSTORE which are all out-of-order and come from a 72uop queue there is very litle need to cache instrcutions...

the l2 cache is still limiting in case of brench-miss-prediction and the 12uop trace cache can also go empty since the execution resources of the P4 out wieght the instrcution Issueing.

while the tace cache will be helpful for P4 its obvious that the over-all execution design is inferror to other wider super-scalar microprocessors.

In essence, the Out-of-Order Execution Logic keeps its buffers full by loading new instructions from the Trace Cache at all times. Even when there is no way to execute instructions in a parallel fashion at the moment, it is still loading new instructions from the Trace Cache to replenish its buffers.

Out of order execution instruciton aligining and reordering has been already interduced by <i> another x86 CPU Manufacturer. </i> and while buffers and out of order execution does help come close to the Theoretical limit of 3uops per cycle (on avreage) in the P4 - it cannot possibly on avreage execute more then 3 instruction per-cycle.
while <i> another x86 CPU </i> also have out of order and reordering execution mechanizem (both for the ALUs AGUs and Floating point units) which as you said - help uops execute as quickly and effecently in the wide super scalar fully out of oreder execution unit.




This post is best viewed with common sense enabled
 

xcom_cheetah

Distinguished
Oct 24, 2001
71
0
18,630
Raystonn i have big confusion here... every access which a processor (or process) generate is looked up in TLB and than it is fetched.. is it rite.?? or fetching through L1 or L2 cache doesn;t need any MMU conversion .. i mean logical address conversion to physical.. i hope i do make some sense
secondly P4 does have 8Kb L1 data cache... Isn;t TLB more needed during data accesses..??
 

Kennyshin

Distinguished
Nov 11, 2001
658
0
18,980
Good source of learning. Thank you. (I mean both your postings here and Intel Technology Journals.)

Searching for the true, the beautiful, and the eternal
 

Raystonn

Distinguished
Apr 12, 2001
2,273
0
19,780
every access which a processor (or process) generate is looked up in TLB and than it is fetched.. is it rite.??
No. Instructions and addresses of instructions for branches, etc., are not looked up in any TLB if they are found in the L1 cache. If they are not found in the L1 cache, they will be looked up in the TLB and fetched from the L2 cache or main memory. The instruction will then be stored in the L1 cache after being decoded, for future use.


secondly P4 does have 8Kb L1 data cache... Isn;t TLB more needed during data accesses..??
TLB lookups are required for accessing 'data' in the L1 'data' cache. This does not include instructions and the address of instructions, which are stored in the trace cache mentioned above.

-Raystonn





= The views stated herein are my personal views, and not necessarily the views of my employer. =
 

Raystonn

Distinguished
Apr 12, 2001
2,273
0
19,780
but when you have a 3 way instruction decoder who can decode 3 complex instruction and issue 9 micro ops which perfectly suffecint for 3*ALUs 3*AGU FMUL, FADD, FSTORE which are all out-of-order and come from a 72uop queue there is very litle need to cache instrcutions...
The caching of instructions is required to be able to reorder them for execution when parallelism can be exploited. While I do agree that a limit of 3 µops transferred to the buffer per cycle is not as high as it could be, it is usually sufficient. We will likely see this increased as one of the microarchitecural improvements of the Prescott.


the l2 cache is still limiting in case of brench-miss-prediction
In the case of branch misprediction, the new instructions will be fetched from the trace cache, not the L2 cache, unless there is a cache miss as well.


the 12uop trace cache can also go empty since the execution resources of the P4 out wieght the instrcution Issueing.
I do not understand you here. The trace cache will never be empty. It will always be full of cached instructions. In this regard it is very similar to how a standard L1 cache is used on other processors.


while the tace cache will be helpful for P4 its obvious that the over-all execution design is inferror to other wider super-scalar microprocessors.
Actually it is far superior. I expect to see AMD using such designs in the future. There are definately some areas that can be improved. We will see many of these improvements in the Prescott. However, the overall design is solid and is a definate improvement over that which is used in, say, the Pentium III. The performance issues being seen with the Pentium 4 are due to implementation, not design.

-Raystonn


= The views stated herein are my personal views, and not necessarily the views of my employer. =
 

eden

Champion
We will likely see this increased as one of the microarchitecural improvements of the Prescott.
More L1 maybe? I sure hope so. 8K is somewhat weak in our eyes, and personally a 32K or higher would definitly help the P4, nothing negative in doing so AFAIK.

Actually it is far superior. I expect to see AMD using such designs in the future. There are definately some areas that can be improved. We will see many of these improvements in the Prescott. However, the overall design is solid and is a definate improvement over that which is used in, say, the Pentium III. The performance issues being seen with the Pentium 4 are due to implementation, not design.
I do agree, the new design is a step forward, but it is barely noticeable because of the size, and many of the problems that plague the P4, such as the lack to feed the pipeline in short time in case of branch misprediction.... But definitly true, that Prescott will showcase more of this interesting cache design. What if it was used in L2 also?




--
For the first time, Hookers are hooked on Phonics!!
 

Raystonn

Distinguished
Apr 12, 2001
2,273
0
19,780
More L1 maybe? I sure hope so. 8K is somewhat weak in our eyes, and personally a 32K or higher would definitly help the P4, nothing negative in doing so AFAIK.
The Pentium 4 has an L1 data cache that will hold 8K bytes and an L1 trace cache that will hold 12K micro-ops. It is designed to be extremely fast instead of big. The bigger it is, the slower it will be.

-Raystonn


= The views stated herein are my personal views, and not necessarily the views of my employer. =
 

eden

Champion
Alright well I am still trying to get the hang of it.
So if I am correct, the L1 cache the P4 uses, has already stored many operations in reserve, and stores any predicted ones, so that when the P4 needs them, it fetches it in a split second compared to L2. Am I on the surface of this? I can't go any deeper, I am no programmer and really don't deal with fancy words that include ops and maths in it. I am hoping I got the surface of this at least!

--
For the first time, Hookers are hooked on Phonics!!
 

FatBurger

Illustrious
Just think of the P4's L1 as being 128MB of RDRAM instead of 256MB of SDRAM.

<font color=blue>If you don't buy Windows, then the terrorists have already won!</font color=blue> - Microsoft
 

Raystonn

Distinguished
Apr 12, 2001
2,273
0
19,780
Except in this analogy the RDRAM has far superior latency! :)

-Raystonn


= The views stated herein are my personal views, and not necessarily the views of my employer. =
 

FatBurger

Illustrious
Speaking of which, take a look by the memory forum, there's a thread about latency I started a couple of days ago. You might be interested.

<font color=blue>If you don't buy Windows, then the terrorists have already won!</font color=blue> - Microsoft
 

xcom_cheetah

Distinguished
Oct 24, 2001
71
0
18,630
Raystonn, i think i m troubling u too much.. anyway one last thing... u said that data is fetched from L1.. and this means using those calibrator 0.9e result it will wait 50 cycles for each data to be fetched from there.. which doesn't sound rite at all.. i mean suppose we have this assembly language instruction
mov ax, [bx]
to execute this instruction the processor will look up in L1 for data.. and for that TLB will be required to see where the data is stored... and this whole process will wait for how much time..??
 

Raystonn

Distinguished
Apr 12, 2001
2,273
0
19,780
Access to the data in the L1 cache is extremely fast. That 50ns figure is wrong. The benchmark is broken for the Pentium 4 processor. I believe they are attempting to measure TLB access using instruction fetches, not data fetches. Thus they are measuring the L2 cache's TLB.

-Raystonn


= The views stated herein are my personal views, and not necessarily the views of my employer. =
 

eden

Champion
How did you make it to stick on the top of the page??? At least it's a FAQ that newbies won't need to ask again since it's seen right in front of them. Now if we can do the same here for those who keep asking about Thoroughbreds (although it will have to be edited once it's out) and that damned "Why is my AXP running at slower speed" question where the user ignores the manual and puts 100MHZ FSB!!!

--
For the first time, Hookers are hooked on Phonics!!
 

FatBurger

Illustrious
Fredi locked/stickied it, and the other FAQs are supposedly being worked on.

<font color=blue>If you don't buy Windows, then the terrorists have already won!</font color=blue> - Microsoft
 

eden

Champion
So you posted it and he agreed to stick it? Wow you guys really do have a lot of insider contacts!
And again, if only he would look at Meltdown and ban him... as well as Kennyshin for the racial comments.

--
For the first time, Hookers are hooked on Phonics!!