Sign in with
Sign up | Sign in
Your question

1cc=1ns? Also, pipeline question

Last response: in CPUs
Share
July 25, 2002 4:02:09 AM

Bear with me as I try to remember my question and try to make it as understandable as possible!

I've been reading a lot of Ars technica's CPU architecture articles recently. Although I have become confused in knowing what are clock cycles refering to in time? Are they just a general saying and not a time in the CPU world? What I am saying is that since when Ars explained about how longer pipelines make faster speeds, was due to the fact each clock cycle which corresponds to one stage, gets shorter in longer pipelines since the steps done in one stage can be now layed out to multiple ones (So as to say Decode becomes Trace Fetch, and many other stages on the P4), so this means each clock cycle has less to do and is shorter, therefore "faster". I am not getting anymore after reading such, what do clock cycles stand for and how long are they? Can a clock cycle lengh actually vary? Or is it 1 nanosecond always? If so, then what explains Ars' claims, which are logical?
To me it seems a cc can be any ns, it all depends how the CPU does one. So it would explain why Athlons do a lot in a clock cycle which may take more ns, but of course in the end LESS total ns than a P4 executing the same thing, with less done in a clock cycle therefore equals less NS per cc.
Anyways I'd like some details on CCs and their real time.
What I also am wondering is when they say 2GHZ, how much time in our measure of time, does it take to do that? Is it doing 2 Billion CCs in one second? If more, then I will get even more confused, which is why I am hoping you guys understand what I am asking for!

My second inquiry is about the pipeline. Long before I read the article on pipelining, I always thought pipeline was actualy a CPU component! But it never made sense, to why make some component where data goes through straight (like what a real life pipeline looks like), if the stages like Execute should be done in the CPU's Execution units. It wasn't until Ars that I knew this was just a form of describing the number of minimum cycles to take and the steps to process one instruction/code/data. So would I be correct to say they could've not chosen pipeline as name and maybe created a board describing each clock cycle/step and what it means, rather than a drawing of some long rectangle with stages in it? They could call it the Step Board. Basically what I am trying to say is that is the term pipeline and pipelining just a method of saying how much cycles are there now or stages to finish the instruction.

Anyways thanks for bearing with me!

--
An AMD employee's diary: Today I kicked an Intel worker in the "Willy"! :lol: 
July 25, 2002 6:44:01 AM

You got the basic jist of it down. Pipelining is the method in which tasks are split into multiple parts (think an assembly line). Each time one part finishes, that would be one clock cycle. Now the reason that you actually have a clock cycle is because all the individual "stages", or rather, parts, must be coordinated so that when one finish working on a specific instruction, another continues it. This is the reason for a clock cycle, to coordinate stages. Otherwise, one stage could work very fast while the other worked slower, this is the concept behind a clockless CPU design and such a design would actually work much faster. Unfortunately the cost of designing such a thing seemed to have been too great even for Intel so they dropped it. Either way, I'm rambling. Back to the point.

Yes, if you split the task of executing an instruction into more parts, each part will do less and as a result, the "clockspeed" will increase. However, there is a rule. The execution unit (or rather, ALU) must always be just 1 stage. All the decoding and scheduling and everything else can take as many stages as they want but 1 pass through the ALU must be shorter than 1 pass through a stage. So in a 4 GHz P4, the ALU is running at 4 billion passes per second. In a 2 GHz Athlon, the ALU is running at 2 billion passes per second. This is where the advantage of hyperpipelining comes to play. Because while each stage has less work to do, there are twice as many stages in a 20-stage design as opposed to a 10-stage design.
Let's go back to that assembly line example. Let's say you had 10 workers and they produce a shoe. One worker would ready the sole of the shoe, then pass it on to the next guy who will put on the strings or whatever. The 10 workers can only work so fast because they each have a pretty big task. Now, if you're familiar with how an assembly line works, you'll see that if you provide a constant stream of shoes to make, each person would be working on a small part at the same time. While a shoe may take 10 passes (person 1 finishes and passes it on to person 2, person 2 finishes his task and passes it on to person 3), once you have everyone working (i.e. after person 1 finishes his task on the first shoe and passes it on, he immediately receives another shoe to start). The throughput after you've filled the assembly line is 1 shoe per pass. In other words, the person at the end will always get a shoe every pass to do his work and finish.
Now, imagine you have 20 people working but each person doing a smaller task. Each person can do his task much faster because repeating small tasks is much easier than repeating large tasks (i.e. you can hit a nail 10 times per second but if you had to put the nail there, you could only put in 2 nails per second). Each person may be doing a smaller job but you have twice as many people. And since they're working faster, the guy at the end still always has a shoe to complete. So the end throughput is, these guys still finish 1 shoe every pass, but you're able to do it 2x or 3x or 4x as fast (i.e. the guys are able to repeat their tasks much faster).
A processor's "clockspeed" is limited by the slowest component, or stage (weakest part of the chain). If, say, part of the decoding stage can not operate faster than 4 billion times per second, then the whole processor is limited to 4 billion passes per second.
That's pretty much what pipelining is. It's just splitting a job into pieces and therefore beign able to work on smaller bits of more jobs at a time.
July 25, 2002 8:18:50 AM

Clock cycle <> NS.
Clock Cycles is not a constant amount of time. it is mesured in how many clock cycle are preformed per second. 2Ghz is 2 billion clock cycle per senond.
NS or nanosecond is a constant amount of time which is one billionth of a second.

so if a Processor runs at 1Ghz it preformes a clock cycle Billion times per second - so you can say that 1Ghz processor does a cycle evry nanosecond.

or:
For 1GHz Processor 1Clock cycle = 1NS.

now about say 2Ghz processor it does 2 Billion cycle evey second so you can say it does 2 cycle each nanosecond.

in practice when you calculate the preofrmance loss from memory latncy which is messured in NS you need to convert it to Clock cycle, example:
it takes 100ns of latncy to acsess memory (chip-set latncy + memory latncy) with a 2Ghz processor you effectvly lose 200cycles of processing just on memory latncy. now say in Hammer the integrated memory controler cuts ~30ns of memory acsess. so now it takes 70NS to acsess memory and you effectvly lose "only" 140 cycle at 2Ghz. a 48% speed improvment on memory latncy.

This post is best viewed with common sense enabled
Related resources
Can't find your answer ? Ask !
July 25, 2002 8:24:55 AM

one importent thing to know and remeber that all stages in a pipeline ar done together - so a given processor with 10 Stages of a pipline shouldn't be any slower then a 20 stages processor at the same clock speed. sence all the 20 stages are done in the same clock cycle.
the problem is that this only takes place in theory when all stages are infact filled and ready to be executed - some time the pipline emptys (cache misses, branch miss-prediction etc) and then it takes the 20 stage processor more time to "Fill" its pipline - 20 clock cycles.

This post is best viewed with common sense enabled
July 25, 2002 2:38:49 PM

That last part was a bit skewed to me.
On a P4, is the minimum amount of time on a FULL pipeline complied instruction (which means it HAS to go through each stages) 20 cycles?
If so, I don't get why you were saying that all stages can be done in 1 CC.
Quote:
sence all the 20 stages are done in the same clock cycle.

However your 1GHZ=1B ns, combined with imgod2u's explanation has finally opened up what I wanted to know in cycles!
It's all clearing up now...

-However, what natural behavior, makes the clock speed on P4s due to twice the pipeline, go faster in MHZ really? Ok so the cycles are shorter to finish, what causes the technical behavior (might be even natural due to the material?)that this will allow it to ramp in MHZ?
-And also when adding components inside and therefore saying reducing IPC, say we add an FPU to the P4, what makes it that from then on, speed ramping slow a bit down? I mean all the FPU does is help the cycles even more, but are they being slowed down?
-Another question (sorry for all that lol, but I think it will help some others as well), now that we know the stages are shorter, the overall steps to take are more though are done in the same time as a 10 stage one technically, why is the need for the big bandwidth? As in 3.2GB/sec compared to 2.1GB/sec?

Thanks guy however, you're clearing up much more, and it is exactly what is my question about after I was worried nobody would get the jist of it!


--
An AMD employee's diary: Today I kicked an Intel worker in the "Willy"! :lol: 
July 25, 2002 2:50:39 PM

There is something I don't get here, what about FPU pipelines? (you talked about the ALU only one pass, one stage)
What do they serve and how fast is the FPU usually?
They say the K7 FPU is 15-stages long, but to counter any kind of mispredict, or miscalculation, lack of data (pipeline bubbles?) or whatever problem that can cause it to flush, it has a 36-entry scheduler to make sure the ops get there constantly. So how long are these stages minimum anyway? Are they saying the 15-stage pipeline in the FPU is actually a sub-step and that they are all done in 1 CC, in order to move on to the Write stage?

Also when they talk about many IPC, does that mean in one clock cycle, say we look at the Athlon architecture and we see 6 MacroOPS reading to go through 3 ALUs and 3 FPUs, thus it is doing an efficient job per clock cycle, does it means in that cycle, all these units will work at the same time, and that the OPS will be executed all at once? (therefore it explains why clock cycles are not a given time, and they are only there to say that a job is done NOW)
This is why I am wondering what about pipelined execution units, unless ALUs are never PPed and FPUs are, in which case I don't get how do you clock them if they even have 5 more stages in a K7 than its own pipeline.

Ok I know I have other questions, but I will try to sort them out and remember them first, I can see I am getting what I want to know, and finally stop all the scramble in my head wondering, so thanks for bearing with me.

--
An AMD employee's diary: Today I kicked an Intel worker in the "Willy"! :lol: 
July 25, 2002 3:03:07 PM

Hmm speaking of latency, if the P4 creaves low latency, am I right to say that the L2 cache architecture of the P4 is actually limiting? I have read it is 8-way associative and if there is anything I have learned from Ars, is that longer time to search in a set means longer latency. And since a balance of reduced latency and reduced conflict misses (remember you can't do one more than the other without the other becoming more disadvantageous) is better, should've Intel have used 4-way associativity? The way I see it, from what I learned, is that it would've been much more effective, given the P4 requires fast access.



--
An AMD employee's diary: Today I kicked an Intel worker in the "Willy"! :lol: 
July 25, 2002 4:26:41 PM

you got somthing wrong... Pentium 4 L2 Cache is 24 way set associative and Athlon is 16 way set associative.
Pentium 3 is 8 if I remeber currectly...

I will replay about your other messeges later... kinda busy right now.

This post is best viewed with common sense enabled
July 25, 2002 6:28:13 PM

Quote:
-However, what natural behavior, makes the clock speed on P4s due to twice the pipeline, go faster in MHZ really? Ok so the cycles are shorter to finish, what causes the technical behavior (might be even natural due to the material?)that this will allow it to ramp in MHZ?

Well, as I mentioned, it's easier to do simple things repeatedly than to do complex things repeatedly. So if you break stages down into simpler tasks, it's easier to operate very fast. Now just by the sheer shortening of the stage, you should gain 2x the passes per second. However, as you'll notice in the P7 core, the gain is much more than just 2x. The P7 core's theoretical scale is around 10 GHz.

Quote:
-And also when adding components inside and therefore saying reducing IPC, say we add an FPU to the P4, what makes it that from then on, speed ramping slow a bit down? I mean all the FPU does is help the cycles even more, but are they being slowed down?

I don't quite understand what you're saying here. The IPC, theoretically should be the same per pipeline no matter how many stages the processing method is broken down into. Branch mispredicts do account for some loss in IPC but not very dramatic with modern prediction methods. The FPU is an execution unit. After the instructions have been decoded and scheduled and the actual math is ready to be done, that's what the FPU does. It's 1 stage in the FPU pipelines. Sure by adding another one you increase the chance that one can't operate as fast as the others but that's true of adding anything.

Quote:
-Another question (sorry for all that lol, but I think it will help some others as well), now that we know the stages are shorter, the overall steps to take are more though are done in the same time as a 10 stage one technically, why is the need for the big bandwidth? As in 3.2GB/sec compared to 2.1GB/sec?

There are actually several reasons. The first and most obvious being the high clockrate. Instructions are fetched from memory (and even prefetched) every clock. The higher the clockrate, the more you have to fetch from memory (or cache) every second. This is why more memory bandwidth is needed. The prefetch logic will have more of a "window" to take a look at what's in the memory and what's being processed and speculatively load what it thinks the processor will need next into the caches. This requires a significant amount of bandwidth as you must have instructions to be fetched and decoded every clock otherwise you have a gap in the pipeline. The more you can fetch and hold in cache, the more likely the processor will always have something to work on.

Quote:
There is something I don't get here, what about FPU pipelines? (you talked about the ALU only one pass, one stage)
What do they serve and how fast is the FPU usually?
They say the K7 FPU is 15-stages long, but to counter any kind of mispredict, or miscalculation, lack of data (pipeline bubbles?) or whatever problem that can cause it to flush, it has a 36-entry scheduler to make sure the ops get there constantly. So how long are these stages minimum anyway? Are they saying the 15-stage pipeline in the FPU is actually a sub-step and that they are all done in 1 CC, in order to move on to the Write stage?

I'm not too sure as to the number of stages in the FPU pipelines of most modern CPU's, however, they are almost always longer than the ALU pipelines. FP operations generally don't consist of too many branch-dependent instructions and and therefore don't suffer as much from branch mispredicts. With modern prediction algorithms, branch mispredicts in the FPU pipeline are almost nonexistent. Generally, FP instructions are more complex and require more stages to complete. Each pass in a stage in the FPU pipeline is the same as an integer pipeline, it simply takes longer for an FP instruction to be processed than in the integer pipeline. However, as I've pointed out, once the pipeline is filled, the throughput is still the same per pipeline, theoretically.

Quote:
Also when they talk about many IPC, does that mean in one clock cycle, say we look at the Athlon architecture and we see 6 MacroOPS reading to go through 3 ALUs and 3 FPUs, thus it is doing an efficient job per clock cycle, does it means in that cycle, all these units will work at the same time, and that the OPS will be executed all at once? (therefore it explains why clock cycles are not a given time, and they are only there to say that a job is done NOW)

Yes. Execution units work in parallel. Micro-ops are scheduled and executed at once and are released at once. However, data dependencies (i.e. you must finish one instruction before you can know the results and start work on another) almost always prevents the Athlon's 9 micro-ops from having work to do at once. It's almost never a limitation of the execution units themselves. Scheduling and having enough micro-ops to work on is usually the limitation. IPC is never an exact thing as it varies with code. Average IPC is basically the average of how much work is done per clock. In one clock, the Athlon may only be able to complete 1 micro-op, while in the next clock, all 9 of its execution units could be filled and it completes 9 micro-ops. You take the average of all of this and you get average IPC. The Athlon's 9 execution units is normally considered overkill. The P3 had 2 FPU units and its FP power was just fine. There are many quirks in the processing of FP that become a big hassle but I won't get into those right now.
A clockcycle, as I said, is one pass through the stages. It says nothing about how much work on average each stage does nor does it say the average throughput (i.e. how many instructions are completed per clock). As I said, theoretically, the average throughput of the P7 core design shouldn't be less than its predecessor. However, there are many things that were cut in the P4 that was in the P3, such as the support for a barrel style execution of FXCH FP instructions, the limited fetch ability, etc. That's really why you see such a reduction in average IPC in the P4.

Quote:
This is why I am wondering what about pipelined execution units, unless ALUs are never PPed and FPUs are, in which case I don't get how do you clock them if they even have 5 more stages in a K7 than its own pipeline.

Each pass through one stage is 1 clock. All stages, be they FP stages or integer stages, all work at the same rate. It takes the same amount of time for 1 stage in the integer pipeline to complete as it does a stage in the FP pipeline. If the FP has more pipeline stages, then it simply takes more passes (clocks) to finish an instruction. FP and integer operations are almost never dependent on eachother. So it really doesn't matter that an integer operation is finished 5 clocks before an FP operation is finished.

Quote:
Hmm speaking of latency, if the P4 creaves low latency, am I right to say that the L2 cache architecture of the P4 is actually limiting? I have read it is 8-way associative and if there is anything I have learned from Ars, is that longer time to search in a set means longer latency. And since a balance of reduced latency and reduced conflict misses (remember you can't do one more than the other without the other becoming more disadvantageous) is better, should've Intel have used 4-way associativity? The way I see it, from what I learned, is that it would've been much more effective, given the P4 requires fast access.

No, I think as, someone else stated, the P4's L2 cache is 24 set associative. It's latency is somewhere around 6 cycles I think. That's a pretty big improvement over any other caching method I know of. The associativity of cache is how many "channels" of access I guess you can call it you have. The more the better.

Quote:
you got somthing wrong... Pentium 4 L2 Cache is 24 way set associative and Athlon is 16 way set associative.
Pentium 3 is 8 if I remeber currectly...


The P3 had 16-way set associative L2 cache. The Celeron had 8. Not sure about the Athlon, I think it had 8 as well.<P ID="edit"><FONT SIZE=-1><EM>Edited by imgod2u on 07/25/02 11:48 AM.</EM></FONT></P>
July 25, 2002 7:29:20 PM

What I meant in IPC increase, is what is causing that the speed ramping becomes slower, when you add components? Technically saying, adding more cache, adding an FPU, wouldn't that make the clock speed ramping a bit slower? (in a way this is K7 because it is filled with power inside it that clock speed ramping isn't as easy anymore)

I am not sure if I got the answer to my question about pipelining execution units. I am asking whether since a clock cycle lengh in a CPU applies to each stage in the pipeline, then when you get to say the 18th one which is the stage of Executing, the op is now for example in the FPU, and is to be executed. Well if that stage is done in one CC, that means after that CC the data has been executed, no? What I don't get is what about the stages in the FPU? Is each stage there a CC or is no matter what the number of stages in an exec unit, it will always be done in one CC as the stage it is in? It's just the fact they pipeline exec units that mixes me up in how long clock cycles can be in a pipeline.

SO I take it that IPC, instructions per clock cycle, mean actually MicroOps or MacroOps and not a real x86 nondecoded instruction, right? The IPC, or MacroOps per clock cycle are refered to when they are in the exec units, and that IPC is not talked about in any stages before, no?

Also while reading on Ars and THG as well, they specify that the K7 decoder allows a much more efficient maximum decoding of 6 uOPS, so why do we say 9 here?

You also state the P4 has a lowered IPC, but also say it has some things removed, so why is its per-clock performance so horrible compared to the P6 core or PIII, back in the Wilamette days? It seems you're saying it is actually remarkable that its performance per clock is so good, while in reality we all know it was horrible to see a 1.4 Willy against a 1GHZ P3 and it got beaten by the P3 in most results.

As for cache, hmm I didn't think I was so behind lol, last time I checked WCPUID for my Palomino's cache, it said 4-way but I was reading off the Instruction TLB. What I don't get though, is that isn't longer associativity gonna result in longer searches in the set, giving higher latencies? I do know though that conflicts where blocks can't be stored in the same and have to be swapped each cycle would make a disaster on the P4, but I thought at first that the big associativity would also make longer search times, making it almost as bad as small associativity=conflict miss moments.
And what are TLBs for in the cache anyway, and why are THEY also associative, at 4-way for my AXP?

I guess I was thinking small associativity at first, but I may have been thinking of the L1 cache, not the L2, so it may have mixed in my mind, and I expected L2 to be same as well.

--
An AMD employee's diary: Today I kicked an Intel worker in the "Willy"! :lol: 
July 25, 2002 8:39:50 PM

Quote:
What I meant in IPC increase, is what is causing that the speed ramping becomes slower, when you add components? Technically saying, adding more cache, adding an FPU, wouldn't that make the clock speed ramping a bit slower? (in a way this is K7 because it is filled with power inside it that clock speed ramping isn't as easy anymore)

No, the K7's higher IPC isn't exactly due to its low scalability. It's simply due to the design. With complex stages, you can only scale so much. As I mentioned, all else equal, a 20-stage design will have the same throughput as a 10-stage design, but the 20-stage design would be able to scale much higher. This is not counting data dependencies, memory latency, etc.

Quote:
I am not sure if I got the answer to my question about pipelining execution units. I am asking whether since a clock cycle lengh in a CPU applies to each stage in the pipeline, then when you get to say the 18th one which is the stage of Executing, the op is now for example in the FPU, and is to be executed. Well if that stage is done in one CC, that means after that CC the data has been executed, no? What I don't get is what about the stages in the FPU? Is each stage there a CC or is no matter what the number of stages in an exec unit, it will always be done in one CC as the stage it is in? It's just the fact they pipeline exec units that mixes me up in how long clock cycles can be in a pipeline.

Of that I am aware, FP operations (done in the execution units) can be pipelined. This means the execution can be broken down into several parts and multiple executions can be worked on at once in different stages. This doesn't affect overall throughput but it does mean operations must take longer to finish.

Quote:
SO I take it that IPC, instructions per clock cycle, mean actually MicroOps or MacroOps and not a real x86 nondecoded instruction, right? The IPC, or MacroOps per clock cycle are refered to when they are in the exec units, and that IPC is not talked about in any stages before, no?[/quote

No, IPC is the average x86 instructions that can be finished per clock. It doesn't matter how many micro-ops are finished and it doesn't matter how many stages they go through. You measure the results of how many x86 instructions are finished per clock and average them out. This is why you can't just simply state the IPC of a CPU because it varies with code of different types of instructions.

Quote:
You also state the P4 has a lowered IPC, but also say it has some things removed, so why is its per-clock performance so horrible compared to the P6 core or PIII, back in the Wilamette days? It seems you're saying it is actually remarkable that its performance per clock is so good, while in reality we all know it was horrible to see a 1.4 Willy against a 1GHZ P3 and it got beaten by the P3 in most results.

Not just back in the Willamette days. The lower average IPC exists on all P4's. The only thing is, with the increase in L2 cache and memory bandwidth, the P4 is getting better IPC due to less memory constraints (so it'll get less gaps in the pipeline). And yes, the reason for its horrific IPC compared to the K7 or P6 designs is due to the fact that many things were cut out. Code can be changed to use other parts of the CPU (such as SSE2) and you wouldn't see such a dramatic difference in IPC. Again, it's all about the software you run.

Quote:
As for cache, hmm I didn't think I was so behind lol, last time I checked WCPUID for my Palomino's cache, it said 4-way but I was reading off the Instruction TLB. What I don't get though, is that isn't longer associativity gonna result in longer searches in the set, giving higher latencies? I do know though that conflicts where blocks can't be stored in the same and have to be swapped each cycle would make a disaster on the P4, but I thought at first that the big associativity would also make longer search times, making it almost as bad as small associativity=conflict miss moments.
And what are TLBs for in the cache anyway, and why are THEY also associative, at 4-way for my AXP?

Of that I'm aware, the associativity of caches is how many "channels" of access the CPU has with that cache. In other words, in a 16-way associative cache access, you have 16 different pathways to access the cache. Whereas in an 8-way associative cache, you have only 8 different pathways. The more pathways, the more flexible cache accessing can be and the more parts of the cache you can attempt to access.
July 26, 2002 4:07:26 AM

Quote:
No, the K7's higher IPC isn't exactly due to its low scalability. It's simply due to the design. With complex stages, you can only scale so much. As I mentioned, all else equal, a 20-stage design will have the same throughput as a 10-stage design, but the 20-stage design would be able to scale much higher. This is not counting data dependencies, memory latency, etc.

Although that counters everything this forum taught me and so many websites claimed, you just englightened me and I have been right all that time. So in reality if I took an AXP and advanced to 20 stages, I would not see a difference in the per-clock performance, right?
I am aware that, if all things same, the increase in pipeline stages would mean more bandwidth needed, so in reality increasing an AXP to 20 stages might lower the per-clock performance by a max of 20%, and can be reclaimed by just adding bandwidth through high speed RAM and maybe dual channeling.

Quote:
Of that I am aware, FP operations (done in the execution units) can be pipelined. This means the execution can be broken down into several parts and multiple executions can be worked on at once in different stages. This doesn't affect overall throughput but it does mean operations must take longer to finish.

So what this leads to conclude is that no stage is one clock cycle anytime? If so, does that mean FP operations, say going in the K7 execution, and since it is 10 stages, and we're doing FP, would that mean it would take a minimum of 25 cycles when going to make FP ops?

Another thing unanswered is how can the Athlon be able to spit out a max of 9 IPCs, when the decoders can do 6 uOPS at once? You state that it's the average X86, so can you lead me where do I find datasheets or the area of the K7 where 9 ops can go in at once in a possible cycle?

As for the P4 IPC, I have a hard time to understand that even though it has cheap execution units compared to Athlon (forget SSE 2), and many other things, at the same clock, the Wily was a good 20% less performing. What is hard to get is that even with things like Trace Cache, very good Hardware Prefetch, a huge 400 MHZ FSB (might not be as effective since it's still GTL+) and 3.2GB/sec Dual Channel RDRAM, it still wasn't enough. One can't image how much without such, it would have sucked. So how is that possible? What other things are significant to improve to get the 20% back?

So in conclusion to IPC, when people say to get higher speeds you need to sacrifice IPC, it isn't entirely true and infact only applies to latency and low bandwidth situations, no?

Finally for the cache, I know about associativity and I should not forget that it's the number of channels PER SET, but again I was saying why would such a long associativity not result in higher latencies? And how did Intel get around that to make the P7 core's cache capabilities so good as to reduce the big latency penalty that usually would result in huge associativity searches?



--
An AMD employee's diary: Today I kicked an Intel worker in the "Willy"! :lol: 
July 26, 2002 4:31:36 AM

Quote:
Although that counters everything this forum taught me and so many websites claimed, you just englightened me and I have been right all that time. So in reality if I took an AXP and advanced to 20 stages, I would not see a difference in the per-clock performance, right?
I am aware that, if all things same, the increase in pipeline stages would mean more bandwidth needed, so in reality increasing an AXP to 20 stages might lower the per-clock performance by a max of 20%, and can be reclaimed by just adding bandwidth through high speed RAM and maybe dual channeling.

Well, first of all, notice I mentioned not counting for data dependencies. In realistic cases, increasing the pipeline length does effect performance as branched instructions can be a potential hazard. Of course, prediction methods nowadays usually keep branch mispredicts to a minimum so the loss in average IPC isn't very dramatic. Certainly not as dramatic as the Athlon vs P4.

Quote:
So what this leads to conclude is that no stage is one clock cycle anytime? If so, does that mean FP operations, say going in the K7 execution, and since it is 10 stages, and we're doing FP, would that mean it would take a minimum of 25 cycles when going to make FP ops?

You kinda lost me. Each stage is a piece of the pipeline. The execution of an FP instruction can take multiple stages to complete, but each stage = 1 clock. But yes, on the Athlon, it would take 15 cycles (usually) for an FP instruction to go in and come out all finished. However, the same principles of pipelining exists. After the first 15 instructions have been put into the process, 1 will come out each clock maintaining the same throughput as if it were all done in 1 stage.

Quote:
Another thing unanswered is how can the Athlon be able to spit out a max of 9 IPCs, when the decoders can do 6 uOPS at once? You state that it's the average X86, so can you lead me where do I find datasheets or the area of the K7 where 9 ops can go in at once in a possible cycle?

The answer to that is that it cannot do 9 IPC. I don't know where this BS came from and frankly, it's about as bad as Apple's "we're twice as fast". It has 9 execution units and can execute a maximum of 9 micro-ops. This is entirely different from completing 9 whole x86 instructions per clock. The average x86 instructions per clock completed on the Athlon is about 1.5 I'd say. A rough guess though as it varies significantly with code.

Quote:
As for the P4 IPC, I have a hard time to understand that even though it has cheap execution units compared to Athlon (forget SSE 2), and many other things, at the same clock, the Wily was a good 20% less performing. What is hard to get is that even with things like Trace Cache, very good Hardware Prefetch, a huge 400 MHZ FSB (might not be as effective since it's still GTL+) and 3.2GB/sec Dual Channel RDRAM, it still wasn't enough. One can't image how much without such, it would have sucked. So how is that possible? What other things are significant to improve to get the 20% back?

What good is having a BMW Z8 if you're driving in a traffic jam? You can put all the memory bandwidth and impressive caching you want on it but when it gets to the end of the pipeline and the execution units aren't powerful enough, there is not a thing you can do to speed up the process. What will help is if Intel either 1. give up and decide to go back to putting heavy-duty x87 FPU units on the chip or 2. release a scalar extension for FP operations which will actually be useful to developers who can't take advantage of SIMD. Also, the trace cache is actually one of the limitations. While it's a great idea it needs to be beefed up. Only being able to issue 3 micro-ops per clock is killing it and 12k simply isn't big enough to store the majority of recurring loops in modern code. Hopefully, Prescott will address these issues (since it's trace/l1 cache will be double-pumped).

Quote:
So in conclusion to IPC, when people say to get higher speeds you need to sacrifice IPC, it isn't entirely true and infact only applies to latency and low bandwidth situations, no?

Not isn't entirely true, it isn't true at all. Intel simply didn't have enough time with the P7 core to make everything work top-notch and still have the benefits of a 20-stage pipeline. If you look at the roadmaps from a few years ago, the P3's should've transitioned to the .13 micron process sometime in 2000/2001 and the P4 wasn't suppose to be introduced till somewhere around 2002. If they'd had more time (and possibly done it right), there should've been no, or very little loss in IPC.
As for memory bandwidth. That's just an obvious statement. As GHz increases, so will the need for higher memory bandwidth and lower latency. This is regardless of how much work per clock the processor does. It's not that difficult a concept to grasp. I wonder why so many people don't get it.

Quote:
Finally for the cache, I know about associativity and I should not forget that it's the number of channels PER SET, but again I was saying why would such a long associativity not result in higher latencies? And how did Intel get around that to make the P7 core's cache capabilities so good as to reduce the big latency penalty that usually would result in huge associativity searches?

Associativity lowers the potential latency if I remember correctly. The more access pathways you have, the less time you have to spend looking for stuff. Should be common sense....
July 26, 2002 9:44:58 AM

actually its about 1.1 complex x86 instructions completed per cycle on avrege for athlon.

This post is best viewed with common sense enabled
July 26, 2002 2:33:20 PM

I don't think I got the time right now to write back, but I want you to check out the top area of this article section, so you'd know where I got the idea of more associativity to search=more latency, and therefore there will be some point where too much latency will no longer make the tradeoff of less caching conflicts useful.
<A HREF="http://arstechnica.com/paedia/c/caching/caching-7.html" target="_new">http://arstechnica.com/paedia/c/caching/caching-7.html&...;/A>

Oh and btw thank you for finally being the one who knew that IPC does not decrease with pipeline increase and that you DO NOT need to strip for example, Athlons K7's core components to ramp in clockspeed. Man the amount of people here and pretty much on the web were taught this FUD and I hope we can spread the truth in this forum at least!

--
An AMD employee's diary: Today I kicked an Intel worker in the "Willy"! :lol: 
July 26, 2002 5:05:49 PM

I've just glanced over a few pdf's over at Intel and apparantly the P4's L2 cache is 8-way set associative. So forget whoever said it was 24-way. Also, keep in mind the P4's prefetch logic speculatively loads data that it views as either being temporally or spacially related into the cache either in the same blocks or near the same blocks. This reduces cache lookup time significantly in most cases.
July 26, 2002 5:30:57 PM

this is not toatly true. you know, Athlons wide execution unit in praticular a 3-way complex decoder is somewhat of limiting in term of clocking and also more heat genrating. packed into a single stage of the pipeline...
again this could be said for alot of components of alot fo CPUs the desinger job is to strike balance between the amount of IPC lost to the amount of Freqency gained and thus total absoulete preformance gain/loss.
Intel Pentium 4 desginer designers did not neccerly took the wrong choises with the Pentium 4 as you see overall preofrmance are very good.

This post is best viewed with common sense enabled
July 26, 2002 6:37:49 PM

Yup, I got the 8-way idea from also having read the PDF by Crashman in this forum. I dunno man but 24-way sounds overkill and would really kill latency.
I do know the prefetch helps however for sure.
And I am also aware about Temporal and Spacial Locality, which is probably something they tried to improve on the P4 as well?

--
An AMD employee's diary: Today I kicked an Intel worker in the "Willy"! :lol: 
July 26, 2002 6:41:12 PM

Yeah there are some things that CAN limit the clocking as well. For sure the heat is one big aspect. Though why would the P4 generate so much heat, yet it has 20 stages, and a much weaker IPC than Athlons?

--
An AMD employee's diary: Today I kicked an Intel worker in the "Willy"! :lol: 
July 26, 2002 7:25:49 PM

Quote:
this is not toatly true. you know, Athlons wide execution unit in praticular a 3-way complex decoder is somewhat of limiting in term of clocking and also more heat genrating. packed into a single stage of the pipeline...
again this could be said for alot of components of alot fo CPUs the desinger job is to strike balance between the amount of IPC lost to the amount of Freqency gained and thus total absoulete preformance gain/loss.
Intel Pentium 4 desginer designers did not neccerly took the wrong choises with the Pentium 4 as you see overall preofrmance are very good.

Well, you just answered your own arguement. "packed into a single stage of the pipeline". It doesn't matter how many components you have, you could always split it into multiple stages to accomplish higher clockrates and maintain throughput (theoretical) through pipelining. Take that 3 decoding units. You split the decoding into twice as many stages as it is now on the Athlon and you'll get an increase in scalability. Of course, it's not a good idea considering having 3 decoding units in parallel would produce heavy data dependencies and therefore it's best to keep pipelining of the decoding stage to a mininum but you get the point.
As far as the P4. It's not so much the "wrong choice". I think the introduction of the P4 was neccessary. The .13 micron process wasn't ready and the P3 wasn't ramping as well. Intel needed to release something based on the P7 core and if you look at the design, it's pretty obvious that it came before its time. Many many parts need to be beefed up. Although it is pulling out better overall performance, it's not what it could've been, or, as I would imagine, should've been in the minds of the engineers.
July 26, 2002 7:43:59 PM

Before I go on with anything I'd like to be able to be sure we're on the same page here. When nowadays we say X CPU has increase in 10% IPC suddenly because of some core improvement, why do we say IPC if it always remains the same and is about the number of instructions that can be handled?

Ok this brings me also to the IPC thingy. When I said 9 IPC, it was written everywhere on the web and everyone knows that, but what I never got was WHAT Instructions are they talking about! Do you know and can explain what and WHERE can the K7 really do 9 IPCs at max?

Just before continuing I'd like to specify, I am only 15 and I am no programmer so this is why I am trying to learn extreme in-depth stuff, as I am very interested in it, and is why I ask you all to bear with me!

Ok back to topic:
Quote:
You kinda lost me. Each stage is a piece of the pipeline. The execution of an FP instruction can take multiple stages to complete, but each stage = 1 clock. But yes, on the Athlon, it would take 15 cycles (usually) for an FP instruction to go in and come out all finished. However, the same principles of pipelining exists. After the first 15 instructions have been put into the process, 1 will come out each clock maintaining the same throughput as if it were all done in 1 stage.

Now I know however that once a pipeline is filled it can then execute a lot faster than one who fills once, then waits till the op is done, then it fills again the next op. However:
Does that mean the first time you fill the pipeline, there will some latency compared to once it is filled and keeps spitting out each clock cycle afterwards?

Also it seems we've been mixed up, at least me for sure. I was with the idea that 10 stages is everywhere on any kind of execution, for the K7. I turns out the pipeline for K7 is 10 stages for Integer, and 15 for FP, so that means in reality the FPUs are not 15 stages long but the entire CPU pipeline for FP ops total 15? So that means in total, it would take 15 cycles to make FP ops. Maybe this might explain the rather better CPU scaling than P3s (seeing 0.18m K7s reaching 1.8GHZ while 0.18m P3s reach 1GHZ)

Quote:
The answer to that is that it cannot do 9 IPC. I don't know where this BS came from and frankly, it's about as bad as Apple's "we're twice as fast". It has 9 execution units and can execute a maximum of 9 micro-ops. This is entirely different from completing 9 whole x86 instructions per clock. The average x86 instructions per clock completed on the Athlon is about 1.5 I'd say. A rough guess though as it varies significantly with code.

Reading through the K7 articles today, I found that the ports that dispatch the MacOps can send on op per clock. This totally mixes me now. If one port is in charge of sending ops to either ALUs or FPUs, all sharing the same port, then how the heck can the Athlons achieve 9 uOPS at once! With 9 exec units, how can you do them all per clock if either the ports, or the uOP decoders can spit out 3 only!
This is indeed mixing me and I am led to beleive that there is never once a pipeline takes really 10 cycles to finish. Again maybe you could explain, because I am losing a lot of train of thought trying to put it all together.

Also, if you're saying that IPC has no regards to pipeline lengh, then say you had a 10 stage Athlon. Each stage can do up to 3 OPS. If you shrink the pipe to 5 stages, how can the IPC not change (remember I am refering IPC here to overall per clock performance)? Wouldn't that meant that a 1GHZ Athlon like this on 10 stages would perform less good than an Athlon on 5 stages at 1GHZ? If so, then you can expect me to be more confused, since you said before that the stages, whether big or not, do not affect IPC, it all depends if each clock cycle on a CPU is not wasted, and has powerful parallel execution.

I realize I might sound here very clueless, and in fact maybe even annoying, but I got so far now, that thinking of forgetting to go on and make things clear would not work. And if there is any way to help me get it right, please provide links, explanations, anything that clear up as much! I'd be very grateful thank you!

--
An AMD employee's diary: Today I kicked an Intel worker in the "Willy"! :lol: 
July 26, 2002 7:51:39 PM

Quote:
Well, you just answered your own arguement. "packed into a single stage of the pipeline". It doesn't matter how many components you have, you could always split it into multiple stages to accomplish higher clockrates and maintain throughput (theoretical) through pipelining. Take that 3 decoding units. You split the decoding into twice as many stages as it is now on the Athlon and you'll get an increase in scalability. Of course, it's not a good idea considering having 3 decoding units in parallel would produce heavy data dependencies and therefore it's best to keep pipelining of the decoding stage to a mininum but you get the point.
As far as the P4. It's not so much the "wrong choice". I think the introduction of the P4 was neccessary. The .13 micron process wasn't ready and the P3 wasn't ramping as well. Intel needed to release something based on the P7 core and if you look at the design, it's pretty obvious that it came before its time. Many many parts need to be beefed up. Although it is pulling out better overall performance, it's not what it could've been, or, as I would imagine, should've been in the minds of the engineers.

It could be me that is a visual person and need visual help to understand, because I didn't quite get what happens when you try to split decoding into 2 stages. This is exactly why pipeline stuff mixes me. How do you split decoding into two stages and how can they both be still the same performance per clock as one stage would, and how would it help?

I do agree the P4 is a new thing, and it did come out very badly. You could imagine if it has 3 fully pipelined FPUs, with SSE2 on each, ALUs AGUs and Trace cache double pumped and trace at over 32KB (which holds tons more ops), even more improved prefetching, the 512K L2 from start, Dual Channel DDR400. Yet with all this it could easily ramp in speeds regardless of the amount of components, so it could reach 10GHZ, and the thing would be a MONSTER! (again based on how imgod2u has explain that adding more performance per clock does not make less scalability)

Though I learned this from you, there is still something not clear. If speed ramping is not affected by the amount of components added, why is OCing affected then? For example the Celeron 300 with no cache OCed a lot, while the one with 128K L2 didn't. How is that explainable then?

--
An AMD employee's diary: Today I kicked an Intel worker in the "Willy"! :lol: 
July 26, 2002 8:41:48 PM

This is an interesting line of thought going on here. Let me try to interject something here. A while back (1-2 Years?) I did a paper on the processes inside of a CPU specifically the Athlon CPU. The Athlon has a processor but inside the processor is a Co-Processor for Floating point and math operations ALU that require a wider register for the result. These are regularly handled by the Co-Processor. Some of these Math operations or commands are single one-command operations and some of them are complex multiple-command operations. The multiple command operations are referred to as complex commands. Inside the processor some of these commands are are hard wired into logic circuts like add and subtract, load, store. Many commands are complex. These complex commands, which are not hardwired into the logic circuits are made up of multiple simple commands. Their composition is in a ROM area of the CPU. This has to be read before the command can be processed. Complex commands are then stored temporarily in a register. If they are used over and over, then they would not have to be looked up as long as they are still in the register.

The time it takes for a clock cycle is relevant. Keep in mind that the Athlon does 2 operations during each clock cycle. Once as it reaches the top of the square clock wave and once before the end of the top of the square wave of the clock. The clock cycle is constant and perfectly timed. There may be nothing going on, because the CPU is waiting for something to happen. The clock is still running however. Think of it as a flow of water. You can interrupt the flow when you put a glass under it to fill up but water is still flowing into the glass and the water is still falling down below the glass.

With multiple pipelines many commands are processed out of order. They are placed in the registers awaiting the other results and then processed in the correct order. Because many pipelines are running out of order, it takes a lot of overhead to predict the order, get the data, run the commands and then store the results. The complex the system the more complex the overhead.

A CPU is like a marching band. Everyone is marching at the same time. Everyone plays a different insrument but all at once to the same beat and tempo. When the band turns to do a wheel, the people on the outside still have to speed up a little to keep in unison (call them simple operations). However, the people on the inside of the wheel take a lot of steps to go the same distance (call them complex operations). However they all move at once.

The Athlon also has some registers that are longer than 32 bits wide. It has registers of different length for different purposes. Some commands are simple and take up less room and some are more complex and have multiple operands. I might be possible to actually look up and store 2 32 bit commands at once in a double wide register.

Clock cycles are important because keeping track of how well a Processor utilizes its time is very important. This is how we can determine how efficient a computer system is. I would venture that a Processor spends most of its time sitting around waiting on the rest of the computer system to do things like fetch memory or read a hard drive. This can be improved through techniques like Symmetrical Multi Processing (SMP) and Massively Multi Processing (MMP) and combinations of different techniques like Non Uniform Memmory Allocation (NUMA).

I would suggest you try to get ahold of a documentation CDROM that AMD use to send to developers. It is very handy to study, and it has lots of diagrams. AMD goes to great length to document the processors they make. I don't know if Intel publishes any of its documantation or not.
July 26, 2002 8:44:32 PM

IPC is not the theortical number of instructions that can be handled - that is MIPS or Millions of instruction per second.

IPC is the real deal avrege throuput of the processor which in athlon about 1.1 and in p4 about 0.9 . I am talking about x86 instructions.

This post is best viewed with common sense enabled
July 26, 2002 8:52:03 PM

You just said it your-self - its not always good or usefull or possible at all to cut complicated processes into more then one stage it cost you transistor count sence such logic slices are not perfect (look at willimate) - and in any way - you gain clock speed but you lose ipc that way.

This post is best viewed with common sense enabled
July 26, 2002 8:59:25 PM

And why lose IPC?

Also then where does the 9 IPC come from? Damn I should relocate the links where I read that number off...

--
The sound of determination is the echo of will...
July 26, 2002 9:16:31 PM

could you be more clear on your first quistion. as it is now it could have many many answers.

9 ipc coes from the fact athlon has 9 execution units.

This post is best viewed with common sense enabled
July 26, 2002 10:29:09 PM

First question is that why are you claiming that you have to sacrifice IPC just to get more MHZ?

Also ok it has 9 exec units, but when is it ever possibly to use them all at once? I mean is it even possible?
I had asked before if it can do 9 uOPS per clock cycle but nobody said it can and I think imgod2u said no and that it is limited or something.

--
The sound of determination is the echo of will...
July 26, 2002 11:44:36 PM

Quote:
Before I go on with anything I'd like to be able to be sure we're on the same page here. When nowadays we say X CPU has increase in 10% IPC suddenly because of some core improvement, why do we say IPC if it always remains the same and is about the number of instructions that can be handled?

I didn't say it would remain the same, I said all else being equal, the theoretical number of instructions completed during each clock (if the pipelines were filled) would remain the same.
Real world conditions are far different. And that's why we talk of average IPC. One clock, a processor may finish 1 instruction, the next clock, it may finish 3 instructions. It's near impossible to predict with all the code out there.
Keep in mind average IPC is the average number of instructions COMPLETED per clock. It doesn't matter how many the CPU is working on and it doesn't matter how many are being fetched from memory. What matters is what is currently comming out of the processor as a freshly processed instruction.

Quote:
Ok this brings me also to the IPC thingy. When I said 9 IPC, it was written everywhere on the web and everyone knows that, but what I never got was WHAT Instructions are they talking about! Do you know and can explain what and WHERE can the K7 really do 9 IPCs at max?


As I said, it doesn't complete 9 instructions per clock. It is capable of a maximum of 9 micro-ops (as AMD calls them) per clock. A micro-op is a far fetch different than a whole x86 instruction.

Quote:
Now I know however that once a pipeline is filled it can then execute a lot faster than one who fills once, then waits till the op is done, then it fills again the next op. However:
Does that mean the first time you fill the pipeline, there will some latency compared to once it is filled and keeps spitting out each clock cycle afterwards?

Also it seems we've been mixed up, at least me for sure. I was with the idea that 10 stages is everywhere on any kind of execution, for the K7. I turns out the pipeline for K7 is 10 stages for Integer, and 15 for FP, so that means in reality the FPUs are not 15 stages long but the entire CPU pipeline for FP ops total 15? So that means in total, it would take 15 cycles to make FP ops. Maybe this might explain the rather better CPU scaling than P3s (seeing 0.18m K7s reaching 1.8GHZ while 0.18m P3s reach 1GHZ)


I've put together a <A HREF="http://server29.hypermart.net/imgod2u/Pipelining/Pipeli..." target="_new">Flash presentation on pipelining</A>.
I suggest you watch it. It should answer most of your questions.

Also, the P3 did have a longer FP pipeline than its integer pipeline as well. The reason it wasn't able to scale as high is because:
1. The Athlon, from the t-birds line and up had copper interconnects instead of aluminum.
2. Intel didn't bleed every bit they had out of the P3 like AMD did with the Athlon.
3. Intel didn't redesign the circuit layout of the P3 like AMD did with the Palomino line to reduce heat by an average o 20-30%.

Quote:
Reading through the K7 articles today, I found that the ports that dispatch the MacOps can send on op per clock. This totally mixes me now. If one port is in charge of sending ops to either ALUs or FPUs, all sharing the same port, then how the heck can the Athlons achieve 9 uOPS at once! With 9 exec units, how can you do them all per clock if either the ports, or the uOP decoders can spit out 3 only!

First of all, 1 micro-op does not equal 1 x86 instruction. The Athlon has 3 decoders capable of decoding a maximum of 3 x86 instructions per clock into micro-ops. An x86 instruction can decode into 3, or 4 or 5 or 12 micro-ops, x86 is variable length so instructions vary in complexity. It's decoding buffer is able to issue a maximum of 6 micro-ops per clock. This is then sent to a scheduling buffer which issues the micro-ops to the execution units. Of that I'm aware of, there are 2 independent scheduling buffers on the K7 design. 1 for integer and memory micro-ops and another for FP/MMX/3DNow/SSE micro-ops. These scheduling buffers have enough ports to issue 1 micro-op to each execution unit.

Quote:
This is indeed mixing me and I am led to beleive that there is never once a pipeline takes really 10 cycles to finish.

I have absolutely no idea what that was suppose to mean.

Quote:
Also, if you're saying that IPC has no regards to pipeline lengh, then say you had a 10 stage Athlon. Each stage can do up to 3 OPS. If you shrink the pipe to 5 stages, how can the IPC not change (remember I am refering IPC here to overall per clock performance)? Wouldn't that meant that a 1GHZ Athlon like this on 10 stages would perform less good than an Athlon on 5 stages at 1GHZ? If so, then you can expect me to be more confused, since you said before that the stages, whether big or not, do not affect IPC, it all depends if each clock cycle on a CPU is not wasted, and has powerful parallel execution.

The size of the stages do not affect average throughput, which is really what IPC is. You're trying to think too literally. Focus on the amount of TOTAL work getting done by the processor, not just each stage. Because remember, while each stage may be doing less, there are twice as many stages doing work at the same time on twice as many instructions. Again, go watch that presentation I put together on pipelining.

Quote:
Though I learned this from you, there is still something not clear. If speed ramping is not affected by the amount of components added, why is OCing affected then? For example the Celeron 300 with no cache OCed a lot, while the one with 128K L2 didn't. How is that explainable then?

First of all, that's not true at all. The first overclocking monsters were the Celeron 300a's, which had 128KB of on-die L2 cache. The original 300's oced well as well but not better than the ones with L2 cache.
Also, with adding anything, there is the chance of loss in scalability. As when you add something, you're increasing the chance that something went wrong while producing that component and it doesn't work as fast as it's suppose to (or rather the rest of the process). This is a manufacturing limitation and will be solved through refinements in manufacturing. It's not an architectural limitation.

Quote:
You just said it your-self - its not always good or usefull or possible at all to cut complicated processes into more then one stage it cost you transistor count sence such logic slices are not perfect (look at willimate) - and in any way - you gain clock speed but you lose ipc that way.

Well, yes that's why I said theoretically. In realistic terms, data dependencies and the increase in transister count would hamper IPC somewhat but that's like saying replacing a 4 cylinder engine in car with an 8 cylinder engine reduces engine efficiency due to the added weight. Such things can be refined and corrected later on with lighter materials without having to use a weak engine which would not provide as much torque. You can only make a car so light before you loose friction.<P ID="edit"><FONT SIZE=-1><EM>Edited by imgod2u on 07/26/02 05:09 PM.</EM></FONT></P>
July 27, 2002 3:28:29 AM

I am well aware of uOPS not the same as x86 and how each are decoded into small uOPS. Sorry I didn't mention it! What I wanted to know, which IIB answered, is what do they mean by a max of 9 IPC, and it means 9 uOPS at the EXEC stage. I was put in the thought that in the beginning each stage does that, but that mixed me a lot. All it means is that on the EXEC stage of any CPU, a certain maximum amount can be executed, right? Now if that IS the case (hope to god it is), then when in what possible situation can the actual 9 units all be working together or 9 uOPS one each, to effectively in that clock spit out the 9 executed at once?

I checked your flash project and again I do understand that. I think I found a way to ask what is one of the main things confusing the living hell outta me. Ok so we know a clock cycle is what is refered as one pass, one step done, and that one clock is one hertz. Now it could be because I didn't learn in school what Hz really is and how it functions, and so, which would explain why I have a hard time learning this, but it could be also because I can't get, WHAT generates that hertz, and how does that thing know that the cycles are now divided into small faster ones, so that it knows from there on it can create even more cycles per second, so that it knows it's capable of scaling so much higher? I guess it's the physics behind the CPU's Hz that I am not getting, and how the pipeline lengh affects it. Hell I don't even know how can the CPU maker tell the silicone and specify which section where that stage occurs in, means 1 CLOCK CYCLE, not more not less. That means when we are in the EXEC stage, how does the CPU maker tell the CPU's physics that in this section (the execution units), it HAS to equal one cycle, one hertz, and not more than one?

Another thing bothering me, and is countering what you told me and I thought so too, is that when you split the tasks into smaller ones, and yet you say that the work per clock SHOULD under ideal conditions remain the same, HOW can it still be said to work the same amount per clock? Let us go back to your shoe example. You are at the part where you put the coating color, and the shining stages. Now you add twice the workers, thus at this stage one worker does the coloring and one does the shining. How can we still say that the efficiency, and the work done per worker, if say we took the worker doing the coloring stage, would be the same as when it was one worker doing coloring AND shining? This is why I still am thinking that increase pipeline stages means reducing the work per clock cycle as well, as IIB also tried to say, at least I think he was.

Quote:
Also, the P3 did have a longer FP pipeline than its integer pipeline as well. The reason it wasn't able to scale as high is because:
1. The Athlon, from the t-birds line and up had copper interconnects instead of aluminum.
2. Intel didn't bleed every bit they had out of the P3 like AMD did with the Athlon.
3. Intel didn't redesign the circuit layout of the P3 like AMD did with the Palomino line to reduce heat by an average o 20-30%.

Ok for the FP and Integer pipeline question, your flash didn't really explain, since it was a general all around rule, however I just wanted to know if the Athlon doesn't really have a 10 stage pipeline ONLY but in reality it is 10 stages for Integer and 15 (8 added to the 7 of the Integer one, which are the same for each pipeline like fetching and decoding) for the FP, and whether that means that if we started a CPU freshly and sent in one FP and one integer instruction at the same time, wouldn't the Integer one finish first, because the the uOP there only has to take 10 cycles and not 15, while 5 cycles later the FP one finally reaches the WRITE stage (or whatever the Athlon calls the final one)?

As for the P3 vs Athlon architecture, I understand why it scaled better, however if both had these advantages, technically without any process obstacles at the same process, they could reach potentially the same max clock speed, right?

Also WHY do we say the K7 is at the end of its days? How can a pipeline one day become limited in scalability to the point nothing can help scale higher?

Quote:
First of all, 1 micro-op does not equal 1 x86 instruction. The Athlon has 3 decoders capable of decoding a maximum of 3 x86 instructions per clock into micro-ops. An x86 instruction can decode into 3, or 4 or 5 or 12 micro-ops, x86 is variable length so instructions vary in complexity. It's decoding buffer is able to issue a maximum of 6 micro-ops per clock. This is then sent to a scheduling buffer which issues the micro-ops to the execution units. Of that I'm aware of, there are 2 independent scheduling buffers on the K7 design. 1 for integer and memory micro-ops and another for FP/MMX/3DNow/SSE micro-ops. These scheduling buffers have enough ports to issue 1 micro-op to each execution unit.

That last part is what isn't clear to me. Does that mean that in order to fill all exec units by the ports, you'd need 9 cycles to dispatch to each unit totally? If not, then does it mean in one cycle it can send 9 uOPS, one to each unit, and make them all work? And since 9 uOPS spit out at the same time is a rare one, can you clarify me HOW can the ports when properly worked out, with the units, really spit out at the same clock 9 uOPS processed?

Quote:
I have absolutely no idea what that was suppose to mean.

Never mind, I think I got it.

Quote:
The size of the stages do not affect average throughput, which is really what IPC is. You're trying to think too literally. Focus on the amount of TOTAL work getting done by the processor, not just each stage. Because remember, while each stage may be doing less, there are twice as many stages doing work at the same time on twice as many instructions. Again, go watch that presentation I put together on pipelining


I know I am thinking too much to confuse myself, but I got this far and I need to get it right, clarified. It may be overkill for my age, I dunno man, I am just in the "orgasm" of learning CPU stuff, so I need to get it right!

Well at least now I know also that adding units or components, or increasing their size in fact do not stop the scaling, which would explain why even with Precott's enhancements, the target 10GHZ is in fact still there and not even close to be stopped unless by physical fab and process problems.


--
The sound of determination is the echo of will...
July 27, 2002 4:05:04 AM

Quote:
I am well aware of uOPS not the same as x86 and how each are decoded into small uOPS. Sorry I didn't mention it! What I wanted to know, which IIB answered, is what do they mean by a max of 9 IPC, and it means 9 uOPS at the EXEC stage. I was put in the thought that in the beginning each stage does that, but that mixed me a lot. All it means is that on the EXEC stage of any CPU, a certain maximum amount can be executed, right? Now if that IS the case (hope to god it is), then when in what possible situation can the actual 9 units all be working together or 9 uOPS one each, to effectively in that clock spit out the 9 executed at once?

The answer to that is data dependencies. Let's say you decoded 3 x86 instructions into 6 micro-ops. Now, let's say that 4 of those micro-ops were from an instruction that depended on the result of 2 of those micro-ops (shouldn't be too hard to figure out that each x86 instruction in my example was decoded into 2 micro-ops). Therefore, those 4 micro-ops that must wait for the result of executing those 2 micro-ops must be stored in the scheduling buffer until those 2 micro-ops are done and the result is known. Now, by the time those 2 micro-ops are executed, another 3 x86 instructions were decoded into 5 micro-ops and put into the scheduling buffer. Now let's say that the 4 micro-ops from before, and these 5 micro-ops, all can be executed without depending on eachother's results. This would fill the execution units (assuming the number of ALU, FPU and AGU execution was evenly distributed).

Quote:
I checked your flash project and again I do understand that. I think I found a way to ask what is one of the main things confusing the living hell outta me. Ok so we know a clock cycle is what is refered as one pass, one step done, and that one clock is one hertz. Now it could be because I didn't learn in school what Hz really is and how it functions, and so, which would explain why I have a hard time learning this, but it could be also because I can't get, WHAT generates that hertz, and how does that thing know that the cycles are now divided into small faster ones, so that it knows from there on it can create even more cycles per second, so that it knows it's capable of scaling so much higher? I guess it's the physics behind the CPU's Hz that I am not getting, and how the pipeline lengh affects it. Hell I don't even know how can the CPU maker tell the silicone and specify which section where that stage occurs in, means 1 CLOCK CYCLE, not more not less. That means when we are in the EXEC stage, how does the CPU maker tell the CPU's physics that in this section (the execution units), it HAS to equal one cycle, one hertz, and not more than one?

A clockcycle isn't exactly a physical thing. It's just a description of how fast each stage can operate. The clockrate is generated on the chipset and the CPU takes that (the FSB) and multiplies it (each stage does a certain amount of passes for every signal it receives from the FSB).

Quote:
Another thing bothering me, and is countering what you told me and I thought so too, is that when you split the tasks into smaller ones, and yet you say that the work per clock SHOULD under ideal conditions remain the same, HOW can it still be said to work the same amount per clock? Let us go back to your shoe example. You are at the part where you put the coating color, and the shining stages. Now you add twice the workers, thus at this stage one worker does the coloring and one does the shining. How can we still say that the efficiency, and the work done per worker, if say we took the worker doing the coloring stage, would be the same as when it was one worker doing coloring AND shining? This is why I still am thinking that increase pipeline stages means reducing the work per clock cycle as well, as IIB also tried to say, at least I think he was.

IPC is not work done per WORKER, it's work done per CLOCK. In other words, the COMBINED work of all the workers in a pass. Two people doing smaller jobs is equal to 1 person doing a big job. Except the 2 people can do it faster, get it? This is explained in the flash presentation.

Quote:
Ok for the FP and Integer pipeline question, your flash didn't really explain, since it was a general all around rule, however I just wanted to know if the Athlon doesn't really have a 10 stage pipeline ONLY but in reality it is 10 stages for Integer and 15 (8 added to the 7 of the Integer one, which are the same for each pipeline like fetching and decoding) for the FP, and whether that means that if we started a CPU freshly and sent in one FP and one integer instruction at the same time, wouldn't the Integer one finish first, because the the uOP there only has to take 10 cycles and not 15, while 5 cycles later the FP one finally reaches the WRITE stage (or whatever the Athlon calls the final one)?

Now you're getting it.

Quote:
As for the P3 vs Athlon architecture, I understand why it scaled better, however if both had these advantages, technically without any process obstacles at the same process, they could reach potentially the same max clock speed, right?

Actually the P3 should reach higher speeds as it has a 13 stage integer pipeline if I recall correctly. And a 15 or 16 or something like that FP pipeline.

Quote:
Also WHY do we say the K7 is at the end of its days? How can a pipeline one day become limited in scalability to the point nothing can help scale higher?

Yes. Now you're getting it. You can only operate complex stages so fast. There gets to a point where the passage of electrons through the stages can't operate any faster without too many errors occuring. You can shrink the transister size all you want and you can use whatever material you want but you're still limited by the rate electrons can travel reliably.

Quote:
That last part is what isn't clear to me. Does that mean that in order to fill all exec units by the ports, you'd need 9 cycles to dispatch to each unit totally? If not, then does it mean in one cycle it can send 9 uOPS, one to each unit, and make them all work? And since 9 uOPS spit out at the same time is a rare one, can you clarify me HOW can the ports when properly worked out, with the units, really spit out at the same clock 9 uOPS processed?

Yes, the scheduling buffer has the ability to issue to all 9 execution units at once. And you're still trying to track just 1 instruction. Remember, there are micro-ops to be executed almost every clock. So while the micro-ops being processed didn't neccessarily start together, there will always be micro-ops waiting to be processed. Just think of multiple lines waiting to see a movie. It doesn't matter how much slower one ticket guy works than another, if there are enough people, the lines will be filled and they will all be busy. And since they work on a cycle (the shortest amount of operation time), they each will be able to FINISH one person with each cycle.

Quote:
Well at least now I know also that adding units or components, or increasing their size in fact do not stop the scaling, which would explain why even with Precott's enhancements, the target 10GHZ is in fact still there and not even close to be stopped unless by physical fab and process problems.

Again, theoretically. Realistic cases will be somewhat different. There are methods (such as prediction) that are used to counter some of effects that may cause reduced average IPC. And again, you have to be careful what components you add. If you add a really complex component onto the processor design that can't be operated as quickly, it will surely limit scalability. But then again, you could just pipeline that component. You just have to be careful that component isn't heavily branch dependent.
July 27, 2002 4:37:37 AM

Well at least I get the 9 uOPS thingy.
But the port thing, itself, I don't get. Unless ports are actually BEFORE the schedule buffers? In that case it would make more sense since then the buffers of each exec units CAN in some given case make all 9 uOPS done in one cycle.

Quote:
A clockcycle isn't exactly a physical thing. It's just a description of how fast each stage can operate. The clockrate is generated on the chipset and the CPU takes that (the FSB) and multiplies it (each stage does a certain amount of passes for every signal it receives from the FSB).

At first glance, that doesn't help much, as I don't know what does that! I told you, I haven't learned what Hz are and do, which is why all we know usually is Hz is what we say to determine the freq speed of a comp, but I wanna know the physics and how THAT works. I wanted to know also after knowing what makes the MHZ, what makes it that since the stages are divided and therefore more cycles on a CPU, be able to know that from here on you can go up to 10GHZ if you want, anytime (given the process shrink and all the quirks were ready to be done)?
And as well how does a CPU maker specify what each part becomes a cycle stage?

Another weird thing I noticed, is that they give the P4 one 20 stage pipeline. Does it mean 20 stages regardless which execution unit? If so, I don't get how the EXEC stage works here. In fact why is there an EXEC stage if the ALU or FPU are pipelined, and therefore there is not just ONE stage to do the calc but added many more by the units?

Quote:
IPC is not work done per WORKER, it's work done per CLOCK. In other words, the COMBINED work of all the workers in a pass. Two people doing smaller jobs is equal to 1 person doing a big job. Except the 2 people can do it faster, get it? This is explained in the flash presentation.

Perhaps the reason why I come back for questions so much, is due to the lack of any examples...? I mean I could only understand Ars so well because the guy provided examples all the time, as well as repeated explanations with images.
So again here I don't get what you mean. I didn't mean IPC per worker, that was just an example, I was meaning that each pass of the say 5 stages of making a shoe, would have much more work than each pass on 10 stages of making a shoe, and if so then how can we still claim that the performance per clock of a 1.5GHZ P4 on 20 stages vs a 1.5GHZ P4 on 10 stages be the same? I know the end throughput is the same, but if 20 stages are meant to get to 10GHZ and we limit it to 1.5GHZ, technically I don't get how the wasted possible cycles we could've used (8.5GHZ more) don't make the P4 on stages weaker than the 10 stages one. Again I have trouble making this work because of lack of any examples or any visual input. You are free not to bother man, I don't wanna bother you too much!

Hmm thinking of brand makes another thing to mind. When they say a branch mispredict causes pipeline flushing, why does it have to do that if say, one of the stages has the faulty instruction, but the instruction in the stage before it, is one that has no brand mispredicted problem, and is in fact there to go on, so why does this one also have to suffer from the bad instruction in the stage right after it, and also go wasted out of the CPU?

--
The sound of determination is the echo of will...
July 27, 2002 6:51:27 AM

Quote:
Well at least I get the 9 uOPS thingy.
But the port thing, itself, I don't get. Unless ports are actually BEFORE the schedule buffers? In that case it would make more sense since then the buffers of each exec units CAN in some given case make all 9 uOPS done in one cycle.

Of that I am aware, the scheduling buffers has one port to every execution unit. Making it a total of 3 and 6 ports for the FPU and ALU/AGP.

Quote:
At first glance, that doesn't help much, as I don't know what does that! I told you, I haven't learned what Hz are and do, which is why all we know usually is Hz is what we say to determine the freq speed of a comp, but I wanna know the physics and how THAT works. I wanted to know also after knowing what makes the MHZ, what makes it that since the stages are divided and therefore more cycles on a CPU, be able to know that from here on you can go up to 10GHZ if you want, anytime (given the process shrink and all the quirks were ready to be done)?
And as well how does a CPU maker specify what each part becomes a cycle stage?

As I said, a clockcycle, or Hz, is not a physical thing. There is no physics behind it because it isn't a physical thing. It's merely a measurement of time. "MHz" measure how many passes each stage is operating at. Or rather, how many times per second (or in terms of MHz, million times per second) it can do its job and pass it on to the next stage. In that flash presentation, when you have the instruction going through the pipeline. Each time it jumps from one stage to another, that is a "Hz". MHz is how many million times it can do that per second.

Quote:
Another weird thing I noticed, is that they give the P4 one 20 stage pipeline. Does it mean 20 stages regardless which execution unit? If so, I don't get how the EXEC stage works here. In fact why is there an EXEC stage if the ALU or FPU are pipelined, and therefore there is not just ONE stage to do the calc but added many more by the units?

The P4 has a 20-stage integer pipeline. It's FP pipeline is somewhat longer. These two pipelines (or rather, 6 pipelines I should say) start out as just 1 but then separates. The integer pipelines are traced through the scheduling buffer and into the ALU units, the memory pipeline are traced through the same scheduling buffer and into the AGU units and the FP pipeline is traced through the FP scheduling buffer and into the FP units. Pipeline length doesn't neccessarily mean a physical, independent pipeline. It's just a measure from start to finish.
As an example, let's say you have a water system. You have a big pipe that splits into 2 pipes and later those two pipes are split into 9 pipes. Let's say in order to keep water running you have to have pumps in the middle of these pipes. In the beginning, you only have 1 pipe but you have 3 pumps. Then, when the pipe splits into two, one of these two has 2 pumps but the other has 6. Then, when it splits into 9 pipes, 3 of these 9 had 6 pumps while 6 of these 9 had 5 pumps. The measurement of an "integer pipeline" would equivalent to the total number of pumps water that goes from start at the top where there's only one pipe to the end whichever path it would take.

Quote:
So again here I don't get what you mean. I didn't mean IPC per worker, that was just an example, I was meaning that each pass of the say 5 stages of making a shoe, would have much more work than each pass on 10 stages of making a shoe,

Wrong. Each pass in each INDIVIDUAL stage would do less work. But the total work is the same. This is not that difficult a concept to understand. 10 stages, each do half as much work as each stage in the 5 stage design, but you have TWICE as many stages doing work at the same time. You have 10 people EACH doing less vs 5 people EACH doing more. 10 x 2 = 5 x 4.

Quote:
and if so then how can we still claim that the performance per clock of a 1.5GHZ P4 on 20 stages vs a 1.5GHZ P4 on 10 stages be the same? I know the end throughput is the same, but if 20 stages are meant to get to 10GHZ and we limit it to 1.5GHZ, technically I don't get how the wasted possible cycles we could've used (8.5GHZ more) don't make the P4 on stages weaker than the 10 stages one.

They are weaker. They each do less work. But there are twice as many of them doing work at the same time. You're still thinking about how much work each stage does INDIVIDUALLY. Do the math, 20 times x work for each stage is equal to 10 times twice x work for each stage.

Quote:
Again I have trouble making this work because of lack of any examples or any visual input. You are free not to bother man, I don't wanna bother you too much!

You're having so much trouble because you're not seeing the big picture. You're still thinking about just 1 instruction floating through the pipeline. Yes in a 20-stage design each stage does less but you have twice as many stages vs a 10-stage design doing work all at once. You're still only focusing on how much 1 stage does. You need to consider the TOTAL.

Quote:
Hmm thinking of brand makes another thing to mind. When they say a branch mispredict causes pipeline flushing, why does it have to do that if say, one of the stages has the faulty instruction, but the instruction in the stage before it, is one that has no brand mispredicted problem, and is in fact there to go on, so why does this one also have to suffer from the bad instruction in the stage right after it, and also go wasted out of the CPU?

They say flush because the CPU cannot speed up one stage so that it can "catch up" with another. So if an instruction goes in and it's results are wrong, it still has to spend 20 cycles being processed. The CPU cannot disregard it (as it wouldn't know the results until the end anyway). So in a sense, a whole 20 cycles that would've otherwise been used to do valuable work were wasted on an instruction which is useless. It's flushed in the sense that that 1 instruction path needs to be rinsed out (surrounded by instructions that are useful).
July 27, 2002 2:34:07 PM

Quote:
As I said, a clockcycle, or Hz, is not a physical thing. There is no physics behind it because it isn't a physical thing. It's merely a measurement of time. "MHz" measure how many passes each stage is operating at. Or rather, how many times per second (or in terms of MHz, million times per second) it can do its job and pass it on to the next stage. In that flash presentation, when you have the instruction going through the pipeline. Each time it jumps from one stage to another, that is a "Hz". MHz is how many million times it can do that per second.

This I know, but I asked you HOW does Hz generate. Electrically-wise that is. And in that sense how does the generator know that with the P4 it can now generate twice more cycles per stage, or that it has the ability to ramp up to 10GHZ because something has changed in the layout of the chip it generates at? I'm not really asking the why, but the HOW.

I think I am understanding the reasons for bigger pipelines, and I would suppose the biggest one of them is because of the electrons' flow, right? Extending helps them float through one stage more comfortably, no?

Now I was thinking of this while rechecking the Flash you had. At the end where you state the speed of the 10 stage processor, at 4Hz, technically said that means the 4Hz is as performing as the 2Hz, no? If that is the case there, then how the hell can Intel be able to keep low clocked P4 in the future, to have the same overall performance at their clock speeds? I could focus on the TOTAL amount as you said, but seeing as 4Hz is bigger than 2Hz, I start wondering what if the clock was decided to be put at 2Hz for the 10 stage one as well, and in that case how can it possibly perform as good, if you state that in reality for example a 2GHZ K7 could perform almost the same on 20 stages...?

Quote:
The P4 has a 20-stage integer pipeline. It's FP pipeline is somewhat longer. These two pipelines (or rather, 6 pipelines I should say) start out as just 1 but then separates. The integer pipelines are traced through the scheduling buffer and into the ALU units, the memory pipeline are traced through the same scheduling buffer and into the AGU units and the FP pipeline is traced through the FP scheduling buffer and into the FP units. Pipeline length doesn't neccessarily mean a physical, independent pipeline. It's just a measure from start to finish.
As an example, let's say you have a water system. You have a big pipe that splits into 2 pipes and later those two pipes are split into 9 pipes. Let's say in order to keep water running you have to have pumps in the middle of these pipes. In the beginning, you only have 1 pipe but you have 3 pumps. Then, when the pipe splits into two, one of these two has 2 pumps but the other has 6. Then, when it splits into 9 pipes, 3 of these 9 had 6 pumps while 6 of these 9 had 5 pumps. The measurement of an "integer pipeline" would equivalent to the total number of pumps water that goes from start at the top where there's only one pipe to the end whichever path it would take.

Makes sense to me, thought I wonder why they only showed THIS pipeline only! And why is there one stage that says EXEC if they don't show us the other pipelines which show the real trajectory and number of additional pipes to reach the FP pipe for example (remember the K7 Integer pipeline, at the 7th stage it splits, and you know from THERE the exec stage begins when you add in the extra 8 stages which will form in total the 15-stage FP pipeline.)

Quote:
Wrong. Each pass in each INDIVIDUAL stage would do less work. But the total work is the same. This is not that difficult a concept to understand. 10 stages, each do half as much work as each stage in the 5 stage design, but you have TWICE as many stages doing work at the same time. You have 10 people EACH doing less vs 5 people EACH doing more. 10 x 2 = 5 x 4.

Again you're right on this and I wasn't thinking this way. But that still doesn't answer how can a 2Hz 10 staged CPU perform as well as the 2Hz 5-stage one?
I can't try to find a way to explain Hz in the shoe example sadly, so you'll have to understand this question on yer own...



--
The sound of determination is the echo of will...
July 27, 2002 8:21:36 PM

Quote:
This I know, but I asked you HOW does Hz generate. Electrically-wise that is. And in that sense how does the generator know that with the P4 it can now generate twice more cycles per stage, or that it has the ability to ramp up to 10GHZ because something has changed in the layout of the chip it generates at? I'm not really asking the why, but the HOW.

The CPU has an onboard clock regulator. This tells each stage how fast to operate at. They all operate at the same speed. That's why the slowest stage is the limit of how fast a clockspeed the CPU can run at. The clockspeed is derived from the FSB. Whatever the FSB runs at, the CPU will multiply that by a certain number and run its stages at that speed. If the speed is too high, the CPU will fail to run properly.
As for the theoretical limit. I think they are just projections. No one can ever be sure the P7 design can reach 10 GHz. However, the engineers do feel it can. I really can't get too much into the process of those projections.

Quote:
I think I am understanding the reasons for bigger pipelines, and I would suppose the biggest one of them is because of the electrons' flow, right? Extending helps them float through one stage more comfortably, no?

Now you're getting it.

Quote:
Now I was thinking of this while rechecking the Flash you had. At the end where you state the speed of the 10 stage processor, at 4Hz, technically said that means the 4Hz is as performing as the 2Hz, no? If that is the case there, then how the hell can Intel be able to keep low clocked P4 in the future, to have the same overall performance at their clock speeds? I could focus on the TOTAL amount as you said, but seeing as 4Hz is bigger than 2Hz, I start wondering what if the clock was decided to be put at 2Hz for the 10 stage one as well, and in that case how can it possibly perform as good, if you state that in reality for example a 2GHZ K7 could perform almost the same on 20 stages...?

No, the 4 Hz processor performs TWICE as much as the 2 Hz 5-stage processor. Look at the diagram. Each "clock" or "hz", a whole instruction is finished by the pipeline once the pipeline is filled. If each processor finishes 1 instruction per clock and one runs at 4 clocks per second and the other 2 clocks per second, the obviously the one running at 4 clocks per second would be faster.

Quote:
Makes sense to me, thought I wonder why they only showed THIS pipeline only! And why is there one stage that says EXEC if they don't show us the other pipelines which show the real trajectory and number of additional pipes to reach the FP pipe for example (remember the K7 Integer pipeline, at the 7th stage it splits, and you know from THERE the exec stage begins when you add in the extra 8 stages which will form in total the 15-stage FP pipeline.)

Where do you read the execution units are only 1 stage? I mean, there is a rule, the ALU's must always be 1 stage or less. In the P4's case, the ALU's take up 1/2 a stage. But you can never pipeline your ALU's. The FPU's, on the other hand, are a different story. As for why the 20-stage is focused on. In the beginning, there was only integer operations. CPU's were just math devices. Only later was FP operations added. So it's traditional I guess to label everything according to the properties of the ALU and the integer pipeline.

Quote:
Again you're right on this and I wasn't thinking this way. But that still doesn't answer how can a 2Hz 10 staged CPU perform as well as the 2Hz 5-stage one?
I can't try to find a way to explain Hz in the shoe example sadly, so you'll have to understand this question on yer own...

In the shoe making example. Let's say you have 2 guys. One does the shining and one does the painting. How many shoes can that first guy paint and then give to the next guy to shine per second? That's the "Hz". How many times that guy can do his job and pass the shoe on is how many "Hz" the whole system runs at. All the guys run at exactly the same "Hz". So when one guy finishes working on a shoe, the next guy just finished his work on the shoe he was working on and receives the shoe from the guy before him. If there were only 1 guy. He would have to both paint and shine the shoe. This is more complex and he couldn't do it as fast. If you have 2 guys, one shining and one painting, they each have simpler jobs and can repeat it faster (do more shoes per second). Even though they're each doing less, the combined work of the two do more than that 1 guy. Now if each "clock" means that each of these guys finish their part of the work on the shoe, that means that the guy at the end will finish one shoe per clock. This is the same as if 1 guy were working and a whole shoe were counted as a "clock". Except these 2 guys can do their jobs faster. Or more jobs per second.
July 27, 2002 10:38:23 PM

I gotta admit you are a survivor to have put up with me lol! To think all these 3 pages are about me constantly whipping questions!
But it gets clearer each time, that's for sure.

Quote:
No, the 4 Hz processor performs TWICE as much as the 2 Hz 5-stage processor. Look at the diagram. Each "clock" or "hz", a whole instruction is finished by the pipeline once the pipeline is filled. If each processor finishes 1 instruction per clock and one runs at 4 clocks per second and the other 2 clocks per second, the obviously the one running at 4 clocks per second would be faster.

Obviously yes, however here is another thing which IMO might make my trouble at getting it, more easy to find and solve. Maybe you had repeated this before, but again it could be because I didn't get it:

Let us say we clocked two processors, one at 5 Stages and one at 10, at 2GHZ, therefore each stage or clock must be 0.5ns, not more not less. That means also that if we took some poster in this topic's example of waves, the next push must also last 0.5ns. Now let us go to the shoe example so you can see where I am not getting how a 4Hz processor with more stages than a 4Hz processor at less stages but each of these last the same amount of clocks per second, wouldn't be outperformed. At the shoe factory, let's say we decided to make each worker's turns, at 30 cycles per HOUR. This means we "clocked" their work, no matter what the number of workers able to do less work and pass it on, at 2 minutes each. When you think about it, on the other shoe factory, where they have 5 workers, they also decided to use 2 minute limits or turns per stage of work, thus they also do 30 cycles. The way I see it, because we clocked the longer work line at also 2 minutes per worker, it means that the shoe factory with 5 workers would finish faster any shoe, because they are forced to do it in 2 minutes (which is the concept of clocking and why now we have to forget the physics pipeline topic and think only clock speed by a CPU maker). So this my friend is what is troubling me. I am no longer wondering why a longer pipeline would help reach high speeds, I am wondering HOW a pipeline with longer stages, at the same clock speed of another CPU with a shorter pipeline can STILL be able to output the same or even higher benchmark results. Of course I mentioned benchmark results, but not literally, just to mean that it is more powerful per clock. This is why I said take the 1.5GHZ P4 at 10 stages and the one at 20, also at 1.5GHZ, and how one who takes more cycles to process an instruction and is limited to 1.5 clocks per nanosecond like the other, would still have the same total performance after say, 1 second past, or one full 1.5GHZ done. If you can clarify that, it would make sense. All I can possibly theorize is that increasing the pipeline in fact means that the two same CPU cores but one with 10 stages and one with 20 stages, cannot perform the same per clock, and that the 20 stage one HAS to perform less if all we did was redesign/increase the pipeline length. So based on that theory, I can think that the only ways to help the processor with 20 stages be better would be to start cramming more ways to decode more x86 instructions, or use Trace Cache fetching, since that cuts the instruction from 20 cycles to less to finish, increase the number of exec units or even their effectiveness, possibly shorten the FPU lengh since the cycle time is still the same at the FPUs and therefore you waste more time to make that instruction at 1.5GHZ.

Quote:
No, the 4 Hz processor performs TWICE as much as the 2 Hz 5-stage processor. Look at the diagram. Each "clock" or "hz", a whole instruction is finished by the pipeline once the pipeline is filled. If each processor finishes 1 instruction per clock and one runs at 4 clocks per second and the other 2 clocks per second, the obviously the one running at 4 clocks per second would be faster.


In a beautiful world that would work, except Intel had their first P4s clocked the same as Tbird and Palominos eventually. So that means the Palominos and the P4s at same clock had a specified amount of cycle per ns, and therefore the pipeline advantage is not applying. I explained just above why I have trouble believing that even splitting the stages into faster ones can allow per clock like 1.6GHZ to be the same as the same CPU with less stages though, but again clocked at also 1.6 cycles per ns.

You see where I am going at?

Quote:
Where do you read the execution units are only 1 stage? I mean, there is a rule, the ALU's must always be 1 stage or less. In the P4's case, the ALU's take up 1/2 a stage. But you can never pipeline your ALU's. The FPU's, on the other hand, are a different story. As for why the 20-stage is focused on. In the beginning, there was only integer operations. CPU's were just math devices. Only later was FP operations added. So it's traditional I guess to label everything according to the properties of the ALU and the integer pipeline.

And that was where I was confused man! I had thought the ALUs were pipelined as well, which is why I was wondering how you create an execute stage alone, if that needs the FPUs. Then I found out than the FPUs have their own trajectory and stages, and finally I found out that the ALUs are not pipelined and therefore explains the classical pipeline with the one stage saying EXECUTE.

--
The sound of determination is the echo of will...
July 27, 2002 11:01:21 PM

Quote:
Let us say we clocked two processors, one at 5 Stages and one at 10, at 2GHZ, therefore each stage or clock must be 0.5ns, not more not less. That means also that if we took some poster in this topic's example of waves, the next push must also last 0.5ns. Now let us go to the shoe example so you can see where I am not getting how a 4Hz processor with more stages than a 4Hz processor at less stages but each of these last the same amount of clocks per second, wouldn't be outperformed. At the shoe factory, let's say we decided to make each worker's turns, at 30 cycles per HOUR. This means we "clocked" their work, no matter what the number of workers able to do less work and pass it on, at 2 minutes each. When you think about it, on the other shoe factory, where they have 5 workers, they also decided to use 2 minute limits or turns per stage of work, thus they also do 30 cycles. The way I see it, because we clocked the longer work line at also 2 minutes per worker, it means that the shoe factory with 5 workers would finish faster any shoe, because they are forced to do it in 2 minutes (which is the concept of clocking and why now we have to forget the physics pipeline topic and think only clock speed by a CPU maker). So this my friend is what is troubling me. I am no longer wondering why a longer pipeline would help reach high speeds, I am wondering HOW a pipeline with longer stages, at the same clock speed of another CPU with a shorter pipeline can STILL be able to output the same or even higher benchmark results.

Again, you're still thinking of just one instruction going through the pipeline. Yes, if there were only 1 instruction going through the pipeline, the 5-stage design would finish the instruction faster than the 10-stage design.
Taking the shoe example, if each worker worked on a shoe, then passed it on, that shoe would take longer to make with 10 workers than 5 workers. But you're not just making 1 shoe. Each worker is constantly working on 10 different shoes at the same time. While the 5 worker setup is only working on 5 shoes at the same time. While the 10 worker setup may be making each shoe half as fast as the 5 worker setup they're working on twice as many shoes at the same time so the total that comes out will be the same.

Quote:
This is why I said take the 1.5GHZ P4 at 10 stages and the one at 20, also at 1.5GHZ, and how one who takes more cycles to process an instruction and is limited to 1.5 clocks per nanosecond like the other, would still have the same total performance after say, 1 second past, or one full 1.5GHZ done. If you can clarify that, it would make sense. All I can possibly theorize is that increasing the pipeline in fact means that the two same CPU cores but one with 10 stages and one with 20 stages, cannot perform the same per clock, and that the 20 stage one HAS to perform less if all we did was redesign/increase the pipeline length. So based on that theory, I can think that the only ways to help the processor with 20 stages be better would be to start cramming more ways to decode more x86 instructions, or use Trace Cache fetching, since that cuts the instruction from 20 cycles to less to finish, increase the number of exec units or even their effectiveness, possibly shorten the FPU lengh since the cycle time is still the same at the FPUs and therefore you waste more time to make that instruction at 1.5GHZ.

Yes, a 20-stage design would take longer on A SINGLE INSTRUCTION. But you're working on 20 instructions at any given time while in a 10-stage design you're working on 10 instructions at any given time. So while it'd take instructions twice as long in the 20-stage design to finish, there are twice as many things being worked on. They're just worked on in separate stages. So when one finishes, another will be in line to finish, and then another, and another. If you measured how long it'd take for a 20 stage design to finish 10 instructions vs a 10-stage design, the 10-stage design would be faster. However, say you had 40 instructions. The 20-stage design would be able to work on 20 of them at a time (slightly skews of course) and pop out one instruction every clock and take in one instruction every clock. While the 10 stage design would take in 10 at a time and pop out 1 instruction per clock and take in one instruction per clock.
Think of it as a road. A single line. And that line is filled. Yes, it'd take longer to fill the line but once the line is filled, 1 goes in and 1 comes out per clock.

I've put together another <A HREF="http://server29.hypermart.net/imgod2u/Pipelining/Pipeli..." target="_new">Flash presentation</A>.<P ID="edit"><FONT SIZE=-1><EM>Edited by imgod2u on 07/27/02 04:31 PM.</EM></FONT></P>
July 28, 2002 12:01:19 AM

-Ah and that Flash again helped!
This is also where my gripe of the pipeline time thing begins.
Doesn't it still mean that at a same clock speed, the processor with the smaller pipeline has already a 5 instructions advantage in advance?

-And not only this but the fact a long pipeline would also get misprediction problems and flush more.
-Ok I know that that once the pipe is filled, you constantly get one op out for sure at the end. However, say we took your pres, and we showed all slots filled, and we wanted to see one instruction of all these (monitored it), and for the sake of my lack of visualizing, we colored it blue to show where it is, and it entered the 10 stage pipeline in your example, wouldn't it still mean it takes more time to process it than the 5 stage one, and that in the end you process less instructions anytime? Again this is keeping in mind the clock speed is the same anytime.
Ok I reread your post and checked again, so I know I am only talking of one instruction but that still accounts for all instructions entering the 10 stage one. But then again it could be I worry too much. But basically I am trying just to know in what possible architecture can it be that the 20 stage CPU at the same clock as the 10 stage one, be having the same performance?

Quote:
Yes, a 20-stage design would take longer on A SINGLE INSTRUCTION. But you're working on 20 instructions at any given time while in a 10-stage design you're working on 10 instructions at any given time. So while it'd take instructions twice as long in the 20-stage design to finish, there are twice as many things being worked on. They're just worked on in separate stages. So when one finishes, another will be in line to finish, and then another, and another. If you measured how long it'd take for a 20 stage design to finish 10 instructions vs a 10-stage design, the 10-stage design would be faster. However, say you had 40 instructions. The 20-stage design would be able to work on 20 of them at a time (slightly skews of course) and pop out one instruction every clock and take in one instruction every clock. While the 10 stage design would take in 10 at a time and pop out 1 instruction per clock and take in one instruction per clock.
Think of it as a road. A single line. And that line is filled. Yes, it'd take longer to fill the line but once the line is filled, 1 goes in and 1 comes out per clock.

I had to reread that many times to get it, but in the end conclusion, it means that at any time, at same clock, the 10-stage one trying to make the 40 would finish first, because it has 10 already processed before the 20 Stage one, no? And that means the same in reality and so that longer stage CPUs under the same core components, would generally indeed be slower per clock than their counterparts, but not by like twice slower, but rather 20-30% max, (when not thinking of the P4 redesign over P3)?

--
The sound of determination is the echo of will...
July 28, 2002 1:12:34 AM

Quote:
Doesn't it still mean that at a same clock speed, the processor with the smaller pipeline has already a 5 instructions advantage in advance?

Yes but when you're working on damn near billions, even trillions of instructions. Does 5 make a difference?

Quote:
And not only this but the fact a long pipeline would also get misprediction problems and flush more.

Yes, which is why branch mispredicts hurt longer-pipelined processors.

Quote:
Ok I know that that once the pipe is filled, you constantly get one op out for sure at the end. However, say we took your pres, and we showed all slots filled, and we wanted to see one instruction of all these (monitored it), and for the sake of my lack of visualizing, we colored it blue to show where it is, and it entered the 10 stage pipeline in your example, wouldn't it still mean it takes more time to process it than the 5 stage one, and that in the end you process less instructions anytime? Again this is keeping in mind the clock speed is the same anytime.
Ok I reread your post and checked again, so I know I am only talking of one instruction but that still accounts for all instructions entering the 10 stage one. But then again it could be I worry too much. But basically I am trying just to know in what possible architecture can it be that the 20 stage CPU at the same clock as the 10 stage one, be having the same performance?

Well, does it matter how long the instruction takes to get processed? As long as you're finishing 1 every clock does it matter which one it is? You're dealing with billions and possibly even trillions of instructions when processing, does the amount of clocks one instruction takes really matter compared to how many of the total instructions you finish? How many instructions you finish IS performance. That is, after all, the result.

Quote:
I had to reread that many times to get it, but in the end conclusion, it means that at any time, at same clock, the 10-stage one trying to make the 40 would finish first, because it has 10 already processed before the 20 Stage one, no? And that means the same in reality and so that longer stage CPUs under the same core components, would generally indeed be slower per clock than their counterparts, but not by like twice slower, but rather 20-30% max, (when not thinking of the P4 redesign over P3)?

It would not be faster "per clock". It would finish any set of instructions 5 clocks faster than a 10-stage design. This would matter if we were only doing a set of 40 or maybe even 100 instructions. But after the first 10 clocks, the 10-stage design would finish as many instructions as the 5-stage design. When you're working on a set of billions, possibly even trillions of instructions at 2 GHz, does a 2.5 ns difference matter? This is a matter of growth vs constant difference. No matter how many instructions you're working on, there will always only be 5 clocks worth of difference. You could be working on a trillion instructions with 2 GHz processors and the 10-stage design would take 5 "Hz" more than a 5-stage design. That's 5 "Hz" out of 2 GHz. After the first 10 clocks, both designs finish as much per clock.
July 28, 2002 4:31:02 AM

Yeah you're pretty much dead on and it just shows how much I really am too picky on some things, such as looking on one instruction.

So in the end what you're saying is that the two P4s with each a pipeline length, 10 and 20 respectively, would still perform almost near each other at a given clock speed, only that the longer staged one would definitly have more problems more occasionally, and should technically be a tad less performing?

Don't ask me, but I just don't know why something is bugging in my head about the design of 5 vs 10, and how one instruction takes more time on the second at the same clock speed. I simply don't even though you're telling me that in reality it doesn't matter, all that was late was 5 cycles on the 10 stage one before it was filled.

Though when I think of my shoe example where each shoe factory (one with 5 workers and one with 10), has a 2 minute per shoe policy, and how the 5 worker one, would anytime have an advantage of probably 10 minutes worth of shoes done against the 10 worker one. And from that I get lost on how the long pipelined processor at same clock speed still in the end outputs around the same performance. I guess you've explained that, I just am missing something here for sure. Especially when I think of the shoe factory example, and you see how one has a slight advantage. But then I think of how you said 5 cycles is nothing, and I imagine the factory continues work for 10 hours, so I guess the slight advantage the 5 worker-one had in the end wouldn't make much a difference no?

In fact back to CPUs, we could still be able to get more performance off the 20-staged one if we improved the number of exec units, added prefetching methods, or Trace Cache-like times and so, and in the end the 20-stage one at the low clock speed, would still win. Am I right by stating this? That improving the IPC, the entire end efficiency would help the final overall performance, so that regardless of the longer length, it can win?

This lead me to what I asked in another topic by some guy who refuses to think Hammer is the next 32-bit CPU from AMD (junkyy). I asked why wouldn't AMD think of just going for 20 stages or around that for their CPU, instead of 12? I mean if the performance per clock isn't overall this much less, and they did improve IPC anyways by the mem controller and some core optimizations, what would have the longer pipeline held wrong other than in fact making it a straight competitor all the while with some real IPC in it? Wouldn't that make it nearly as competitive as the P4, though its clockspeed might not be as high, it still can easily reach the P4's performance and outrun it in a matter of some speed bumps, no?

--
The sound of determination is the echo of will...
July 28, 2002 6:00:23 AM

Quote:
In fact back to CPUs, we could still be able to get more performance off the 20-staged one if we improved the number of exec units, added prefetching methods, or Trace Cache-like times and so, and in the end the 20-stage one at the low clock speed, would still win. Am I right by stating this? That improving the IPC, the entire end efficiency would help the final overall performance, so that regardless of the longer length, it can win?

Each execution unit could be said to be 1 pipeline. Although they all start at the same place, they eventually split up. So more execution units = more pipelines working in parallel to eachother. Of course, the thing about the trace cache being beefed up is still true as that is part of the one big pipeline that started out.
The Athlon has 9 pipelines fed by 3 decoding pipelines. While the P4 has 6 execution pipelines fed by 1 trace cache. The trace cache is able to issue only 3 micro-ops per clock. This is, I think, one of the very basic limitations of the P4. The length, if you trace the path of an integer instruction and it was "20 stages", of a pipeline is relatively unimportant. As I said, in modern day, prediction methods are very accurate. However, modern processors aren't just 1 pipeline, they're a huge combination of different processes going on at once and optimizing how instructions are managed among those processes is where the real IPC gain or loss is done, as well as the basic units themselves.

Quote:
This lead me to what I asked in another topic by some guy who refuses to think Hammer is the next 32-bit CPU from AMD (junkyy). I asked why wouldn't AMD think of just going for 20 stages or around that for their CPU, instead of 12? I mean if the performance per clock isn't overall this much less, and they did improve IPC anyways by the mem controller and some core optimizations, what would have the longer pipeline held wrong other than in fact making it a straight competitor all the while with some real IPC in it? Wouldn't that make it nearly as competitive as the P4, though its clockspeed might not be as high, it still can easily reach the P4's performance and outrun it in a matter of some speed bumps, no?

The P7 core (the 20 stage integer pipeline design used in the P4) took Intel 4 years to develope and even then it was not quite ready. This is Intel, as in "shove billions into R&D" Intel. And it'd take them somewhere around 5 years to make a design that had as much IPC as the Athlon's current design and still be scalable. Can you imagine the feat that would be for AMD?
Hammer is an improvement but it's an improvement over the K7 design. Many of the things in the core are left in tact and are near-identical to that of the K7. x86-64 extensions were added, SSE2 extensions were added, as well as an extra packing stage in the pipeline (all pipelines) and some other cache-management features. It is certainly not as dramatic a change as the P6 to P7 core switch.
July 28, 2002 9:45:41 AM

Quote:

A clockcycle isn't exactly a physical thing. It's just a description of how fast each stage can operate. The clockrate is generated on the chipset and the CPU takes that (the FSB) and multiplies it (each stage does a certain amount of passes for every signal it receives from the FSB).

no. the CPU does not multiplie the FSB. there is another electronic cirutry that does that - a Multipler - which is actully a pretty nifty chip in its complexty multiplies the FSB frequncy.
the FSB frequcny isn't genrated by the chip-set (it is genrated for the chip-set) it is genrated by another nifty elctric circutry which is called a PLL (Phase Lock Loop) which genrates a certain frequncy from a refrence frequncy given by a fluctuation crystal (like quartz crystal which is frequncy is used to count time for quartz watchs).

This post is best viewed with common sense enabled
July 28, 2002 9:56:26 AM

its Amazing that the Athlon was conceived and implemented in record time of about a year and a half under the lead of Ex DEC EV6 (head?) designer Dirk Mayer.

the Hammer might only be an imrovment - but its a very big one - I doubt that P7 will get much bigger improvments over its 6 years of projected life.
the core improvments and around the core improvments (high speed interconects for NUMA, and major cache imrovments) along with the core improvments make it a big change.


This post is best viewed with common sense enabled
July 28, 2002 5:48:51 PM

While I agree Hammer has some really nice improvements and additions (Hyper Transport should help reduce motherboard costs), I also would not say revolutionary. The Trace Cache IMO is a really nice innovative addition, because for one the fact it reduces pipeline stages when it sends a needed uOP, means that many operations or programs can do faster work when the app is properly optimized for P4 and to use that caching. I am not a programmer, so I don't know whether 12000 uOPS is a lot of space or not. I just like to study things with no programming requirements, as I am a very big enthusiast of this.



--
The sound of determination is the echo of will...
July 28, 2002 6:00:31 PM

Sometimes billions of dollars are nothing if the company doesn't have competent developpers. Not saying Intel has incompetent ones, no way, but if AMD was able to conceive a straight from the ground 10-stage K7 core which has been up to today a very good performer anytime, with some very nice powerful features (triple FP fully pipelined power), I'd say they could get a shot at a K9 core with more stages. Also do you know how many stages is the K6 pipeline? It seems it didn't last further than 600MHZ.

Also, I am now aware of splitting stages to help ramp speeds by allowing electrons to once again flow comfortably, but what happens when you lenghten the pipeline by ADDING stages with new commands? Does it still act the same as splitting the tasks?
When going to the shoe analogy, it doesn't seem to, it'd be like adding a worker that would do paint jobs for a shoe, not very practical really.

I personally don't know why I kept asking the same questions to you on why instructions taking more cycles to complete on a longer staged pipelined CPU against a same clocked one with less stages, are mixing me.
Though if we looked at the continuous flow of of instructions (as in a filled pipeline), and a new app suddenly asks for 40 instructions to be processed, and we looked at this group of 40 coming in the pipe, wouldn't the less stage one still finish those in a few cycles less, and therefore in almost any situation it would process faster? Of course I am aware of the "this is just one of billions in a second", but I am just wondering.
I dunno, when I think of looking at one instruction coming in, it will always take longer to finish on the same clocked longer staged CPU, because it takes say, 10 cycles instead of 5, but then I think of the last stage which is the one who is the most effective IMO, since it spits out the instructions, and it kinda contradicts my first thought. I guess this is why I am mixed here, and though you don't need to answer, I just want you to know why I am confused, and it's cuz I think of two sides completly contradicting themselves, well to me at least.

Quote:
And it'd take them somewhere around 5 years to make a design that had as much IPC as the Athlon's current design and still be scalable. Can you imagine the feat that would be for AMD?

Then how come Prescott should be from then on, able to beat the Athlon PER CLOCK, so if we ever threw a P4 Prescott at 1.8GHZ, it'd beat the crap out of the 1.8GHZ Palomino, no?


--
The sound of determination is the echo of will...
July 28, 2002 7:53:48 PM

Quote:
no. the CPU does not multiplie the FSB. there is another electronic cirutry that does that - a Multipler - which is actully a pretty nifty chip in its complexty multiplies the FSB frequncy.
the FSB frequcny isn't genrated by the chip-set (it is genrated for the chip-set) it is genrated by another nifty elctric circutry which is called a PLL (Phase Lock Loop) which genrates a certain frequncy from a refrence frequncy given by a fluctuation crystal (like quartz crystal which is frequncy is used to count time for quartz watchs).

Actually, there is a clock generator on the chipset. The crystal is used to regulate frequency generation but it's just a reference point. 14.2 MHz if I'm not mistaken. The actual clock generator itself is on the motherboard (I'm not sure if it's part of the chipset). And the CPU multiplier is on the packaging, which is really part of the CPU.

Quote:
its Amazing that the Athlon was conceived and implemented in record time of about a year and a half under the lead of Ex DEC EV6 (head?) designer Dirk Mayer.

the Hammer might only be an imrovment - but its a very big one - I doubt that P7 will get much bigger improvments over its 6 years of projected life.
the core improvments and around the core improvments (high speed interconects for NUMA, and major cache imrovments) along with the core improvments make it a big change.

The point is, it was not a huge leap in core design. Rather, it was an vast array of small improvements here and there. Certainly not as dramatic a jump as the P6 to P7 change. Hell, there is almost nothing identical about the P4 compared to the P3. The ALU's are different, the FPU are different, the decoding stage is different. Hell, the only thing that are the same are the legacy x87 FP registers and the general purpose x86 registers.

Quote:
Sometimes billions of dollars are nothing if the company doesn't have competent developpers. Not saying Intel has incompetent ones, no way, but if AMD was able to conceive a straight from the ground 10-stage K7 core which has been up to today a very good performer anytime, with some very nice powerful features (triple FP fully pipelined power), I'd say they could get a shot at a K9 core with more stages. Also do you know how many stages is the K6 pipeline? It seems it didn't last further than 600MHZ.

The K7 wasn't built from the ground up. I think the original design was bought by AMD and improved upon.

Quote:
Also, I am now aware of splitting stages to help ramp speeds by allowing electrons to once again flow comfortably, but what happens when you lenghten the pipeline by ADDING stages with new commands? Does it still act the same as splitting the tasks?
When going to the shoe analogy, it doesn't seem to, it'd be like adding a worker that would do paint jobs for a shoe, not very practical really.

I don't think so. That is, adding extra stages with new functions won't help ramping of clockspeed. But then again, it would depend on what those stages are for I guess. I can't exactly come up with anything right now but some engineer somewhere may come up with a way to help ramping by adding another function.

Quote:
I personally don't know why I kept asking the same questions to you on why instructions taking more cycles to complete on a longer staged pipelined CPU against a same clocked one with less stages, are mixing me.
Though if we looked at the continuous flow of of instructions (as in a filled pipeline), and a new app suddenly asks for 40 instructions to be processed, and we looked at this group of 40 coming in the pipe, wouldn't the less stage one still finish those in a few cycles less, and therefore in almost any situation it would process faster? Of course I am aware of the "this is just one of billions in a second", but I am just wondering.
I dunno, when I think of looking at one instruction coming in, it will always take longer to finish on the same clocked longer staged CPU, because it takes say, 10 cycles instead of 5, but then I think of the last stage which is the one who is the most effective IMO, since it spits out the instructions, and it kinda contradicts my first thought. I guess this is why I am mixed here, and though you don't need to answer, I just want you to know why I am confused, and it's cuz I think of two sides completly contradicting themselves, well to me at least.

As I said, there will be a latency of 5 clocks more on a 10-stage design vs the 5-stage one. However, this is a constant latency. It doesn't grow as the number of instructions processed grows and it doesn't grow as time goes on. It's not "per clock", it's just an initial lag. Think of it this way, a group of people are walking in a line down 2 roads. One road is twice as long as the other. The two groups of people are still walking at the same speed and after the first one has exited the road, both groups will emerge from the road at the same speed. Except the first group walking through the shorter road will emerge however much faster depending on how much shorter the road is. But it will always be only an initial difference in time.

Quote:
Then how come Prescott should be from then on, able to beat the Athlon PER CLOCK, so if we ever threw a P4 Prescott at 1.8GHZ, it'd beat the crap out of the 1.8GHZ Palomino, no?


I really can't say. Prescott may be a wonderful per clock performer, it may not be. Who knows. But in 2003, the P7 core would have been in design for 7 years. Its design did start in 1996.<P ID="edit"><FONT SIZE=-1><EM>Edited by imgod2u on 07/28/02 01:04 PM.</EM></FONT></P>
July 29, 2002 4:31:14 AM

Quote:
As I said, there will be a latency of 5 clocks more on a 10-stage design vs the 5-stage one. However, this is a constant latency. It doesn't grow as the number of instructions processed grows and it doesn't grow as time goes on. It's not "per clock", it's just an initial lag. Think of it this way, a group of people are walking in a line down 2 roads. One road is twice as long as the other. The two groups of people are still walking at the same speed and after the first one has exited the road, both groups will emerge from the road at the same speed. Except the first group walking through the shorter road will emerge however much faster depending on how much shorter the road is. But it will always be only an initial difference in time.

Yeah I get that now, but this IMO still would apply if a new app asked and sent a set 40 instructions in and to see them we colored them blue, the smaller staged one would still get a little advantage. This means that in one second, you could have many gaps where an app will send a set, and that in the end results that the small staged one at same speed always ends up more performing, especially when you send in occasional sets, for that app. Though the gap isn't that big, you do get what I mean. It'd be so nice if one day we could see two same CPU cores running on different pipeline lengths, it'd really show whether or not who is contradicting who here. Well anyways, as you say, "one in billions", don't matta!

--
The sound of determination is the echo of will...
July 29, 2002 6:00:26 AM

You need to get a prispective on processing. in a benchmarks or any othr intensive task (or even not so intensive) the CPU is at 100% usabilty - which mean it is constantly working and switching threads between the diffrente procsses runing on the OS (the whole definition of "CPU Utilization" is how much time a certain process executes relatively to the whole time allocated to the execution of all programs in one execution cycle.) this means the processor is most time faced with incoming instructions and pipline is filled in the majorty of time - the things that make it go empty are branch miss prediction, cahce misses and maybe some other stuff I don't know - but in those cases the long pipline presents more latncy.
I cant think of any modern task which "suddenly" (CPU 100% idle) asks for 40 instructions - even opening note-pad should genrated a nice stream of instruction both by note-pad and the multi-tasking OS.

This post is best viewed with common sense enabled
July 29, 2002 7:37:13 AM

Pretty much. It's also the reason why memory bandwidth and latency is such an issue on the longer-pipeline design. You absolutely must make sure there is a flow of instructions going into the processor at any given time.
This is what Intel hopes to do with Hyperthreading and why it may indeed help. All they need to do is widen the decoding/issuing of instructions and micro-ops so that the execution units will actually have enough to do. x86 has a "sweet spot" of 3 instruction decodes per clock. 2 threads would have a "sweet spot" of 6 x86 decodes per clock. The P4 decodes 1 x86 instruction per clock (albeit the trace cache does make up for it somewhat). I know in code that repeats small instructions over and over again, the P4 would have a significant advantage but you still have cases in which code isn't being re-used in which case you need a beefy enough decoding stage in order to effectly feed the execution units with incomming instructions.
July 29, 2002 6:58:28 PM

That was something I was wondering about, in that how can the P4 possibly be able to issue or fully utilize the 6 IPCs at once?
I know you talked about data dependencies, however is it different with the P4?
Give me an example if you can btw, that'd really help.
This also leads me to another thing, the scheduler.

On page 2 of this thread, you talked about data dependencies. You also mentioned how when you got 3x86 instructions for example, decoded into 6 uOPS, get sent to the Schedule buffer (you didn't specify though which so I don't get that as well). You then said that maybe 4 uOPS of them needed to depend on the 2 other uOPS. Does that mean in the end that an instruction in a said 10 stage CPU as example, would not always in fact take 10 cycles to be processed but in fact may sometimes have to wait in a stage with a buffer, and therefore if we calculated this OP only's time, it would be more than 10?
If so how does the communication of the 2 uOPS come back to the CPU telling it now it has to dispatch the remaining 4 uOPS? Or were they intentionally released because 5 others came in from another instruction decoded? Also the number of entries a scheduler has, represents what kind of entries and how does it work? I had read Ars' presentation of the K7's scheduler but I couldn't grasp entirely how it works.

--
The sound of determination is the echo of will...
!