Sign in with
Sign up | Sign in
Your question

AMD v INTEL on performace equation

Last response: in CPUs
Share
November 13, 2001 6:30:10 PM

<b>Performance = Clock Speed x Operations/Cycle</b>

Looking at the equation, to get high performance, we need to have a high clock speed and also high operations per cycle.

Pentiums are continually bringing out higher clocked processors than their Athlon competitor but thier "operation per cycle" figures are lower.

Athlon on the other hand have a higher figure for the "operations per cycle" although they have lower cock speeds. Hence they are able to match the performance values for the Pentium processors.

What if Intel comes out with higher "operation per cycle" figure as well, by using, for example, higher bus speeds and higher cache transfer rates etc.? Athlon's have a higher Level 1 cache and that is mainly how they are able to match the performce values with Pentium. What if Pentium come out with this high Level 1 cache too (is it possible!!)?

This would result in Pentium computers outperforming Athlons by miles. Athlons would not be able to keep up because they cannot manufacture higher clocked processors like Intel can. (I mean Athlons are having a hard time bringing higher clocked processors).

Pentium's have gone beyond 2000MHz, and soon when Northwood will come out, they will go much further. Where is Athlon in clock speed?


Don't start blasting at me kids. I'm only a "stranger". Although I use a computer 24/7, and also know what the inside of a computer looks like and what each component does, I don't know much about the technical issues. Correct me if I'm wrong.

Cheers.
November 13, 2001 6:52:37 PM

It all comes down to if you want to pay the extra $200 for the name and extra memory bandwith that Intel has.

-¤ Shut the f*ck up or go AMD ¤-
November 13, 2001 7:44:18 PM

Hey, good to see someone actually stretching their brains, instead of sitting on their ass doing nothing :) 

Let's see if I can shed some light on the subject...

Higher "operations per cycle", or "instructions per clock" (IPC), as AMD likes to call it, has to do more with multiple simultaneous pipelines than the bus speed or the cache. The bus and cache help the clock speed, not the IPC.

The P4's low L1 cache is the main reason it has poor performance. 8k is not nearly enough to satisfy a 2.2GHz beast. The Northwood should have been designed with more L1 cache to go along with it's doubled L2. That is not the case, regrettably.

You're right. If Intel suddenly matched or beat the Athlon's IPC, then they would blow by AMD. However, if AMD suddenly matched or beat the P4's clock speed, they would blow by Intel.

As for bus speed, remember that the P4 operates on a quad-pumped 100MHz bus, reaching effective 400MHz speeds. The Tbird/Palamino is at a double-pumped 100/133 bus, effectively reaching 200/266MHz.

<font color=orange>Quarter</font color=orange> <font color=blue>Pounder</font color=blue> <font color=orange>Inside</font color=orange>
Related resources
November 13, 2001 8:45:19 PM

Well said Burger.

Quote:
If Intel suddenly matched or beat the Athlon's IPC, then they would blow by AMD. However, if AMD suddenly matched or beat the P4's clock speed, they would blow by Intel.

True, but you can't play the what if game.

As far as clock speeds vs IPC and which each CPU can add, there is a tradeoff. It would seem that it's hard to make a CPU do both well. AMD is performing better because they have done both a bit better at the moment. From the roadmaps I see, it looks like AMD will be turning the clocks up soon to go with their better IPC. Intel is simply aiming to turn the clock up to increadable numbers.

Performance = Clock Speed x IPC

That's not quite true. If it were, a P4 2.0 GHZ would be EXACTLY 25% faster than a P4 1.6 GHZ , and that is not the case. You begin to lose a bit of performance if all you do is change the multiplier.

Here's one last look at a RL example of this not being true.

A Pentium 133 was faster than a Pentium 150. Both had the same IPC, so according to the formula there, the 150 should be faster. The problem was that the P150 ran on a 60 Mhz Bus, while the 133 was on a 66. That 10% bus speed difference effected not only the CPU, but the Memory, and the expantion cards. This obviously isn't the best illistration, but it does prove my point, you can't simply chalk up performance to a simple equasion.

Chesnuts roasting on an open CPU
Bill Gates nipping at your wallet
November 13, 2001 8:47:41 PM

It's much more than just L1 cache that holds back the P4.

The 20-stage pipeline is also holding it up; Intel deliberately gave it such a long pipeline so they could ramp up clockspeeds higher. Intel's best answer to the problem is HyperThreading Technology, which isn't even released and probably won't be released for at least a year yet. When it <i>is</i> released, it will spend the first part of its life as a Xeon-specific feature.

The Pentium 4 also has a hacked-up ALU and miserably deficient x87 FPU. The x87 FPU hurts it more than anything. Intel was planning to force their new SSE2 instruction set into mainstream use quickly to make up for it--but the Athlon is even managing to defeat the Pentium 4 in SSE2-enabled benchmarks, so it doesn't look all that promising. Not to mention which, we have no idea how many apps will be SSE2-optimized in the future. Currently, the majority of apps are <i>not</i> SSE2-optimized.

Quote:
Pentium's have gone beyond 2000MHz, and soon when Northwood will come out, they will go much further. Where is Athlon in clock speed?

How much further? I believe the .18u P4's speed advance was put on hold primarily because of power issues. The power draw on a .18u P4 is outrageous. Athlons consume about 33% less power for the same performance, and people were already up in arms about the Athlon's increased power demands.

It seems the P4, like the Athlon, is also hitting a brick wall in terms of clock scaling. 0.13u will help with the Northwood, but it will also help with the Thoroughbred.

Kelledin
<A HREF="http://www.linuxfromscratch.org/" target="_new">LFS</A>: "You don't eat or sleep or mow the lawn; you just hack your distro all day long."
November 13, 2001 9:26:44 PM

I'm not an expect on Processors. But even if Intel were to increase the L1 cache on the P4, would it still be able to keep up with an Athlon clock for clock? I know athlon have a much higher cache hit rate. But wouldn't the lower IPC on the p4 still hold it back. Also when AMD goes 0.13 micron
they will only increase the pipeline 2 stages and IPC is suppose to increase. The only thing I see holding back the athlon is the FSB. They seriously have to increase it and soon.
November 13, 2001 10:18:35 PM

Actually there is even more =)

First we have the decoder. The P4 can decode 1 instruction every cycle. The athlon can decode 1 complex instruction per cycle or up to 3 if they are "directpath" meaning they consist of max 2 microOPs. Now, this doesn't mean that K7 executes 3 times the speed of P7. P7 has something called a execution trace cache, which means that the L1 instruction cache of P7 caches instruction after they have been decoded instead of before. Applications spend 90% of their time in some inner loop, so the instructions will be found in the ETC most of the time. But; the ETC can only issue 3 microops to the execution ports where K7 can issue up to 6 microops (3 MacroOPs @ 2 MicroOPs each).

That means that both K7 and P7 can execute 3 simple instructions like adding two registers per cycle. A slightly more complex instruction like adding 3 variables in memory to 3 registers take one cycle on K7 but 2 cycles on P7. Theese are of coure best case scenarios where no instruction depends on the result of another.

Then we have the issue ports and execution units. Up to three FPU instructions can begin execution each cycle on the K7 if they are of "different types". That is you have to issue for example one add, one multiply and one load/store to make use of all units. P7 only allows 2 FPU-microOPs per cycle (1 operation + 1 store).

Then we have the latency of FPU-instruction. In general operations on P7 take longer time to complete due to its longer pipelines. For example the FADD instruction which adds two real numbers in two registers together completes in 4 cycles on K7 and 5 cycles on P7. This doesn't always matter as the beauty of pipelined processors is that a new instruction can begin execution every cycle even though each instruction takes many cycles to execute. But if you need the result of an add after 4 cycles then you will have to wait an extra cycle on the P7 for the result to become available. If this happens you should probably fire your software engineer before you change your hardware.
Note: The latencies of SSE/SSE2 instructions are very good on the pentium.

As for your main question. What if intel increases its IPC. Well; They probably will, but the Pentium 4 will never reach the IPC of an athlon, because P4 is designed for high clockspeeds. The reason P4 can be clocked so high is that some "features" is sacrificed at the microarchitectural level. There is no such thing as a free lunch. Or maybe there is? With Northwood arriving early next year and AMD staying with 0,18u for another 9 months (?) intel has a golden opportunity to either raise the clockfrequency to insane levels or increase their IPC ( I hate that word...). Both options would leave athlon in the dust because I don't think the K7 can be clocked much higher at the present manufacturing process. The K7 architecture is more advanced than the P6, and coppermine freaked out @ 1.13GHz. I think it would be weird if the designers at AMD could reach twice the clockspeed as their Intel brothers who went to the same school (as someone put it in another thread).

Of course, the main objective of any company is to maximize its earnings and even if Intel had the technology to produce Northwoods at 3Ghz it would not automaticly mean they would release one. They probably make more money releasing a processor every 3 months with each release adding 200MHz to the frequency compared to releasing all flavours at once.

Well. That's about it =)
Maybe there is even more?
/Markus
November 13, 2001 10:37:40 PM

It's all about the $$$.

-¤ Shut the f*ck up or go AMD ¤-
November 13, 2001 11:25:12 PM

Very good job mala. I never thought of it that way.

KG.
November 14, 2001 12:50:42 AM

Nicely written and well thought out Mala, but I have one possible correction to make....

AMD should have Thoroughbreds (.13 micron) relatively early 2002. If Jerry is to be believed, XP2200+ will be released before end of 1st quarter next year which is only a couple months(no more than 3) after Northwood. Being optimistic, Thoroughbred might come out by mid 1st quarter, basically nixing just about any advantage Northwood may gain within a few weeks.

Being more realistic, it looks like Northwood might bring the performance crown back to Intel for 2-4 months.

Fun stuff, ehh?

Mark-

When all else fails, throw your computer out the window!!!
November 14, 2001 3:21:54 AM

>Currently, the majority of apps are not SSE2-optimized

kelledin, most of your post is bs so I wont even comment on it, but I'll say this Most professional(that's like <1% of you folks!) apps are getting optimized for the P4 take a look at Lightwave, 3dmax, Maya Photoshop and others are coming out! these are industry standards. Just because they have not done MS Word or your fav games, doesn't mean things aren't happening.

"<b>AMD/VIA!</b>...you are <i>still</i> the weakest link, good bye!"
November 14, 2001 3:40:11 AM

Quote:
Just because they have not done MS Word or your fav games, doesn't mean things aren't happening.

As I said, the majority of apps are not SSE2-optimized. There's a difference between "getting optimized" and "is optimized." Plus, we've already seen that SSE2 often doesn't put the P4 in the lead.

If so much of my post is BS, it should be no problem for you to refute it. :tongue:

Kelledin
<A HREF="http://www.linuxfromscratch.org/" target="_new">LFS</A>: "You don't eat or sleep or mow the lawn; you just hack your distro all day long."

P.S. I think it's hilarious how every time the Athlon beats the P4 at something, you claim it's "badly written software."

You kept cheering for Maya+SSE2--then stopped cheering when SSE2 came to Maya, and the P4 still couldn't win the benchmarks.

Then you pointed out Chris Blanos's Lightwave benchmark database as a symbol of the P4's SSE2 superiority--until a dual AthlonMP 1800+ took top rankings.

At your current rate of denial, every piece of software will be labelled as "crap" within the year... :lol: 
November 14, 2001 3:48:12 AM

Well written analysis Markus!

Although much of it is BCS it does happen in real world! And that is reflected in the performance of the processors. Why dont these guys take off their orange glasses and see the green? (both P3 and Athlon ;-))

Intel already has the P4 RAE running at 4 GHz, and that is on a 0.18u chip. So I guess 0.18 still has some juice left. Just so AMD guys might have decided to stay ay 0.18 for the time being!

The IPC (me too hate this term - P4 has simpliy left this term no meaning, blurred the Instructions per Second *handled* or *executed*!) and thinner die seem to be marketing gimmiks. Even a VIA C3 boasts to be a green chip fabricated with 0.15u technology. But how does it perform? Even a Celeron beats it!

girish

<font color=red>No system is fool-proof. Fools are Ingenious!</font color=red>
Anonymous
a b à CPUs
November 14, 2001 10:40:56 AM

>Intel already has the P4 RAE running at 4 GHz, and that is
>on a 0.18u chip.

Hmm.. what is "RAE" ? and are you serious about this ? Have a link ? FOUR gigahertz @.18 ?

= The views stated herein are my personal views, and not necessarily the views of my wife. =
November 14, 2001 10:44:01 AM

Mala, great technical info, you forgot to mention as well, with p4's longer pipeline branch prediction misses hurt it much more than the athlon. I also believe the athlon has a better branch prediction rate than the p4.(but am not 100% positive).

~Matisaro~
"The Cash Left In My Pocket,The BEST Benchmark"
~Tbird1.3@1.5~
November 14, 2001 11:50:39 AM

All true guys,
but my few coins on the equation at the start.
performance is nearly impossible to measure, which is why most comparisons of processors use a dozen or so "different" bench-marks.
clock speed has something to do with it, as does the amount done for each clock-cycle (on average). but it hardly stops there.
you need to consider memory bandwidth, this is where the P4 can gain or lose. it will lose out if you can't give it data because the cost of not providing enough data is larger (ie. the quicker the clock-cycle the more idle instruction-cycles for a given memory latency) however the P4 would streak ahead for a good memory bandwidth because it could improve further by accepting and processing more data in a given time than an Athlon (this is debateable but generally speaking...).. unfortunately for the P4 the memory specs for todays tech. leave processors waiting, and TLB, hradware prefetch and yada yada are the attempts to make better use of this waiting time or really, to remove waiting time.

as for the IPC complaints and thoughts on AMDs clock-speed verses Intels, I believe the P4 gained it clock speed by the simple feat of lengthening it's pipe-line. To put it very very basically, AMD wins out at the moment because memory cannot support such high data throughput requirements. AMD chose lower clock-speeds but lets do more at each stage of our pipe-line (essentially shortening a pipe-line) while Intel said "let's do less per stgae of the pipe-line so that we can go higher clock-speeds". AMDs choice was better FOR THE MOMENT.

Perhaps I should note something about clock-speed.
Clock-speed is limited by latency of signals, transistors turning on and off, through a pipe-line stage. ALl pipe-line stages should be similar to get best efficiency per clock-cycle, more on that later.
Intel could probably get a clock to go 40 GHz. But if they attached it to a P4 core, well no data would ever get through it's pipe-line stage and it certainly would get out of sync real quick.
Here's an IPC vs clock-speed line.
----|---------------------|-------|----------------------------------|
high clock rates Intel AMD big IPC

one extreme is to have little stages, maybe 5 transistors each. You can have massive clock speeds, with 100 stage pipe-lines and simple micro-ops like 'AND bit with bit', XOR bit with bit. this sounds ridiculous I know.
the other extreme, is to have low clock cycles but your micro-ops are things like calculate FFT on two 32KB arrays, parse 5 simultaneous perl scripts, etc.
-- your doing heaps per clock cycle but can't really run very fast cause the data might take milliseconds to get through any one stage of the pipe.

Intels P4 and AMDs Athlon both fall around the middle, which has the best advantages from both sides, but they played the trade-off game differently.

hope any of that helps.
Feel free to pick it to bits cause it's 12:50am here and I need to be asleep, probably come hte morning even I'll not know what I was talking about.

pray for my coherence please... thank you

Balzi

So performance is lots of different things all mized up.

"I spilled coffee all over my wife's nighty... ...serves me right for wearing it?!?"
Anonymous
a b à CPUs
November 14, 2001 1:05:14 PM

I'm gonna print this lot off and show it to my grandma - ha - she'll think were aliens.
:lol: 

I am NOT responsible for any damage the info in this post may cause to your system.
November 14, 2001 2:28:53 PM

You're not? Get him, guys!!

<font color=orange>Quarter</font color=orange> <font color=blue>Pounder</font color=blue> <font color=orange>Inside</font color=orange>
Anonymous
a b à CPUs
November 14, 2001 2:50:27 PM

"Feel free to pick it to bits"

why? I rather liked it. I was a bit confused about the line thing about half way down.
Anonymous
a b à CPUs
November 14, 2001 3:04:32 PM

It was probably the most informative post I've read.

You are responsible for any damage the info in your post may cause to my system.
Anonymous
a b à CPUs
November 14, 2001 3:13:06 PM

Duh *run*

You are responsible for any damage the info in your post may cause to my system.
November 14, 2001 5:52:33 PM

but you cant fit all that many transistors on todays sised cores, can u?
AMD's core is smaller then intel's, why dont they use it to put more transistors on on a specific (or costume) made CPU.
does a deeper pipe line mean more energy consumption?
SSE/SSE2, how effective thay are compred to just excecuting the proggy?

i will be in my base (army) for the next few days, would be great if you could mail me with the respond.Thx a bounch.

<font color=green>
*******
*K.I.S.S*
*(k)eep (I)t (S)imple (S)tupid*
*******
</font color=green>
November 14, 2001 5:52:42 PM

Quote:
If Intel suddenly matched or beat the Athlon's IPC, then they would blow by AMD. However, if AMD suddenly matched or beat the P4's clock speed, they would blow by Intel.


The role of the CPU clock is to co-ordinate the CPU activities. Each time the clock pulses, operations waiting to happen are carried out. That means that each part of an instruction can only be accessed every time the clock pulses. So an instruction occupying three bytes of memory will take three clock pulses to carry out (i.e. there are three memory reads required). Clock speed is therefore important. Potentially, the faster the clock the faster the instruction cycle time.

But the thing is, it is much more difficult to develope processors with higher clock speeds. So Intel have an advantage, because they have already produced high-clocked processors. All they have to do is to use a different architecture (maybe similar to AMD's) that will do more instructions per clock.

Remember,

Performance = clock speed x instructions per cycle


AMD on the other hand are having a hard time in increasing clock speed (I don't blame them because it is very difficult to do this), but they do have an architecture which carries out more instructions per clock.

I think Intel is at an advantage, but I don't see why they don't modify their systems so that they will do more instrutions per cycle. Then they will have a high figure for clock, and a high figure for instructions per cycle, and therefore a high overall performance.


It is true that performance depends on factors that you guys have mentioned such as memory band width, cache transfer rate etc. but all these play a role in carrying out instructions, so large memory band width would play a role in carrying out more instructions every time the clock pulses. So this is already implimented in the term "instructions per cycle" of the equation.

My question is, why doesn't intel change the structure of their computers (maybe make them similar to AMD's, apart from the CPU of course) so that they would carry out more instructions per clock, and hence better performance?
November 14, 2001 9:35:33 PM

OK, sorry Conqueror, I mustn't have made my point clear enough.

The point my friend, behind higher clock-speeds and more done per clock-cycle, is that you MUST trade between massive clock-speeds and "IPC".

Intel has chosen to have less done at each stage of their pipe, which is the *MAIN* reason why they can hit higher clock-speeds.

AMD have chosen a different tact (note 1), they have decided to do a little more per clock-cycle (the more you do teh easier it is to tweak and get a higher average "IPC").. but they can't scale their clock-speed a high as Intel can.

imagine,. for the sake of this example, that it takes 10pS (picoseconds) to switch a transistor on a 0.18um process. If Intels pipe stages were 10 transistors then maybe AMDs pipe stages could be 20 transistors (??).

so each time Intel feeds data to a stage it takes 10 x 10pS = 100pS to complete. so if you clocked faster than 1/100pS = 10GHz you would not have valid data at the end of the pipe... therefore Intel's machine would have a high-end clock limit of 10GHz, assuming all pipe stages were 10 transistors or less.

AMD, on the other hand, would need 20transistors x 10pS = 200pS per clock-cycle, which is 5GHz. so it can't 'start' as many instructions per second (5 billion vs. 10 billion).. but it does just as much processing by doing twice as much with is 20 transistors as intel can do with their 10 transistors.

Now to address the point of intel doing as much as AMD per clock, or AMD meeting Intel Clock-speed. is it not that AMDs tech is slower, or that Intel's tech has bad IPC. It's all a matter of trade-off..

Intel cannot do as much as AMD because to get 10GHz they made each stage only 5 transistors to reduce **PROPAGATION DELAYS** (say it with me, propagation).
However AMD cannot match Intels 10GHz cause there pipe-lines have 20 transistors and therefore double the propagation.
so you can't get both with current technology unless you get yoyur transistors to switch faster (that is exactly what SOI - silicon on insulator tries to achieve)

SO if they both move to a 0.13um process or when AMD employs SOI, maybe the transistors will switch in 5pS.. so now Intels architecture can hit 5pS x 10 trannies = 50pS or 20GHz.. but also AMD will hit 10GHz and will still be doing more per clock-cycle..

Now, scale this all up to reality and the main concepts still hold. 0.13um probably won't double clock-speeds though.

but after all that is said.. you can see why Clock-speed matters alot less than how smart your desin team is, how you can optimise your pipe-line, how your branchprediction unit works, how your parallel FPUs and ALUs work together.. etc

that's the end of "Balzi know best" part 2.. hehehe

hopw i cleared your head out Conqueror.. hey, are you related to my web-browser, his name is Konqueror.

"I spilled coffee all over my wife's nighty... ...serves me right for wearing it?!?"
November 14, 2001 10:24:15 PM

<A HREF="http://www.blanos.com/benchmark/bprint.cgi?lw_scene=var..." target="_new">Yes, even with SSE2 optimizations on and an 800MHz handicap.</A>

Only reason the 1800MP isn't there is because it hasn't been put through those benchmarks yet. Actually use the search feature, and you see which one wins. :tongue:

Kelledin
<A HREF="http://www.linuxfromscratch.org/" target="_new">LFS</A>: "You don't eat or sleep or mow the lawn; you just hack your distro all day long."
November 14, 2001 10:51:42 PM

Not Fair, AthlonMP has twice as much Ram.

KG.
November 14, 2001 10:57:46 PM

*sniff sniff*

Ahh...the sweet sound of hairs being split down the middle.

<font color=orange>Quarter</font color=orange> <font color=blue>Pounder</font color=blue> <font color=orange>Inside</font color=orange>
November 14, 2001 11:00:36 PM

From LoveGuRu
"SSE/SSE2, how effective thay are compred to just excecuting the proggy?"

SSE/SSE2 and 3DNow! blah blah MMX stuff.. it's all pretty much the same.. just tailored differently for different CPU cores.

when coding in standard asm you use codes like mov eax,yada .. add cx,y add y,bx.. whatever. they are "generally" small tasks. a better exmaple is probably.

let's say you wanna code C = (A*D + A*B) % 10;
(now there's probably no SSE/SSE2 instruction for this.)
but normally you would have to execute
A*D -> save result.
A*B -> add to previous result.
result % 10 -> store in C.

that's three instructions, and takes 3 clock cycles (well kinda , pipelines make this all messy.. but imagine it takes 3)

teh reason it takes 3 clock cycles and can't be calculated at once, is that the CPU is not capable of knowing the final outcome of the maths, it just executes little bits of math in what ever order you tell it.

I smart compiler might even be able to say " hey, let's add D and B and mutiply by A later, jsut to save dragging A out of memory twice.." but it's still 3 cycles.

now the CPU might capable of doing add/mul integer maths on a few numbers in parallel and doing modulo arithmetic on results in a single pipe-line stage.
But this parallelism cannot be utilised because the instructions are seperate, the CPU can't know to calculate them all at once without special flow tracking HW which takes silicon and would produce heat.

But what is you could make an instruction that does it all at once..
Simple instructions are embedded together, the CPU can easily calculate A*D, A*B at teh same time.. add them together and do a just before the result pops out of the pipe. An instruction could do all this.... and it would get labeled as MMX type thingy.

SSE/SSE2 and MMX yada yada.. do similar things, (AFAIK) but as to exactly what they help with, I have no clues. but that's the concept. Do lots of stuff in one instruction to make more efficient use of the CPU.

An analogy might help.
Imagine you know your 6 times tables (hopefully no-one is 'imagining')
and can add 3 digit numbers with ease.

now i am dumb, and want something calculate for me.
you can calculate anything but it's takes you 3 seconds to write anything.
I write down my question in bits... 5+5, 6*4, , AND 0xff.
I pass you 5+5?
wait for an answer --- 10.
I pass you 6*4?
wait for an answer --- 24.
I pass you ans1 + ans2
wait for an answer ---- 34.
I pass you ans3 % 10.
wait for answer ----- 4.
I pass you ans4 AND 0xff.
wait for answer --- 4.

so that's all very inefficient.
it takes you 5 x 3second = 15 second to write down all those intermediate answers and give me final result.
What if I pass you ((5+5)+(6*4)) % 10 AND 0xff
you look at it, takes you 1 second to calculate. umm 10+34 % 10.. and with 0xff.
yep..
ytou write back 4. it took 3 seconds because I told you everything at once.
because you know the full story you can optimise the maths behind the result and get me a result only writing down one thing, '4' in 3 seconds.

same with puters. they are real smart and real quick, but can you make good use of having 3 32bit ALUs, 2 single-cycle FPUs etc...

that's SSE/SSE2 simple style...

Balzi

"I spilled coffee all over my wife's nighty... ...serves me right for wearing it?!?"
November 15, 2001 4:52:41 PM

balzi, i understand what you're saying. I know you have to trade off clock speed with IPC. It's like this: You may have a very high clock speed, but not enough speed to carry out instructions per every clock. In other words, the clock is pulsing so fast, but the system that carries out the instruction cannot keep up with the clock. So the high clock speed is wasted.

On the other hand, you could have a low clock speed, and the system that carries out the instructions is fast. It carries out instructions at a much higher rate then the clock can pulse. It finishes carrying out the instructions before the next clock can pulse. Here, the clock speed is too slow for the system that carries out the instructions.


But I beg the question again, why can't Intel make the "Instruction Per cycle" (IPC) term of the equation bigger?

Why can't they use 20 transistors in their pipe stages, together with their high clock speed?

<font color=blue><i>Mankind must put an end to War,
or War will put an end to mankind!<i></font color=blue>
Anonymous
a b à CPUs
November 15, 2001 5:36:03 PM

maybe you need to repeat it again "propagation delays".

If the propagation delay is longer than the clock period then while the next clock pulse (ie sampling period) is occuring , the data is still trying to propagate to where it needs to be to be properly sampled. If for example some set of hardware was designed to output valid data within a certain time period, and doesn't do it then invalid data is then used by the software, we all know what kind of lovliness this causes. Of course it would be possible to only sample the data half as often, but that just gets you back to half speed, I don't think that's what you were after.

There are physical limits as to how the hardware can be used. If you are less than master of time space and dimension then you must remain within these limits.

"On the other hand, you could have a low clock speed, and the system that carries out the instructions is fast. It carries out instructions at a much higher rate then the clock can pulse. It finishes carrying out the instructions before the next clock can pulse. Here, the clock speed is too slow for the system that carries out the instructions."

No dude, not too slow, that is what you want , and how synchronous state machines operate. The idea to get maximum performance is make sure all the hardware stages stay busy until just before the next clock pulse, and they have data valid just in time for this next clock pulse.

It might be easiest to understand this by picturing the clock pulse as the data sampling time, where the data must be valid for a window of time before and after the actual clock pulse to make a succesful sample.
November 15, 2001 6:13:06 PM

<i>No dude, not too slow, that is what you want , and how synchronous state machines operate. The idea to get maximum performance is make sure all the hardware stages stay busy until just before the next clock pulse, and they have data valid just in time for this next clock pulse. </i>

What do you mean "not too slow", of course it would be slow in comparison to the data transfer rate.

This is exactly what happened, for example to the XT and AT computers. Their data tranfer rate was too sow in compasrison to the clock speed. So they had to be abandoned.

<b>Data tranfer rate = Bus Width * Clock speed / (8 * No. of clock pulses per transfer)</b>

Can you see that it depends on the data bus, address bus, and clock speed.

Some histrocial background for you:

Since a system can only run at the speed of its slower element (a system is only as strong as its weakest link) the clock speed was reduced to that which the memory could handle. This was a reflection of the state of the art in memory chips development at the time. Most ISA (Industry standard architecture) buses ran around 5Mb/s. It was important that all cards currently in use at the time for the XTs would also be able to work on the AT machine. So the bus layout would have to be compatible - and still provide the extra data and address bus connections. This was achieved by keeping the original XT expansion bus and adding an extension section to the bus for the extra connections. In that way, XT cards would fit in the expansion slot, while AT cards would also use the slot extension.

This is termed the ISA system (Industry Standard Architecture).

Then of course, they brought out the 386/486, MCA, EISA systems etc.

<font color=blue><i>Mankind must put an end to War,
or War will put an end to mankind!<i></font color=blue>
November 15, 2001 6:22:01 PM

Conqueror....

Intel probably COULD turn up the IPC if they completely redesigned the chip to shorten the steps etc. However, they would have to lower the clockspeed. It's a matter of tradeoffs.

Now, supposedly, there's a P4 variant that will be able to perform more instructions per cycle using a term they call Hyperthreading. However, that is earmarked to their server chips and will likely be very expensive, PLUS even with hyperthreading technology, P4 will not have the IPC of the Athlon.

Mark-

When all else fails, throw your computer out the window!!!
November 15, 2001 7:57:53 PM

What I like to know is if Intel can increase the IPC and also the Mhz. I don't know if this is possible or not. But let's just say they did. Then do you think AMD will go back and rename their Athlon XP 1900+ with Athlon XP 1700-.

Just making a joke.
KG.
November 15, 2001 8:10:43 PM

They can but it'll be more costly and they'll need to implement some serious heat spreaders and thermal protection as well as move to the .13 micron process.

AMD technology + Intel technology = Intel/AMD Pentathlon IV; the <b>ULTIMATE</b> PC processor
November 15, 2001 9:31:24 PM

hehehe.. Ok guys. let's try and wrap this one THIS TIME.

firstly-> Conquerors comment
"This is exactly what happened, for example to the XT and AT computers. Their data tranfer rate was too sow in compasrison to the clock speed. So they had to be abandoned.

Data tranfer rate = Bus Width * Clock speed / (8 * No. of clock pulses per transfer)"

Your point is extremely correct but also extremely off-topic. The data-transfer rate you speak of is the memory sub-system transfer rate.

What knewton and myself have been saying is that each 'stage' of a pipe-line WITHIN the processor has a transfer or 'propagation' rate.

I'm sorry my previous examples haven't been satisfactory. but I might try again.

Let's say that you and a group of friends (10 of you) are a professional PC construction team. (I thought I'd use something we all might know about)

You can organise yourselves into groups or do one task each but at the end you want the PC to be together and working (oh really!!)

now, your construction team needs an overseer, that's me - I'm your clock.
For the sake of this example, let's assume that you have as many PC parts as you'll need (that's equivalent to unlimited memory bandwidth.. always enough data)

First attempt, you guys organise yourselves into two groups of 5 people.
First team puts CPU and RAM onto MB. Second team puts MB into case along with all PCI cards, video card, HD, PS, CD, Floppy.. we're ignoring peripherals for now.

Now the flow of PC bits is moderated by me yelling "MOVE ON" at exact intervals. say every 1 minute.

so every 2 minutes (1 minute per stage) 5 PCs come out.. or do they.

now the first group has to pass there MBs to the second group. but we all know that teh second stage is going to take longer, in fact they can't complete they duties in that 1 minute. so they tell me to 'clock' less often, yell "MOVE ON" every 5 minutes. (slower clock).

so I do that, and now the second group is ready for new MBs on time. but now group one is sitting around doing nothing for the most part of each cycle.. this is in-efficient.

it's takes 10 (5) minutes to 'clock' through 5 PCs...
but on average if we left the process running long enough, the time would be 5 minutes because we start and finish 5 PCs every 5 minutes. (that's 1 PC per minute).

you re-organise your team. now your in 2 groups again, but the first group is only 2 people, group 2 is 8 people.
The first group does more easy work, installing CPU and RAM. the second group does the same, but there's more of them to do it.
now I 'clock' every 5 minutes, but in that clock the first team can get 8 MBs ready, for the second team who can all finish 1 MB each, so that's 8 MBs for teh group.
so now every 5 minutes (on average after time) you produce 8 PCs. or 1.66 PCs per minute...

you just upped your rate by simply re-organising you pipe-line stages, notice that the clock is the same.
Now I can't start 'clocking' every 2,3 or 4 minutes to make it faster, unless I train you guys to build faster... this is this equivalent of getting transistors to switch faster.

you guys re-organise again, this time you are in 4 groups - 2 people, 2 people, 3 and 3. (10 altogether)
Group 1 of 2 puts CPUs and RAM in - this takes 1 minute each so 2 PCs per minute.
Group 2 of 2 puts PS and MB in case - this takes 1 minute each aswell.
Group 3 of 3 work together to get HD and Floppy and CD into the case. they can produce 2 PCs per minute allup.
Group 4 of 3 people stick all PCI cards in and video card in and plug in HD, CD and FLoppy cables. together they can assemble 2 PCs per minute.

so you have organised a nice pipe-line; each stage is equal.. throughput is maximised.. and you can change your clock to 1 minute intervals. all up on average (after time) your PCs will be assembled at 2 per minute.

now, see here, you have doubled your pipe-line and gone for a 5 times faster clock, but your throughput is only up 20% and that's really b/c your pipe-line is managed better.

now if you clock faster, no-one will get anything done.
You could split into 8 stages and halve your clock again, but your throughput wouldn't change would it. each stage might only get 1 PC out per clock.. with a 30 second clock, that's still 2 PCs per clock.

so the time it takes for each person or group to get theier tasks done is the assembly line propagation delay.. which is directly relatable to the propagation delay we spoke of in the real world CPU transistors delays.

I sincerely hope I covered it properly... any problems to speak of, feel free to correct me.

The logic is kinda simple when you finally get your head around it.. Sort of like once you find Wally you think everyone will see him straight away - too easy!!!!

balzi

"I spilled coffee all over my wife's nighty... ...serves me right for wearing it?!?"
November 15, 2001 9:48:35 PM

This is a fairly good explanation. Now add to that the following: The people building the systems can hear the guys taking orders on the phones for these systems. They will attempt to predict what kind of systems they need to build before the sales call has been completed so they do not have to stall and waste time. If they predict incorrectly they must start building the correct system starting at the beginning. This is the equivolent of a branch misprediction.

The amount of time they wasted by mispredicting is determined by the clockspeed (the guy yelling 'Move on!') and the number of steps in the pipeline (the assembly line of people making the system.) In essence it is the time for one system to make its way through the assembly line from start to finish. You can think of this as latency while total number of produced systems is bandwidth.

-Raystonn


= The views stated herein are my personal views, and not necessarily the views of my employer. =
November 15, 2001 10:17:10 PM

hehe. Nice explanation :) .

I have a question.

The pipeline discussed in your (and many other) post determines the amount of work done along ONE path inside the processor whereas my post talked more about the number of paths, or the parallellism, of the processor.

My question is:
How does wider designs (i.e higher issue bandwidth) limit the maximum clock frequency? I have the feeling that transforming the P4 to an athlon with longer pipelines can only be done at the expense of maximum operating frequency, although I can't really convince myself of why that would be.

Of course, the logic issuing 6 uops has to "do more" than the one issuing 3 uops, but that can't be the limiting factor, or can it?

/Markus
November 15, 2001 10:22:26 PM

dont forget that a FPU is inherantly more accurate than a sse2. 80bits vs 64

Why do i feel like the lone sane voice in the mental assylum?
November 15, 2001 11:12:51 PM

ahh yes, parallelism... the buzz word of CPUs quite a while back.. when i was in uni. parallel processing, WOW?!?!?! hehe

anyway, parallelism is great because it's like having two sets of assembly teams who can operate independently but respond to teh same 'clock'. The thing is, this works lovely for assembly lines, but when your performing calculations, who cares about having two pipes to get teh same answer..
you double production in assembly, you look stupid in CPU calculations... "hey the answers 50..", "I know!!!"

instead parallelism in CPUs is very different. It's when you try to ask as many mutually exclusive questions as possible, and then have the processor calculate lots of things in parallel (same time, different silicon).

Because of program flow on a CPU with branches and instructions that want to know the answer to the previous instruction and yada yada, parallelism is difficult to implement.

To Raystonn, there's lots of stuff that my analogy covers, but it's not really that close to the real thing. Thanks for you clarification.. the more people explaining, the better we cover everyone else trying to comprehend.

In thinking of your additions to the PC assembly example - I thought about teh P4s branch predict.. at least I think it's the P4.
it has a simultaneous execution unit.. when it executes both branches.
Kinda of like, whenever the construction team here's a phonecall each team gets building a PC of a different type. When they find out what PC they're supposed to build, everyone but the correct team pulls there PC to bits and discards their work.

oo I like this analogy.. it's so scalable.. hehehe.



"I spilled coffee all over my wife's nighty... ...serves me right for wearing it?!?"
November 16, 2001 1:58:23 AM

About the branch prediction:

I've read about some processor taking both branches, but I can't remember which one. I am pretty sure P4 doesn't do this though.

Actually, it is a kind of stupid approach. Branch prediction is based on the assumption that your hardware can predict the outcome of a branch more than 50% of the time. Otherwise the predictor would be a waste of space as either 'always taken' or 'never taken' must be true at least 50% of the time. If you continue execution on both the taken and not taken adress only half of your execution paths would be doing any useful work and essentially you would be 50% right; that is - waste of space...

/Markus
<P ID="edit"><FONT SIZE=-1><EM>Edited by mala on 11/15/01 11:30 PM.</EM></FONT></P>
November 16, 2001 2:37:53 AM

yeah, sorry I was only guessing.. way back inteh recesses of my mind it linked P4 to multiple branch execution../. don't know why.

It wasn't supposed to be informative.. I was just mucking around with the assmebly line analogy.

hey, Mala.. you seem to know a bit about everything.. I always wondered what the P4 did wrong with their FPUs.. everybody disses 'em.. btu I don't know whats uop with that.

do you know??

ta
balzi

"I spilled coffee all over my wife's nighty... ...serves me right for wearing it?!?"
November 16, 2001 3:59:23 AM

I don't know how much of a differance there is (I still have celerons powering my system) =/.

First you have the overall design of P4 which affects all code, both ALU and FPU.

Then athlon (and P3) can issue up to 2 FP uops/cycle; P4 can only do one.

And add to that higher latencies for instructions on the P4. The latency problem is not as bad as it sounds because as athlons can issue twice as many instructions per cycle. There is probably a greater risk of stalling the athlon pipeline than the p4. (my guess)

I guess some of it's "slowness" has to do with the P4 being a new processor, and that the Net Burst microarchitecture differs quite a lot from previous cpu:s. Maybe the applications produced today doesn't use the newest compilers or maybe the programmers doesn't care to optimize for the P4 as it would lower the performance on P3's and Athlons(?). And of course; old programs released before the P4 hasn't been optimized...

There was an article here on THG when the P4 was new. THG got horrible results in some video encoding benchmark. A guy working for Intel asked to look at the software and the next day the programmers at Intel (Raystonn? =) ) had tuned the code to much higher scores.

Humm... That was the long version of "sorry, I don't know" ;) 

/Markus
Anonymous
a b à CPUs
November 16, 2001 8:37:03 AM

>I've read about some processor taking both branches, but I
>can't remember which one.

That would be the Itanium. If im not mistaken, intel calls this 'branch predication' (raystonn, correct me if im wrong).

Anyway.. while this seems a crazy idea, it might not be that crazy. A lot of logic is spent on predicting the outcome of branches. A lot die space therefore. WHile its true the predictions are correct in much more than 50% of the case, dont forget the penalty for mispredicting is HUGE. Especially with a long pipeline. When you follow ALL the branches, there is never a misprediction. Its a strange approach.. I'll admit. And frankly, it doesnt seem to work well. read this:
<A HREF="http://www.aceshardware.com/Spades/read.php?article_id=..." target="_new">http://www.aceshardware.com/Spades/read.php?article_id=...;/A>

Aces benched a Queens chess puzzle on the itanium. This code has lots of branches, and this really seems to kill the performance on it. (to be fair, some of the other benches in that preview, especially the FP benches on the itanium are very impressive)

= The views stated herein are my personal views, and not necessarily the views of my wife. =
Anonymous
a b à CPUs
November 16, 2001 8:45:12 AM

>maybe the programmers doesn't care to optimize for the P4
>as it would lower the performance on P3's and Athlons(?).

The funny thing is, these P4 optimized apps generally also perform (much) faster on the Athlon, and often even on the P3.. Go figure.

So thats not the reason were are seeing relatively little P4 or SSE2 optimized code. I think it has more to do with compilers being used. Currently, intel compilers seem to offer the best performing code, especially on the P4. But I read somewhere that Intel compilers are rarely used to compile mainstream applications, because they wouldnt be as stable as other C/C++ compilers?

= The views stated herein are my personal views, and not necessarily the views of my wife. =
November 16, 2001 9:01:04 AM

Maybe the processor I read about was the Itanium.
But I don't agree that predication on the Itanium takes both branches.
The point of predication is to do if-statements without using branches at all.

Instead of a general compare setting flags in the processor and branches being taken if some flag equals something, IA64 lets you test a value for a specific property and storing the answer (true/false) in a 1 bit predicate register. Then any instruction can be made conditional depending on the result in a predicate register.

Kind of like the cmovxx instruction in x86, but more powerful.

/Markus
Anonymous
a b à CPUs
November 16, 2001 9:11:55 AM

You'r right.. Its not predication, your explanation of it seems correct.

However, the capability of Itanium to process different branches in parallel is called "control speculation". here is a quote from it-enquirer.com :
---------------
In contrast to branch prediction, the speculative execution performed by processors with Itanium-architecture involves loading and executing both expected instruction sequences. HP and Intel call this procedure control speculation.

Itanium architecture flags the results in additional registers, so that the results of the unnecessary program branch which was executed can be discarded without any problems. The “costs” of this procedure are smaller than the time and thus performance losses if a false branch prediction is made. The multiple functional units within Itanium processors facilitate simultaneous execution of various program branches.
---------------

= The views stated herein are my personal views, and not necessarily the views of my wife. =
November 16, 2001 3:47:36 PM

Balzi, with all due respect, I understood your explanation about the clock when you explained it the very first time in one of your previous posts.

Even I mentioned in one of my previous posts: <i>"You may have a very high clock speed, but not enough speed to carry out instructions per every clock. In other words, the clock is pulsing so fast, but the system that carries out the instruction cannot keep up with the clock. So the high clock speed is wasted.

On the other hand, you could have a low clock speed, and the system that carries out the instructions is fast. It carries out instructions at a much higher rate then the clock can pulse. It finishes carrying out the instructions BEFORE the next clock can pulse. Here, the clock speed is too slow for the system that carries out the instructions."</i>

Sorry, my language wasn't a bit too technical but that was what you were basically trying to explain, weren't you?

My question again is (lol, sorry!), why can't Intel organise their pipe-line and make it similar to Athlon's pipe-line so that it would carry out more instructions per clock?? That way, they would have high clock speed, and high instructions per cycle, and hence higher performance.

I know about propagation delays etc. but this question is not relating to that.

<font color=blue><i>Mankind must put an end to War,
or War will put an end to mankind!<i></font color=blue>
November 16, 2001 4:33:16 PM

Interesting.
I reread the IA64 Software Developers Manual now, but I can't find anything that suggests that Itanium would behave like this. Actually, I didn't find any reference to "procedure control speculation" at all.

I think it is possible that it-enquirer.com has misunderstood some concept. Of cource, it is possible that I have too.

/Markus
Anonymous
a b à CPUs
November 16, 2001 4:33:51 PM

Of course I could be wrong ($h1t happens), but this is the tradeoff. You either get high complexity per stage or high clock speed, not both. If you increase complexity/stage, as a result there is more propagation delay, and thus a lower clock is necessary. Also if complexity/stage is reduced the propagation through each stage is reduced, and so clock speed may safely be increased. For Intel to have their system behave like Athalons more, i.e. get more work done per cycle, would require them to increase stage complexity, and thus be forced to reduce clock speed to correct for that. Each apprach is valid, just different.

"I know about propagation delays etc. but this question is not relating to that."

Sorry about using the "P" word, but I just don't know how else to talk about this.

I think maybe it is this statement where the communication breakdown is occuring:

"It carries out instructions at a much higher rate then the clock can pulse."

This is most certainly what is happening in an underclocked part. If, in general, there were an excess of idle time then the clock speed would be increased as much as possible, by the manufacturer, right up as close as possible to the limit. After all, by taking up this slack by raising clock speed the manufactures can call it a faster processor, and charge more for it.

It appears to me that everyone seems to be saying the same thing here, and chances are good that we are getting close to what is true. There is obviously some kind of miscommunication going on, but I'll play the odds, and say it is you who is misunderstanding. Not to be disrespectful, but you may want to take a step back, and reevaluate what you are asking, and what the replies were. You may find that the question has been sufficiently answered already.
!