AMD v INTEL on performace equation

Conqueror

Distinguished
Nov 8, 2001
87
0
18,630
<b>Performance = Clock Speed x Operations/Cycle</b>

Looking at the equation, to get high performance, we need to have a high clock speed and also high operations per cycle.

Pentiums are continually bringing out higher clocked processors than their Athlon competitor but thier "operation per cycle" figures are lower.

Athlon on the other hand have a higher figure for the "operations per cycle" although they have lower cock speeds. Hence they are able to match the performance values for the Pentium processors.

What if Intel comes out with higher "operation per cycle" figure as well, by using, for example, higher bus speeds and higher cache transfer rates etc.? Athlon's have a higher Level 1 cache and that is mainly how they are able to match the performce values with Pentium. What if Pentium come out with this high Level 1 cache too (is it possible!!)?

This would result in Pentium computers outperforming Athlons by miles. Athlons would not be able to keep up because they cannot manufacture higher clocked processors like Intel can. (I mean Athlons are having a hard time bringing higher clocked processors).

Pentium's have gone beyond 2000MHz, and soon when Northwood will come out, they will go much further. Where is Athlon in clock speed?


Don't start blasting at me kids. I'm only a "stranger". Although I use a computer 24/7, and also know what the inside of a computer looks like and what each component does, I don't know much about the technical issues. Correct me if I'm wrong.

Cheers.
 

bgates

Distinguished
Nov 12, 2001
161
0
18,680
It all comes down to if you want to pay the extra $200 for the name and extra memory bandwith that Intel has.

-¤ Shut the f*ck up or go AMD ¤-
 

FatBurger

Illustrious
Hey, good to see someone actually stretching their brains, instead of sitting on their ass doing nothing :)

Let's see if I can shed some light on the subject...

Higher "operations per cycle", or "instructions per clock" (IPC), as AMD likes to call it, has to do more with multiple simultaneous pipelines than the bus speed or the cache. The bus and cache help the clock speed, not the IPC.

The P4's low L1 cache is the main reason it has poor performance. 8k is not nearly enough to satisfy a 2.2GHz beast. The Northwood should have been designed with more L1 cache to go along with it's doubled L2. That is not the case, regrettably.

You're right. If Intel suddenly matched or beat the Athlon's IPC, then they would blow by AMD. However, if AMD suddenly matched or beat the P4's clock speed, they would blow by Intel.

As for bus speed, remember that the P4 operates on a quad-pumped 100MHz bus, reaching effective 400MHz speeds. The Tbird/Palamino is at a double-pumped 100/133 bus, effectively reaching 200/266MHz.

<font color=orange>Quarter</font color=orange> <font color=blue>Pounder</font color=blue> <font color=orange>Inside</font color=orange>
 

bront

Distinguished
Oct 16, 2001
2,122
0
19,780
Well said Burger.

If Intel suddenly matched or beat the Athlon's IPC, then they would blow by AMD. However, if AMD suddenly matched or beat the P4's clock speed, they would blow by Intel.
True, but you can't play the what if game.

As far as clock speeds vs IPC and which each CPU can add, there is a tradeoff. It would seem that it's hard to make a CPU do both well. AMD is performing better because they have done both a bit better at the moment. From the roadmaps I see, it looks like AMD will be turning the clocks up soon to go with their better IPC. Intel is simply aiming to turn the clock up to increadable numbers.

Performance = Clock Speed x IPC

That's not quite true. If it were, a P4 2.0 GHZ would be EXACTLY 25% faster than a P4 1.6 GHZ , and that is not the case. You begin to lose a bit of performance if all you do is change the multiplier.

Here's one last look at a RL example of this not being true.

A Pentium 133 was faster than a Pentium 150. Both had the same IPC, so according to the formula there, the 150 should be faster. The problem was that the P150 ran on a 60 Mhz Bus, while the 133 was on a 66. That 10% bus speed difference effected not only the CPU, but the Memory, and the expantion cards. This obviously isn't the best illistration, but it does prove my point, you can't simply chalk up performance to a simple equasion.

Chesnuts roasting on an open CPU
Bill Gates nipping at your wallet
 

Kelledin

Distinguished
Mar 1, 2001
2,183
0
19,780
It's much more than just L1 cache that holds back the P4.

The 20-stage pipeline is also holding it up; Intel deliberately gave it such a long pipeline so they could ramp up clockspeeds higher. Intel's best answer to the problem is HyperThreading Technology, which isn't even released and probably won't be released for at least a year yet. When it <i>is</i> released, it will spend the first part of its life as a Xeon-specific feature.

The Pentium 4 also has a hacked-up ALU and miserably deficient x87 FPU. The x87 FPU hurts it more than anything. Intel was planning to force their new SSE2 instruction set into mainstream use quickly to make up for it--but the Athlon is even managing to defeat the Pentium 4 in SSE2-enabled benchmarks, so it doesn't look all that promising. Not to mention which, we have no idea how many apps will be SSE2-optimized in the future. Currently, the majority of apps are <i>not</i> SSE2-optimized.

Pentium's have gone beyond 2000MHz, and soon when Northwood will come out, they will go much further. Where is Athlon in clock speed?
How much further? I believe the .18u P4's speed advance was put on hold primarily because of power issues. The power draw on a .18u P4 is outrageous. Athlons consume about 33% less power for the same performance, and people were already up in arms about the Athlon's increased power demands.

It seems the P4, like the Athlon, is also hitting a brick wall in terms of clock scaling. 0.13u will help with the Northwood, but it will also help with the Thoroughbred.

Kelledin
<A HREF="http://www.linuxfromscratch.org/" target="_new">LFS</A>: "You don't eat or sleep or mow the lawn; you just hack your distro all day long."
 

Makaveli

Splendid
I'm not an expect on Processors. But even if Intel were to increase the L1 cache on the P4, would it still be able to keep up with an Athlon clock for clock? I know athlon have a much higher cache hit rate. But wouldn't the lower IPC on the p4 still hold it back. Also when AMD goes 0.13 micron
they will only increase the pipeline 2 stages and IPC is suppose to increase. The only thing I see holding back the athlon is the FSB. They seriously have to increase it and soon.
 

mala

Distinguished
Oct 12, 2001
45
0
18,530
Actually there is even more =)

First we have the decoder. The P4 can decode 1 instruction every cycle. The athlon can decode 1 complex instruction per cycle or up to 3 if they are "directpath" meaning they consist of max 2 microOPs. Now, this doesn't mean that K7 executes 3 times the speed of P7. P7 has something called a execution trace cache, which means that the L1 instruction cache of P7 caches instruction after they have been decoded instead of before. Applications spend 90% of their time in some inner loop, so the instructions will be found in the ETC most of the time. But; the ETC can only issue 3 microops to the execution ports where K7 can issue up to 6 microops (3 MacroOPs @ 2 MicroOPs each).

That means that both K7 and P7 can execute 3 simple instructions like adding two registers per cycle. A slightly more complex instruction like adding 3 variables in memory to 3 registers take one cycle on K7 but 2 cycles on P7. Theese are of coure best case scenarios where no instruction depends on the result of another.

Then we have the issue ports and execution units. Up to three FPU instructions can begin execution each cycle on the K7 if they are of "different types". That is you have to issue for example one add, one multiply and one load/store to make use of all units. P7 only allows 2 FPU-microOPs per cycle (1 operation + 1 store).

Then we have the latency of FPU-instruction. In general operations on P7 take longer time to complete due to its longer pipelines. For example the FADD instruction which adds two real numbers in two registers together completes in 4 cycles on K7 and 5 cycles on P7. This doesn't always matter as the beauty of pipelined processors is that a new instruction can begin execution every cycle even though each instruction takes many cycles to execute. But if you need the result of an add after 4 cycles then you will have to wait an extra cycle on the P7 for the result to become available. If this happens you should probably fire your software engineer before you change your hardware.
Note: The latencies of SSE/SSE2 instructions are very good on the pentium.

As for your main question. What if intel increases its IPC. Well; They probably will, but the Pentium 4 will never reach the IPC of an athlon, because P4 is designed for high clockspeeds. The reason P4 can be clocked so high is that some "features" is sacrificed at the microarchitectural level. There is no such thing as a free lunch. Or maybe there is? With Northwood arriving early next year and AMD staying with 0,18u for another 9 months (?) intel has a golden opportunity to either raise the clockfrequency to insane levels or increase their IPC ( I hate that word...). Both options would leave athlon in the dust because I don't think the K7 can be clocked much higher at the present manufacturing process. The K7 architecture is more advanced than the P6, and coppermine freaked out @ 1.13GHz. I think it would be weird if the designers at AMD could reach twice the clockspeed as their Intel brothers who went to the same school (as someone put it in another thread).

Of course, the main objective of any company is to maximize its earnings and even if Intel had the technology to produce Northwoods at 3Ghz it would not automaticly mean they would release one. They probably make more money releasing a processor every 3 months with each release adding 200MHz to the frequency compared to releasing all flavours at once.

Well. That's about it =)
Maybe there is even more?
/Markus
 

zengeos

Distinguished
Jul 3, 2001
921
0
18,980
Nicely written and well thought out Mala, but I have one possible correction to make....

AMD should have Thoroughbreds (.13 micron) relatively early 2002. If Jerry is to be believed, XP2200+ will be released before end of 1st quarter next year which is only a couple months(no more than 3) after Northwood. Being optimistic, Thoroughbred might come out by mid 1st quarter, basically nixing just about any advantage Northwood may gain within a few weeks.

Being more realistic, it looks like Northwood might bring the performance crown back to Intel for 2-4 months.

Fun stuff, ehh?

Mark-

When all else fails, throw your computer out the window!!!
 

AmdMELTDOWN

Distinguished
Dec 31, 2007
2,000
0
19,780
>Currently, the majority of apps are not SSE2-optimized

kelledin, most of your post is bs so I wont even comment on it, but I'll say this Most professional(that's like <1% of you folks!) apps are getting optimized for the P4 take a look at Lightwave, 3dmax, Maya Photoshop and others are coming out! these are industry standards. Just because they have not done MS Word or your fav games, doesn't mean things aren't happening.

"<b>AMD/VIA!</b>...you are <i>still</i> the weakest link, good bye!"
 

Kelledin

Distinguished
Mar 1, 2001
2,183
0
19,780
Just because they have not done MS Word or your fav games, doesn't mean things aren't happening.
As I said, the majority of apps are not SSE2-optimized. There's a difference between "getting optimized" and "is optimized." Plus, we've already seen that SSE2 often doesn't put the P4 in the lead.

If so much of my post is BS, it should be no problem for you to refute it. :tongue:

Kelledin
<A HREF="http://www.linuxfromscratch.org/" target="_new">LFS</A>: "You don't eat or sleep or mow the lawn; you just hack your distro all day long."

P.S. I think it's hilarious how every time the Athlon beats the P4 at something, you claim it's "badly written software."

You kept cheering for Maya+SSE2--then stopped cheering when SSE2 came to Maya, and the P4 still couldn't win the benchmarks.

Then you pointed out Chris Blanos's Lightwave benchmark database as a symbol of the P4's SSE2 superiority--until a dual AthlonMP 1800+ took top rankings.

At your current rate of denial, every piece of software will be labelled as "crap" within the year... :lol:
 

girish

Distinguished
Dec 31, 2007
2,885
0
20,780
Well written analysis Markus!

Although much of it is BCS it does happen in real world! And that is reflected in the performance of the processors. Why dont these guys take off their orange glasses and see the green? (both P3 and Athlon ;-))

Intel already has the P4 RAE running at 4 GHz, and that is on a 0.18u chip. So I guess 0.18 still has some juice left. Just so AMD guys might have decided to stay ay 0.18 for the time being!

The IPC (me too hate this term - P4 has simpliy left this term no meaning, blurred the Instructions per Second *handled* or *executed*!) and thinner die seem to be marketing gimmiks. Even a VIA C3 boasts to be a green chip fabricated with 0.15u technology. But how does it perform? Even a Celeron beats it!

girish

<font color=red>No system is fool-proof. Fools are Ingenious!</font color=red>
 
G

Guest

Guest
>Intel already has the P4 RAE running at 4 GHz, and that is
>on a 0.18u chip.

Hmm.. what is "RAE" ? and are you serious about this ? Have a link ? FOUR gigahertz @.18 ?

= The views stated herein are my personal views, and not necessarily the views of my wife. =
 

Matisaro

Splendid
Mar 23, 2001
6,737
0
25,780
Mala, great technical info, you forgot to mention as well, with p4's longer pipeline branch prediction misses hurt it much more than the athlon. I also believe the athlon has a better branch prediction rate than the p4.(but am not 100% positive).

~Matisaro~
"The Cash Left In My Pocket,The BEST Benchmark"
~Tbird1.3@1.5~
 

balzi

Distinguished
Oct 16, 2001
121
0
18,680
All true guys,
but my few coins on the equation at the start.
performance is nearly impossible to measure, which is why most comparisons of processors use a dozen or so "different" bench-marks.
clock speed has something to do with it, as does the amount done for each clock-cycle (on average). but it hardly stops there.
you need to consider memory bandwidth, this is where the P4 can gain or lose. it will lose out if you can't give it data because the cost of not providing enough data is larger (ie. the quicker the clock-cycle the more idle instruction-cycles for a given memory latency) however the P4 would streak ahead for a good memory bandwidth because it could improve further by accepting and processing more data in a given time than an Athlon (this is debateable but generally speaking...).. unfortunately for the P4 the memory specs for todays tech. leave processors waiting, and TLB, hradware prefetch and yada yada are the attempts to make better use of this waiting time or really, to remove waiting time.

as for the IPC complaints and thoughts on AMDs clock-speed verses Intels, I believe the P4 gained it clock speed by the simple feat of lengthening it's pipe-line. To put it very very basically, AMD wins out at the moment because memory cannot support such high data throughput requirements. AMD chose lower clock-speeds but lets do more at each stage of our pipe-line (essentially shortening a pipe-line) while Intel said "let's do less per stgae of the pipe-line so that we can go higher clock-speeds". AMDs choice was better FOR THE MOMENT.

Perhaps I should note something about clock-speed.
Clock-speed is limited by latency of signals, transistors turning on and off, through a pipe-line stage. ALl pipe-line stages should be similar to get best efficiency per clock-cycle, more on that later.
Intel could probably get a clock to go 40 GHz. But if they attached it to a P4 core, well no data would ever get through it's pipe-line stage and it certainly would get out of sync real quick.
Here's an IPC vs clock-speed line.
----|---------------------|-------|----------------------------------|
high clock rates Intel AMD big IPC

one extreme is to have little stages, maybe 5 transistors each. You can have massive clock speeds, with 100 stage pipe-lines and simple micro-ops like 'AND bit with bit', XOR bit with bit. this sounds ridiculous I know.
the other extreme, is to have low clock cycles but your micro-ops are things like calculate FFT on two 32KB arrays, parse 5 simultaneous perl scripts, etc.
-- your doing heaps per clock cycle but can't really run very fast cause the data might take milliseconds to get through any one stage of the pipe.

Intels P4 and AMDs Athlon both fall around the middle, which has the best advantages from both sides, but they played the trade-off game differently.

hope any of that helps.
Feel free to pick it to bits cause it's 12:50am here and I need to be asleep, probably come hte morning even I'll not know what I was talking about.

pray for my coherence please... thank you

Balzi

So performance is lots of different things all mized up.

"I spilled coffee all over my wife's nighty... ...serves me right for wearing it?!?"
 
G

Guest

Guest
I'm gonna print this lot off and show it to my grandma - ha - she'll think were aliens.
:lol:

I am NOT responsible for any damage the info in this post may cause to your system.
 
G

Guest

Guest
"Feel free to pick it to bits"

why? I rather liked it. I was a bit confused about the line thing about half way down.
 
G

Guest

Guest
It was probably the most informative post I've read.

You are responsible for any damage the info in your post may cause to my system.
 
G

Guest

Guest
Duh *run*

You are responsible for any damage the info in your post may cause to my system.
 

AmdMELTDOWN

Distinguished
Dec 31, 2007
2,000
0
19,780
"Then you pointed out Chris Blanos's Lightwave benchmark database as a symbol of the P4's SSE2 superiority--until a dual AthlonMP 1800+ took top rankings."

MP in the top rankings???

<A HREF="http://www.blanos.com/benchmark/bprint.cgi?lw_scene=raytrace&limit=10&search=type]" target="_new">http://www.blanos.com/benchmark/bprint.cgi?lw_scene=raytrace&limit=10&search=type]</A>

"<b>AMD/VIA!</b>...you are <i>still</i> the weakest link, good bye!"
 

LoveGuRu

Distinguished
Sep 21, 2001
612
0
18,980
but you cant fit all that many transistors on todays sised cores, can u?
AMD's core is smaller then intel's, why dont they use it to put more transistors on on a specific (or costume) made CPU.
does a deeper pipe line mean more energy consumption?
SSE/SSE2, how effective thay are compred to just excecuting the proggy?

i will be in my base (army) for the next few days, would be great if you could mail me with the respond.Thx a bounch.

<font color=green>
*******
*K.I.S.S*
*(k)eep (I)t (S)imple (S)tupid*
*******
</font color=green>
 

Conqueror

Distinguished
Nov 8, 2001
87
0
18,630
If Intel suddenly matched or beat the Athlon's IPC, then they would blow by AMD. However, if AMD suddenly matched or beat the P4's clock speed, they would blow by Intel.

The role of the CPU clock is to co-ordinate the CPU activities. Each time the clock pulses, operations waiting to happen are carried out. That means that each part of an instruction can only be accessed every time the clock pulses. So an instruction occupying three bytes of memory will take three clock pulses to carry out (i.e. there are three memory reads required). Clock speed is therefore important. Potentially, the faster the clock the faster the instruction cycle time.

But the thing is, it is much more difficult to develope processors with higher clock speeds. So Intel have an advantage, because they have already produced high-clocked processors. All they have to do is to use a different architecture (maybe similar to AMD's) that will do more instructions per clock.

Remember,

Performance = clock speed x instructions per cycle


AMD on the other hand are having a hard time in increasing clock speed (I don't blame them because it is very difficult to do this), but they do have an architecture which carries out more instructions per clock.

I think Intel is at an advantage, but I don't see why they don't modify their systems so that they will do more instrutions per cycle. Then they will have a high figure for clock, and a high figure for instructions per cycle, and therefore a high overall performance.


It is true that performance depends on factors that you guys have mentioned such as memory band width, cache transfer rate etc. but all these play a role in carrying out instructions, so large memory band width would play a role in carrying out more instructions every time the clock pulses. So this is already implimented in the term "instructions per cycle" of the equation.

My question is, why doesn't intel change the structure of their computers (maybe make them similar to AMD's, apart from the CPU of course) so that they would carry out more instructions per clock, and hence better performance?
 

balzi

Distinguished
Oct 16, 2001
121
0
18,680
OK, sorry Conqueror, I mustn't have made my point clear enough.

The point my friend, behind higher clock-speeds and more done per clock-cycle, is that you MUST trade between massive clock-speeds and "IPC".

Intel has chosen to have less done at each stage of their pipe, which is the *MAIN* reason why they can hit higher clock-speeds.

AMD have chosen a different tact (note 1), they have decided to do a little more per clock-cycle (the more you do teh easier it is to tweak and get a higher average "IPC").. but they can't scale their clock-speed a high as Intel can.

imagine,. for the sake of this example, that it takes 10pS (picoseconds) to switch a transistor on a 0.18um process. If Intels pipe stages were 10 transistors then maybe AMDs pipe stages could be 20 transistors (??).

so each time Intel feeds data to a stage it takes 10 x 10pS = 100pS to complete. so if you clocked faster than 1/100pS = 10GHz you would not have valid data at the end of the pipe... therefore Intel's machine would have a high-end clock limit of 10GHz, assuming all pipe stages were 10 transistors or less.

AMD, on the other hand, would need 20transistors x 10pS = 200pS per clock-cycle, which is 5GHz. so it can't 'start' as many instructions per second (5 billion vs. 10 billion).. but it does just as much processing by doing twice as much with is 20 transistors as intel can do with their 10 transistors.

Now to address the point of intel doing as much as AMD per clock, or AMD meeting Intel Clock-speed. is it not that AMDs tech is slower, or that Intel's tech has bad IPC. It's all a matter of trade-off..

Intel cannot do as much as AMD because to get 10GHz they made each stage only 5 transistors to reduce **PROPAGATION DELAYS** (say it with me, propagation).
However AMD cannot match Intels 10GHz cause there pipe-lines have 20 transistors and therefore double the propagation.
so you can't get both with current technology unless you get yoyur transistors to switch faster (that is exactly what SOI - silicon on insulator tries to achieve)

SO if they both move to a 0.13um process or when AMD employs SOI, maybe the transistors will switch in 5pS.. so now Intels architecture can hit 5pS x 10 trannies = 50pS or 20GHz.. but also AMD will hit 10GHz and will still be doing more per clock-cycle..

Now, scale this all up to reality and the main concepts still hold. 0.13um probably won't double clock-speeds though.

but after all that is said.. you can see why Clock-speed matters alot less than how smart your desin team is, how you can optimise your pipe-line, how your branchprediction unit works, how your parallel FPUs and ALUs work together.. etc

that's the end of "Balzi know best" part 2.. hehehe

hopw i cleared your head out Conqueror.. hey, are you related to my web-browser, his name is Konqueror.

"I spilled coffee all over my wife's nighty... ...serves me right for wearing it?!?"