Sign in with
Sign up | Sign in
Your question

AMD vs. Intel

Tags:
  • CPUs
  • Chipsets
  • Intel
  • Windows XP
Last response: in CPUs
Share
October 27, 2002 12:32:49 PM

How can an Athlon XP 2800+ w/ nVidia GeForce2 chipset beat an Intel Pentium 4 2.8GHz w/ Intel 850E chipset in most of the tests results in the THG article, "AMD Travels Through Time: Athlon XP 2800+ with 166 FSB" when the bandwidth of former is 2.6GB/s and the latter is 4.2 GB/s? Also, the Athlon CPU is not as optimized as the Intel CPU.

Thanks in advance.

More about : amd intel

October 27, 2002 2:06:15 PM

Correction ## There is no chipset named GeForce2. It should be nForce2.

Quote:
Also, the Athlon CPU is not as optimized as the Intel CPU.


What kind of optimization you want to mean? AXP performs much better clock per clock compared to P4. And P4 2.8 GHz doesn't really get 4.2 GB/s memory bandwidth, actually it gets about 3.3 GB/s. And Athlon practically gets about 2.5 GB/s instead of theoritical 2.7 GB/s. So the practical bandwidth difference not much as theoritical difference. And DDR SDRAM has much better access time than RDRAM.

What Audio Compression Technology you use for storing music? <A HREF="http://forumz.tomshardware.com/community/modules.php?na..." target="_new"> Tell Here</A>
October 27, 2002 2:19:55 PM

Quote:
How can an Athlon XP 2800+ w/ nVidia GeForce2 chipset beat an Intel Pentium 4 2.8GHz w/ Intel 850E chipset in most of the tests results in the THG article, "AMD Travels Through Time: Athlon XP 2800+ with 166 FSB" when the bandwidth of former is 2.6GB/s and the latter is 4.2 GB/s? Also, the Athlon CPU is not as optimized as the Intel CPU.
Quote:


The nForce2, as I recall, provides up to 5.2 GB/sec of memory bandwidth. What's the problem? Just because the Athlon can't take advantage of that, you deem it unfair?

"We are Microsoft, resistance is futile." - Bill Gates, 2015.
Related resources
October 27, 2002 3:41:48 PM

How you say? Well the only reason that the Athlon XP 2800+ manages to achieve that performance is because of the very aggressive memory settings on the nForce 2, which have been proven to cause serious stability problems.

BTW, bandwith doesn't really mean alot. In the P4's case, it has alot of bandwith, but it uses it well. The P4 was designed to have high bandwith.

The Athlon core doesn't need alot of bandwith, because it has alot of execution power. The Athlon is very efficient in terms of using all of it's resources and performance potential, but that does limit it's ability to ramp in clock speed. The P4 can ramp high, and it's not using it's full potential, which means the core has alot of room for growth, which is a good thing.

Based on my explanation, you're probably thinking that the P4 should have horrible performance. In fact, it doesn't. To boost performance on the P4, Intel uses a few "tricks". On of them is a very efficient cache. Also, very good prefetching along with SSE2 help out. The upcoming Hyperthreading is Intel's newest "trick" to boost the P4's performance, without adding much heat or die space.

- - -
<font color=green>All good things must come to an end … so they can be replaced by better things! :wink: </font color=green>
<P ID="edit"><FONT SIZE=-1><EM>Edited by Dark_Archonis on 10/27/02 12:48 PM.</EM></FONT></P>
October 27, 2002 6:34:34 PM

The stability problem of early nForce2 was casued by MCP-T southbridge. nVidia has fixed it.

Quote:
The Athlon core doesn't need alot of bandwith, because it has alot of execution power


Athlon core benefits from higher bandwidth if equipped with a good chipset.

Quote:
The upcoming Hyperthreading is Intel's newest "trick" to boost the P4's performance, without adding much heat or die space.


Hyperthreaded P4's will require more power than simple P4's. So there will be extra heat.

What Audio Compression Technology you use for storing music? <A HREF="http://forumz.tomshardware.com/community/modules.php?na..." target="_new"> Tell Here</A>
October 27, 2002 6:45:07 PM

Quote:
In the P4's case, it has alot of bandwith, but it uses it well

You should revise your facts Dark, it does NOT use it well, never has, due to latency. Look at the theoretical bandwidth from Sandra for example, it does not reach 4.2GB/sec, ever. And it's logical, RDRAM has high latency, plus the CAS. In the end it's more like ~3.5GB. Same thing for DDR, however the low latency that it can reach with not much trouble gets it closer to the theoretical bandwidth.
Therefore, the P4 is NOT efficiently using its bandwidth, if it had, PC1066 would've given it a tremendous boost over what it currently gives.

Quote:
The Athlon core doesn't need alot of bandwith, because it has alot of execution power.

From what I have read, the reasons behind same clock speeds, different architecture= variable needed bandwidth stem from branch misprediction frenquencies and prefetching. Otherwise how can you explain the 1.4GHZ P3 needing 1.06GB/sec of bandwidth while Athlons at such speeds require DDR266? In order to be able to use more bandwidth, it'd simply need to be crappy at branch predicting and such. Improving such on a P4 would probably give it a lot of credits.

Quote:
The Athlon is very efficient in terms of using all of it's resources and performance potential, but that does limit it's ability to ramp in clock speed

If heat-wise, I agree, if technical, you are thinking like the rest of those who didn't read my 1ns= Hz topic, or who don't know a lot in frequency and pipelines.
I wish imgod2u would've told you this, but you DO NOT sacrifice IPC in order to get higher speeds by pipelining. This is theoretical of course because you get heat and power specification demands as well as silicon physics limits, however the point is, it's not because the Athlon has high IPC that it means it can't ramp so much.
I agree it can't do so now because of the heat problems, and many process variables, however please don't state it like this without specifying HEAT.

Quote:
The P4 can ramp high, and it's not using it's full potential, which means the core has alot of room for growth, which is a good thing.

Refering to what I said above, this just backs up what you said about the Athlon, in that once P4s reach the IPC of Athlons, they would as well have ramping problems, THEORETICALLY in your assumption.

Quote:
without adding much heat

It is not proven yet, a 70% increase in IPC due to HT, would seem to me as a very heat demanding operation. The average 25% increases won't yeild as much, but you can't just say that it won't add much right now.



--
I guess I just see the world from a fisheye. -Eden<P ID="edit"><FONT SIZE=-1><EM>Edited by Eden on 10/27/02 03:52 PM.</EM></FONT></P>
October 27, 2002 6:47:11 PM

I'd like to add that technically, there is no extra die space for HT, it is already on any P4, so I fail to see how future ones would need bigger space, unless it is a dramatically improved version that required a good 40mm^2 die space!

Didn't know nVidia fixed the problem, but did they confirm the performance variation after that?

--
I guess I just see the world from a fisheye. -Eden
October 27, 2002 8:23:27 PM

Quote:
How you say? Well the only reason that the Athlon XP 2800+ manages to achieve that performance is because of the very aggressive memory settings on the nForce 2, which have been proven to cause serious stability problems.


The processor prefetch circuitry can't load enough instructions into cache at a time due to the lack of bandwidth. Timing or not, the best case for a memory access would be no memory access at all, i.e. the prefetch logic managed to load correct data into cache. This is very bandwidth-intensive and something the Athlon can't take advantage of due to its slower FSB.

Quote:
BTW, bandwith doesn't really mean alot. In the P4's case, it has alot of bandwith, but it uses it well. The P4 was designed to have high bandwith.


The P4 craves bandwidth due to its excellent prefetch logic. Get an Athlon at 2.8 GHz and you will require just as much bandwidth if not more than a 2.8 GHz P4 because of how aggressively the prefetch logic would have to work to make sure the processor never has to access memory directly (and wait for the stalls).

Quote:
The Athlon core doesn't need alot of bandwith, because it has alot of execution power.


Actually, it's due to its low clockspeed. Just think about it this way:
The processor runs at 1 GHz, that means each clock is 1 ns. The memory runs at 100 MHz, each clock is 10 ns. Let's say that the memory is DDR and is set to 2:2:2. Even in the best case, if the processor accesses memory, it'll have to wait 3 memory clocks (synchronized memory). That's 30 ns. That translates to 30 clocks that the processor remains idle.
Now let's say you had a 2 GHz processor. That same 30 clocks wasted is now 60 clocks wasted. What's the solution? More aggressive prefetch. Use as much memory side-bandwidth as you can to look through all the data, check the patterns of access and load all of that information into cache so that the processor never has to access memory directly.
At 1.8 or whatever GHz, the Athlon doesn't need aggressive prefetching as a 2.8 GHz P4, therefore, it can make due with less bandwidth and in fact, it won't benefit much to have more bandwidth anyway because you already are getting excellent cache hit rates.

Quote:
Based on my explanation, you're probably thinking that the P4 should have horrible performance. In fact, it doesn't. To boost performance on the P4, Intel uses a few "tricks". On of them is a very efficient cache. Also, very good prefetching along with SSE2 help out. The upcoming Hyperthreading is Intel's newest "trick" to boost the P4's performance, without adding much heat or die space.


The P4 does indeed perform pretty horribly. Perhaps not "horribly" compared to older designs like the P3 or Athlon but horribly in that the P7 core is capable of so much more. The P4's design seems to be filled with crippling cutbacks. Of course, it's not impossible to get great performance out of the P4, but you'd have to significantly change the way you code your application.

"We are Microsoft, resistance is futile." - Bill Gates, 2015.
October 28, 2002 12:41:56 AM

So, it seems that the Athlon CPU's logic circuitry has a higher cache hit ratio that the P4 so it doesn't need the extra bandwidth that the P4 requires? Interesting.

It sounds that dollar for dollar the Athlon XP 2800+ packs a better punch.

I plan to build two computers -- one will be used for gaming and development (e.g., Star Trek Voyager Elite Force, Mechwarrior 4: Vengeance. and Visual Studio.NET Professional), the other is for graphics (e.g., Adobe Photoshop 6). Both computers' OS will be Windows 2000 Professional and have business apps and utilities such as Office 2000 Professional, Norton AntiVirus 2002 and PerfectDisk 2000.

I may go with the Athlon XP 2800+ w/ nVidia nForce2 chipset (when it comes out) for the gaming computer. However, since Adobe Photoshop takes advantage of multiprocessors, I'm wondering if there would be a performance boost with a dual-processor over a single processor system. Any thoughts?

<P ID="edit"><FONT SIZE=-1><EM>Edited by joey256 on 10/27/02 09:43 PM.</EM></FONT></P>
October 28, 2002 12:44:09 AM

I recommend a 2.4GHZ P4 system, which you can overclock to 2.8GHZ quite safely with the stock retail HSF. Couple it with DDR400 on the 845PE chipset, which has killer performance and support for HT and you get yourself a whole-in-one rendering and general purpose system!

--
I guess I just see the world from a fisheye. -Eden
October 28, 2002 2:05:20 AM

damn. I wish I could understand what you guys are talking about so that I could participate in these cpu comparison conversations. I'm just a math major (who takes lots of cs classes). If anyone wants to talk about partial differential equations I'm on it but until then, damn I wish I would understand what you guys are talking about. How did you all learn as much as you know? or do you just regurgitate what you have read elsewhere? I suspect some of you know and some of you don't really know but for those of you who do know: where did you learn all this?

It's always darkest just before it goes pitch black.
October 28, 2002 2:08:47 AM

<A HREF="http://www.arstechnica.com" target="_new">http://www.arstechnica.com&lt;/A> is where I started learning a lot about CPUs. Mind you I'm only 16 so I can't take any IT or Computer Architecture classes yet, so I take what I learn and mix it with what current technology has. It's really interesting.

But those guys know far more than I do, so I also have trouble grasping it all.
--
I guess I just see the world from a fisheye. -Eden<P ID="edit"><FONT SIZE=-1><EM>Edited by Eden on 10/27/02 11:10 PM.</EM></FONT></P>
October 28, 2002 7:13:21 AM

Well you are very intelligent,and mature for a 16 year old....My hats off to you..:) 

If ya don't ask..How ya gonna know.
October 28, 2002 1:41:46 PM

What kind of optimization you want to mean? AXP performs much better clock per clock compared to P4. And P4 2.8 GHz doesn't really get 4.2 GB/s memory bandwidth, actually it gets about 3.3 GB/s. And Athlon practically gets about 2.5 GB/s instead of theoritical 2.7 GB/s. So the practical bandwidth difference not much as theoritical difference. And DDR SDRAM has much better access time than RDRAM.


Only sisoft sandra say so Steam bench show that RDRAM is almost 2X faster that Pc2100

Now what to do??
October 28, 2002 2:45:47 PM

true dat.

It's always darkest just before it goes pitch black.
October 28, 2002 6:12:36 PM

:redface: ...thanks!

--
I guess I just see the world from a fisheye. -Eden
October 28, 2002 6:58:17 PM

Thanks for the link.
For my money the xp2200 with the ti4600, is the best mulity tasking system right now.
October 28, 2002 7:38:28 PM

I forgot to add, that indeed in current games, the CPU plays a much less important role, so a lower end AthlonXP will do little difference in gaming performance, to a P4 high-end, it all depends on the card.

--
I guess I just see the world from a fisheye. -Eden
October 28, 2002 8:23:46 PM

It would depend really. UT2003 seems to be pretty processor-intensive. I put together a Celeron system for my cousin a few months back and overclocking the thing from 1.1 to 1.46 GHz brought about a 20% boost in the flyby framerate and a 15% boost in the drone match framerate. Upping the thing to 1.6 GHz brought about a 14% and 10% boost in framerate respectively, so for a lot of the newer games, a fast processor is pretty important.

"We are Microsoft, resistance is futile." - Bill Gates, 2015.
October 28, 2002 8:33:10 PM

I'd say the lowest end requirements are rising, but for users with a 2GHZ P4, thinking that a 2.8GHZ would bring them nearly 40% better performance are in for a disappointment.

--
I guess I just see the world from a fisheye. -Eden
October 28, 2002 8:50:27 PM

Quote:
<i>Written by Spitfire_x86</i>
The stability problem of early nForce2 was casued by MCP-T southbridge. nVidia has fixed it.

Interesting, I never knew that was the problem. I really do hope that retail nForce2's don't have any stability problems.

Quote:
<i>Written by Spitfire_x86</i>
Hyperthreaded P4's will require more power than simple P4's. So there will be extra heat.

Notice how (in my post) I said "without adding <b>much</b> heat". I know there will be more heat, but it won't be a significant amount.

Quote:
<i>Written by Eden</i>
You should revise your facts Dark, it does NOT use it well, never has, due to latency. Look at the theoretical bandwidth from Sandra for example, it does not reach 4.2GB/sec, ever. And it's logical, RDRAM has high latency, plus the CAS. In the end it's more like ~3.5GB. Same thing for DDR, however the low latency that it can reach with not much trouble gets it closer to the theoretical bandwidth.
Therefore, the P4 is NOT efficiently using its bandwidth, if it had, PC1066 would've given it a tremendous boost over what it currently gives.

The P4 itself does not suffer from high latency or anything. The P4's cache's are extremely low latency, and the FSB is fast. Indeed, RDRAM does have a high latency, which hinders the bandwith of the P4, but the latency in PC 1066 RDRAM is fairly small (it's about 20% less than PC 800).

Quote:
<i>Written by Eden</i>
If heat-wise, I agree, if technical, you are thinking like the rest of those who didn't read my 1ns= Hz topic, or who don't know a lot in frequency and pipelines.
I wish imgod2u would've told you this, but you DO NOT sacrifice IPC in order to get higher speeds by pipelining. This is theoretical of course because you get heat and power specification demands as well as silicon physics limits, however the point is, it's not because the Athlon has high IPC that it means it can't ramp so much.
I agree it can't do so now because of the heat problems, and many process variables, however please don't state it like this without specifying HEAT.

Indeed, I was referring to heat, but also, to technical aspects. You do realize how hard it is (technically) to make all 9 K7 execution units run at 4Gghz, for example? The problem is, there are physical and material limitations as to how fast execution units (and caches) can run. Sure, the obvious problem is heat, but also, there are technical constraints. For example, on 0.13 process, a K8 WOULD NOT be able to reach 4Ghz, simply to due to heat, as well as stability reasons. Normally, the architecture wouldn't be able to run at such a clock speed (similar to the P3 1.13 Ghz situation). For example, maybe the interconnects might not handle the frequency, or there might be too much leakage, or it might get too hot, etc. With SOI, all of these things are reduced, thus giving AMD the ability to ramp the K8's frequency 35% higher than normal. That's what I meant. Also, you must remember that the K8 is 9 layers deep, which causes quite a few heat pockets, not to mention electron interference. SOI remedies this somewhat. For it's CPU's to meet it's very stringent stablility, reliability, safety measures, and other requirements at high frequencies (4Ghz + ), Intel is using such things as strained silicon, 0.09nm process, as well as low-k dielectric. For speeds in the area of 10Ghz and beyond, Intel will use the 0.065nm & 0.045nm manufacturing processes, BBUL packaging, as well as SOI (in 2005).

Quote:
<i>Written by imgod2u</i>
The P4 craves bandwidth due to its excellent prefetch logic. Get an Athlon at 2.8 GHz and you will require just as much bandwidth if not more than a 2.8 GHz P4 because of how aggressively the prefetch logic would have to work to make sure the processor never has to access memory directly (and wait for the stalls).

Sorry, I forgot about that.

Quote:
<i>Written by imgod2u</i>
Actually, it's due to its low clockspeed. Just think about it this way:
The processor runs at 1 GHz, that means each clock is 1 ns. The memory runs at 100 MHz, each clock is 10 ns. Let's say that the memory is DDR and is set to 2:2:2. Even in the best case, if the processor accesses memory, it'll have to wait 3 memory clocks (synchronized memory). That's 30 ns. That translates to 30 clocks that the processor remains idle.
Now let's say you had a 2 GHz processor. That same 30 clocks wasted is now 60 clocks wasted. What's the solution? More aggressive prefetch. Use as much memory side-bandwidth as you can to look through all the data, check the patterns of access and load all of that information into cache so that the processor never has to access memory directly.
At 1.8 or whatever GHz, the Athlon doesn't need aggressive prefetching as a 2.8 GHz P4, therefore, it can make due with less bandwidth and in fact, it won't benefit much to have more bandwidth anyway because you already are getting excellent cache hit rates.

Darn, I knew I had forgotten something. It all makes sense now. I knew that it had something to do with the architecture, prefetching and cache hit-rates both slipped out of my mind.

BTW (this applies to you Eden, too), the P4 has excellent prefetching (more so than the Athlon) which does negate most of RDRAM's latency (after reading imgod2u's post over and over, I remembered this).

Quote:
<i>Written by imgod2u</i>
The P4 does indeed perform pretty horribly. Perhaps not "horribly" compared to older designs like the P3 or Athlon but horribly in that the P7 core is capable of so much more. The P4's design seems to be filled with crippling cutbacks. Of course, it's not impossible to get great performance out of the P4, but you'd have to significantly change the way you code your application.

Indeed, the P4 has a large amount of room for growth, since many features were "cutback" from the Williamette. No doubt you are familiar with the original features of the Williamette. I can just see it now; 3 double-pumped ALU's, 1MB L3 cache ...

Tommunist, I learned my knowledge in a way similar to that of Eden. I also read alot, and I know this may sound corny, but I go digging for whitepapers on many companies' websites, like Intel's, AMD's, and Nvidia's. Whitepapers have a wealth of information, in case you didn't know.

BTW, I'm also 16.


- - -
<font color=green>All good things must come to an end … so they can be replaced by better things! :wink: </font color=green>
October 29, 2002 2:09:15 AM

Quote:
The P4 itself does not suffer from high latency or anything. The P4's cache's are extremely low latency, and the FSB is fast. Indeed, RDRAM does have a high latency, which hinders the bandwith of the P4, but the latency in PC 1066 RDRAM is fairly small (it's about 20% less than PC 800).

Obviously if the P4 can stand QDR 100MHZ FSB that it has good latency. However RDRAM's problem is that it is still far from reaching its theoretical (and advertised) bandwidth. Just think of how much it'd be powerful if it did, at least it would feed the FSB would is probably sitting their 20% of each second's clock speeds waiting!

Quote:
Indeed, I was referring to heat, but also, to technical aspects. You do realize how hard it is (technically) to make all 9 K7 execution units run at 4Gghz, for example? The problem is, there are physical and material limitations as to how fast execution units (and caches) can run. Sure, the obvious problem is heat, but also, there are technical constraints. For example, on 0.13 process, a K8 WOULD NOT be able to reach 4Ghz, simply to due to heat, as well as stability reasons. Normally, the architecture wouldn't be able to run at such a clock speed (similar to the P3 1.13 Ghz situation). For example, maybe the interconnects might not handle the frequency, or there might be too much leakage, or it might get too hot, etc. With SOI, all of these things are reduced, thus giving AMD the ability to ramp the K8's frequency 35% higher than normal. That's what I meant. Also, you must remember that the K8 is 9 layers deep, which causes quite a few heat pockets, not to mention electron interference. SOI remedies this somewhat. For it's CPU's to meet it's very stringent stablility, reliability, safety measures, and other requirements at high frequencies (4Ghz + ), Intel is using such things as strained silicon, 0.09nm process, as well as low-k dielectric. For speeds in the area of 10Ghz and beyond, Intel will use the 0.065nm & 0.045nm manufacturing processes, BBUL packaging, as well as SOI (in 2005).


I never even expected 0.13m K8s to reach 4GHZ, I dunno how you of all would!
Technically by pipeline lenght laws, the K7-K8 should be twice less fast than a P7 core at the utmost maximum. A 0.13m P4 can reach 4GHZ and top there, therefore a K7 is having trouble scaling higher than 2GHZ. 0.09m K7s can reach 4GHZ.

Quote:
BTW (this applies to you Eden, too), the P4 has excellent prefetching (more so than the Athlon) which does negate most of RDRAM's latency (after reading imgod2u's post over and over, I remembered this).

I am aware of that however I am still having trouble beleiving a 1.8GHZ K7 needs less bandwidth than a P4 at 1.8GHZ, if the P4 has 1 decoder, Athlon has 3, technically the Athlon would need more bandwidth! Some say it's due to branch misprediction, so I guess it makes sense why at the P4's pipeline, bandwidth is easily more waste, therefore you need more. I guess I may have answered my own question!
(if not, please correct me)

BTW if you're like me, you will drool at every 'naked silicon' you see, and feel excitement over seeing a new chip technology as well as dreaming of marrying a supercomputer. :smile: (ok scratch that last one!)


--
I guess I just see the world from a fisheye. -Eden
October 29, 2002 4:33:26 AM

Quote:
I never even expected 0.13m K8s to reach 4GHZ, I dunno how you of all would!
Technically by pipeline lenght laws, the K7-K8 should be twice less fast than a P7 core at the utmost maximum. A 0.13m P4 can reach 4GHZ and top there, therefore a K7 is having trouble scaling higher than 2GHZ. 0.09m K7s can reach 4GHZ.


I don't know where you got this, I never said anything like this and no respectable web site has. There is no "law" stating that twice the pipeline length would result in twice the scalability, it simply makes scaling easier, nobody actually knows how much.

Quote:
I am aware of that however I am still having trouble beleiving a 1.8GHZ K7 needs less bandwidth than a P4 at 1.8GHZ, if the P4 has 1 decoder, Athlon has 3, technically the Athlon would need more bandwidth! Some say it's due to branch misprediction, so I guess it makes sense why at the P4's pipeline, bandwidth is easily more waste, therefore you need more. I guess I may have answered my own question!
(if not, please correct me)


Umm, a P4 at 1.8 GHz would require less memory bandwidth than an Athlon at 1.8 GHz. Both processors are fetching instructions at the same rate and both suffer as much from a cache miss, with the Athlon requiring more instructions to be fetched every clock. A P4 at 2.26 GHz, however (roughly equivalent in performance of a 1.8 Athlon) would require more memory bandwidth. An Athlon at 2.26 GHz would probably require more memory bandwidth than a P4 at 2.26 GHz, however, currently, no selling Athlons are at 2.26 GHz. The top of the line P4 that thrives on RDRAM bandwidth is at 2.8 GHz. You see where I'm going here?

"We are Microsoft, resistance is futile." - Bill Gates, 2015.
October 29, 2002 1:57:12 PM

Dark_Archonis:
The P4 itself does not suffer from high latency or anything. The P4's cache's are extremely low latency, and the FSB is fast. Indeed, RDRAM does have a high latency, which hinders the bandwith of the P4, but the latency in PC 1066 RDRAM is fairly small (it's about 20% less than PC 800).
--------------------------
Have You ever read any benchmarks of P4? It shines on memory benchmarks, but sucks almost all in in the others(clock per clock).

It is not the latency of memory bandwith what hinders P4, it's the latency of data it gets for processing, due to it's puny and "smart" L1 cache.

P4 was originally designed to operate with the now famous so called "hyperthreading", Intel just gouldn't make it work in time. But Intel was in such a big hurry to bring a new processor in the market because AMD was getting too far ahead( P3 1,13MHz, remember).So Intel brought out a lobotomized CPU with a higher clockspeed then AMD.The "hyperthreading" is fundamental part of P4 to co-operate with the "smart" L1 cache and with the rest of the stuff .Intel just failed, ones again. Thats it.

Or am I wrong?
October 29, 2002 3:28:21 PM

Quote:
Have You ever read any benchmarks of P4? It shines on memory benchmarks, but sucks almost all in in the others(clock per clock).


I assume you mean compared to an Athlon. I don't there is a single application in which the P4 matches per-clock performance to the Athlon. However, that was not the point of the design. The P4 has about 1/3 the decoding logic, 66% of the execution logic, half the issuing ability, it obviously was not meant to be a monster clock for clock. Nor does it need to be. Its extremely hyperpipelined designs seems to be giving it a lot more scalability than either the previous P6 core or the K7 core and the cutbacks saved a tremendous amount of die space and heat to allow for fancier features such as SSE2 and per clock performance is at what, a 20% loss in modern code that has been well tuned for the previous P6 and K7 architectures? As code is adjusted to the P4's.....quirks that gap even lessens.

Quote:
It is not the latency of memory bandwith what hinders P4, it's the latency of data it gets for processing, due to it's puny and "smart" L1 cache.


I can't say I agree. While I do think that the L1 data cache should be bigger and the trace cache could benefit from being bigger, I don't think it's the major bottleneck. I would say that the major bottleneck for the P4 would be its x87 FPU. However, since Intel is trying to cast asside x87 and replace it with its own SSE to begin with, I doubt they really see it as a "bottleneck" anyway.

Quote:
P4 was originally designed to operate with the now famous so called "hyperthreading", Intel just gouldn't make it work in time. But Intel was in such a big hurry to bring a new processor in the market because AMD was getting too far ahead( P3 1,13MHz, remember).So Intel brought out a lobotomized CPU with a higher clockspeed then AMD.The "hyperthreading" is fundamental part of P4 to co-operate with the "smart" L1 cache and with the rest of the stuff .Intel just failed, ones again. Thats it.


Here I will agree with you. Intel obviously gave up on spending tons of die space to squeeze 10-20% more instruction level parallelism out of code and instead opted to go for thread-level parallelism.

"We are Microsoft, resistance is futile." - Bill Gates, 2015.
October 29, 2002 4:08:28 PM

The bottom line is =

AMD is equal performance to the P4 for half the price or less ~
October 29, 2002 4:08:29 PM

The bottom line is =

AMD is equal performance to the P4 for half the price or less ~
October 29, 2002 4:08:29 PM

The bottom line is =

AMD is equal performance to the P4 for half the price or less ~
October 29, 2002 4:08:38 PM

The bottom line is =

AMD is equal performance to the P4 for half the price or less ~
October 29, 2002 4:08:38 PM

The bottom line is =

AMD is equal performance to the P4 for half the price or less ~
October 29, 2002 6:57:18 PM

Quote:
<i>Written by Eden</i>
I never even expected 0.13m K8s to reach 4GHZ, I dunno how you of all would!
Technically by pipeline lenght laws, the K7-K8 should be twice less fast than a P7 core at the utmost maximum. A 0.13m P4 can reach 4GHZ and top there, therefore a K7 is having trouble scaling higher than 2GHZ. 0.09m K7s can reach 4GHZ.

I didn't say they actually would reach such a clock speed, I was just providing an example. BTW, did you know that Intel recently stated that the "optimal" pipeline length is 20 stages, so it's very unlikely they'll go any higher.

Quote:
<i>Written by imgod2u</i>
I am aware of that however I am still having trouble beleiving a 1.8GHZ K7 needs less bandwidth than a P4 at 1.8GHZ, if the P4 has 1 decoder, Athlon has 3, technically the Athlon would need more bandwidth! Some say it's due to branch misprediction, so I guess it makes sense why at the P4's pipeline, bandwidth is easily more waste, therefore you need more. I guess I may have answered my own question!

As imgod2u stated, the Athlon fetches more instructions per clock than the P4. The P4's prefetching also allows it to cut back a bit on the bandwith.

Quote:
<i>Written by Era</i>
Have You ever read any benchmarks of P4? It shines on memory benchmarks, but sucks almost all in in the others(clock per clock).

It is not the latency of memory bandwith what hinders P4, it's the latency of data it gets for processing, due to it's puny and "smart" L1 cache.

P4 was originally designed to operate with the now famous so called "hyperthreading", Intel just gouldn't make it work in time. But Intel was in such a big hurry to bring a new processor in the market because AMD was getting too far ahead( P3 1,13MHz, remember).So Intel brought out a lobotomized CPU with a higher clockspeed then AMD.The "hyperthreading" is fundamental part of P4 to co-operate with the "smart" L1 cache and with the rest of the stuff .Intel just failed, ones again. Thats it.

Well, the P4 does really have alot of data processing latency. I don't know where you would get that. The P4's L1 is ultra-low latency, so it's very fast, but a side effect is it's very small.

I also agree with you on the whole HT things, and the fact that the P4 is missing so many of it's original specs.

Hey phial, nice <b>quintuple</b> post. Hey, can anyone confirm if this is a forum record? Your statement may have been true 3 years ago, but not now. Now, even the most expensive P4 out there (2.8Ghz) is $458. The most expensive Athlon (XP2400) is $191. In most benchmarks, this P4 achieves about 20-30% better performance than the Athlon. Lets not forget that the P4 has better overclocking. So, in this case, your statement is semi-correct. But you're looking at the top of th line P4, where Intel charges a hefty premium. In the mid-range, your statement is not true. Things are only going to get worse for AMD. I heard that AMD will start selling their Athlon's at a higher ASP (average selling price) than ever before, which is apparently a decision Hector Ruiz made. On top of that, Intel has another price cut coming in November, along with HT enabled CPU's, so that price/performance gap between AMD/Intel will grow even smaller. BTW, if you check out pricewatch right now, the XP2400 Athlon is going for $191, while the P4 2.4B is going for only $188. The 2.4B has about 5% better performance, has better overclocking ability, and actually costs less. Lets say the prices for both are the same, the P4 is still the better CPU in this case.

- - -
<font color=green>All good things must come to an end … so they can be replaced by better things! :wink: </font color=green>
<P ID="edit"><FONT SIZE=-1><EM>Edited by Dark_Archonis on 10/29/02 04:01 PM.</EM></FONT></P>
October 29, 2002 7:21:28 PM

Quote:
I don't know where you got this, I never said anything like this and no respectable web site has. There is no "law" stating that twice the pipeline length would result in twice the scalability, it simply makes scaling easier, nobody actually knows how much.

Ok I shouldn't have said laws, you're right, however if we followed the idea of splitting tasks into two, you'd expect capabilities to be around twice faster and perhaps even more. I still beleive the K7 will top out at 5GHZ where after that, no die shrink can possibly help the physics to extend, and that the P4 will go for ~10GHZ and top out there for a 20 stage pipe. We'll just see in 2-3 years from now.

Quote:
Umm, a P4 at 1.8 GHz would require less memory bandwidth than an Athlon at 1.8 GHz. Both processors are fetching instructions at the same rate and both suffer as much from a cache miss, with the Athlon requiring more instructions to be fetched every clock. A P4 at 2.26 GHz, however (roughly equivalent in performance of a 1.8 Athlon) would require more memory bandwidth. An Athlon at 2.26 GHz would probably require more memory bandwidth than a P4 at 2.26 GHz, however, currently, no selling Athlons are at 2.26 GHz. The top of the line P4 that thrives on RDRAM bandwidth is at 2.8 GHz. You see where I'm going here?

That's where it becomes mixed up. If you actually did put the P4 at lower bandwidth, say we go for 266MHZ FSB on P4s, assuming they either used QDR at 66MHZ OR DDR at 133MHZ, and we used DDR 266. So you're telling me the P4 needs less bandwidth? Then how do you explain the big performance gap it'll suffer from?
When we increased the AthlonXP to 333MHZ FSB and memory, the performance jump was small.
It still is not entirely clear to me why bandwidth varies despite the exact same clock frequency rate of data entering the CPU, should we go for theoretical.
Let us assume there is a water pipe, and a half pipe connects the water from the pumps, to that pipe. Let us assume we send the water in at a certain pressure stream in function of the time in seconds, (clock speed) and we wanted to see how much has entered and filled the pipe, per second. The only way one pipe would have less water in the end and therefore require a higher amount sent at the same pressure, is for it to have little holes on the half pipe, before it entered the pipe to fill it. What am I trying to say?
That stuff like branch mispredictions in one second, flush some of the "flow" instead of maxing the pipe out at all times, and so in a water pipe example, you would want a higher "bandwidth" of water at the beginning to offset that problem, and in the end the bandwidth is higher, but the output is the same for both processors. That's how I see it, as why the P4 would need 3.2GB/sec at 1.8GHZ so it can possibly perform decently, when compared to an Athlon at 1.8GHZ requiring 2.1GB/sec.

--
I guess I just see the world from a fisheye. -Eden
October 29, 2002 9:22:44 PM

Quote:
That's where it becomes mixed up. If you actually did put the P4 at lower bandwidth, say we go for 266MHZ FSB on P4s, assuming they either used QDR at 66MHZ OR DDR at 133MHZ, and we used DDR 266. So you're telling me the P4 needs less bandwidth? Then how do you explain the big performance gap it'll suffer from?


A P4 Northwood at 1.8 GHz suffers from a huge performance gap when using DDR266? The 2.4B P4 suffers from an average of <A HREF="http://www.anandtech.com/showdoc.html?i=1624&p=6" target="_new">8.43%</A> when comparing PC1066 to DDR266 (i845g chipset). The Via PX266 chipset performs even better. And that's 2.4 GHz, the gap's even smaller at 1.8 GHz.

Quote:
When we increased the AthlonXP to 333MHZ FSB and memory, the performance jump was small.


The Athlon was at 1.8 GHz, with an average of 5% performance boost from DDR266/266 FSB to DDR333/333 FSB. The P4 at 2.4 GHz improved 8.43% when going from DDR266 to PC1066. I would imagine at 1.8 GHz, the difference between DDR333 and DDR266 would be even less.

Quote:
It still is not entirely clear to me why bandwidth varies despite the exact same clock frequency rate of data entering the CPU, should we go for theoretical.


Chipset efficiency for the most part.

Quote:
Let us assume there is a water pipe, and a half pipe connects the water from the pumps, to that pipe. Let us assume we send the water in at a certain pressure stream in function of the time in seconds, (clock speed) and we wanted to see how much has entered and filled the pipe, per second. The only way one pipe would have less water in the end and therefore require a higher amount sent at the same pressure, is for it to have little holes on the half pipe, before it entered the pipe to fill it. What am I trying to say?

Unfortunately, it is not an entirely accurate analogy. You see, memory doesn't just send information constantly, you have to have the processor send a request for data, the chipset has to relay that signal, then the data has to be found in memory, the memory has to charge, open the right bank and close the previous bank all of that has to be done before 64 bytes (1 cache line for the P4) of data can be sent. After that 64 bytes, it has to start over again. As you can see, only a certain amount of clocks is the memory actually sending data. A lot of the time, it's looking for where data is, charging, waiting for a signal. How fast the chipset can queue and relay requests from the processor and certain other abilities is crucial to how much total memory bandwidth you will get.

Quote:
That stuff like branch mispredictions in one second, flush some of the "flow" instead of maxing the pipe out at all times, and so in a water pipe example, you would want a higher "bandwidth" of water at the beginning to offset that problem, and in the end the bandwidth is higher, but the output is the same for both processors. That's how I see it, as why the P4 would need 3.2GB/sec at 1.8GHZ so it can possibly perform decently, when compared to an Athlon at 1.8GHZ requiring 2.1GB/sec.


As I said above, a P4 at 1.8 GHz would not neccessarily require that more memory bandwidth.
All that's required from cache/memory is that enough instructions are fetched per clock to enter the pipeline. Even if there was a branch mispredict, you'd still only need 1 instruction every clock. So really, that's not the reason.
Now, you'd think with the Athlon attempting to fetch 3 instructions every clock, it'd require more memory bandwidth per clock. However, that's now how the processor accesses memory. Whenever the processor accesses memory, it transfers a cache line. On the P4, this is 64 bytes. On the Athlon, I think this is 32 bytes. So even if that 32 byte contained 3 instructions that were fetched and that 64 bytes contained only 1 instruction that was fetched, you'd still be transfering 32 bytes and 64 bytes. If that 64 bytes contains other instructions that will be used later, you will get a cache hit soon, otherwise, you'd have to go fetch other cachelines.

"We are Microsoft, resistance is futile." - Bill Gates, 2015.
October 30, 2002 12:51:40 AM

I think you missed a part here, I mentioned if we also downclocked the FSB of the P4, to 266MHZ, to sync with the memory.
THG's very first P4 article had benches where they downclocked RDRAM, there was a loss of performance at 1.4GHZ, and it'd have been even worse if the bus went down as well.

Quote:
I would imagine at 1.8 GHz, the difference between DDR333 and DDR266 would be even less.

I wouldn't, a lot here would agree that the DDR333 boosts much more on the P4 than Athlons.

Quote:
Chipset efficiency for the most part.

Yes that might change what I just said above, but that'd mean Intel's chipsets are still lax in performance, so that 3.2GB/sec is a need at these clock speeds.

Quote:
Now, you'd think with the Athlon attempting to fetch 3 instructions every clock, it'd require more memory bandwidth per clock. However, that's now how the processor accesses memory. Whenever the processor accesses memory, it transfers a cache line. On the P4, this is 64 bytes. On the Athlon, I think this is 32 bytes. So even if that 32 byte contained 3 instructions that were fetched and that 64 bytes contained only 1 instruction that was fetched, you'd still be transfering 32 bytes and 64 bytes. If that 64 bytes contains other instructions that will be used later, you will get a cache hit soon, otherwise, you'd have to go fetch other cachelines.

Ok that answers again what I said above about chipset, heh. If it's true the P4 fetches bigger cache lines, then I guess it would make sense that some of those are wasted and so the total memory bandwidth could stand to go down. Would make sense as to why DDR333 competes RDRAM, it is very efficient at its latencies, and being 500MB less, as well as not even reaching its theoretical maximum, it kind of creates a leveling in the end, so that if you had the two cores at 1.8GHZ, the same bandwidth of 2.7GB would work quite equally. Yes I can see clearer now! It's really about the latency.
In the end all that remains, is really the IPC, in that the only reason for the clock per clock loss here is the IPC, as you remove the bandwidth variable for an instant.

PS: I just rethought about it, I'm having trouble here man, is it the latency or is it the cache lines? If you say the P4 could use less bandwidth at that clock speed, then just how on earth will it do so if it fetches 64 bytes compared to 32 bytes? Technically it will need higher bandwidth plus a very low latency. "sigh", there's always something that makes things more complex...
--
I guess I just see the world from a fisheye. -Eden<P ID="edit"><FONT SIZE=-1><EM>Edited by Eden on 10/29/02 09:53 PM.</EM></FONT></P>
October 30, 2002 5:41:32 AM

Quote:
I wouldn't, a lot here would agree that the DDR333 boosts much more on the P4 than Athlons.


Did you actually read the numbers I posted? At 2.4 GHz with a 533 MT/s FSB, there was a little over 8% difference performance wise between PC1066 and DDR266. Dependency on memory subsystems increase as the clockrate disparity between the processor and memory grows (e.g. a 2.4 GHz processor would require a more aggressive prefetch than a 1.8 GHz processor of the same architecture and therefore, require more memory bandwidth). A 1.8 GHz P4 would suffer from even less than an 8% performance gap between DDR266 and PC1066 RIMM's. Imagine the difference between DDR266 and DDR333.

Quote:
Yes that might change what I just said above, but that'd mean Intel's chipsets are still lax in performance, so that 3.2GB/sec is a need at these clock speeds.


As I remember it, the 440BX was the last of the Intel chipsets to be the crown king as far as memory performance.

Quote:
I just rethought about it, I'm having trouble here man, is it the latency or is it the cache lines? If you say the P4 could use less bandwidth at that clock speed, then just how on earth will it do so if it fetches 64 bytes compared to 32 bytes? Technically it will need higher bandwidth plus a very low latency. "sigh", there's always something that makes things more complex...


Memory only spends part of its time actually transmitting the data, a great deal of time is just spent searching for the right row and column in memory and receiving the signal. The difference in real time of transfering 64 bytes vs 32 bytes isn't all that much when you factor in the total time it takes from the time the processor sends the request to when it receives the entire cacheline.
Also, keep in mind that since 64 bytes are transfered to cache, there's a greater chance that that cacheline contains instructions that may be accessed later. As many programs follow spacial and temporal locality in that they store instructions that run one after the other close together in memory. So even though it takes a bit longer for the P4 to receive one cacheline, that one cacheline could potentially contain more useful instructions. This is where prefetching comes in as well, as it uses a lot of bandwidth to transfer other cachelines to cache as well.

"We are Microsoft, resistance is futile." - Bill Gates, 2015.
October 30, 2002 5:54:26 PM

Quote:
As I remember it, the 440BX was the last of the Intel chipsets to be the crown king as far as memory performance.

Snif, you're the second one this week to make me regret trading-in my old i440BX Pentium II MMX system for this AthlonXP! :frown:

--
I guess I just see the world from a fisheye. -Eden
October 30, 2002 6:08:40 PM

You gave up a 440BX system? That's like a collector's item. A legend if you will. Never give that up.

"We are Microsoft, resistance is futile." - Bill Gates, 2015.
October 30, 2002 6:48:03 PM

Ok that's it, I'm off to the bridge.... :frown:


PS: It was an Abit BH6, supporting P,P2,P3,Celeron.

--
I guess I just see the world from a fisheye. -Eden
!