Sign in with
Sign up | Sign in
Your question

p4 w/hyperthreading VS XP2800+/nforce2

Tags:
Last response: in CPUs
Share
October 27, 2002 5:07:03 AM

So here's the scoop. what do you think will be the dominant force??

Im excited to see AMD back into the picture. I am curious what the current "speculation's" are for which one will be the best in terms of overall performance.

Does hyperthreading require new programming for it to become usefull? Will the XP2800+ require a super coolin gunit just to remain stable?

My current p4 (1.6 oc/d to 2.2) is a lot more stable then my old Athlon XP. That could be attributable to many factors, but in the end , it overall is.

I use a lot of 3d and audio apps. (Lightwave/3dsmax) and (CubaseSX). I loved to see Lightwave gain such a boost from the intel ops, but it TOOK long enough for them to do it. Would hyperthreading be the same way, will intel have to convince programmers to implement it?

Im very curious to hear what Tom's community thinks will be the more powerfull solution. Thx
October 27, 2002 5:57:16 AM

The way I see it, Hyper-Threading will only improve programs with multiple contexts that don't use the processor to it's fullest. A well-written application should be able to utilize the processor to its fullest. Two dogs fighting over the same bone often end up fighting more than enjoy the bone.

The new nForce2 platform makes up for the relatively limited cache of the XP processor by speculatively guessing memory loads. Once Barton comes out this will be less necessary but still an improvement over other chipsets.

Each technology may sway the benchmarks a bit, but nothing makes up for core technologies and brute horsepower.

I think everyone will agree that for the moment Intel will continue to dominate with AMD catching up here and there. Hopefully next year will be much more competitive.

Complicated proofs are proofs of confusion.
October 27, 2002 2:21:49 PM

Something to mention, while it's true that applications may contend for resources, OS's that can recognize logical processors (i.e. WinXP) will assign threads based on priority to logical processors accordingly. Also, no code, anywhere, ever, maxes out the processor's resources, not x86 code.

"We are Microsoft, resistance is futile." - Bill Gates, 2015.
Related resources
October 27, 2002 3:02:31 PM

Quote:
The way I see it, Hyper-Threading will only improve programs with multiple contexts that don't use the processor to it's fullest.

Based on this comment, I don't believe you understand Hyperthreading very well. The whole purpose of Hyperthreading is to allow the CPU's execution units to do more work.

Even when your CPU monitor says the CPU is at 100%, the truth is that the CPU's execution units are somewhere around 30-35% utilized. (The rest of the time they're sitting idle waiting for the rest of the CPU to get ready to feed them some work.) The idea behind Hyperthreading is to boost <i><b>that</b></i> usage, thereby boosting the overall work that can be done by the processor.

As for individual programs, yes, ones that are coded with multiple execution threads will benefit most. But anyone trying to run two programs at once on a Hyperthreading enabled OS (WinXP, Linux) will see a boost, even if the individual programs are single-threaded.

PS - Don't run multiple programs at once? Think again. I run a fairly "clean" machine (i.e. right now I have antivirus software running in the background, I'm running IE to look at this forum, and I have Task Manager open). I currently have 27 processes and 319 threads executing! And it's only going to get worse.
October 27, 2002 3:29:00 PM

The dominant force? Well, by the time the Athlon XP 2800+ comes out, you'll probably be seeing the 3.2 Ghz HT enabled P4's running around. And the XP 2800+, even with the Nforce 2, won't be able to keep up with the 3.2 P4 performance-wise. WHat if the worst happens, and only the 3Ghz P4 is around by the time the 2800+ comes out? Well, the 3 Ghz P4 combined with the Granite Bay chipset will be able to hold on to the performance crown. As I have stated many times, even Barton will have a hard time competing with the P4's. The olny thing that can help AMD in the desktop market is the clawhammer, and we won't be seeing it for it a while, since AMD's decided to stop concentrating on desktops for a while.

Personally, I don't expect AMD to regain the performance crown in the desktop arena for quite a while.

To make FULL use of HT, some apps will have to be reprogrammed. By "FULL" use I mean the maximum benefit you can get out of HT, which Intel states is a 70% increase in performance, quite a significant increase. On average, without reprogramming, you can expect to see a 25% performance increase with HT. That is of course if you're using WinXP. Intel recommends that to enjoy HT to the fullest, you sholud be using WinXP, or at least Linux 2.4x. The reason is simply the way earlier OS'es handle logical CPU's, and how they assign threads. Intel, after having done testing, says that WIn2k/98/Me/95 do not handle HT very well. That is, they aren't very good at switching a P4 w/ HT between multi-task, and single-task. With the apps you use, expect to see a noticeable performance increase, especially if you use several of them at the same time.

Quote:
<i>Written by imgod2u</i>
Also, no code, anywhere, ever, maxes out the processor's resources, not x86 code.

That line is just classic!


Everything sonoran said is correct. Even if you're a "light" user, which most people on these forums are not, then you'll se a benefit from HT. In windows, the benefit should be a bit higher than on linux, simply because Windows is more bloated, and runs more useless processes than Linux.


- - -
<font color=green>All good things must come to an end … so they can be replaced by better things! :wink: </font color=green>
October 27, 2002 3:36:45 PM

Thanks for the Great responses guys....



Can you please explain to me the "granite bay" chipset, I have no heard of this yet :) 
October 27, 2002 5:50:28 PM

Quote:
Based on this comment, I don't believe you understand Hyperthreading very well. The whole purpose of Hyperthreading is to allow the CPU's execution units to do more work.

Even when your CPU monitor says the CPU is at 100%, the truth is that the CPU's execution units are somewhere around 30-35% utilized. (The rest of the time they're sitting idle waiting for the rest of the CPU to get ready to feed them some work.) The idea behind Hyperthreading is to boost that usage, thereby boosting the overall work that can be done by the processor.

Hyper-Threading divides the processor down the middle except for the execution units. This means programs are sharing cache, interrupt request lines, and memory pathways. You can't tell me there aren't programs out there that need 512k of L2 cache and utilize the FP units to its fullest. Quake anyone. When the P4's floating point execution units are running at tick latency of 7-8 and instruction throughput 1-2, throwing in a couple of instructions here and there is definitely going to slow things down a bit. Most programmers do there best to schedule a mix of integer and floating point instructions to utilize the pipeline to its fullest.

I had a boss once who one day decides our highly pipelined transformation engine needed to be multi-threaded to improve performance. I told him, first it can be done (the answer all programmers should give first), second it probably won't buy us that much and third it may cost us a little. Sometimes order is totally important. Because I didn't agree 100% he gets this other programmer to do it. So it gets done and ends up costing us 20% performance. We had two threads fighting for the same resources, thrashing the cache, and in some cases running idle. I spent 3 weeks reworking the code, writing scheduling algorithms, and optimizing the system. I got it back to 95% but it never reached the simple model. On a multi-processor system it ran at 110% when only running the transformation engine, and broke even when rasterizing was performed. In the end, our footprint was only 10% of the total execution timings and we were fighting for 1-2% of the pie. Oh well I got paid.

Complicated proofs are proofs of confusion.
October 27, 2002 5:56:55 PM

You should hire carmack then, sounds like you have a sloppy programmer. Dont et me wrong I couldnt do any better but i know many others can.

-Jeremy

<font color=blue>Just some advice from your friendly neighborhood blue man </font color=blue> :smile:
October 27, 2002 6:11:41 PM

I would not be surprised if AMD starts considering HT for the K8. That could potential give them a significant boost over even HT enabled P4, as imgod2u said, because the amount of execution units it has would definitly give out something. Just that alone could buy AMD a huge competitive edge, unfortunatly I am not expecting them to implement a similar MTing technique, though I would not be surprised if they did.

BTW I have 561 threads at the moment with 53 processes, I think I definitely need HTing!

--
I guess I just see the world from a fisheye. -Eden<P ID="edit"><FONT SIZE=-1><EM>Edited by Eden on 10/27/02 03:12 PM.</EM></FONT></P>
October 27, 2002 7:19:27 PM

Actually, he was a great programmer. Russian. He just didn't like arguing with the boss. I discussed the theories with him and he agreed with the outcome. With some serious redesigning I'm sure it could have become an optimal solution. As with my original comment on the Hyper-Threading architecture, sometimes the effort isn't worth it.

Complicated proofs are proofs of confusion.
October 27, 2002 8:33:00 PM

Quote:
Hyper-Threading divides the processor down the middle except for the execution units. This means programs are sharing cache, interrupt request lines, and memory pathways. You can't tell me there aren't programs out there that need 512k of L2 cache and utilize the FP units to its fullest. Quake anyone. When the P4's floating point execution units are running at tick latency of 7-8 and instruction throughput 1-2, throwing in a couple of instructions here and there is definitely going to slow things down a bit. Most programmers do there best to schedule a mix of integer and floating point instructions to utilize the pipeline to its fullest.


I don't think this is true at all. While you may have a point about cache being split down the middle (which I still don't think so but I can't see anything about it in the HT documentation), the actual physical resources are not split 50/50. One thread is given priority while a secondary thread is used to "fill in the gaps", at least, in the early decoding stages. The problem comes when you get to the execution stage and you have more integer micro-ops from 2 threads than your ALU's can handle. Supposedly, Intel has done away with this problem with revision 2 of HT but only time will tell.
And while programmers try to throw in a mix of different instructions, that's not gonna help when you have data dependencies and memory latency, something the programmers have no control over. That's really the bulk of the limitation on ILP in x86 code, data dependencies. With 2 separate threads, you have 2 independent sets of instructions to choose from.

"We are Microsoft, resistance is futile." - Bill Gates, 2015.
October 27, 2002 8:55:59 PM

not x86 code

all CODE except if you can find a computer with Zero latency

Now what to do??
October 27, 2002 8:58:06 PM

PS - Don't run multiple programs at once? Think again. I run a fairly "clean" machine (i.e. right now I have antivirus software running in the background, I'm running IE to look at this forum, and I have Task Manager open). I currently have 27 processes and 319 threads executing! And it's only going to get worse.


That what in RAM not what is need to be excute only few can be excute in the same time

Now what to do??
October 27, 2002 11:11:18 PM

You'd be surprised at what processes are run at the same time. The majority of these "processes" are indeed only stored in RAM, however, a lot of things need to be done in real time. Such as updating the clock based on the RTC, Antivirus programs run constantly, etc.

"We are Microsoft, resistance is futile." - Bill Gates, 2015.
October 27, 2002 11:46:29 PM

Quote:
Hyper-Threading divides the processor down the middle except for the execution units.

Actually, I just looked back through some of the stuff online about Hyperthreading. You may be surprised at how much stuff is duplicated. The following does a better job of explaining it all than I could, including a list of what's duplicated vs what's shared: <A HREF="http://cedar.intel.com/media/training/hyper_threading_i..." target="_new">http://cedar.intel.com/media/training/hyper_threading_i...;/A>
October 28, 2002 12:03:28 AM

By the way, I never said Hyper-Threading was bad technology. I'm just saying its not some magic bullet that makes everything that much faster. There are some cases where it does not apply. The best thing is it can be disabled without too much trouble.

Complicated proofs are proofs of confusion.
October 28, 2002 12:14:36 AM

on pure test basic like SPec bench only few tread additional it need to be excute (real life have much more threads).Like on a windows 98 wimamp continue running even if the if comp have crash the CPU is not need or only very few instruction need to be done here we try to find a 100K instruction that can be done in the same times as main program treads.Also decrease latency from having two treads in a case of cache miss the others tread will have all the excution for it self as it will finish it work.I like to point out that SMP implentation in P4 NW is not treads aware so i dont make any difference beetween treads '1' and '2'.But most of the time in benchmark gain will be little but real life is a others thing

Now what to do??
October 28, 2002 2:53:02 AM

What about it? Here's a quote:

"Finally, resources like caches (trace and L1 data), branch global history array, microcode ROM, the scheduler, register renaming control logic, and execution units are fully shared, though <b>not partitioned.</b>"

Certain queues including the reorder buffers (which come into effect after the actual execution) are partitioned (split down the middle) but resources like cache, decoding and execution are fully shared, not partitioned (e.g. it's not split down the middle), it's dynamically allocated.

<A HREF="http://developer.intel.com/technology/itj/2002/volume06..." target="_new">Intel's presentation</A> also depicts this.

"We are Microsoft, resistance is futile." - Bill Gates, 2015.
October 28, 2002 4:07:38 AM

Agreed. Shared does not necessarily mean halved and other stuff like Register Tables and Instruction Translation Lookaside Buffers are duplicated. Sorry about being misleading.

One more article with good benchmarks <A HREF="http://www.2cpu.com/Hardware/ht_analysis" target="_new">http://www.2cpu.com/Hardware/ht_analysis&lt;/A>.

Here's a story about windows multi-tasking. At my old company we had our controls embedded in a web page with some flash animations. We found that if we had more than one of our graphs animating on the page, the flash controls would not animate. I was put on the task of figuring out why this was so. After watching the messages, I found that flash controls run off a multi-media timer and while we were sucking up the cpu, they had no chance of acquiring the cpu. Even on a multi-cpu system the same thing would happen. Another minor caveat, if we attempted to animate more than two graphs, the first one would run at full speed the second would run at half speed and the third would tick like once every two seconds. Any graph after that wouldn't tick at all. I came up with an elegant scheduling algorithm that passed the execution time around and every so often just dropped out to allow other programs to execute. Even though it worked like a charm, my boss hated it. He said I just added Windows 3.1 multi-tasking to Windows 98. In some ways he was right. We ended up trying some other tricks like adding more threads, setting priorities, etc. Nothing worked as well. The code remained and another rift was made between my boss and I.

Recursion - I curse then I recurse again and again
October 28, 2002 9:10:49 PM

Actually, because both the K8 is so efficient in using it's execution units and resources, I doubt that HT would make a big difference on the K8. The K8 does not have alot of it's execution units sitting idle alot, or it's resources sitting idle. HT <b>might</b> provide a noticeable performance boost in 32-bit, IF the idle 64bit resources in 32bit mode can be utilized. Otherwise, there wouldn't be a big benefit.

Gemini, Granite Bay (or the 7205) is supposed to be similar to Intel's 850E chipset, except it's supposed to offer dual channel DDR 266 support, AGP 8X support, USB 2 support. Also, it has HT support and uses the ICH 4 I/O, which a newer hub, compared to the ICH 2 I/O on the 850E.

Currently, I have 100 threads, and 20 processes running. I get this on a typical day.

- - -
<font color=green>All good things must come to an end … so they can be replaced by better things! :wink: </font color=green>
<P ID="edit"><FONT SIZE=-1><EM>Edited by Dark_Archonis on 10/28/02 06:11 PM.</EM></FONT></P>
October 28, 2002 9:20:02 PM

HT or CMP will give more power to a K7 that P4.

Now what to do??
October 28, 2002 10:46:13 PM

Care to explain?

- - -
<font color=green>All good things must come to an end … so they can be replaced by better things! :wink: </font color=green>
October 28, 2002 10:58:06 PM

Quote:
Actually, because both the K8 is so efficient in using it's execution units and resources, I doubt that HT would make a big difference on the K8. The K8 does not have alot of it's execution units sitting idle alot, or it's resources sitting idle. HT might provide a noticeable performance boost in 32-bit, IF the idle 64bit resources in 32bit mode can be utilized. Otherwise, there wouldn't be a big benefit.


I wouldn't say so at all. Consider this, the K8 has 3 decoding units capable of decoding 3 simple x86 instructions per clock or 1 complex x86 instruction per clock, but it has 9 execution units (assuming nothing was changed from the Athlon). Your average x86 instruction translates to about 1.5 micro-ops, so that's an average of 4.5 micro-ops per clock that is fed into the execution units and that's the theoretical maximum. In reality, the maximum sustained decoding rate of an Athlon would be around 2.2 instructions or 3.3 micro-ops per clock. That is nowhere near enough to fill 9 execution units. Even at this optimal condition, you'd still have 66% of the execution units remaining idle at any given time. In reality, even the decoder units aren't filled at any given time. Your average instruction throughput for an Athlon (and considering the decoding stage for the K8 isn't very different in 32-bit mode) is around 1.2 IPC in good, well optimized code. Many times it's not even a full 1 instruction per clock. Think about how much you could benefit if you could get twice that at any stage, even the decoding rate. Multiple threads don't encounter data dependencies with respect to each other, so instead of guaranteeing that you'll always have 1 instruction to decode and enter into the pipeline, you're guaranteed to have 2, provided you get a cache hit.
Whatever made you think the K7 or K8 ever "efficiently" used its die space? It has 3 times the decoding logic as the P4, 1.5 times the execution resources and up to twice the dispatching ability, with what, a 20% IPC advantage on average code that is optimized for the previous P6/K7 design?

Quote:
Gemini, Granite Bay (or the 7205) is supposed to be similar to Intel's 850E chipset, except it's supposed to offer dual channel DDR 266 support, AGP 8X support, USB 2 support. Also, it has HT support and uses the ICH 4 I/O, which a newer hub, compared to the ICH 2 I/O on the 850E.


The i850e really just specifies the North Bridge. The motherboard manufacturers can actually choose to use whatever South Bridge they want. So it is very possible for say, Asus to use ICH4 with an i850e chipset.

"We are Microsoft, resistance is futile." - Bill Gates, 2015.
October 28, 2002 11:08:23 PM

Correct me if I'm wrong. The P4 stores code in its L1 cache as micro-ops. Thusly no decoding is necessary. The k family decodes on the fly. This would cause a major bottleneck if twice as many instructions were thrown at k family processsors.

Recursion - I curse then I recurse again and again
October 28, 2002 11:14:14 PM

Yes, but storing decoded micro-ops in the trace cache would only help with repeatable code. A good deal of your average program contains one-time-run codes that will never be used again, in which case you have the limitation of 1 x86 decode per clock. And even then, the trace cache is only capable of issuing 3 micro-ops per clock (averages about 2 x86 instructions equivalent), compare that with the K7 (and probably K8's) issuing rate of up ot 6 micro-ops per clock (depending on how many micro-ops the x86 instructions are decoded into), that's still twice the issuing rate, with resulting benefit of less than 10% (due to the issuing rate alone). In fact, I doubt that the issuing rate is the bottleneck even in the P4.

"We are Microsoft, resistance is futile." - Bill Gates, 2015.
October 29, 2002 1:38:42 AM

I dunno about Dark but I had a feeling he'd disagree on the HT thing. Still too much Intel faith.

Something came to me today as I was thinking about HT. If the P4 has 1 decoder, and you're trying to allow more threads at once, just how can you shove them all to fill the units?
By definition that would mean that Intel has put a lot of hard work into making sure the HT rev 2 logic can predict threads and store their uOPS into the Trace Cache, otherwise I fail to see how if one decoder can possibly even do parallelism, considering it's only 1!

BTW I think you're being a bit strict on the uOP decoding, I doubt anyone has proof of how many uOPS are decoded on the P4s on average. You say 1.5, I'd say 2.2 since Intel claims an average of 35% units are used, which is 2 rounded, and you could eventually say that the FPU/SSE2 unit would be always filled in intensive FP operations, you'd also want the AGUs to be ready to store and load, therefore 2-3 uOPS on average on P4s is quite average to me. Not to mention Trace Cache full 3uOP successes.

--
I guess I just see the world from a fisheye. -Eden
October 29, 2002 1:51:49 AM

Quote:
Something came to me today as I was thinking about HT. If the P4 has 1 decoder, and you're trying to allow more threads at once, just how can you shove them all to fill the units?
By definition that would mean that Intel has put a lot of hard work into making sure the HT rev 2 logic can predict threads and store their uOPS into the Trace Cache, otherwise I fail to see how if one decoder can possibly even do parallelism, considering it's only 1!


As I said, even 1 decoder isn't neccessarily a bottleneck. Just think about what happens when you have a cache miss. The decoder just sits there idle while the processor makes a call to memory. What if you had another thread with instructions ready to feed the decoder and that thread was in cache? You can effectively have the decoder keep on worker clock after clock without any stalls or idle time.
Also, the trace cache does store almost all the repeatable code so there is quite a bit of cache hits when you're running the bulk of the program, so HT would still benefit in that sense.

Quote:
BTW I think you're being a bit strict on the uOP decoding, I doubt anyone has proof of how many uOPS are decoded on the P4s on average. You say 1.5, I'd say 2.2 since Intel claims an average of 35% units are used, which is 2 rounded, and you could eventually say that the FPU/SSE2 unit would be always filled in intensive FP operations, you'd also want the AGUs to be ready to store and load, therefore 2-3 uOPS on average on P4s is quite average to me. Not to mention Trace Cache full 3uOP successes.


1.5 is my guest. The majority of simple x86 instructions such as iadd or imul would only take 1 micro-op. You throw in more complex instructions that could take 4 micro-ops and you up that average a little. Of course, there are more simple instructions that are used more frequently than complex instructions that aren't used as much so the average shifts more towards the simple instruction lengths. You have to also keep in mind that in Intel's eye, the trace cache is where the "process" starts and they expect the trace cache to do the issuing of micro-ops most of the time, so that 35% number would be based on how many micro-ops the trace cache can issue, which is 3 maximum but probably a little less than that in actuality.

"We are Microsoft, resistance is futile." - Bill Gates, 2015.
October 29, 2002 2:31:43 AM

The i850e really just specifies the North Bridge. The motherboard manufacturers can actually choose to use whatever South Bridge they want. So it is very possible for say, Asus to use ICH4 with an i850e chipset.

I850 or the version they add a E.Have a Hub link 1.1 ICH 4 need a hub link 1.5 gigabytes have able to make it work but the possibility of unstability is increase.

Now what to do??
October 29, 2002 2:40:39 AM

Trace cache is ano brainer big gain no draw back except a bit more transistor use.On video games gain are huge.At micro processeur forum a AMD enginer have speak about SMP on K7 according to the engineer the gain from Ht will be little or none but super treading will be more useful.K7 or K8 dont have a desing for doing SMP i guess AMD will have to find a others nextgen.

Now what to do??
October 29, 2002 2:57:47 AM

I try to find some information how Ht will affect banias

Micro ops fusion and Hyper treading 1 or 2.

Now what to do??
October 29, 2002 4:26:28 AM

Quote:
Trace cache is ano brainer big gain no draw back except a bit more transistor use.On video games gain are huge.


There is not really a lot of practical benefit in the trace cache idea as far as achieving average throughput. However, it does save a lot of die space. You don't need to have a 3-way, heavy duty decoding stage with tons of messy predecode and predict methods just to try to get enough instructions decoded. That's really the point, to save transistors, and it does a great job at it, although it's not an "all-round" replacement for heavy-duty decoders as there is a good amount of code in your average program that just can't be cached (it's not repeated, caching it would be useless).

Quote:
At micro processeur forum a AMD enginer have speak about SMP on K7 according to the engineer the gain from Ht will be little or none but super treading will be more useful.K7 or K8 dont have a desing for doing SMP i guess AMD will have to find a others nextgen.


What'd you expect AMD to say? "Oh wow! this Hyperthreading thing is really awesome! Too bad we don't have it in our current design!"

"We are Microsoft, resistance is futile." - Bill Gates, 2015.
October 29, 2002 3:12:59 PM

imgod2u :

That's really the point, to save transistors, and it does a great job at it, although it's not an "all-round" replacement for heavy-duty decoders as there is a good amount of code in your average program that just can't be cached (it's not repeated, caching it would be useless)
--------------------------------------------
to save transistors, and it does a great job at it
--------------------------------------------
The "smart" cache of P4 takes almost 100kB worth of die space.
And You call it a great job??
October 29, 2002 3:32:49 PM

Quote:
The "smart" cache of P4 takes almost 100kB worth of die space.
And You call it a great job??


Do you have any idea how much die space and heat 3-way heavy duty decoders would take? The trace cache is absolutely puny compared to that. Also where did you get 100 KB? The trace cache is 12k entries. Also, keep in mind that because you're constantly decoding and predecoding, you need a larger L1 cache, which, for instance on the Athlon takes up 64KB on top of the 3-way heavy duty decoders.

"We are Microsoft, resistance is futile." - Bill Gates, 2015.
October 29, 2002 4:09:36 PM

Well, I got it from THG article.

But, if that article was right, the puny 8kB L1 predecoded cache doesn't do it's job very well .
It gould, or will do OK with "hyperthreading".That's is a essential part of P4, from the first beginning.Intel just haven't got it to work.It has been taking a lots of the die space all along from the first Willamette.

Tell me I'm wrong!!
October 29, 2002 4:22:47 PM

It's funny that I get a lot of crackers on me every time I write something opposing Intel.
October 29, 2002 4:30:06 PM

Quote:
But, if that article was right, the puny 8kB L1 predecoded cache doesn't do it's job very well.

I would like to see this THG article. Also, the "8 KB of L1 cache" refers to the data cache, not the trace cache. IMO, the 8 KB of data cache is too small, but it's that small for a reason. It's ultra-low latency. The trace cache is 12k entries. Intel has never disclosed how much actual "bitspace" it takes up.

Quote:
It's funny that I get a lot of crackers on me every time I write something opposing Intel.


When you post information someone doesn't agree with, they post a reply, in a forum, imagine that.......

"We are Microsoft, resistance is futile." - Bill Gates, 2015.
October 29, 2002 5:36:14 PM

Yes, You are right.
I was wrong there.It's 12kword predecoded L1 cache?I got my wires crossed.Sorry.

But I still think that the P4 cache stinks.That is, without the HT.

I'm off for today.
October 29, 2002 6:35:58 PM

Quote:
<i>Written by imgod2u</i>
I wouldn't say so at all. Consider this, the K8 has 3 decoding units capable of decoding 3 simple x86 instructions per clock or 1 complex x86 instruction per clock, but it has 9 execution units (assuming nothing was changed from the Athlon). Your average x86 instruction translates to about 1.5 micro-ops, so that's an average of 4.5 micro-ops per clock that is fed into the execution units and that's the theoretical maximum. In reality, the maximum sustained decoding rate of an Athlon would be around 2.2 instructions or 3.3 micro-ops per clock. That is nowhere near enough to fill 9 execution units. Even at this optimal condition, you'd still have 66% of the execution units remaining idle at any given time. In reality, even the decoder units aren't filled at any given time. Your average instruction throughput for an Athlon (and considering the decoding stage for the K8 isn't very different in 32-bit mode) is around 1.2 IPC in good, well optimized code. Many times it's not even a full 1 instruction per clock. Think about how much you could benefit if you could get twice that at any stage, even the decoding rate. Multiple threads don't encounter data dependencies with respect to each other, so instead of guaranteeing that you'll always have 1 instruction to decode and enter into the pipeline, you're guaranteed to have 2, provided you get a cache hit.
Whatever made you think the K7 or K8 ever "efficiently" used its die space? It has 3 times the decoding logic as the P4, 1.5 times the execution resources and up to twice the dispatching ability, with what, a 20% IPC advantage on average code that is optimized for the previous P6/K7 design?

Thanks for the wake up call. I really needed that. To be honest with you, I never knew that the Athlon was <b>that</b> inefficient. I guess I've been reading too much posts made by AMD fanboys. Now that you mention the Athlon's innefficiency, I'd like to change my stance on the HT thing. It seems HT WOULD provide a big performance advantage, but correct me if I'm wrong, wouldn't the increase in heat be higher than if HT was being used on a P4? I mean, since it has so much more decoding logic, as well as execution units compared to the P4? If the K7 can get a bigger gain from HT than the P4, my understanding is that it would result in a more significant heat increase compared to the P4. Clock for clock, Athlons already produce more heat than P4's, since the transistors in the Athlon's are packed closer together, along with the fact that it has 9 layers now. Referring back to the Athlon's IPC, I used to wonder why the Athlon didin't perform better than what is was achieving. I used to think about all the decoding and execution units, the large L1 cache the Athlon carries, compared to the P4's small L1. And then, I also though about how current P4's perform so well. I remembered that the cache width for the P4 is alot higher than that of the Athlon (I forget the exact numbers). Also, I remembered the excellent cache hit rate for the Athlon, as well as the great prefetching of the P4. In short, your post really gave me a wake up call.

Quote:
<i>Written by imgod2u</i>
The i850e really just specifies the North Bridge. The motherboard manufacturers can actually choose to use whatever South Bridge they want. So it is very possible for say, Asus to use ICH4 with an i850e chipset.

When Intel says that Granite Bay will come with ICH4, that means that's their <i>recommended</i> South Bridge, correct?

Quote:
<i>Written by Eden</i>
I dunno about Dark but I had a feeling he'd disagree on the HT thing. Still too much Intel faith.

Hmm, I'm curious as to why you would think that. I made an honest mistake and <b>overestimated</b> the Athlon's ability. Because of this, I thought that HT wouldn't be a big help to the Athlon, because of my overestimation. How could you call that too much Intel faith?

Quote:
<i>Written by imgod2u</i>
I would like to see this THG article. Also, the "8 KB of L1 cache" refers to the data cache, not the trace cache. IMO, the 8 KB of data cache is too small, but it's that small for a reason. It's ultra-low latency. The trace cache is 12k entries. Intel has never disclosed how much actual "bitspace" it takes up.

true, true. I'v been hearing rumours that both the data and trace cache will increase in size and speed on the Prescott.

Era, the trace cache is great. It's a very novel idea. Intel makes great caches, and technologically, the trace cache in the P4 is much better than the simple L1 cache in the Athlon. The only problem with the trace cache is that it's very small (along with the data cache). I don't know why you'd think that the P4's cache stinks. I've seen AMD fanboys before actually admit that Intel makes great caches.


- - -
<font color=green>All good things must come to an end … so they can be replaced by better things! :wink: </font color=green>
October 29, 2002 7:08:27 PM

Quote:
But I still think that the P4 cache stinks.That is, without the HT.

Well, that's your opinion and I can't control what you think. But the fact of the matter is that the trace cache is one of the more brilliant ideas on the P4, considering that it is capable of cutting the neccessary decoding logic to 1/3 of what it needs to be yet maintain almost identical throughput with very little heat generated (remember, it's cache, meaning it's idle most of the time except for whatever small portion is being accessed, whereas a decoder is active almost all the time sifting through instructions and trying to resolve data dependencies). Despite whatever you may have heard, size isn't everything.

In response to Dark's post, ya, more parts of the die active means more heat, but you're actually getting more work done. Heat can be resolved through process refinements and new transistor technologies, we're mainly speaking about possible performance on an architectural level.

"We are Microsoft, resistance is futile." - Bill Gates, 2015.
October 29, 2002 7:35:51 PM

Quote:
As I said, even 1 decoder isn't neccessarily a bottleneck. Just think about what happens when you have a cache miss. The decoder just sits there idle while the processor makes a call to memory. What if you had another thread with instructions ready to feed the decoder and that thread was in cache? You can effectively have the decoder keep on worker clock after clock without any stalls or idle time.

But by that definition, this isn't HyperThreading anymore, you're just doing ONE thread at a time during a fetch, I thought HT was to send various threads at once to fill the units, and by the P4's way, you'd really need to rely heavily on the Trace Cache which supplies 50% of the units should they be assigned to fill 3 (two of one type, 1 of another, or one of each), otherwise I fail to see if you had 2 threads with no code repetition, how in heavens will HT ever work, when you shove 2 threads in one decoder, you know what I'm saying?

Quote:
which is 3 maximum but probably a little less than that in actuality.

Since uOPs are integer numbers as in you can't have one op and a half, only in averaging, we can safely say it's either 1,2 or 3! :wink:

--
I guess I just see the world from a fisheye. -Eden
October 29, 2002 7:49:46 PM

Quote:
Hmm, I'm curious as to why you would think that. I made an honest mistake and overestimated the Athlon's ability. Because of this, I thought that HT wouldn't be a big help to the Athlon, because of my overestimation. How could you call that too much Intel faith?

To be honest I still felt just plain felt, that your posts here were somewhat pro-Intel that I just couldn't figure out in the end if you saying the Athlon is very efficient in execution units, was actually a comment or just a concession to strenghten any arguments in favor of Intel afterwards. Sorry dude, I just felt it that way, as you still often try to find in any situation, ways to give pointers to Intel, not in a competitory POV but just for them, though I apologize for that comment. You can comment on Intel's innovations anytime, I often do, but sometimes you just add it in between words to boost their image.

Quote:
To be honest with you, I never knew that the Athlon was that inefficient.

All x86 processors in the end, are inefficient. If I knew programming, I'd validate that in a second, but I'll take imgod2u and Schmide's words for that, in that x86 simply is not parallel efficient, so no matter how far you try to go, it'll never reach the optimum. Now imagine if AMD went to the extreme as to create 9 decoders in assumption that uOPS are now 1 maximum, lol, the things we'll go to create ILP in x86!
Though despite anything about K7 being inneficient, in contrast to P4, it is much more efficient, and I'd assume 5 out of 9 units are used on average, which is about 55.5% of the exec units the K7 has.

Quote:
Era, the trace cache is great. It's a very novel idea. Intel makes great caches, and technologically, the trace cache in the P4 is much better than the simple L1 cache in the Athlon. The only problem with the trace cache is that it's very small (along with the data cache). I don't know why you'd think that the P4's cache stinks. I've seen AMD fanboys before actually admit that Intel makes great caches.

I couldn't agree more in that the Trace Cache is awesome, however the real deal will be when programmers code for it, in a way to reduce non-repeatable code, so that the TC can issue 3 uOPS more often, and therefore expanding single thread execution times to 50% unit usage per clock. That can significantly help, otherwise the TC can increase IPC performance by a mere 5%, as it only reduces the total number of stages taken out of 20, to execute instructions, and if we count the frequency of mispredictions on P4s, the TC's efficiency just climbs more as clock speeds increase! (kind of contradictory but in a good way)

Quote:
true, true. I'v been hearing rumours that both the data and trace cache will increase in size and speed on the Prescott.

I doubt it will, remember Prescott is intended as a small core advancement. I personally mentioned expected core improvements to an Intel employee here, which were the higher FSB, 1MB cache, and something that just slipped my mind! He probably could not divulge anything by NDA, though he seemed to indicate he didn't see anything else really expected than the known advancement. Anyone would agree, that aside SSE2, you will still need to improve x87 FPUs, you just HAVE to, as there are applications which will probably not benefit much compared to raw power! (look at LW7 benches vs 3dS Max, and both are rendering tools, or look at Science benches where K7s rape) I don't see how Intel claims more powerful FPUs would not help, if anything it encourages programmers to efficiently code for the Trace Cache to issue more FP uOPS, hence probably making it as or even more efficient than the K7, assuming P4 would have 3 FPUs.

BTW imgod2u, I may have forgot your answer, but if we had 3 FPUs with SSE2, instead of the P4's 1, would it be more powerful? As you say we are discussing performance right now, not any size, heat or anything that imposes problems on producing such.

--
I guess I just see the world from a fisheye. -Eden
October 29, 2002 7:58:25 PM

I think you have totally mis-understood the HT for P4. The way it works is it would fill the in the first thread with the second thread. Here is an example...
<pre>Thread One
Units 1 2 3
ALU X X
FPU
LOD/STR X
</pre><p><pre>Thread Two
Units 1 2 3
ALU X
FPU
LOD/STR X X
</pre><p>As you can see that there are two conflict between Thread One and Thread Two. One for the ALU and one for the Load/Store. But if this thread were running on a single processor it would take 6 clocks to complete these two threads. But on an HT it would take only four clocks. Here's how...
<pre>Threads One and Two (One marked as X1 and Two Marked as X2)
Units 1 2 3 4
ALU X1 X1 X2
FPU
LOD/STR X2 X1 X2
</pre><p>So you see during the first clock the Load/Store unit of the processor would stay idel so the HT unit would fill that Unit with Thread Two's Load/Store Instruction. And then wait for ALU to be available for Thread Two and continue on executing.

I hope this makes sense.

KG


"Artificial intelligence is no match for natural stupidity." - Sarah Chambers
October 29, 2002 9:21:27 PM

I do understand that and I understand the cases when HT works, sometimes 25% better. I also understand that sometimes it does worse, sometimes 25% worse.

How do you explain the 15-25% hits on memory benches and divx benches? Sometimes programs do better when they have free reign of the processor.

Recursion - I curse then I recurse again and again
October 29, 2002 9:32:40 PM

Quote:
But by that definition, this isn't HyperThreading anymore, you're just doing ONE thread at a time during a fetch, I thought HT was to send various threads at once to fill the units, and by the P4's way, you'd really need to rely heavily on the Trace Cache which supplies 50% of the units should they be assigned to fill 3 (two of one type, 1 of another, or one of each), otherwise I fail to see if you had 2 threads with no code repetition, how in heavens will HT ever work, when you shove 2 threads in one decoder, you know what I'm saying?


Hyperthreading is Intel's market name, so if they wanna call it Hyperthreading, it's Hyperthreading. That asside, it is indeed SMT (simultaneous multithreading). Conventional MPU's would have to wait until one thread is finished before starting another (a lot of idle time). SMT-enabled MPU's can work on two different threads, one doesn't have to be completed before another can start. And btw, the trace cache <b>does</b> supply much more than 50% of the instructions used. It stores almost all repeatable code that's currently being processed. Code is repeated a lot in your average program.
As for non-repetitive code, did you read what I wrote? In a conventional non-SMT processor, if you run into a memory stall (the processor has to call to memory and sit there idle), you <b>can't</b> do anything else but wait, because you must finish one thread before you start another. In an SMT-enabled MPU, the processor can take instructions from either thread and put it in the pipeline. So if the instruction from one thread is in memory, and an instruction from another thread is in cache, you won't have to wait for the instruction to be fetched from memory (wasting a lot of clockcycles), instead you can fetch the instruction from the other thread from cache (a lot less clockcycles wasted). Less idle time, less wasted time.

Quote:
Since uOPs are integer numbers as in you can't have one op and a half, only in averaging, we can safely say it's either 1,2 or 3!


You can't have 2.2 instructions either. Or 1.5 or whatever. It's always the average that we speak of. Modern MPU's perform the same operation billions of time per second, it'd be pretty pointless to worry about what it does in just 1 clock. Again, total throughput is much more important than 1 specific instruction or 1 specific clock.

"We are Microsoft, resistance is futile." - Bill Gates, 2015.
October 29, 2002 9:37:46 PM

Let me correct this remember 100% better is equal to 50% worse. Sometimes HT does 70% better and sometimes it does 25% worse. The best thing is you can disable it thus it is a win win situation, as you can only do better or even.

Recursion - I curse then I recurse again and again
October 29, 2002 11:04:49 PM

Hence why they corrected the performance "hits" in HT2. At least they say they did.

-Jeremy

<font color=blue>Just some advice from your friendly neighborhood blue man </font color=blue> :smile:
October 29, 2002 11:13:02 PM

I thought HT2 wasn't scheduled until Prescott mid next year.

Recursion - I curse then I recurse again and again
October 29, 2002 11:18:19 PM

Perhaps but from my understanding Prescotts HT (officially HT2)has Legrande onboard and HT on the 3.06 has been changed so that there wount be a performance hit any longer. So technically im dumb and should have said revised HT not HT2 my bad.

-Jeremy

<font color=blue>Just some advice from your friendly neighborhood blue man </font color=blue> :smile:
!