p4 w/hyperthreading VS XP2800+/nforce2

gemini8026

Distinguished
Oct 27, 2002
27
0
18,530
So here's the scoop. what do you think will be the dominant force??

Im excited to see AMD back into the picture. I am curious what the current "speculation's" are for which one will be the best in terms of overall performance.

Does hyperthreading require new programming for it to become usefull? Will the XP2800+ require a super coolin gunit just to remain stable?

My current p4 (1.6 oc/d to 2.2) is a lot more stable then my old Athlon XP. That could be attributable to many factors, but in the end , it overall is.

I use a lot of 3d and audio apps. (Lightwave/3dsmax) and (CubaseSX). I loved to see Lightwave gain such a boost from the intel ops, but it TOOK long enough for them to do it. Would hyperthreading be the same way, will intel have to convince programmers to implement it?

Im very curious to hear what Tom's community thinks will be the more powerfull solution. Thx
 

Schmide

Distinguished
Aug 2, 2001
1,442
0
19,280
The way I see it, Hyper-Threading will only improve programs with multiple contexts that don't use the processor to it's fullest. A well-written application should be able to utilize the processor to its fullest. Two dogs fighting over the same bone often end up fighting more than enjoy the bone.

The new nForce2 platform makes up for the relatively limited cache of the XP processor by speculatively guessing memory loads. Once Barton comes out this will be less necessary but still an improvement over other chipsets.

Each technology may sway the benchmarks a bit, but nothing makes up for core technologies and brute horsepower.

I think everyone will agree that for the moment Intel will continue to dominate with AMD catching up here and there. Hopefully next year will be much more competitive.

Complicated proofs are proofs of confusion.
 

imgod2u

Distinguished
Jul 1, 2002
890
0
18,980
Something to mention, while it's true that applications may contend for resources, OS's that can recognize logical processors (i.e. WinXP) will assign threads based on priority to logical processors accordingly. Also, no code, anywhere, ever, maxes out the processor's resources, not x86 code.

"We are Microsoft, resistance is futile." - Bill Gates, 2015.
 

sonoran

Distinguished
Jun 21, 2002
315
0
18,790
The way I see it, Hyper-Threading will only improve programs with multiple contexts that don't use the processor to it's fullest.
Based on this comment, I don't believe you understand Hyperthreading very well. The whole purpose of Hyperthreading is to allow the CPU's execution units to do more work.

Even when your CPU monitor says the CPU is at 100%, the truth is that the CPU's execution units are somewhere around 30-35% utilized. (The rest of the time they're sitting idle waiting for the rest of the CPU to get ready to feed them some work.) The idea behind Hyperthreading is to boost <i><b>that</b></i> usage, thereby boosting the overall work that can be done by the processor.

As for individual programs, yes, ones that are coded with multiple execution threads will benefit most. But anyone trying to run two programs at once on a Hyperthreading enabled OS (WinXP, Linux) will see a boost, even if the individual programs are single-threaded.

PS - Don't run multiple programs at once? Think again. I run a fairly "clean" machine (i.e. right now I have antivirus software running in the background, I'm running IE to look at this forum, and I have Task Manager open). I currently have 27 processes and 319 threads executing! And it's only going to get worse.
 

Dark_Archonis

Distinguished
Apr 20, 2002
286
0
18,780
The dominant force? Well, by the time the Athlon XP 2800+ comes out, you'll probably be seeing the 3.2 Ghz HT enabled P4's running around. And the XP 2800+, even with the Nforce 2, won't be able to keep up with the 3.2 P4 performance-wise. WHat if the worst happens, and only the 3Ghz P4 is around by the time the 2800+ comes out? Well, the 3 Ghz P4 combined with the Granite Bay chipset will be able to hold on to the performance crown. As I have stated many times, even Barton will have a hard time competing with the P4's. The olny thing that can help AMD in the desktop market is the clawhammer, and we won't be seeing it for it a while, since AMD's decided to stop concentrating on desktops for a while.

Personally, I don't expect AMD to regain the performance crown in the desktop arena for quite a while.

To make FULL use of HT, some apps will have to be reprogrammed. By "FULL" use I mean the maximum benefit you can get out of HT, which Intel states is a 70% increase in performance, quite a significant increase. On average, without reprogramming, you can expect to see a 25% performance increase with HT. That is of course if you're using WinXP. Intel recommends that to enjoy HT to the fullest, you sholud be using WinXP, or at least Linux 2.4x. The reason is simply the way earlier OS'es handle logical CPU's, and how they assign threads. Intel, after having done testing, says that WIn2k/98/Me/95 do not handle HT very well. That is, they aren't very good at switching a P4 w/ HT between multi-task, and single-task. With the apps you use, expect to see a noticeable performance increase, especially if you use several of them at the same time.

<i>Written by imgod2u</i>
Also, no code, anywhere, ever, maxes out the processor's resources, not x86 code.
That line is just classic!


Everything sonoran said is correct. Even if you're a "light" user, which most people on these forums are not, then you'll se a benefit from HT. In windows, the benefit should be a bit higher than on linux, simply because Windows is more bloated, and runs more useless processes than Linux.


- - -
<font color=green>All good things must come to an end … so they can be replaced by better things! :wink: </font color=green>
 

gemini8026

Distinguished
Oct 27, 2002
27
0
18,530
Thanks for the Great responses guys....



Can you please explain to me the "granite bay" chipset, I have no heard of this yet :)
 

Schmide

Distinguished
Aug 2, 2001
1,442
0
19,280
Based on this comment, I don't believe you understand Hyperthreading very well. The whole purpose of Hyperthreading is to allow the CPU's execution units to do more work.

Even when your CPU monitor says the CPU is at 100%, the truth is that the CPU's execution units are somewhere around 30-35% utilized. (The rest of the time they're sitting idle waiting for the rest of the CPU to get ready to feed them some work.) The idea behind Hyperthreading is to boost that usage, thereby boosting the overall work that can be done by the processor.
Hyper-Threading divides the processor down the middle except for the execution units. This means programs are sharing cache, interrupt request lines, and memory pathways. You can't tell me there aren't programs out there that need 512k of L2 cache and utilize the FP units to its fullest. Quake anyone. When the P4's floating point execution units are running at tick latency of 7-8 and instruction throughput 1-2, throwing in a couple of instructions here and there is definitely going to slow things down a bit. Most programmers do there best to schedule a mix of integer and floating point instructions to utilize the pipeline to its fullest.

I had a boss once who one day decides our highly pipelined transformation engine needed to be multi-threaded to improve performance. I told him, first it can be done (the answer all programmers should give first), second it probably won't buy us that much and third it may cost us a little. Sometimes order is totally important. Because I didn't agree 100% he gets this other programmer to do it. So it gets done and ends up costing us 20% performance. We had two threads fighting for the same resources, thrashing the cache, and in some cases running idle. I spent 3 weeks reworking the code, writing scheduling algorithms, and optimizing the system. I got it back to 95% but it never reached the simple model. On a multi-processor system it ran at 110% when only running the transformation engine, and broke even when rasterizing was performed. In the end, our footprint was only 10% of the total execution timings and we were fighting for 1-2% of the pie. Oh well I got paid.

Complicated proofs are proofs of confusion.
 

spud

Distinguished
Feb 17, 2001
3,406
0
20,780
You should hire carmack then, sounds like you have a sloppy programmer. Dont et me wrong I couldnt do any better but i know many others can.

-Jeremy

<font color=blue>Just some advice from your friendly neighborhood blue man </font color=blue> :smile:
 

eden

Champion
I would not be surprised if AMD starts considering HT for the K8. That could potential give them a significant boost over even HT enabled P4, as imgod2u said, because the amount of execution units it has would definitly give out something. Just that alone could buy AMD a huge competitive edge, unfortunatly I am not expecting them to implement a similar MTing technique, though I would not be surprised if they did.

BTW I have 561 threads at the moment with 53 processes, I think I definitely need HTing!

--
I guess I just see the world from a fisheye. -Eden<P ID="edit"><FONT SIZE=-1><EM>Edited by Eden on 10/27/02 03:12 PM.</EM></FONT></P>
 

Schmide

Distinguished
Aug 2, 2001
1,442
0
19,280
Actually, he was a great programmer. Russian. He just didn't like arguing with the boss. I discussed the theories with him and he agreed with the outcome. With some serious redesigning I'm sure it could have become an optimal solution. As with my original comment on the Hyper-Threading architecture, sometimes the effort isn't worth it.

Complicated proofs are proofs of confusion.
 

imgod2u

Distinguished
Jul 1, 2002
890
0
18,980
Hyper-Threading divides the processor down the middle except for the execution units. This means programs are sharing cache, interrupt request lines, and memory pathways. You can't tell me there aren't programs out there that need 512k of L2 cache and utilize the FP units to its fullest. Quake anyone. When the P4's floating point execution units are running at tick latency of 7-8 and instruction throughput 1-2, throwing in a couple of instructions here and there is definitely going to slow things down a bit. Most programmers do there best to schedule a mix of integer and floating point instructions to utilize the pipeline to its fullest.

I don't think this is true at all. While you may have a point about cache being split down the middle (which I still don't think so but I can't see anything about it in the HT documentation), the actual physical resources are not split 50/50. One thread is given priority while a secondary thread is used to "fill in the gaps", at least, in the early decoding stages. The problem comes when you get to the execution stage and you have more integer micro-ops from 2 threads than your ALU's can handle. Supposedly, Intel has done away with this problem with revision 2 of HT but only time will tell.
And while programmers try to throw in a mix of different instructions, that's not gonna help when you have data dependencies and memory latency, something the programmers have no control over. That's really the bulk of the limitation on ILP in x86 code, data dependencies. With 2 separate threads, you have 2 independent sets of instructions to choose from.

"We are Microsoft, resistance is futile." - Bill Gates, 2015.
 

juin

Distinguished
May 19, 2001
3,323
0
20,780
PS - Don't run multiple programs at once? Think again. I run a fairly "clean" machine (i.e. right now I have antivirus software running in the background, I'm running IE to look at this forum, and I have Task Manager open). I currently have 27 processes and 319 threads executing! And it's only going to get worse.


That what in RAM not what is need to be excute only few can be excute in the same time

Now what to do??
 

imgod2u

Distinguished
Jul 1, 2002
890
0
18,980
You'd be surprised at what processes are run at the same time. The majority of these "processes" are indeed only stored in RAM, however, a lot of things need to be done in real time. Such as updating the clock based on the RTC, Antivirus programs run constantly, etc.

"We are Microsoft, resistance is futile." - Bill Gates, 2015.
 

Schmide

Distinguished
Aug 2, 2001
1,442
0
19,280
Please Read <A HREF="http://www.extremetech.com/article2/0,3973,570431,00.asp" target="_new">http://www.extremetech.com/article2/0,3973,570431,00.asp</A>

Complicated proofs are proofs of confusion.
 

sonoran

Distinguished
Jun 21, 2002
315
0
18,790
Hyper-Threading divides the processor down the middle except for the execution units.
Actually, I just looked back through some of the stuff online about Hyperthreading. You may be surprised at how much stuff is duplicated. The following does a better job of explaining it all than I could, including a list of what's duplicated vs what's shared: <A HREF="http://cedar.intel.com/media/training/hyper_threading_intro/tutorial/index.htm" target="_new">http://cedar.intel.com/media/training/hyper_threading_intro/tutorial/index.htm</A>
 

Schmide

Distinguished
Aug 2, 2001
1,442
0
19,280
Here is another great article

<A HREF="http://arstechnica.com/paedia/h/hyperthreading/hyperthreading-1.html" target="_new">http://arstechnica.com/paedia/h/hyperthreading/hyperthreading-1.html</A>

especially paragraph 3 <A HREF="http://arstechnica.com/paedia/h/hyperthreading/hyperthreading-5.html" target="_new">here</A>

After the game of course. Go Giants.

Complicated proofs are proofs of confusion.
 

Schmide

Distinguished
Aug 2, 2001
1,442
0
19,280
By the way, I never said Hyper-Threading was bad technology. I'm just saying its not some magic bullet that makes everything that much faster. There are some cases where it does not apply. The best thing is it can be disabled without too much trouble.

Complicated proofs are proofs of confusion.
 

juin

Distinguished
May 19, 2001
3,323
0
20,780
on pure test basic like SPec bench only few tread additional it need to be excute (real life have much more threads).Like on a windows 98 wimamp continue running even if the if comp have crash the CPU is not need or only very few instruction need to be done here we try to find a 100K instruction that can be done in the same times as main program treads.Also decrease latency from having two treads in a case of cache miss the others tread will have all the excution for it self as it will finish it work.I like to point out that SMP implentation in P4 NW is not treads aware so i dont make any difference beetween treads '1' and '2'.But most of the time in benchmark gain will be little but real life is a others thing

Now what to do??
 

imgod2u

Distinguished
Jul 1, 2002
890
0
18,980
What about it? Here's a quote:

"Finally, resources like caches (trace and L1 data), branch global history array, microcode ROM, the scheduler, register renaming control logic, and execution units are fully shared, though <b>not partitioned.</b>"

Certain queues including the reorder buffers (which come into effect after the actual execution) are partitioned (split down the middle) but resources like cache, decoding and execution are fully shared, not partitioned (e.g. it's not split down the middle), it's dynamically allocated.

<A HREF="http://developer.intel.com/technology/itj/2002/volume06issue01/art01_hyper/vol6iss1_art01.pdf" target="_new">Intel's presentation</A> also depicts this.

"We are Microsoft, resistance is futile." - Bill Gates, 2015.
 

Schmide

Distinguished
Aug 2, 2001
1,442
0
19,280
Agreed. Shared does not necessarily mean halved and other stuff like Register Tables and Instruction Translation Lookaside Buffers are duplicated. Sorry about being misleading.

One more article with good benchmarks <A HREF="http://www.2cpu.com/Hardware/ht_analysis" target="_new">http://www.2cpu.com/Hardware/ht_analysis</A>.

Here's a story about windows multi-tasking. At my old company we had our controls embedded in a web page with some flash animations. We found that if we had more than one of our graphs animating on the page, the flash controls would not animate. I was put on the task of figuring out why this was so. After watching the messages, I found that flash controls run off a multi-media timer and while we were sucking up the cpu, they had no chance of acquiring the cpu. Even on a multi-cpu system the same thing would happen. Another minor caveat, if we attempted to animate more than two graphs, the first one would run at full speed the second would run at half speed and the third would tick like once every two seconds. Any graph after that wouldn't tick at all. I came up with an elegant scheduling algorithm that passed the execution time around and every so often just dropped out to allow other programs to execute. Even though it worked like a charm, my boss hated it. He said I just added Windows 3.1 multi-tasking to Windows 98. In some ways he was right. We ended up trying some other tricks like adding more threads, setting priorities, etc. Nothing worked as well. The code remained and another rift was made between my boss and I.

Recursion - I curse then I recurse again and again
 

Dark_Archonis

Distinguished
Apr 20, 2002
286
0
18,780
Actually, because both the K8 is so efficient in using it's execution units and resources, I doubt that HT would make a big difference on the K8. The K8 does not have alot of it's execution units sitting idle alot, or it's resources sitting idle. HT <b>might</b> provide a noticeable performance boost in 32-bit, IF the idle 64bit resources in 32bit mode can be utilized. Otherwise, there wouldn't be a big benefit.

Gemini, Granite Bay (or the 7205) is supposed to be similar to Intel's 850E chipset, except it's supposed to offer dual channel DDR 266 support, AGP 8X support, USB 2 support. Also, it has HT support and uses the ICH 4 I/O, which a newer hub, compared to the ICH 2 I/O on the 850E.

Currently, I have 100 threads, and 20 processes running. I get this on a typical day.

- - -
<font color=green>All good things must come to an end … so they can be replaced by better things! :wink: </font color=green>
<P ID="edit"><FONT SIZE=-1><EM>Edited by Dark_Archonis on 10/28/02 06:11 PM.</EM></FONT></P>
 

imgod2u

Distinguished
Jul 1, 2002
890
0
18,980
Actually, because both the K8 is so efficient in using it's execution units and resources, I doubt that HT would make a big difference on the K8. The K8 does not have alot of it's execution units sitting idle alot, or it's resources sitting idle. HT might provide a noticeable performance boost in 32-bit, IF the idle 64bit resources in 32bit mode can be utilized. Otherwise, there wouldn't be a big benefit.

I wouldn't say so at all. Consider this, the K8 has 3 decoding units capable of decoding 3 simple x86 instructions per clock or 1 complex x86 instruction per clock, but it has 9 execution units (assuming nothing was changed from the Athlon). Your average x86 instruction translates to about 1.5 micro-ops, so that's an average of 4.5 micro-ops per clock that is fed into the execution units and that's the theoretical maximum. In reality, the maximum sustained decoding rate of an Athlon would be around 2.2 instructions or 3.3 micro-ops per clock. That is nowhere near enough to fill 9 execution units. Even at this optimal condition, you'd still have 66% of the execution units remaining idle at any given time. In reality, even the decoder units aren't filled at any given time. Your average instruction throughput for an Athlon (and considering the decoding stage for the K8 isn't very different in 32-bit mode) is around 1.2 IPC in good, well optimized code. Many times it's not even a full 1 instruction per clock. Think about how much you could benefit if you could get twice that at any stage, even the decoding rate. Multiple threads don't encounter data dependencies with respect to each other, so instead of guaranteeing that you'll always have 1 instruction to decode and enter into the pipeline, you're guaranteed to have 2, provided you get a cache hit.
Whatever made you think the K7 or K8 ever "efficiently" used its die space? It has 3 times the decoding logic as the P4, 1.5 times the execution resources and up to twice the dispatching ability, with what, a 20% IPC advantage on average code that is optimized for the previous P6/K7 design?

Gemini, Granite Bay (or the 7205) is supposed to be similar to Intel's 850E chipset, except it's supposed to offer dual channel DDR 266 support, AGP 8X support, USB 2 support. Also, it has HT support and uses the ICH 4 I/O, which a newer hub, compared to the ICH 2 I/O on the 850E.

The i850e really just specifies the North Bridge. The motherboard manufacturers can actually choose to use whatever South Bridge they want. So it is very possible for say, Asus to use ICH4 with an i850e chipset.

"We are Microsoft, resistance is futile." - Bill Gates, 2015.