Sign in with
Sign up | Sign in
Your question

HyperThreading article misleading?

Last response: in CPUs
Share
November 30, 2002 1:24:50 PM

I just read the Article on Hyper-threading, and was left with a feeling
that major points were missing and that the article was generally
misleading.

As I understand it, hyper-threading allows the CPU to execute
instruction in parallel, whenever two instruction does not touch the
same registers. Additional hardware ensures that no "primary" thread
exists, and that both threads running on the CPU are served almost
equally. Do please note, that HT is not two CPU's in one, but a extra
virtual CPU, as noted in the article.

Now comes the frustrating points:

1) The graph on "Hyper-Threading: A Virtual Dual CPU System", is very
misleading. It seems the indicate that is is not possible to utilize
the CPU 100% without HT, if two threads running in parallel does not
utilize the CPU 100% each. This is simply untrue. The OS should be
able to switch to the next thread, as soon as one thread makes a
blocking system call. This means that the CPU will _always_ be
utilized 100% if there exist a thread which can be executed on the
CPU. HT however can "over-utilize" the CPU better than non-HT CPU's, as
it has more instructions to choose from when trying to find two
instruction that does not touch the same registers. So a HT enabled
CPU may perform 105% clock cycle vice compared to a non-HT CPU. That
said, there can be huge improvements for threads that often synchronize
with each other, but that is really a big study and is subject for a
Ph.d.

2) The Conclusion of the article, is that HT CPU's outperforms non-HP
CPU's whenever multiple CPU-intensive programs are run in
parallel. Please - HT enabled CPU are not much faster than non-HT
CPU's. The reason for the "it feels better (TM)" is that threads
cannot starve other threads (if there is only two thread running at
the same time). If more than two CPU intensive threads are running on
the system, starvation can still occur. What the conclusion of the
article indicates is that the Windows scheduler does a really poor job
- Not that HT is better. Its like curing the symptoms of a badly
written OS. (Please note that I'm not favoring other OSes over Windows
- I do not know the implementation of the Windows scheduler
implementation well enough to do that. But it would be interesting to
see how the same tests performs on, say, Linux 2.5.)

My conclusion (from a theoretical point of view) would be: If you
usually play games, HT is nothing for you - go with the clock-rate. If
you however run multiple programs in parallel (recording videos,
ripping Cd's and browse the Internet), and cannot afford a real dual
system, then HT will provide a more interactive feel, but if the clock
rate is lower for the HT enable CPU (compared to the clock freq. of non-HT enabled CPU's), encoding and recording will take longer.

I would go for HT anytime, as I do not care for games that much.

Lastly, The article seems to indicate that only Windows supports HT.
Not that Linux does also - I do not know about other OSes for the Intel platform.

Regards
November 30, 2002 8:02:13 PM

Quote:
This means that the CPU will _always_ be utilized 100% if there exist a thread which can be executed on the CPU.

No, the CPU is hardly ever utilized 100%. Modern CPUs have several separate execution units. When any single thread is being run, chances are that some of those execution units are not being used. For example, the SSE2 or x87 FPU units will not be running if you're playing Doom because Doom was programmed in 1993 and knows nothing about SSE2 and also doesn't use x87 code. However, with HT another thread that *does* use SSE2 may be run in parallel with Doom.

Ritesh
November 30, 2002 8:21:35 PM

My definition of 100% utilized was that a thread is executing instructions on the CPU. It's true that a CPU can never be utilized 100% in terms of all execution units beeing used by the thread having exclusive rights over the CPU. What I meant was, that If two threads are running, and they both are CPU intensive, the CPU will not be idle. And as far as scheduling theory is conserned, a CPU can either execute a thread or not.
Related resources
November 30, 2002 8:39:31 PM

Quote:
My definition of 100% utilized was that a thread is executing instructions on the CPU.

That is one screwed up definition. :wink:
But I can see what you mean.

In my eyes HT should have never been advertised as a "Two CPU in one", nor a "performance booster". The proper use of it is that it's a multitasking enhancement feature, that allows you to run many programs at once with little to no lag or performance drop.
The extra performance boost is simply the side-effect of pushed single threaded use in definition that it makes sure single threaded apps keep pushing as much threads at once so not to keep the CPU with pipeline bubbles.

--
*You can do anything you set your mind to man. -Eminem
November 30, 2002 9:13:17 PM

I think you're missing the point. You continue to talk about "the thread that has exclusive rights to the cpu". The point is that the operating system sees the HT enabled cpu as 2 processors and not as one, thereby allowing 2 threads to have "exclusive rights" to the cpu. As a result, the cpu can either be executing 1 thread, 2 threads or none at all.

What really amazes me is the different results which Tom and Anand gets. The benchmarks on anandtech paint a much better picture than the benches on this site. At least that's what I have noticed.

Regards
Andreas
November 30, 2002 9:22:13 PM

Not really, it's just Anandtech's test CPUs are less layed out, and they only have 266MHZ deltas between CPUs. THG has almost any P4 MHZ model.
With Anand you also see some benches THG doesn't have, or they used different rendering files. 3dS Max 5 at Anandtech had a file rendered especially well for the P4's style.

--
**Canadian joke:
Here we don't say the word "retarded", we say "Alliance"! -Mike Bullard**
November 30, 2002 9:28:42 PM

I'm not missing the point at all :-). Try and replace CPU in my previous post with "Processing element". Now see if it all makes sence. A processing Element(PE) can only serve one thread. a HT cpu can be said to have two PE's. What I meant was that either a thread has exclusive rights over a PE, an can execute all the instructions it wants to, or not. The term. Its a definition, and can (by definition) not be wrong.
I still beleive that the graph in that article, makes the reader thing - Uh, this is smart if two program are running, which each only requires 80% CPU time of the total life time, then HT can utilize the remaining 20% idleness. This is utterly wrong.
November 30, 2002 9:54:52 PM

Well it would use the other units unused, if it knows the uOPS of the second thread can go there. HT also as I said tries as much to reduce pipeline bubbles, if not it will let its trace cache get pieces of the second thread to fill the other gaps.

--
**Canadian joke:
Here we don't say the word "retarded", we say "Alliance"! -Mike Bullard**
December 1, 2002 12:40:05 AM

Quote:
The graph on "Hyper-Threading: A Virtual Dual CPU System", is very
misleading. It seems the indicate that is is not possible to utilize
the CPU 100% without HT, if two threads running in parallel does not
utilize the CPU 100% each. This is simply untrue. The OS should be
able to switch to the next thread, as soon as one thread makes a
blocking system call. This means that the CPU will _always_ be
utilized 100% if there exist a thread which can be executed on the
CPU. HT however can "over-utilize" the CPU better than non-HT CPU's, as
it has more instructions to choose from when trying to find two
instruction that does not touch the same registers. So a HT enabled
CPU may perform 105% clock cycle vice compared to a non-HT CPU. That
said, there can be huge improvements for threads that often synchronize
with each other, but that is really a big study and is subject for a
Ph.d.


This is simply not true. The processor is virtually never utilized 100%. Data dependencies between instructions in a single thread, not to mention memory latencies, decoding bottlenecks, scheduling conflicts, etc. etc. etc. all cause stalls in the processor. And unless that thread just happens to know when these problems happen (which is very unpredictable) and issues a HALT command, the processor is forced to remain idle and wait for the next instruction from the thread. This is one of the biggest problems in processor performance with x86, and probably all code out there.

Quote:
The Conclusion of the article, is that HT CPU's outperforms non-HP
CPU's whenever multiple CPU-intensive programs are run in
parallel. Please - HT enabled CPU are not much faster than non-HT
CPU's. The reason for the "it feels better (TM)" is that threads
cannot starve other threads (if there is only two thread running at
the same time). If more than two CPU intensive threads are running on
the system, starvation can still occur. What the conclusion of the
article indicates is that the Windows scheduler does a really poor job
- Not that HT is better. Its like curing the symptoms of a badly
written OS. (Please note that I'm not favoring other OSes over Windows
- I do not know the implementation of the Windows scheduler
implementation well enough to do that. But it would be interesting to
see how the same tests performs on, say, Linux 2.5.)


Windows XP seems to be able to manage multiple threads between logical processors quite well. As does Linux kernel above 2.4.3 I think. It's true that multiple processor-intensive applications can contest for resources of the processor, however, with the OS being able to give priority to threads, you can effectively have a primary thread that's given full speed. It all depends on how well the OS handles thread priority and looking at the benchmarks under WinXP (on every other site except Toms of course), it does a pretty good job.

Quote:
My conclusion (from a theoretical point of view) would be: If you
usually play games, HT is nothing for you - go with the clock-rate. If
you however run multiple programs in parallel (recording videos,
ripping Cd's and browse the Internet), and cannot afford a real dual
system, then HT will provide a more interactive feel, but if the clock
rate is lower for the HT enable CPU (compared to the clock freq. of non-HT enabled CPU's), encoding and recording will take longer.


Not neccessarily, again it would depend on the OS's ability to manage threads. It would also depends on how "thread-hogging" the second program is. Current HT implementations can only handle 2 threads simultaneously. So if an application took up both threads, it can significantly decrease performance of the other program.

Quote:
Not that Linux does also - I do not know about other OSes for the Intel platform.


As I recall, Linux kernel 2.4.3 and higher support, and are optimized for SMT.

"We are Microsoft, resistance is futile." - Bill Gates, 2015.
December 1, 2002 1:11:43 AM

The only program i know of that uses ALL the CPU pipelines is called Toast. A cpu stress tester specially designed for AMD cpu's.


<b>Just because someone's a member of an ethnic minority doesn't mean they're not a nasty small-minded little jerk. <i>Terry Pratchett</i></b>
December 1, 2002 1:19:28 AM

Quote:

Not neccessarily, again it would depend on the OS's ability to manage threads. It would also depends on how "thread-hogging" the second program is. Current HT implementations can only handle 2 threads simultaneously. So if an application took up both threads, it can significantly decrease performance of the other program.

If there are two PE's (Processing elements) in the machine, worstcase latency will be better than the latency for a single PE machine.

Quote:

As I recall, Linux kernel 2.4.3 and higher support, and are optimized for SMT.

I guess that you mean SMP (Symetric multi processing).
Yes - 2.4 does a good job at scheduling threads to PE's, but somewhat lacks PE affinity. Linux 2.5 has an improved schduler which uses per PE runqueues and therfore automatically provides good PE affinity.

But the fact still remanins. Having more PE's will help on the average case when threads are waiting to be processed, and the latency will be smaller the more PE's you have. But if you play games, it all comes down to the makespan - How many instructions can the machine execute over time.
December 1, 2002 1:42:54 AM

No, I mean SMT. All Linux kernels since 1.0 had support and were attempted to be optimized for SMP. Only 2.4.3 and higher were able to recognize logical processors and had optimizations to balance the load with multiple threads on an SMT-enabled processor.

"We are Microsoft, resistance is futile." - Bill Gates, 2015.
December 1, 2002 3:09:38 AM

Quote:
Data dependencies between instructions in a single thread, not to mention memory latencies, decoding bottlenecks, scheduling conflicts, etc. etc. etc. all cause stalls in the processor. And unless that thread just happens to know when these problems happen (which is very unpredictable) and issues a HALT command, the processor is forced to remain idle and wait for the next instruction from the thread. This is one of the biggest problems in processor performance with x86, and probably all code out there.

I've studied a lot of architectures and I can't think of a single platform (CISC, RISC, EPIC) that isn't susceptible to the above problems. Why single out x86? Or is that what you meant by "and probably all code out there"?

Dichromatic for your viewing plesure...
December 1, 2002 3:30:45 AM

Yes, that is what I mean, however, the problem is particularly great with x86. It is probably the ISA out of all the ones out there that are currently in use that has the most problems with ILP.

"We are Microsoft, resistance is futile." - Bill Gates, 2015.
December 1, 2002 12:01:24 PM

It actually IS smart, the HT-principle. It think you are looking at it a the wrong level. You look at the CPU as a black box that gives it's resources to a thread. Your words: <i>"... that either a thread has exclusive rights over a PE, an can execute all the instructions it wants to, or not."</i> That is a thing that was halfway true in non-HT CPU's, but it is not anymore. The thing HT does, is breaking through that black box, and look into it, seeing where pipeline bubbles occur, and push in some instructions of another thread. HT goes deeper than any OS can go. So I disagree with your statement about OSes, too. The fact that a system performs better on a HT-enabled CPU does not proof at all that the OS it is running, has got a lousy (?) scheduler.
To describe HT in brief: HT makes a CPU that could only 'see' ILP (instruction level parallellism), now go up one level, and make it see parallellism between different threads, thereby gaining IPC (number of instructions completed per clock cycle).
I think, afgm, that you didn't get the principle behind HyperThreading. I hope it's a little more clear now, and if it's not ... Go and have a look at <A HREF="http://www.arstechnica.com" target="_new">ArsTechnica.com</A>. That is quite a technical website, but they have very thourough explanations on CPU's and how they work. Not easy to read, but really worth it, if you want to know the how and why of recent CPU-developments.

Greetz,
Bikeman

<i>Then again, that's just my opinion</i>
December 2, 2002 2:59:19 AM

Hyper treading is the best feature i have seen from the day of 486 and pentium original.486 was introducing stage pipeline and pentium super scalar there also OoO feature .

I think Ht is better that 2 CPu box.It cut lantancy from every where increase load in FSB increase load in I/O and not adding very complex silicon like OoO.Most of 64 bit Cpu moving to SMT power 5 alpha Ev8 Montecino slow spark <P ID="edit"><FONT SIZE=-1><EM>Edited by juin on 12/02/02 00:04 AM.</EM></FONT></P>
December 2, 2002 3:07:01 AM

The 1 SMT chip was pentium 4 and no compiler or Os is built to take this advantage and i will bet that they will never take great use of it.

Now what to do??
December 2, 2002 3:19:09 AM

Hypertghreading is just an effect of a good CPU design. With more registers added in the CPU, virtual mode seemed to be created thus making the cpu function in real as well as in virtual mode.

Its very nice of course. A vast improvement.

But we should not worry ourselves much over hyperthreading, beause in the end, it is the buying market who will decide things for us, wetther they like it or not, wether they like it but will buy their old prefernces, wehther they will gamble on the new one, wether their finances will allow, and wether they will profit from it all!


"A wise man gets wiser as he gains more experiences", am I right?

What do you say?
December 2, 2002 9:41:05 AM

You guys are all missing the point...it simply doesn't matter how many threads are in the queue or if 2 threads are being executed simultaneously on 2 virtural CPUs that the OS thinks is a dualie but is really a single physical processor. (Anyway, that is an old concept...like 1975). Anyhoo, I have had always had 2CPU rigs, with Xeons, Athlon XPs (unlocked with Tom's help) and even PIII (yech!). This little Shuttle I have rox and I can take it anywhere I want. I filled it full with 2 GB Mushkin DDR-RAM and 2 WD JB1200s striped and mirrored and an ATi 9700 Pro. I would put it up against anyone's rig, dualie or not running most apps. I run OGR 24/7, burn DVDs, and play UT2003 without a stutter. Sure it cost like a dualie, but wtf? I can take it anywhere I want. It weighs in at only like <4 kilos. My only gripe is that I use it with a DELL 2000FP at work, but I can see a real difference with a really good 21" CRT in terms of image quality and pixel response when I take it to a LAN party. The benchmarks run by Dr Tom and Anand the Shrimp will never show up in real life, so what the hell? Until the next one comes out (maybe the Nforce with dual ddr-ram channels) I am quite satisfied. My rig cost me prolly $2200 without the LCD, and I think I got my money's worth. Here are full specs:

BTW: If you really are serious about performance, spend a few bucks on MemTurbo II now that it supports 2048 MB RAM. It rox! Also, RamDisk that makes a RAM Drive on your sys is totally cool when you can load a full game into RAM like Max Payne. I don't know if it works in WXP, but I used it alla time in W2K. Try them both.

Shuttle Intel chipset HT enabled MB Aluminium case
P4 3.06 HT enabled (little dab of AS3 on heatpipe sys)
2 GB Mushkin 266 DDR-RAM
2 WD 1200JB HDD (www.coolerguys.com thin HDD coolers)
Adaptec PCI IDE RAID 1+0 or RAID 10 or however you like to term it
Pioneer DVD-R/RW 4x w/ CD R/RW
Ati 9700 Pro 128 MB DDR-RAM ref model
WXP Pro SP1
No OCing
No Floppy
All onboard USB 2.0, Firewire, all that crap and 5.1 sound and 10/100 Ethernet chip.
Bose Noise cancelling headset (neighbor control)
Built in 1 day w/SW and all (LOL I got over 2 Mbits on tampabay.rr.com cable sys)RR Rulz!
Next week I will tweak the FSB till it rox or I get BSOD ;-)


Dual CPU freak. 2 Xeon systems and a Shuttle with HT running W2k, XP Pro, and Suse Linux 8.1
December 2, 2002 12:18:38 PM

Quote:
2 WD JB1200s striped and mirrored

Interesting that you managed to get 2 physical hard disks into a RAID10 array from a manufacturer that doesn't even support it. I imagine that's a new form of HyperThreading for hard disks.
December 4, 2002 7:51:01 PM

I guess thanatos does not have a 3.06 after all.

You are limited to what your mind can perceive.
December 4, 2002 8:45:07 PM

if that use more that two instruction i will be impress
2 instruction a load and a store.

Now what to do??
December 4, 2002 10:54:35 PM

Im unsure how toast works, but when its running at high priority the cpu gets 2-5C hotter than any other program ive ever used. Burn in, prime95, that passcode cracker, seti@home, superPI, etcetc.


<b>Just because someone's a member of an ethnic minority doesn't mean they're not a nasty small-minded little jerk. <i>Terry Pratchett</i></b>
December 5, 2002 5:30:21 AM

The Shuttle AB48N is not available in the US, its the only PE board from shuttle.

You are limited to what your mind can perceive.
!