Sign in with
Sign up | Sign in

How Hyper-Threading Works

Core i7-980X: Do You Want Six Cores Or 12 Threads?
By

While the Pentium III had a 10-stage instruction pipeline, the Pentium 4 processor increased pipeline length to 20 stages with the Willamette (180nm) and Northwood (130nm) cores. The following Prescott core (90nm) even ran a 31-stage pipeline. The last of its kind, Cedar Mill (65nm), maintained this execution pipeline structure.

The basic idea behind an instruction pipeline is to structure processing into independent steps, and putting more steps into a pipeline translates into higher execution throughput, especially at high clock speeds. However, leaving the pipeline partially empty or loaded with the wrong instructions leads to performance penalties. Program branches are the most critical factor, as the branch prediction unit of a CPU has to guess which branch will be followed in order to load the appropriate instructions.

The 31-stage pipelines of Prescott and Cedar Mill in particular depended on high workload efficiency. Therefore Intel invented and added a "replay unit," which allowed the processor to intercept operations that have been mistakenly sent for execution and replay them once proper execution conditions were granted. A side-effect of the replay system was that some applications would actually slow down with Hyper-Threading enabled, as execution resources were tied up and therefore detracting from the second thread's performance. At the time, Hyper-Threading's value had to be called into question, since it sometimes served as a benefit and sometimes as a detriment.

All Core i7 processors and most of the upper mainstream Core i5 CPUs support Hyper-Threading.All Core i7 processors and most of the upper mainstream Core i5 CPUs support Hyper-Threading.

Today’s implementation of Hyper-Threading is similar in that it presents each physical core to the operating system as a pair of logical processors. If execution resources aren’t used by a current task, the processor’s scheduler can execute something else to increase efficiency or prevent stalling from branch mispredictions, cache misses, or other data dependencies.

Hardware-wise, all that you need to support Hyper-Threading is a platform with BIOS support and a compatible operating system (we take the actual HT-equipped processor for granted here). This has been the case since the days of Windows NT.

In the past, we’ve seen Hyper-Threading provide additional performance, but it also clearly contributed to power consumption (even if, according to Intel, it's a cheap addition with regard to increasing die surface area). Heavily threaded applications and workloads typically take more efficient advantage of many cores and multiple threads than mainstream software that is less-optimized for multiple threads.

Display all 71 comments.
This thread is closed for comments
Top Comments
  • 27 Hide
    ta152h , March 22, 2010 7:33 AM
    Quote "The basic idea behind an instruction pipeline is to structure processing into independent steps, and putting more steps into a pipeline translates into higher execution throughput, especially at high clock speeds."

    That's really quite convoluted, and not even accurate. Apparently, the author of this doesn't really understand pipelining.

    Back in the bad old days of the 386, only one instruction was worked on at a time. There were separate parts of the 386, so it's not entirely true, but basically one instruction was being worked and after it was done, the next was started. Now, let's say this instruction took three clock cycles to perform, that's pretty much all you could do during those clock cycles (again, there was some parallelization on the 386, like memory pre-fetching, but I'm simplifying to illustrate the point).

    The 486 was a scalar processor, meaning it had pipelines. Now, let's say we have four stages in our pipeline. The first instruction starts on clock cycle one, and goes into stage one. Clock cycle two sees the first instruction go to stage two, and the next instruction go to stage one. The next cycle sees them move on down. The benefit is, without mispredictions or stalls, you can push out an instruction per cycle. You're parallelizing your execution since more than one instruction is being worked on at the same time in a different stage of the pipeline. Of course, they added more parallelization with more than one pipeline, but that's a different technology called "super-scalar".

    The other remark, which is just worded badly is " and putting more steps into a pipeline translates into higher execution throughput, especially at high clock speeds."

    The extra steps mean less work is done per cycle, so each cycle can take less time, meaning higher clock speeds. It's not "especially" at high clock speeds, the high clock speeds are why super-pipelined processors can execute quickly, as they would otherwise be slower since they have greater branch mispredict penalties.

    Having said that, the Pentium 4 was poor at Hyper-Threading, vis-a-vis the Nehalem for a different reason. The trace-cache was quite small, and there was only one decoder in front of it. Even on one thread, the cache misses on this processor crippled the performance of it, as it was only running as a scalar processor far too often. Add in Hyper-Threading, and you lower the cache hit rate even worse, and you're still limited by the one decoder in front of the trace-cache, so you do have the potential of lowering performance in some situations.

    The Nehalem doesn't have a trace-cache (although there is some loop caching after the decoders which was added to save power), and has far greater decoder capabilities because of it. It is also wider, so shows greater benefits since it's architecture is better suited for it.
  • 11 Hide
    uh_no , March 22, 2010 6:58 AM
    p1n3apqlexpr3ssSo using this theory of how HT isnt really that useful... the i3s are nothing better than a better architectured version of the C2D e8xxxs?

    or you could ignore the microarchitechtural differences.....
  • 10 Hide
    kokin , March 22, 2010 9:44 AM
    shreeharshaWhat about use in gaming. Is HT useless for gaming....

    It's like you never even bothered to read any of the HT-related articles Tom's has published... Most games don't even make use of quad cores, so expect most of them to also not make use of 8-12 logical cores. If you plan to play FSX or GTA4, then you'll probably see a small benefit as these two games rely heavily on the CPU, but having HT is no game changer by any means. This is why Tom's always recommends the Intel i5-750/AMD Phenom II 955/965 and any other CPUs below that.

    If your main purpose is gaming, stick with an AMD CPU. Otherwise, for any other type of work+gaming, go for the Intel CPUs.
Other Comments
  • 3 Hide
    shin0bi272 , March 22, 2010 6:18 AM
    One of the issues I have tried to get the guys at dbpoweramp to see is that at least with HT you have the opportunity to do twice the number of tracks (12 in the case of the 980) even if it doesnt actually finish much faster it's still working on all 12 at once. They so far have not adjusted their converter to support HT in multicore cpus though.
  • -4 Hide
    Lutfij , March 22, 2010 6:42 AM
    here we go AGAIN!

    then intel will wipe the HT off of the rest of its new processors line ups - i'm beginning to see this feature as a limited edition...back in 2004 i couldnt get my hands on 1 since it was darn expensive, coupled with a board that had HT support.
  • -5 Hide
    p1n3apqlexpr3ss , March 22, 2010 6:48 AM
    So using this theory of how HT isnt really that useful... the i3s are nothing better than a better architectured version of the C2D e8xxxs?
  • 11 Hide
    uh_no , March 22, 2010 6:58 AM
    p1n3apqlexpr3ssSo using this theory of how HT isnt really that useful... the i3s are nothing better than a better architectured version of the C2D e8xxxs?

    or you could ignore the microarchitechtural differences.....
  • 27 Hide
    ta152h , March 22, 2010 7:33 AM
    Quote "The basic idea behind an instruction pipeline is to structure processing into independent steps, and putting more steps into a pipeline translates into higher execution throughput, especially at high clock speeds."

    That's really quite convoluted, and not even accurate. Apparently, the author of this doesn't really understand pipelining.

    Back in the bad old days of the 386, only one instruction was worked on at a time. There were separate parts of the 386, so it's not entirely true, but basically one instruction was being worked and after it was done, the next was started. Now, let's say this instruction took three clock cycles to perform, that's pretty much all you could do during those clock cycles (again, there was some parallelization on the 386, like memory pre-fetching, but I'm simplifying to illustrate the point).

    The 486 was a scalar processor, meaning it had pipelines. Now, let's say we have four stages in our pipeline. The first instruction starts on clock cycle one, and goes into stage one. Clock cycle two sees the first instruction go to stage two, and the next instruction go to stage one. The next cycle sees them move on down. The benefit is, without mispredictions or stalls, you can push out an instruction per cycle. You're parallelizing your execution since more than one instruction is being worked on at the same time in a different stage of the pipeline. Of course, they added more parallelization with more than one pipeline, but that's a different technology called "super-scalar".

    The other remark, which is just worded badly is " and putting more steps into a pipeline translates into higher execution throughput, especially at high clock speeds."

    The extra steps mean less work is done per cycle, so each cycle can take less time, meaning higher clock speeds. It's not "especially" at high clock speeds, the high clock speeds are why super-pipelined processors can execute quickly, as they would otherwise be slower since they have greater branch mispredict penalties.

    Having said that, the Pentium 4 was poor at Hyper-Threading, vis-a-vis the Nehalem for a different reason. The trace-cache was quite small, and there was only one decoder in front of it. Even on one thread, the cache misses on this processor crippled the performance of it, as it was only running as a scalar processor far too often. Add in Hyper-Threading, and you lower the cache hit rate even worse, and you're still limited by the one decoder in front of the trace-cache, so you do have the potential of lowering performance in some situations.

    The Nehalem doesn't have a trace-cache (although there is some loop caching after the decoders which was added to save power), and has far greater decoder capabilities because of it. It is also wider, so shows greater benefits since it's architecture is better suited for it.
  • 2 Hide
    shreeharsha , March 22, 2010 9:26 AM
    What about use in gaming. Is HT useless for gaming....
  • 10 Hide
    kokin , March 22, 2010 9:44 AM
    shreeharshaWhat about use in gaming. Is HT useless for gaming....

    It's like you never even bothered to read any of the HT-related articles Tom's has published... Most games don't even make use of quad cores, so expect most of them to also not make use of 8-12 logical cores. If you plan to play FSX or GTA4, then you'll probably see a small benefit as these two games rely heavily on the CPU, but having HT is no game changer by any means. This is why Tom's always recommends the Intel i5-750/AMD Phenom II 955/965 and any other CPUs below that.

    If your main purpose is gaming, stick with an AMD CPU. Otherwise, for any other type of work+gaming, go for the Intel CPUs.
  • 7 Hide
    Tomtompiper , March 22, 2010 10:16 AM
    Any chance of doing some Linux tests to see if there are any benefits to running HT?
  • 1 Hide
    NucDsgr , March 22, 2010 10:32 AM
    Next time a hyperthreading comparison article is written, make sure you include the memory requirements for the application with and without hyperthreading. Hyperthreading requires more memory to an extend that is surprising.
  • 0 Hide
    amnotanoobie , March 22, 2010 10:38 AM
    kokinThis is why Tom's always recommends the Intel i5-750/AMD Phenom II 955/965 and any other CPUs below that.If your main purpose is gaming, stick with an AMD CPU. Otherwise, for any other type of work+gaming, go for the Intel CPUs.


    Plus (if you're gaming), you're probably better off saving the cash for a better video card.

    Even a lightly OCed 920 is overkill for a lot of games today.

    Quote:
    Next time a hyperthreading comparison article is written, make sure you include the memory requirements for the application with and without hyperthreading. Hyperthreading requires more memory to an extend that is surprising.



    Why would HT require more memory when it is just less of a real full-pledged core?


  • -1 Hide
    nekoangel , March 22, 2010 10:52 AM
    not much of a fan of hyper threading yet on an industry side though multi cores and visualization and cloud computing are just plain awesome. so much feasibility has been unlocked due to multi cores. I can see hyper threading still waiting for both the OS and applications. Apparently programing for more than one core is supposed to be quite the challenge and to make use of multi cores with hyper threading an even greater, though I look forward to condensing server farms even more in the future.
  • 1 Hide
    Hilarion , March 22, 2010 11:12 AM
    In our data processing and image conversion realm of business with current business software, hyper-threading has proved to be a hard and heavy hit on performance. We have realized greater speed by shutting off the hyper-threading though I had to prove it to our system admin by having him run benches on our systems with and without hyper-threading. None of our data acquisition/conversion machines are currently utilizing hyper-threading for our business applications.

    Software in the business world has a long way to go yet to catch up with our hardware.

    The only place I am aware that we are using hyper-threading at all is in our virtualized servers whose performance is out of my purview.
  • 0 Hide
    JohnnyLucky , March 22, 2010 11:16 AM
    Will software developers be able to keep up with Intel?
  • 2 Hide
    neiroatopelcc , March 22, 2010 11:47 AM
    JohnnyLuckyWill software developers be able to keep up with Intel?


    In games this is only going to happen if the engine designers (id, valve etc) figure out ways to put as much of their graphics engine into new threads as possible. The base graphics engine cannot be threaded if Direct3D is being used, which pretty much limits performance gains by having more threads. AI, physics, data aquisition and interface related things can still be moved to other threads, but if the graphics engine is bottlenecked by an insufficient core speed it won't help. *

    nekoangelnot much of a fan of hyper threading yet on an industry side though multi cores and visualization and cloud computing are just plain awesome. so much feasibility has been unlocked due to multi cores. I can see hyper threading still waiting for both the OS and applications. Apparently programing for more than one core is supposed to be quite the challenge and to make use of multi cores with hyper threading an even greater, though I look forward to condensing server farms even more in the future.

    It is quite beneficial on terminal servers and in virtualization enviroments. It won't make a difference on file servers and database servers with limited memory, but it does work. Also remember that any pre 2008 server only uses one core for networking, so network intensive stuff won't benefit at all - but that won't benefit from real cores either.

    NucDsgrNext time a hyperthreading comparison article is written, make sure you include the memory requirements for the application with and without hyperthreading. Hyperthreading requires more memory to an extend that is surprising.

    Really? I haven't noticed any difference in memory usage with it enabled or not. I usually have about 2-3GB memory that windows uses as file cache because it has no other use for it.

    shreeharshaWhat about use in gaming. Is HT useless for gaming....

    I've tested serveral games. Running the game at native resolution on a 22" display and having resmon and taskman running on a secondary monitor.
    I don't know how, but appearently windows 7 knows which cores are real and which are not. For instance Dragon age uses only 4 cores - and somehow windows makes sure it uses the real ones ie. you won't see core #0 and #1 being loaded at the same time unless you start something else in addition to dragon age.

    Most games I've seen so far don't use more than one or two cores. Some newer seem to use three or four, but I've yet to see a game that uses all eight.

    * don't know if the problem has been solved with the new directx versions ; I'm guessing it hasn't.
  • 2 Hide
    soky602 , March 22, 2010 12:50 PM
    hoof_heartedBut can it play Crysis?

    Why can't people save this over used line for new GPU's, obviously a multiple core CPU can run it, but you're not going to see anything if not for a GPU.
  • 0 Hide
    eaclou , March 22, 2010 1:01 PM
    Thanks for including Cinebench results - I'd love to see this included in future CPU articles, and maybe GPU for OpenGL performance.
  • 3 Hide
    Anonymous , March 22, 2010 1:08 PM
    All these benchmarks take only one running application into account. What if you're zipping something, converting a video, and doing some photoshop while listening to music all at the same time? Then what!
  • 6 Hide
    tipoo , March 22, 2010 1:38 PM
    You guys see the leaked pricing for AMD's Phenom II X6 on Techreport? Apparently its 200 dollars, or 300 for the BE. That in comparison to over 1000 for this. Intel may have the performance crown, but that price jump is MASSIVE!
Display more comments