IBM thinks Hyperthreading is the way forward

mr_gobbledegook

Distinguished
Sep 3, 2001
468
0
18,780
<A HREF="http://news.com.com/2100-1006_3-5065839.html?tag=fd_top" target="_new">Clicky</A>

Interesting..so it looks like Power 5 will have dual cores and hyperthreading capability ! I bet companies will have fun licenseing there software on these machines !

<font color=purple>Ladies and Gentlemen, its...Hammer Time !</font color=purple>
 

slvr_phoenix

Splendid
Dec 31, 2007
6,223
1
25,780
That's cool, a simul-multi-threaded multi-cored CPU. I wonder how well it'd scale... Imagine a dualie box: 2 CPUs, 4 cores, 8 logical processors. That'd be some darn funky stuff. I don't even want to think about a real server with a massive CPU count...

Hmm. At some point one has to ask themselves if things could get too far. I mean you usually only have so many heavy threads running at a time...

<pre><b><font color=red>*** BattleTech - The Crescent Hawks Inception ***</b>
Pilot twenty-ton behemoth robots to save your planet from a
Kuritian invasion force. Now available on the C=64!</font color=red></pre><p>
 

cdpage

Distinguished
Aug 4, 2001
789
0
18,990
seems pretty cool to me. power consuption's a bit high but for the increase in speed it should pay off. as long as its being used.





ASUS P4S8X - P4 2.4B - 2 x 512M DDR333 - ATI 9500 Pro(Sapphire) - WD 80G HD (8M Buffer) - SAMSUNG SV0844D 8G HD - LG 16X DVD - Yamaha F1 CDRW - Iomega Zip 250 int.
 

eden

Champion
Now Multithreading is just another way of saying HyperThreading, right?

--
<A HREF="http://www.lochel.com/THGC/album.html" target="_new"><font color=green><b>A sexual experience like never before seen</font color=green></b></A>
Site has now even more sexy members, for your pleasure.
 
G

Guest

Guest
Or something like hyperthreadin is intel's way to say multithreading.(At least thats the way I see it) and multithreading could mean more than 2...
 

Crashman

Polypheme
Former Staff
Multithreading is what software does with two processors, be they logical or actual.

<font color=blue>Only a place as big as the internet could be home to a hero as big as Crashman!</font color=blue>
<font color=red>Only a place as big as the internet could be home to an ego as large as Crashman's!</font color=red>
 

c0d1f1ed

Distinguished
Apr 14, 2003
266
0
18,780
Hyper-Threading <b>is</b> the way forward.

It's simple. We started with processors that did one instruction at a time, in several clock cycles. Then we had pipelining which means we start fetching and decoding the next instruction before the previous has finished executing. Then with the 486 nearly every instruction was done in one clock cycle so we needed something new. The Pentium had two pipelines so when subsequent instructions were independent it executed them in parallel. The Pentium Pro went even further by totally allowing out-of-order execution to avoid dependencies. But then we got a little stuck, only MMX and SSE further improved performance by adding explicit parallelism in the instructions, but compilers nearly don't use it.

So... what else can be done besides fine-tuning? We're already selecting as many independent instructions as practically possible, everything is pipelined with huge latencies and clock rates, branch prediction tries to keep the processor busy even when dependencies are not resolved yet, etc. The only way is to look out of the box. You don't have to execute instructions from one program, take two/four or even more! And with a little effort, the programmer can split his program in several greatly independent treads that can be executed in parallel.

Suddenly instruction dependencies are no problem, huge latencies are no problem, and branch misprediction has no penalty, in the optimal case. Furthermore the extra transistor cost is negligible, and extra instruction units can be added that will be used effectively. So it's really a logical step forward. We might even see new architectures that save transistors for everything else and focus on Hyper-Threading to lower heat dissipation.

Too bad AMD still believes in low latency and ILP in the same program...
 

Pirox

Distinguished
Jul 9, 2003
78
0
18,630
AMD does not need hyperthreading..yet. Because if you do claim so now you will be a fool because it will only show your lack of knowledge about processors. The p4 is using hyperthreading not because of just multiprocessing alone but because it has a longer <b>Pipeline</b>. This pipeline is most of the time idle and needs to be filled with data and that is also gained with higher bandwidth...

What you probably meant was AMD needs a processor with <i>Dual-cores or more registers</i> since it has a shorter <b>pipeline</b>...but until it has a longer <b>pipeline</b> hyperthreading isn't needed...remember lower pipelined processors don't need it...they can catch data much faster in the pipeline and to archieve that lower latency is needed.

---
If you go to work and your name is on the door, you're rich. If your name is on your desk, you're middle class. If your name is on your shirt, you're poor!
<P ID="edit"><FONT SIZE=-1><EM>Edited by pirox on 08/21/03 04:50 AM.</EM></FONT></P>
 

c0d1f1ed

Distinguished
Apr 14, 2003
266
0
18,780
Because if you do claim so now you will be a fool because it will only show your lack of knowledge about processors.
I had a course about computer architecture and more specifically processor architecture, including the Pentium 4 and Hyper-Threading. I am no fool and I don't lack knowledge about processors.
The p4 is using hyperthreading not because of just multiprocessing alone but because it has a longer Pipeline.
Wrong, pipeline length does not influence the benefits of Hyper-Threading. It is true that processors with longer pipelines have extra advantages but it doesn't mean an AMD wouldn't benefit from it. No matter how short your pipeline, you will be able to process more instructions per clock if you have Hyper-Threading and extra execution units.
What you probably meant was AMD needs a processor with Dual-cores or more registers since it has a shorter pipeline...
No that's not what I meant and your statement is false. Dual core is much less cost-effective. More registers are not going to help fill your pipelines more.
remember lower pipelined processors don't need it...
The AMD pipeline is not thát much shorter than Intel's that it wouldn't benifit from Hyper-Threading. And like I said even the shortest pipeline would benefit from it if you have extra execution units.
they can catch data much faster in the pipeline and to archieve that lower latency is needed.
Bullshit. ILP of an Athlon isn't that much higher than of a Pentium. And programs implicitely don't have infinite ILP so no matter how short your pipelines are, how low your latency is and how many execution units you have, you are not going to get past dependency bottlenecks. Only Hyper-Threading can keep things efficient.
 

slvr_phoenix

Splendid
Dec 31, 2007
6,223
1
25,780
And with a little effort, the programmer can split his program in several greatly independent treads that can be executed in parallel.
You're either obviously not a computer programmer, or if you are you've never worked with any serious multi-threading code. That statement is just wrong on so many levels...

1) It's a <i>LOT</i> of effort. Even with libraries and compilers to help it still introduces a whole new world of bugs caused by timing problems. And even if you work out all of the timing problems on one PC they can be completely different on a different PC. So once you switch to multi-threaded code you now have to test your code on numerous PCs if you want to ensure that these types of bugs don't happen. On top of that debugging multi-threaded code is a heck of a lot harder than debugging single-threaded code, <i>period</i>. So multi-threaded code is a <i>huge</i> effort compared to single-threaded code.

2) <i>Most</i> programs only have one primary line of logic anyway and thus <i>can't</i> be split into "several greatly independent threads(sic) that can be executed in parallel". Games are the only real exception, and even then it's mostly just the AI that gets split into independant threads.

3) Most single-threaded code suffers <i>not</i> from a CPU that is incapable of fully utilizing it's execution units, but from the code itself being so poorly optimized that the code simply can't even use the execution units efficiently. Programmers who use any sort of a profiler to maximize the efficiency of their code will gain far more performance in a single-CPU system with HT than making their code multi-threaded.

4) Making code multi-threaded <i>adds overhead</i>. All of the extra pointers, timing logic, etc. create an additional use of resources that single-threaded code doesn't have. If you run multi-threaded code next to single-threaded code on a single non-HT CPU you can even measure this overhead rather effectively.

Anyone who does <i>good</i> coding for a living knows these things. This is why most game developers don't even write multi-threaded code. And of those who do, their only reason is because they know that some of their biggest enthusiasts (even if it is a rather small percentage of their customers, they're usually the loudest of their customers) run dualie boxes and so multi-threading their code simply gives them access to two CPUs instead of one.

Basically, concepts like HyperThreading do absolutely nothing for people who know how to write efficient code, even if the code is multi-threaded. All that it really does is slightly help poorly written multi-threaded code and greatly improve multi-tasking.

I guess you could call HyperThreading a hardware engineer's trick to try and overcome some of the bloat caused by lazy programmers.

<pre><b><font color=red>*** BattleTech - The Crescent Hawks Inception ***</b>
Pilot twenty-ton behemoth robots to save your planet from a
Kuritian invasion force. Now available on the C=64!</font color=red></pre><p>
 

eden

Champion
Actually, aside from what Codified said, a K7 or K8 would benefit even more than a P4 ever will from HT.
Why?
Because HT's job is to make sure more pipelines are filled, hence up the amount or Instructions out of the CPU per clock, and inside it.
Right now the 9 IPCs are not entirely used, probably not even half most of the time. x86 is just that, it doesn't translate well, and isn't parallel efficient at all. Therefore the 3 decoders on the Athlon are basically not that helpful. Refer to THG's main Athlon article which introduces all the new IPC features on it. You'd expect a good 40% rise in performance, but at best it was 15% and 40% more with the FP benchmarks.
With HT, you could almost double the amount of IPC in there. That is such a great thing for Athlons, because that means they could reach an almost perfect ratio of "over-6" IPCs, which is the maximum a P4 can take, and probably rarely ever reaches that (one decoder with one trace cache). That means whatever IPC boosts the P4 had in some apps, the Athlon can drive through that. Of course, they'd probably be equal though, if they were both at 6IPCs. I dunno what could make one faster than the other. I guess latency would.

But I can be certain a K7 will benefit from HT far more than a P4 will. It's AMD's stubbornness to accept it. Heck, I bet SSE2 ops could be more performing, if all threads called where for that purpose.

BTW, it is true the NetBurst core suffers much more from pipeline bubbles, but HT does even more to help it, as it calls even more often instructions to fill it back.
Yes, it takes more time to fill it, but is 10 clocks that much? A programmer might tell me more about the average amount of clocks used to fill a pipeline per second, on a P4 with lots of pipeline bubbles. But I am pretty sure it's not THAT dramatic.

--
<A HREF="http://www.lochel.com/THGC/album.html" target="_new"><font color=green><b>A sexual experience like never before seen</font color=green></b></A>
Site has now even more sexy members, for your pleasure.
 

imgod2u

Distinguished
Jul 1, 2002
890
0
18,980
SSE/SSE2 would probably be one of the few cases in which the P4/Athlon would not benefit that much from SMT. Assuming you're just processing SSE/SSE2 instruction, both processors only have enough execution resources to complete 1 SSE/SSE2 instruction per clock. There's simply no more room for growth.
Considering SSE/SSE2 is very high-latency and high-bandwidth, I doubt there is ever much data dependencies and since most of the applications that use it are streaming, I doubt memory latency is all that much of an issue either.
As far as SMT goes, a 5% increase in transistor size for 5-30% more processor utilization (and more in some database level benchmarks) is certainly a smart move. And the K7/K8 would probably benefit even more with its vast array of parallel execution units.

"We are Microsoft, resistance is futile." - Bill Gates, 2015.
 

slvr_phoenix

Splendid
Dec 31, 2007
6,223
1
25,780
I have to agree with the SSE/SSE2 bit. That wouldn't gain from HT because it's designed <i>not</i> to run parallel. I mean that's the point, to cram up as much as you can in there so that the execution units don't have time to care about wondering when their next feed will be. Well, something like that anyway. :)

Besides, Opteron/A64 has a less-than-stellar implementation of SSE2. I'd dare say that for the most part AMD's implementation was to just simply interpret SSE2 commands as normal x86 commands. That'd be compatible, but performance would be no better than regular x86 then. It'd certainly explain their SSE2 performance. But enough about SSE2 and back to HT.

I don't think that AMD would gain much in IOps with HT. The P4 has some snazzy IOp execution that to my knowledge AMD hasn't even tried to duplicate yet. An Athlon might gain a little from it, but I doubt it'd be all that much. (And in fact if AMD implemented HT and <i>didn't</i> increase the number of integer calcs performable per cycle than the HT might even serve as a detriment by choking up the integer calculations on at least one of the two threads being run simul.)

On the other hand though I think that an Athlon would gain a crap-load in floating point performance if they implemented HT. I mean unlike the P4 the Athlons actually have more than one floating point execution unit, so anything that could increase the utilization of that strong point could get nasty.

But then again, killer FP does you no good if the FlOps are waiting on conditional IOps. So HT in an Athlon without improved IOp throughput might result in uselessness even for FP.

One other concern though would also be AMD's die sizes. They're tiny as it were and they're already running so hot for a processor that doesn't have SMT. I dare say that they'd give Scotty competition as the hottest desktop processor on Earth if AMD implemented HT. (Or implemented it well anyway. If they implement it badly non-executing units still don't generate additional heat one way or the other.)

Just my thoughts...

<pre><b><font color=red>*** BattleTech - The Crescent Hawks Inception ***</b>
Pilot twenty-ton behemoth robots to save your planet from a
Kuritian invasion force. Now available on the C=64!</font color=red></pre><p>
 

imgod2u

Distinguished
Jul 1, 2002
890
0
18,980
FP and Integer data are rarely interdependent. Most FP-intensive applications nowadays are very streaming and usually don't involve a lot of interacting with integer data.
As for increasing the Athlon integer performance. There is certainly room for improvement. That's the whole point of the K8's extra packing/depacking stage. By utilizing SMT, you'd have less of a need for such stages.

"We are Microsoft, resistance is futile." - Bill Gates, 2015.
 

eden

Champion
It's a bit contradictory. I thought Slvr said you needed to improve the integer performance first, then put SMT.
Then you say that things to improve include the extra packing stages, and that by using SMT, you wouldn't need them.

Contradictory to me.

BTW, just what do you mean by "increase integer performance"? P4s process not as much integer as Athlons and we've yet to be convinced the Dual-pumping does anything. So I don't understand what improving Integer is all about.

--
<A HREF="http://www.lochel.com/THGC/album.html" target="_new"><font color=green><b>A sexual experience like never before seen</font color=green></b></A>
Site has now even more sexy members, for your pleasure.
 

Pirox

Distinguished
Jul 9, 2003
78
0
18,630
I read some where on AMDZONE a while algo about an AMD engineer who said that if they where to use HT it will be with a dual-core processor and that what intel is doing with HT now is overloading the registers.

---
If you go to work and your name is on the door, you're rich. If your name is on your desk, you're middle class. If your name is on your shirt, you're poor!
 

Pirox

Distinguished
Jul 9, 2003
78
0
18,630
I read some where on AMDZONE a while ago about an AMD engineer who said that if they where to use HT it will be with a dual-core processor and that what intel is doing with HT now is overloading the registers. I think he meant to say to be able to recieve a 100% gain and not 30% or so using HT..

---
If you go to work and your name is on the door, you're rich. If your name is on your desk, you're middle class. If your name is on your shirt, you're poor!
<P ID="edit"><FONT SIZE=-1><EM>Edited by pirox on 08/22/03 03:34 AM.</EM></FONT></P>
 

slvr_phoenix

Splendid
Dec 31, 2007
6,223
1
25,780
FP and Integer data are rarely interdependent. Most FP-intensive applications nowadays are very streaming and usually don't involve a lot of interacting with integer data.
That's really not very true. The 'streaming' FP is SSE2, not x87. And as already stated SSE/SSE2 gains nothing from HT because it's already designed to max out the utilization.

The x87 FP usage most certainly isn't streamed, and there is an awful lot of code that still uses this, even with SSE2 being available, simply because not all FP is just crunching large amounts of data. Conditional FP logic is still very heavily used and still x87. That's what HT would help. But without better integer throughput it might not do much to help the FP throughput.

That's the whole point of the K8's extra packing/depacking stage. By utilizing SMT, you'd have less of a need for such stages.
Less of a need, but unless I'm mistaken it would make it even worse unless you improved that before implementing SMT.

<pre><b><font color=red>*** BattleTech - The Crescent Hawks Inception ***</b>
Pilot twenty-ton behemoth robots to save your planet from a
Kuritian invasion force. Now available on the C=64!</font color=red></pre><p>
 

slvr_phoenix

Splendid
Dec 31, 2007
6,223
1
25,780
P4s process not as much integer as Athlons and we've yet to be convinced the Dual-pumping does anything.
That's only true for programmers who don't optimize their code. If you actually write code well then there isn't even a comparison.

<pre><b><font color=red>*** BattleTech - The Crescent Hawks Inception ***</b>
Pilot twenty-ton behemoth robots to save your planet from a
Kuritian invasion force. Now available on the C=64!</font color=red></pre><p>
 

imgod2u

Distinguished
Jul 1, 2002
890
0
18,980
That's really not very true. The 'streaming' FP is SSE2, not x87. And as already stated SSE/SSE2 gains nothing from HT because it's already designed to max out the utilization.

The x87 FP usage most certainly isn't streamed, and there is an awful lot of code that still uses this, even with SSE2 being available, simply because not all FP is just crunching large amounts of data. Conditional FP logic is still very heavily used and still x87. That's what HT would help. But without better integer throughput it might not do much to help the FP throughput.

I find this very skeptical as almost every x87 and SSE FP instructions have an exceedingly high latency (30+ clocks). Doing large amounts of logical operations with these FP instructions simply doesn't makes sense from a performance point of view.

Less of a need, but unless I'm mistaken it would make it even worse unless you improved that before implementing SMT.

It depends on how well you can manage threat priority. With Prescott, Intel will introduce a few new instructions allowing the OS to control threads that are running on a single physical processor. If the OS gives the main thread priority, the other thread will only "fill in the gaps" and not interfere with the first. Even with the K8's implementation of advanced scheduling through pack/unpack, I doubt it has a mean throughput of the maximum 3 integer ops/clock.

"We are Microsoft, resistance is futile." - Bill Gates, 2015.
 

slvr_phoenix

Splendid
Dec 31, 2007
6,223
1
25,780
I find this very skeptical as almost every x87 and SSE FP instructions have an exceedingly high latency (30+ clocks). Doing large amounts of logical operations with these FP instructions simply doesn't makes sense from a performance point of view.
Have you seen any of my rants on software engineers being lazy? I think that explains a lot of it in and of itself. Heh heh.

Seriously though, the thing is that SSE2 simply doesn't give much of a gain unless you can actually feed it a good solid chunk of data to process. It's designed for streaming, not for conditional use.

On top of that there are a lot of cases where it simply isn't worth the manpower to code for when you consider just how many PCs out there don't support SSE2 that you still have to write for either way. It's just extra time and money (often because time <i>is</i> money) for developers to write for SSE2 on top of their normal software development.

But again, the big thing is that SSE2 is for processing large amounts of data. Games, rendering software, graphics software, encoding/decoding, etc. can all benefit a lot from SSE2 and so they're optimized for SSE2. But that doesn't mean that <i>all</i> of their FP ops are through SSE2. For small amounts of FP calcs (such as you get with conditional logic) x87 is still heavily used because there it just makes more sense. Look through SSE2 optimized code some time. There's usually still a lot of x87 use in the logic sections. And for software that's primarily conditional (such as many scientific apps) it often isn't even worth the time to try and find places in the code that could be optimized for SSE2.

<font color=blue>If you look <font color=purple>The Devil</font color=purple><font color=red>®</font color=red> straight in the eye and only see yourself then you must be standing in front of a mirror.</font color=blue>
 

AMD_Man

Splendid
Jul 3, 2001
7,376
2
25,780
1) It's a LOT of effort. Even with libraries and compilers to help it still introduces a whole new world of bugs caused by timing problems. And even if you work out all of the timing problems on one PC they can be completely different on a different PC. So once you switch to multi-threaded code you now have to test your code on numerous PCs if you want to ensure that these types of bugs don't happen. On top of that debugging multi-threaded code is a heck of a lot harder than debugging single-threaded code, period. So multi-threaded code is a huge effort compared to single-threaded code.
Tell me about. Also synchronizing threads too often takes most of the point of multithreading anyway. The biggest advantage multithreading has is when you need to develop an application that needs to run instructions asynchronously. A simple example would be a file search subroutine. You'd like to search for a particular file on the hard drive but at the same time, you'd like the program to execute other instructions not directly related. While this can be done both single-threaded and multi-threaded, the multithreaded approach will allow the applications various threads to share CPU resources as each's needs require. For example, if the user is not executing any other instructions in the main thread of an application at a given point in time, why should you have extra overhead of waiting for application messages to be processed? In a windows application, if there is no work being done on the main window, even paint messages can be delayed. Processing application thread messages during a heavy operation just wastes time if there's no vital work being done. A good programmer that wants to use multithreading will want to avoid as just synchronization between threads as possible. Therefore, you should write a program so that the threads don't rely directly on each other. Each thread should go their own separate way until they're done executing or when a vital event needs to be called.

2) Most programs only have one primary line of logic anyway and thus can't be split into "several greatly independent threads(sic) that can be executed in parallel". Games are the only real exception, and even then it's mostly just the AI that gets split into independant threads.
Agreed, in most office applications...multithreading wouldn't make much sense. However, it might be possible for tasks like the automatic spell checker or grammer checker where threads synchronize after every word and sentence respectively.

3) Most single-threaded code suffers not from a CPU that is incapable of fully utilizing it's execution units, but from the code itself being so poorly optimized that the code simply can't even use the execution units efficiently. Programmers who use any sort of a profiler to maximize the efficiency of their code will gain far more performance in a single-CPU system with HT than making their code multi-threaded.
Hehe, I completely agree. Programmers need to first consider simplifying their logic so that programs execute faster. They also need to use, as you said, profiling tools and performance counters to test how efficient their program is. They need to avoid using higher-level language whenever possible for small subroutines. Really, there are hundreds of ways to accelerate code for any type of application. However, the thing that will most improve performance is simplifying logic. Programmers need to step back and think about what they're trying to do, and the quickest way to do it. No complier and no programming language will ever figure out what the end result you're looking for is, and what's the quickest way of doing it for you. I recently wrote an OCR class for a project I was working on as a hobby. The first time I ran it, it took nearly 60 secs to process a 600dpi scanned page. After simplifying the logic of the program (figuring out how to do the same thing faster), I got that time down to a blistering 2 secs. I moved from GDI for processing the bitmap image, to Direct Memory Access. I also looked at all the loops I had. Way too many. Worst of all, I was looping through the same information over and over again, looking for different things. So I reduce the number of loops. I get everything done in a single loop and store that information in a cache. Then I piece together everything. BTW, this OCR class runs in a separate thread. It only sends back feedback to the calling thread(aka synchronizes) 5 or 6 times throughout the entire process. I myself was surprised how much a little optimization can improve performance.

Intelligence is not merely the wealth of knowledge but the sum of perception, wisdom, and knowledge.