Penryn to have Hyperthreading

the_vorlon

Distinguished
May 3, 2006
365
0
18,780
The "always" accurate folks at the Inquirer say Penryn will have hyperthreading in addition to 6 megs of cache, SSE4, Low/Hi K dialetrics/gates, and a 45 nano process.

There has always been "rumours" that Core2duo actually had HT circuitry (much like Willamette and early Northwoods did) but that Intel never turn it on.

HT was never worth a whole lot in the desktop universe, a couple % at best, it did however make 25% + differences in selected server type applications.

The big problem was that Netburst could only RETIRE two micro-ops per clock, so regardles of how full the pipeline got stuffed, what came out the other end was always limited.

With Core doing 4 (or kinda 5, depending how you want to argue and/or count things) instructions per clock, hyperthreading may actually work this time.

A more interesting possibility is that Intel may actually have Penryn use hyperthreading to (under some conditions) exectute BOTH haves of a conditional in the code, hyperthread both branches, and then actually use the one that turns out to be correct - greatly reducing the penalty of having to flush the pipeline in the event of a branch prediction error.
 

HeavyF

Distinguished
Dec 25, 2006
32
0
18,530
Interesting post. However, I have never given much creedence to HT tech. It just seems like a marketing driven term rather than an actual benefit. This doesn't mean that as "free" feature I wouldn't take it. I am just skeptical. :wink:
 

1Tanker

Splendid
Apr 28, 2006
4,645
1
22,780
Interesting post. However, I have never given much creedence to HT tech. It just seems like a marketing driven term rather than an actual benefit. This doesn't mean that as "free" feature I wouldn't take it. I am just skeptical. :wink:
Me thinks you never used an HT enabled CPU.
 

epsilon84

Distinguished
Oct 24, 2006
1,689
0
19,780
Interesting post. However, I have never given much creedence to HT tech. It just seems like a marketing driven term rather than an actual benefit. This doesn't mean that as "free" feature I wouldn't take it. I am just skeptical. :wink:

In multithreaded apps it improves performance by up to 25%, but since these apps are rare on the desktop (especially a few years ago) I found HT more useful for keeping the PC 'responsive' during high CPU load times. Also, it's particularly useful if an application 'hangs' Windows and is using 100% of CPU resources, HT enables the system to remain responsive enough to actually reach Task Manager and to end the hung application there.
 

HeavyF

Distinguished
Dec 25, 2006
32
0
18,530
I've been outed :eek:

OK. so I am talking about something I don't really know about... BUT even with this info taken with a grain of salt, why should HT technology be something that is compelling on the Penryn CPU? I am not bashing HT, just mentioning that it occupies ~5% of the transitor space (or so I have heard) and does not offer many benefits besides a few particular apps. AGAIN, I will take HT tech/ SSE4+ :wink: (didn't conroe already have sse4 :wink: ) + 1333 FSB, etc. without so much as a grumble :lol:
 

epsilon84

Distinguished
Oct 24, 2006
1,689
0
19,780
I've been outed :eek:

OK. so I am talking about something I don't really know about... BUT even with this info taken with a grain of salt, why should HT technology be something that is compelling on the Penryn CPU? I am not bashing HT, just mentioning that it occupies ~5% of the transitor space (or so I have heard) and does not offer many benefits besides a few particular apps. AGAIN, I will take HT tech/ SSE4+ :wink: (didn't conroe already have sse4 :wink: ) + 1333 FSB, etc. without so much as a grumble :lol:

I would imagine the main point would be to improve multithreaded performance. It would allow a dual core CPU to execute four threads, although not at the same levels as a true quad core CPU.

Imagine Alan Wake, we have all heard the hype about how there are individual threads for physics, AI, audio etc. The physics component is very intensive and should max out one core, but apparently the AI and audio threads are far less intensive, so in such an instance we may see the benefits of HT in being able to execute more threads simultaneously.
 

darkz

Distinguished
Jan 24, 2007
28
0
18,530
An application that fully utilizes a multicore CPU has to be multithreaded anyway, and in that case the chances are good that it will scale to 4-8 threads when the CPU has HT.

I've seen ~30% performance gains from hyperthreading in a server app so it does work and is not just a "marketing" feature.
 

m25

Distinguished
May 23, 2006
2,363
0
19,780
An efficient architecture like Core2 or K8/K8L cannot benefit from HT; it was just some kind of viagra to keep P4s and some PDs competitive for a little more. HT just made up for the incredible rate of cache misses and overall pipeline stalls the netburst architecture had, however, I'd not be surprised if they reintroduced it, just to boast a little more.

To darkz:
The 30% you mention is for a netburst P4 of Xeon, given a prescott core. The (somehow) more efficient northwood core, only gained ~15% from HT and something like a Core2, will only get something not more than 5%, an amount that can well be swallowed by thread synchronization in some cases.
 

Slobogob

Distinguished
Aug 10, 2006
1,431
0
19,280
An efficient architecture like Core2 or K8/K8L cannot benefit from HT; it was just some kind of viagra to keep P4s and some PDs competitive for a little more. HT just made up for the incredible rate of cache misses and overall pipeline stalls the netburst architecture had, however, I'd not be surprised if they reintroduced it, just to boast a little more.

To darkz:
The 30% you mention is for a netburst P4 of Xeon, given a prescott core. The (somehow) more efficient northwood core, only gained ~15% from HT and something like a Core2, will only get something not more than 5%, an amount that can well be swallowed by thread synchronization in some cases.

Reintroducing it just to boast makes no sense. Intel chased the magic Ghz once just to boast and i´m certain they have learned from that.

Having HT with a long pipeline CPU like P4 wasn´t a bad idea. I´m not sure how well it would work with a 14 or 15 stage CPU. I like the idea though. Since development goes to multicore anyway, why not reintroduce the virtual cores?
 

darkz

Distinguished
Jan 24, 2007
28
0
18,530
An efficient architecture like Core2 or K8/K8L cannot benefit from HT; it was just some kind of viagra to keep P4s and some PDs competitive for a little more. HT just made up for the incredible rate of cache misses and overall pipeline stalls the netburst architecture had, however, I'd not be surprised if they reintroduced it, just to boast a little more.

To darkz:
The 30% you mention is for a netburst P4 of Xeon, given a prescott core. The (somehow) more efficient northwood core, only gained ~15% from HT and something like a Core2, will only get something not more than 5%, an amount that can well be swallowed by thread synchronization in some cases.

I think the performance advantage from HT will be more than 5%, more like 10%-15% at least. If it was only 5% it would mean that C2D basically has 5x better branch prediction and 5x less cache misses than netburst which I find hard to believe. AFAIK the main advantage of C2D is being able to execute more instructions per cycle which doesn't really affect HT does it?
 

darkz

Distinguished
Jan 24, 2007
28
0
18,530
I like the idea though. Since development goes to multicore anyway, why not reintroduce the virtual cores?

My thoughts exactly :D Also, the apps I work with would benefit fully from HT, so I'm all for it...
 

Aragorn

Distinguished
Feb 17, 2005
528
2
19,015
The differences in branch prediction and cache misses don't have to be 5x for HT to make so much less difference on a Core 2. Since the processor itself os so much more efficient the penalties for cache misses and mis-predictions are far smaller. That means that there is less room for that virtual thread to run in the processor (it's not empty for as long).
 

1Tanker

Splendid
Apr 28, 2006
4,645
1
22,780
An efficient architecture like Core2 or K8/K8L cannot benefit from HT; it was just some kind of viagra to keep P4s and some PDs competitive for a little more. HT just made up for the incredible rate of cache misses and overall pipeline stalls the netburst architecture had, however, I'd not be surprised if they reintroduced it, just to boast a little more.

To darkz:
The 30% you mention is for a netburst P4 of Xeon, given a prescott core. The (somehow) more efficient northwood core, only gained ~15% from HT and something like a Core2, will only get something not more than 5%, an amount that can well be swallowed by thread synchronization in some cases.
In some benches HT actually hurt performance(ever so slightly). With my 3.0C, HT gave a 19% advantage in Cinebench(2003).I liken HT to NCQ on hard drives. It helps in certain scenarios, and doesn't in some. :? HT does give a great increase in responsiveness(undeniably), and as someone else mentioned, it does leave some horsepower, while the CPU is heavily loaded, to do minor things, and/or as said...at least get to task manager without having to wait for 10 minutes for the CPU to summon up a few cycles.
 

m25

Distinguished
May 23, 2006
2,363
0
19,780
They haven't enabled HT in 65 nm Core2 CPUs, and this tels a lot on it's impact on the new arch. Pentium Ds have it , while Core2 dualcores don't, not even the extreme editions.
 

AZHeat

Distinguished
Sep 30, 2006
29
0
18,530
Correct - no HT on Penryn.

Nehalem (phase II 45nm) will be the next CPU w/ hyper-threading enabled... and it sounds like they are going to make some "modifications" to how it works to minimize the penalty for using it (when you would lose performance) and to make the CPU's work better together to gain more performance when possible... i.e. all CPU's will share a cache, thus many things can be done smarter / more efficiently w/ this...
 

m25

Distinguished
May 23, 2006
2,363
0
19,780
In some benches HT actually hurt performance(ever so slightly). With my 3.0C, HT gave a 19% advantage in Cinebench(2003).I liken HT to NCQ on hard drives. It helps in certain scenarios, and doesn't in some. Confused HT does give a great increase in responsiveness(undeniably), and as someone else mentioned, it does leave some horsepower, while the CPU is heavily loaded, to do minor things, and/or as said...at least get to task manager without having to wait for 10 minutes for the CPU to summon up a few cycles.
Even the gain in multitasking is only relative to the netburst inefficient architecture; the multitasking of a HT enabled P4 is often not distinctly superior to that of a comparable K8 Athlon,...taking a look at the charts, the inverse may be true; A singlecore athlon 64 sometimes beats a HT enabled P4:
http://www23.tomshardware.com/cpu.html?modelx=33&model1=444&model2=491&chart=192
 

DavidC1

Distinguished
May 18, 2006
494
67
18,860
With my 3.0C, HT gave a 19% advantage in Cinebench(2003)

I know Cinebench is a respected benchmark and lots of users/reviewers use it to benchmark, but for benchmarking, its a bad one. It's no less synthetic than PCMark benchmarks. It may be a real used benchmark, but when the behavior is similar to synthetic ones, then its not really useful.

Take a look at multiprocessor scaling for all processors, Pentium 4/D/M/Core Duo/Core 2 Duo/Athlon/AthlonFX/etc. They have nearly similar numbers for 2P, 4P, 8P. Since its a rendering benchmark, Athlon 64's should scale better since it has more bandwidth right?? No. It scales just as well as any Intel processor.

Cinebench, is one of the few benchmarks that shows Core 2 Duo pretty close to the X2.
 
Hyperthreading is Intel's name for simultaneous multi-threading, which let a processor work on two threads at once. Intel did this by having the CPU look like two virtual CPUs so that the OS will schedule tasks on both cores. Other manufacturers use multi-threading techniques similar to HT- Sun's UltraSPARC T1 uses a fine-grained threading technique to let each of its 4, 6, or 8 cores look like 16, 24, or 32 cores to the OS, respectively. IBM's POWER5 dual-core CPU has SMT enabled so the CPU looks like 4 cores to the OS. The PowerPC Processing Engine core in the Cell CPU has SMT enabled also.

The POWER5 CPU has a long-ish pipeline at 20 stages IIRC, which is similar to the P4 Northwood's 21 and much less than the P4 Prescott/Cedar Mill's 31 stages. However, the T1 supposedly has 6-stage pipelines (part of why it runs at only 1.4 GHz maximum) so apparently this multi-threading isn't only useful on long-pipeline chips to mitigate branch mispredictions on long-pipeline CPUs. I'd like to see Intel bring it back, even if it's just to see how well it would work.
 

samxxxii

Distinguished
May 16, 2006
24
0
18,510
I didn´t know that HT was implemented in other cpu uArch outside from Intel. thanks for the info...!

The thing that I was wondering is the way that Intel is going to implement HT, dailey tech reports that the implementation of HT on Penryn's derivates it's diferent compared to P4, they say that the OS is not going to see any virtual cores as I can recall now...!

So if we are going only to see 2 cores for Woldale and 4 cores for Yorksfield, then how is supposed to work this version of HT? :roll:

Anyone?

Thx

Sam
8)