The end of Netburst in 2006

Guest · May 12, 2005

Archived from groups: comp.sys.ibm.pc.hardware.chips,comp.sys.intel (More info?)

X-bit labs - Hardware news - Intel Confirms New CPU Architecture to
Launch in Late 2006.
http://www.xbitlabs.com/news/cpu/display/20050512111032.html

One interesting thing they mentioned is that they're going to attempt
to retain Hyperthreading even with the new less-pipelined core.
Hyperthreading is easy on a highly-pipelined core like with the Pentium
4, which has a lot of idle slots in its pipeline to fit two threads. In
a shallow pipelined architecture, with fewer idle slots, fitting a
second thread in there would probably end up making one thread or the
other, or both slower. The only way around it is to actually do proper
Symettrical MultiThreading (SMT), and install more execution units for
each thread. The difference between SMT and Hyperthreading is like the
difference between a Concorde and a jumbojet -- they both achieve the
same thing, but go about it in different ways. SMT is also much more
difficult to design than not only Hyperthreading, but also more
difficult than multicores.

It would be interesting to know if they're just going to try to graft
simple HT onto the new core with any additional execution units, for a
cheap marketing stunt, despite the fact that it might slow down
applications badly. Or if they're going to do true SMT and just call it
HT to keep people from being confused.

Yousuf Khan

Guest · May 26, 2005

Archived from groups: comp.sys.ibm.pc.hardware.chips,comp.sys.intel (More info?)

YKhan wrote:
> X-bit labs - Hardware news - Intel Confirms New CPU Architecture to
> Launch in Late 2006.
> http://www.xbitlabs.com/news/cpu/display/20050512111032.html
>
> One interesting thing they mentioned is that they're going to attempt
> to retain Hyperthreading even with the new less-pipelined core.
> Hyperthreading is easy on a highly-pipelined core like with the Pentium
> 4, which has a lot of idle slots in its pipeline to fit two threads. In
> a shallow pipelined architecture, with fewer idle slots, fitting a
> second thread in there would probably end up making one thread or the
> other, or both slower. The only way around it is to actually do proper
> Symettrical MultiThreading (SMT), and install more execution units for
> each thread. The difference between SMT and Hyperthreading is like the
> difference between a Concorde and a jumbojet -- they both achieve the
> same thing, but go about it in different ways. SMT is also much more
> difficult to design than not only Hyperthreading, but also more
> difficult than multicores.
>
> It would be interesting to know if they're just going to try to graft
> simple HT onto the new core with any additional execution units, for a
> cheap marketing stunt, despite the fact that it might slow down
> applications badly. Or if they're going to do true SMT and just call it
> HT to keep people from being confused.

I think what you are calling SMT is really multicore. The whole benefit
of HT is that it uses idle execution units with the addition of minimal
complexity, and by the time you add a lot of execution units it becomes
simpler to have individual cores with shared cache. Feel free to clarify
if you're not looking for that level of added Xunits.

What you said about pipeline length is correct, but there may be ways
around it. Consider as an example some sort of system where there are
several pipelines, one per thread, and an execution unit traffic control
which offers all available execution units to one thread, get zero or
more micro-ops started and then offers any remaining units to another
thread. Clearly this could slow a thread at some point in the future,
but would allow better use of all Xunits, and probably more work done by
the CPU overall. No matter how you add CPU Xunits, they compete for
cache and eventually total memory bandwidth.

I note that as the Linux HT scheduler has gotten better the CPU time has
stayed the same but the clock time has dropped for some benchmarks.

--
bill davidsen (davidsen@darkstar.prodigy.com)
SBC/Prodigy Yorktown Heights NY data center
Project Leader, USENET news
http://newsgroups.news.prodigy.com

Guest · Jun 9, 2005

Archived from groups: comp.sys.ibm.pc.hardware.chips,comp.sys.intel (More info?)

> In your dreams, perhaps. But that's not how processors work. Execution
> units can be kept busy even bound to a single thread. There is no
> requirement, nor reason, to dedicate execution units to a thread. To do
> so is simply silly, when a single thread may be able to use them more
> effectively.

You totally lost me here. You said (a) Xunits can be kept busy when
bound to a single thread, then (b) there's no reason to do that, then
(c) a single thread can use them more effectively.

>>Not only does the CPU do more work, but it actually can use HT to make
>>less work (fewer context switches) needed. That shows up as less cache
>>misses as well. More work done, less work needed, better cache
>>performance. Not a waste in my book!
>
>
> You must be an Intel marketeer. Screw SMT and go SMP, if you must.

What does marketing have to do with it? HT makes programs run faster ON
than OFF. Any arguments that it can't are suspect.

--
bill davidsen (davidsen@darkstar.prodigy.com)
SBC/Prodigy Yorktown Heights NY data center
Project Leader, USENET news
http://newsgroups.news.prodigy.com

Guest · Jun 9, 2005

Archived from groups: comp.sys.ibm.pc.hardware.chips,comp.sys.intel (More info?)

On Thu, 09 Jun 2005 04:52:49 GMT, Bill Davidsen
<davidsen@darkstar.prodigy.com> wrote:

>>>Not only does the CPU do more work, but it actually can use HT to make
>>>less work (fewer context switches) needed. That shows up as less cache
>>>misses as well. More work done, less work needed, better cache
>>>performance. Not a waste in my book!
>>
>>
>> You must be an Intel marketeer. Screw SMT and go SMP, if you must.
>
>What does marketing have to do with it? HT makes programs run faster ON
>than OFF. Any arguments that it can't are suspect.

It's also pretty obvious that in some, not so rare, task mixes HT can make
all tasks/threads run slower... i.e. longer time to complete than if run
consecutively. I'd hesitate to use it for any situation where I had a
compute bound task.

--
Rgds, George Macdonald

Guest · Jun 9, 2005

Archived from groups: comp.sys.ibm.pc.hardware.chips,comp.sys.intel (More info?)

I'd rather think that HT will slow down for L1-cache bound tasks. You
effectively have half of each cache and half of uOP cache for each thread.

"George Macdonald" <fammacd=!SPAM^nothanks@tellurian.com> wrote in message
news:9obga1dmsvok9fl9f6b9bc2cfot6hucf8u@4ax.com...
> On Thu, 09 Jun 2005 04:52:49 GMT, Bill Davidsen
> <davidsen@darkstar.prodigy.com> wrote:
>
>>>>Not only does the CPU do more work, but it actually can use HT to make
>>>>less work (fewer context switches) needed. That shows up as less cache
>>>>misses as well. More work done, less work needed, better cache
>>>>performance. Not a waste in my book!
>>>
>>>
>>> You must be an Intel marketeer. Screw SMT and go SMP, if you must.
>>
>>What does marketing have to do with it? HT makes programs run faster ON
>>than OFF. Any arguments that it can't are suspect.
>
> It's also pretty obvious that in some, not so rare, task mixes HT can make
> all tasks/threads run slower... i.e. longer time to complete than if run
> consecutively. I'd hesitate to use it for any situation where I had a
> compute bound task.
>
> --
> Rgds, George Macdonald

keith · Jun 10, 2005

Archived from groups: comp.sys.ibm.pc.hardware.chips,comp.sys.intel (More info?)

On Thu, 09 Jun 2005 04:52:49 +0000, Bill Davidsen wrote:

>
>
>> In your dreams, perhaps. But that's not how processors work. Execution
>> units can be kept busy even bound to a single thread. There is no
>> requirement, nor reason, to dedicate execution units to a thread. To do
>> so is simply silly, when a single thread may be able to use them more
>> effectively.
>
> You totally lost me here. You said (a) Xunits can be kept busy when
> bound to a single thread, then (b) there's no reason to do that, then
> (c) a single thread can use them more effectively.

You obviously don't read very well.

A) You stated that execution units were only necessary for multiple
threads. False. A single thread can use multiple execution units in a
super-scalar processor. An OoO processor has more opportunity to find
parallelism in a single thread. Multiple execution units came long before
multi-threaded processors (well, ignoring the 360/91).

B) THere is *every* reason to have multiple execution units for a single
threaded processor (see: A).

C) Since there is no reason that multiple threads are necessary to keep
many execution units busy, this is *not* a reason for multi-threaded
processors. In fact multiple threads (at least as Intel does things)
isn't much of a gain at all and often a negative.

--
Keith

>
>>
>> You must be an Intel marketeer. Screw SMT and go SMP, if you must.
>
> What does marketing have to do with it? HT makes programs run faster ON
> than OFF. Any arguments that it can't are suspect.

Because Intel's HT is a marketing gimmick that you've obvioulsy fallen
for. ...and you're spreading the FUD.

--
Keith

Guest · Jun 10, 2005

Archived from groups: comp.sys.ibm.pc.hardware.chips,comp.sys.intel (More info?)

"keith" <krw@att.bizzzz> wrote in message
news

an.2005.06.10.01.31.25.341612@att.bizzzz...

> A) You stated that execution units were only necessary for multiple
> threads. False. A single thread can use multiple execution units in a
> super-scalar processor. An OoO processor has more opportunity to find
> parallelism in a single thread. Multiple execution units came long before
> multi-threaded processors (well, ignoring the 360/91).

Correct.

> B) THere is *every* reason to have multiple execution units for a single
> threaded processor (see: A).

Correct.

> C) Since there is no reason that multiple threads are necessary to keep
> many execution units busy, this is *not* a reason for multi-threaded
> processors.

Wrong. The general form of your argument is wrong and it's wrong in this
particular situation as well.

The flaw in the general form of your argument can easily be seen if you
try the argument on other things. For example, you don't need to brush your
teeth daily to have healthy teeth. You could, for example, go to a dental
hygienist daily. It does not, however, follow that having healthy teeth is
not a reason to brush daily.

It's wrong in this particular case because one of the main benefits of
multi-threaded processors is that execution units that would otherwise lie
idle can do useful work. The more parallelism you can exploit, the greater
percentage of your execution units you can keep busy. Multi-threaded
processors give the processor more parallelism to exploit.

> In fact multiple threads (at least as Intel does things) isn't much of a
> gain at all and often a > negative.

Actually, in my experience is has been a *huge* benefit on machines that
only have a single physical CPU. Not as useful on machines that have
multiple CPUs already.

DS

Guest · Jun 10, 2005

Archived from groups: comp.sys.ibm.pc.hardware.chips,comp.sys.intel (More info?)

On Thu, 09 Jun 2005 14:59:24 GMT, "Alexander Grigoriev"
<alegr@earthlink.net> wrote:

>I'd rather think that HT will slow down for L1-cache bound tasks. You
>effectively have half of each cache and half of uOP cache for each thread.

Yup and the TLB, which is a *big* part of CPU performance is going to get
soiled.

>"George Macdonald" <fammacd=!SPAM^nothanks@tellurian.com> wrote in message
>news:9obga1dmsvok9fl9f6b9bc2cfot6hucf8u@4ax.com...
>> On Thu, 09 Jun 2005 04:52:49 GMT, Bill Davidsen
>> <davidsen@darkstar.prodigy.com> wrote:
>>
>>>>>Not only does the CPU do more work, but it actually can use HT to make
>>>>>less work (fewer context switches) needed. That shows up as less cache
>>>>>misses as well. More work done, less work needed, better cache
>>>>>performance. Not a waste in my book!
>>>>
>>>>
>>>> You must be an Intel marketeer. Screw SMT and go SMP, if you must.
>>>
>>>What does marketing have to do with it? HT makes programs run faster ON
>>>than OFF. Any arguments that it can't are suspect.
>>
>> It's also pretty obvious that in some, not so rare, task mixes HT can make
>> all tasks/threads run slower... i.e. longer time to complete than if run
>> consecutively. I'd hesitate to use it for any situation where I had a
>> compute bound task.

--
Rgds, George Macdonald

Guest · Jun 10, 2005

Archived from groups: comp.sys.ibm.pc.hardware.chips,comp.sys.intel (More info?)

Tony Hill wrote:

> Now, obviously I'll take dual-core over SMT any day, but by it's very
> nature dual-core involves doubling the transistors.

Not if part of the cache hierarchy is shared between cores,
e.g. Intel's Yonah.

By the way, you often write "it's" instead of its ;-)

Guest · Jun 11, 2005

Archived from groups: comp.sys.ibm.pc.hardware.chips,comp.sys.intel (More info?)

keith wrote:
> On Thu, 09 Jun 2005 04:52:49 +0000, Bill Davidsen wrote:
>
>
>>
>>>In your dreams, perhaps. But that's not how processors work. Execution
>>>units can be kept busy even bound to a single thread. There is no
>>>requirement, nor reason, to dedicate execution units to a thread. To do
>>>so is simply silly, when a single thread may be able to use them more
>>>effectively.
>>
>>You totally lost me here. You said (a) Xunits can be kept busy when
>>bound to a single thread, then (b) there's no reason to do that, then
>>(c) a single thread can use them more effectively.
>
>
> You obviously don't read very well.

Given that zero of the things you reply to below are in the test you
quoted, or in the original article, I don't think the problem is mine.
>
> A) You stated that execution units were only necessary for multiple
> threads. False. A single thread can use multiple execution units in a
> super-scalar processor. An OoO processor has more opportunity to find
> parallelism in a single thread. Multiple execution units came long before
> multi-threaded processors (well, ignoring the 360/91).
>
> B) THere is *every* reason to have multiple execution units for a single
> threaded processor (see: A).
>
> C) Since there is no reason that multiple threads are necessary to keep
> many execution units busy, this is *not* a reason for multi-threaded
> processors. In fact multiple threads (at least as Intel does things)
> isn't much of a gain at all and often a negative.
>

>Because Intel's HT is a marketing gimmick that you've obvioulsy fallen
>for. ...and you're spreading the FUD.

I ran real benchmarks, for large compiles, DNS servers, and NNTP
servers. The compiles ran in 10-30% less clock time, the max tps of the
servers went up 10-15%. That's not FUD that's FACT.

--
bill davidsen
SBC/Prodigy Yorktown Heights NY data center
http://newsgroups.news.prodigy.com

Guest · Jun 11, 2005

Archived from groups: comp.sys.ibm.pc.hardware.chips,comp.sys.intel (More info?)

David Schwartz wrote:
> "keith" <krw@att.bizzzz> wrote in message
> news

an.2005.06.10.01.31.25.341612@att.bizzzz...
>
>
>>A) You stated that execution units were only necessary for multiple
>>threads. False. A single thread can use multiple execution units in a
>>super-scalar processor. An OoO processor has more opportunity to find
>>parallelism in a single thread. Multiple execution units came long before
>>multi-threaded processors (well, ignoring the 360/91).
>
>
> Correct.

I assume you mean he's correct in his technical statement, and not that
you agree I ever said any such thing...
>
>
>>B) THere is *every* reason to have multiple execution units for a single
>>threaded processor (see: A).
>
>
> Correct.
>
>
>>C) Since there is no reason that multiple threads are necessary to keep
>>many execution units busy, this is *not* a reason for multi-threaded
>>processors.
>
>
> Wrong. The general form of your argument is wrong and it's wrong in this
> particular situation as well.
>
> The flaw in the general form of your argument can easily be seen if you
> try the argument on other things. For example, you don't need to brush your
> teeth daily to have healthy teeth. You could, for example, go to a dental
> hygienist daily. It does not, however, follow that having healthy teeth is
> not a reason to brush daily.
>
> It's wrong in this particular case because one of the main benefits of
> multi-threaded processors is that execution units that would otherwise lie
> idle can do useful work. The more parallelism you can exploit, the greater
> percentage of your execution units you can keep busy. Multi-threaded
> processors give the processor more parallelism to exploit.
>
>
>>In fact multiple threads (at least as Intel does things) isn't much of a
>>gain at all and often a > negative.
>
>
> Actually, in my experience is has been a *huge* benefit on machines that
> only have a single physical CPU. Not as useful on machines that have
> multiple CPUs already.

Thank you, I'm not sure I've seen *huge* gains, but 10-30% for free is a
nice bonus. I've never seen a negative on real work, although there was
a benchmark showing that. Gain appear larger on threaded applications
than general use, probably because of more shared code and data in cache.

The real gain I see is when multiple threads exchange data via shared
memory. With one CPU there are constant context switches between the
producer and consumer threads. With SMT the number of CTX goes down,
which means that the CPU not only does more work in unit time, but that
the work to be done is reduced. User CPU percentage goes up, CTX rate
goes down, system time goes down. Win-win-win!

--
bill davidsen
SBC/Prodigy Yorktown Heights NY data center
http://newsgroups.news.prodigy.com

keith · Jun 11, 2005

Archived from groups: comp.sys.ibm.pc.hardware.chips,comp.sys.intel (More info?)

On Thu, 09 Jun 2005 20:19:00 -0700, David Schwartz wrote:

>
> "keith" <krw@att.bizzzz> wrote in message
> news

an.2005.06.10.01.31.25.341612@att.bizzzz...
>
>> A) You stated that execution units were only necessary for multiple
>> threads. False. A single thread can use multiple execution units in a
>> super-scalar processor. An OoO processor has more opportunity to find
>> parallelism in a single thread. Multiple execution units came long before
>> multi-threaded processors (well, ignoring the 360/91).
>
> Correct.
>
>> B) THere is *every* reason to have multiple execution units for a single
>> threaded processor (see: A).
>
> Correct.
>
>> C) Since there is no reason that multiple threads are necessary to keep
>> many execution units busy, this is *not* a reason for multi-threaded
>> processors.
>
> Wrong. The general form of your argument is wrong and it's wrong in this
> particular situation as well.

Nope. I'm now rather more confident that you haven't a clue.

<how you wash you hog deleted>

> It's wrong in this particular case because one of the main benefits of
> multi-threaded processors is that execution units that would otherwise lie
> idle can do useful work. The more parallelism you can exploit, the greater
> percentage of your execution units you can keep busy. Multi-threaded
> processors give the processor more parallelism to exploit.

That is *not* the point. Modern processors are more limited in dispatch
and completion slots than they are in execution units (e.g. the developers
don't kow how many FP instructions you're going to run togethr). As long
as a single thread can dispatch the processor will be full. Another thread
is *ONLY* useful if the pipe stalls. Even then, it's only useful to
restart another thread if your caches aren't trashed. Another thread can
muck up the works in any number of ways other than the caches.

>> In fact multiple threads (at least as Intel does things) isn't
much of
>> a gain at all and often a > negative.
>
> Actually, in my experience is has been a *huge* benefit on machines
> that
> only have a single physical CPU. Not as useful on machines that have
> multiple CPUs already.

Your workload is quite unique then. No one else, other than Intel's
marketing department, has found such workload.

--
Keith

Guest · Jun 11, 2005

Archived from groups: comp.sys.ibm.pc.hardware.chips,comp.sys.intel (More info?)

"keith" <krw@att.bizzzz> wrote in message
news

an.2005.06.11.02.21.23.761332@att.bizzzz...

> On Thu, 09 Jun 2005 20:19:00 -0700, David Schwartz wrote:

>> Wrong. The general form of your argument is wrong and it's wrong in
>> this
>> particular situation as well.

> Nope. I'm now rather more confident that you haven't a clue.

Always good to throw in a few insults while someone's trying to reason
with you. Your mother dresses you funny.

>> It's wrong in this particular case because one of the main benefits
>> of
>> multi-threaded processors is that execution units that would otherwise
>> lie
>> idle can do useful work. The more parallelism you can exploit, the
>> greater
>> percentage of your execution units you can keep busy. Multi-threaded
>> processors give the processor more parallelism to exploit.

> That is *not* the point. Modern processors are more limited in dispatch
> and completion slots than they are in execution units (e.g. the developers
> don't kow how many FP instructions you're going to run togethr). As long
> as a single thread can dispatch the processor will be full. Another thread
> is *ONLY* useful if the pipe stalls. Even then, it's only useful to
> restart another thread if your caches aren't trashed. Another thread can
> muck up the works in any number of ways other than the caches.

This doesn't sound like anything even remotely resembling a reasonable
argument. It is a fact that a single thread is just not going to keep all
the execution units busy. Another thread could use those execution units.

>>> In fact multiple threads (at least as Intel does things) isn't
>>> much of
>>> a gain at all and often a > negative.

>> Actually, in my experience is has been a *huge* benefit on machines
>> that
>> only have a single physical CPU. Not as useful on machines that have
>> multiple CPUs already.

> Your workload is quite unique then. No one else, other than Intel's
> marketing department, has found such workload.

Here's a trivial example -- one program goes into a 100% CPU spin. With
HT, the system stays responsive (because the program can, at most, grab half
the CPU resources). Without it, it doesn't. Now you think a program that has
to do a lot of work while I'd prefer the system remain responsive is
unique?!

DS

keith · Jun 11, 2005

Archived from groups: comp.sys.ibm.pc.hardware.chips,comp.sys.intel (More info?)

On Fri, 10 Jun 2005 19:39:43 -0700, David Schwartz wrote:

>
> "keith" <krw@att.bizzzz> wrote in message
> news

an.2005.06.11.02.21.23.761332@att.bizzzz...
>
>> On Thu, 09 Jun 2005 20:19:00 -0700, David Schwartz wrote:
>
>>> Wrong. The general form of your argument is wrong and it's wrong in
>>> this
>>> particular situation as well.
>
>> Nope. I'm now rather more confident that you haven't a clue.
>
> Always good to throw in a few insults while someone's trying to reason
> with you.

Hmm, I didn't see much "reason".

> Your mother dresses you funny.

s/mother/wife

I went on my way 34 years ago.

>>> It's wrong in this particular case because one of the main benefits
>>> of
>>> multi-threaded processors is that execution units that would otherwise
>>> lie
>>> idle can do useful work. The more parallelism you can exploit, the
>>> greater
>>> percentage of your execution units you can keep busy. Multi-threaded
>>> processors give the processor more parallelism to exploit.
>
>> That is *not* the point. Modern processors are more limited in dispatch
>> and completion slots than they are in execution units (e.g. the developers
>> don't kow how many FP instructions you're going to run togethr). As long
>> as a single thread can dispatch the processor will be full. Another thread
>> is *ONLY* useful if the pipe stalls. Even then, it's only useful to
>> restart another thread if your caches aren't trashed. Another thread can
>> muck up the works in any number of ways other than the caches.
>
> This doesn't sound like anything even remotely resembling a reasonable
> argument. It is a fact that a single thread is just not going to keep all
> the execution units busy. Another thread could use those execution units.

All the execution units won't be busy because there aren't enough
issue/completion slots to fill all units. Another thread doesn't increase
the number of I/C slots. A single thread can easily fill the slots
available.

The argument for a second thread isn't execution units, rather OoO,
speculative execution, long pipes, and slow memory, thus expensive
flushes. Adding a thread adds more speculative execution and resourse
thrashing for *perhaps* a chance of utilizing the pipeline when one thread
flushes. If it's done right it even works. Apparently Intel has an
"issue" with their implementation. It's not a clear win like you folks
believe it to be.

>>>> In fact multiple threads (at least as Intel does things) isn't much
>>>> of
>>>> a gain at all and often a > negative.
>
>>> Actually, in my experience is has been a *huge* benefit on
>>> machines that
>>> only have a single physical CPU. Not as useful on machines that have
>>> multiple CPUs already.
>
>> Your workload is quite unique then. No one else, other than Intel's
>> marketing department, has found such workload.
>
> Here's a trivial example -- one program goes into a 100% CPU spin.
> With
> HT, the system stays responsive (because the program can, at most, grab
> half the CPU resources). Without it, it doesn't. Now you think a program
> that has to do a lot of work while I'd prefer the system remain
> responsive is unique?!

Like all trivial examples and hand-waving...

This is perhaps a good argument for SMP, but SMT will likely still
choke because the thread that's "spinning" isn't likely flushing the pipe,
since the pre-fetching/branch prediction is trying it's best to keep the
pipe full. Of course for any implementation it's possible to come up with
a degenerative case. As noted elsewhere in this thread a "spinning thread"
can trash the L1, perhaps even L2, causing SMT make things even worse.
Indeed, this is shown in several benchmarks.

Sometimes (Intel's implementation of) SMT is a win, sometimes a loss.
It's not at all clear whether it's worth it, but in any case it has
*nothing* to do with filling execution units (the OP's argument).
Multiple issue/completion slots will fill execution units from a single
thread.

--
Keith

Guest · Jun 11, 2005

Archived from groups: comp.sys.ibm.pc.hardware.chips,comp.sys.intel (More info?)

On Fri, 10 Jun 2005 15:55:50 +0200, Grumble <devnull@kma.eu.org>
wrote:

>Tony Hill wrote:
>
>> Now, obviously I'll take dual-core over SMT any day, but by it's very
>> nature dual-core involves doubling the transistors.
>
>Not if part of the cache hierarchy is shared between cores,
>e.g. Intel's Yonah.

Perhaps I should have specified that you're doubling the transistors
in the core at the very least and possibly doubling cache transistors
as well.

>By the way, you often write "it's" instead of its ;-)

Yeah, I do it mainly to piss of Keith who has commented on this more
than once! :> (actually I'm just lazy and never did learn me that
grammar stuff none too good!)

-------------
Tony Hill
hilla <underscore> 20 <at> yahoo <dot> ca

Guest · Jun 11, 2005

Archived from groups: comp.sys.ibm.pc.hardware.chips,comp.sys.intel (More info?)

Robert Redelmeier wrote:
>
> AFAIK, the only case where SMT is a win is when a thread
> stalls like waiting for uncached data, IO or frequent branch
> misprediction. Otherwise it is a loss because of lower cache
> hits (caches split). Some apps, like relational databases
> are pointer chasing exercises and need a lot of uncached data.
> I think compilers suffer a lot of misprediction.
>

It's amazing that this topic just won't die. The only thing that's new
about, say, Pentium-M and HT is that Pentium-M's shorter pipeline makes
hyperthreading less valuable than on NetBurst, where it is already
marginal in the sense that it seems to cost just about the same
increase in power and transistors as you get in increased performance.
If you had no other way of jamming more throughput onto the die and you
could swallow the hit in power, HT is almost always a clear win. If
you include the cost in power or transistors, it's almost always a
wash.

The biggest win I remember (about 35% IIRC) seeing for HT was on a
chess playing program, where I assume the win came from stalled
pointer-chasing threads. Server applications, which also typically
spend significant (~50%) time stalled for memory, should benefit
significantly, as well.

HT may, in practice, do little more than to reduce the hit that Intel
takes in latency from having the memory controller off the die, but it
does do that (up to whatever effect cache-trashing has in the other
direction).

HT does give Intel marketeers a feature that AMD doesn't have to talk
about. The fact that AMD doesn't have nearly the need for SMT because
memory latency is lower and the pipeline is shorter isn't something
you'd really expect Intel to emphasize in its advertising. The way HT
is used in Intel advertising is just market babble. As a part of the
design philosophy of Intel microprocessors, HT actually does make
sense.

RM

Guest · Jun 12, 2005

Archived from groups: comp.sys.ibm.pc.hardware.chips,comp.sys.intel (More info?)

In comp.sys.ibm.pc.hardware.chips keith <krw@att.bizzzz> wrote:
> All the execution units won't be busy because there
> aren't enough issue/completion slots to fill all units.
> Another thread doesn't increase the number of I/C slots.
> A single thread can easily fill the slots available.

Very true, especially on a CPU like the iP7 (Pentium4)
that has lots of execution units, but very few issue ports.

AFAIK, the only case where SMT is a win is when a thread
stalls like waiting for uncached data, IO or frequent branch
misprediction. Otherwise it is a loss because of lower cache
hits (caches split). Some apps, like relational databases
are pointer chasing exercises and need a lot of uncached data.
I think compilers suffer a lot of misprediction.

-- Robert

keith · Jun 12, 2005

Archived from groups: comp.sys.ibm.pc.hardware.chips,comp.sys.intel (More info?)

On Sat, 11 Jun 2005 18:52:17 +0000, Robert Redelmeier wrote:

> In comp.sys.ibm.pc.hardware.chips keith <krw@att.bizzzz> wrote:
>> All the execution units won't be busy because there
>> aren't enough issue/completion slots to fill all units.
>> Another thread doesn't increase the number of I/C slots.
>> A single thread can easily fill the slots available.
>
> Very true, especially on a CPU like the iP7 (Pentium4)
> that has lots of execution units, but very few issue ports.

IIRC, the P4 can only issue from the same thread, which reduces the
benefit of INTC's version of SMT.

> AFAIK, the only case where SMT is a win is when a thread stalls like
> waiting for uncached data, IO or frequent branch misprediction.

I thought I said that. ;-) It has *nothing* to do, as the OP proposed,
with execution units.

> Otherwise it is a loss because of lower cache hits (caches split). Some
> apps, like relational databases are pointer chasing exercises and need a
> lot of uncached data. I think compilers suffer a lot of misprediction.

....particularly with the miniscule P4 I-Cache.

--
Keith

Guest · Jun 12, 2005

Archived from groups: comp.sys.ibm.pc.hardware.chips,comp.sys.intel (More info?)

"Bill Davidsen" <davidsen@deathstar.prodigy.com> wrote in message
news:Mplqe.7544$_A5.24@newssvr19.news.prodigy.com...

> David Schwartz wrote:

>> "keith" <krw@att.bizzzz> wrote in message
>> news

an.2005.06.10.01.31.25.341612@att.bizzzz...

>>>A) You stated that execution units were only necessary for multiple
>>>threads. False. A single thread can use multiple execution units in a
>>>super-scalar processor. An OoO processor has more opportunity to find
>>>parallelism in a single thread. Multiple execution units came long
>>>before
>>>multi-threaded processors (well, ignoring the 360/91).

>> Correct.

> I assume you mean he's correct in his technical statement, and not that
> you agree I ever said any such thing...

Yes.

> Thank you, I'm not sure I've seen *huge* gains, but 10-30% for free is a
> nice bonus. I've never seen a negative on real work, although there was a
> benchmark showing that. Gain appear larger on threaded applications than
> general use, probably because of more shared code and data in cache.

The huge gains are not measurable but are in terms of usability and
interactive responsiveness. Benchmark gains do tend to be modest.

I think a lot of the usability gains are due to design problems with the
hardware and software. On a single CPU system without HT, for example, an
interrupt that takes too long to service makes the system non-responsive and
frustrates the user. On a system with either multiple CPUs or HT, the system
remains responsive.

DS

mygarbage2000 · Jun 13, 2005

Archived from groups: comp.sys.ibm.pc.hardware.chips,comp.sys.intel (More info?)

On Sat, 28 May 2005 01:01:32 -0400, Tony Hill
<hilla_nospam_20@yahoo.ca> wrote:

>I think you're kind of hitting the nail on the head with the second
>option. My understanding is that SMT added only a very small number
>of transistors to the core (the numbers I've heard floated around are
>5-10%, though I have no firm quote and I'm not sure if that's for
>Northwood or Prescott). With IBM's Power5, where the performance
>boost from SMT is much larger, I understand that they were looking at
>a 25% increase in the transistor count.
>
>That actually brings up a rather interesting point though. At some
>point SMT may become counter-productive vs. multi-core. In the case
>of the Power5, if you need to increase you're transistor count by 25%
>per core for SMT, you only need 4 cores before you've got a enough
>extra transistors for another full-fledged core. That of course leads
>to the question, are you better off with 4 cores with SMT or 5 cores
>without? My money is on 5 cores without.
>
....snip...
>
>-------------
>Tony Hill
>hilla <underscore> 20 <at> yahoo <dot> ca

Now a question: why we didn't see the dual cores many years
earlier? The main (OK, in overly simplistic view) difference between
486 and P5(AKA Pentium) was the second integer pipeline. While I
don't have the transistor count per pipeline in P5 and proportion of
it to the total count, I may suppose (just hypothetically) that making
a dual core of 486-style single pipeline (minus branch prediction
logic, plus extra FPU for second core) would not be much more
complicated than a single core P5, using the same logic as above.
Besides, 486 was even easier to crank up the clock - when Pentium was
around 100 (or 120 - too ancient history to remember exactly) AMD had
its 586 (486 with bigger L1 cache) at 133, easily overclockable to 160
- did it myself. So why nobody back then jumped to make dual cores?
My answer - no software to take advantage of it, at least on
consumer level. Win95 had nothing in it to take advantage of SMP.
Ditto Quake 1 ;-). And in corporate world, it was more than a year
before the introduction of NT4 that might (or might not) have been
benefitted.
Any other answers?
Just another 'what if' speculation with no practical meaning...

keith · Jun 13, 2005

Archived from groups: comp.sys.ibm.pc.hardware.chips,comp.sys.intel (More info?)

On Mon, 13 Jun 2005 02:09:16 +0000, nobody@nowhere.net wrote:

> On Sat, 28 May 2005 01:01:32 -0400, Tony Hill
> <hilla_nospam_20@yahoo.ca> wrote:
>
>>I think you're kind of hitting the nail on the head with the second
>>option. My understanding is that SMT added only a very small number
>>of transistors to the core (the numbers I've heard floated around are
>>5-10%, though I have no firm quote and I'm not sure if that's for
>>Northwood or Prescott). With IBM's Power5, where the performance
>>boost from SMT is much larger, I understand that they were looking at
>>a 25% increase in the transistor count.
>>
>>That actually brings up a rather interesting point though. At some
>>point SMT may become counter-productive vs. multi-core. In the case
>>of the Power5, if you need to increase you're transistor count by 25%
>>per core for SMT, you only need 4 cores before you've got a enough
>>extra transistors for another full-fledged core. That of course leads
>>to the question, are you better off with 4 cores with SMT or 5 cores
>>without? My money is on 5 cores without.
>>
> ...snip...
>>
>>-------------
>>Tony Hill
>>hilla <underscore> 20 <at> yahoo <dot> ca
>
> Now a question: why we didn't see the dual cores many years
> earlier?

There is a simple answer to that. There were other, better, things to
do with transistors.

> The main (OK, in overly simplistic view) difference between
> 486 and P5(AKA Pentium) was the second integer pipeline.

I think there was a tad more than that, but...

> While I
> don't have the transistor count per pipeline in P5 and proportion of
> it to the total count, I may suppose (just hypothetically) that making
> a dual core of 486-style single pipeline (minus branch prediction
> logic, plus extra FPU for second core) would not be much more
> complicated than a single core P5, using the same logic as above.
> Besides, 486 was even easier to crank up the clock - when Pentium was
> around 100 (or 120 - too ancient history to remember exactly) AMD had
> its 586 (486 with bigger L1 cache) at 133, easily overclockable to 160
> - did it myself. So why nobody back then jumped to make dual cores?
> My answer - no software to take advantage of it, at least on
> consumer level. Win95 had nothing in it to take advantage of SMP.
> Ditto Quake 1 ;-). And in corporate world, it was more than a year
> before the introduction of NT4 that might (or might not) have been
> benefitted.
> Any other answers?

Yes! Caches were a better use of transistors.

> Just another 'what if' speculation with no practical meaning...

What if the Earth were flat sorta thing? ;-)

--
Keith

Guest · Jun 13, 2005

Archived from groups: comp.sys.ibm.pc.hardware.chips,comp.sys.intel (More info?)

nobody@nowhere.net wrote:
> On Sat, 28 May 2005 01:01:32 -0400, Tony Hill
> <hilla_nospam_20@yahoo.ca> wrote:
>
>
>>I think you're kind of hitting the nail on the head with the second
>>option. My understanding is that SMT added only a very small number
>>of transistors to the core (the numbers I've heard floated around are
>>5-10%, though I have no firm quote and I'm not sure if that's for
>>Northwood or Prescott). With IBM's Power5, where the performance
>>boost from SMT is much larger, I understand that they were looking at
>>a 25% increase in the transistor count.
>>
>>That actually brings up a rather interesting point though. At some
>>point SMT may become counter-productive vs. multi-core. In the case
>>of the Power5, if you need to increase you're transistor count by 25%
>>per core for SMT, you only need 4 cores before you've got a enough
>>extra transistors for another full-fledged core. That of course leads
>>to the question, are you better off with 4 cores with SMT or 5 cores
>>without? My money is on 5 cores without.
>>
>
> ...snip...
>
>>-------------
>>Tony Hill
>>hilla <underscore> 20 <at> yahoo <dot> ca
>
>
> Now a question: why we didn't see the dual cores many years
> earlier? The main (OK, in overly simplistic view) difference between
> 486 and P5(AKA Pentium) was the second integer pipeline. While I
> don't have the transistor count per pipeline in P5 and proportion of
> it to the total count, I may suppose (just hypothetically) that making
> a dual core of 486-style single pipeline (minus branch prediction
> logic, plus extra FPU for second core) would not be much more
> complicated than a single core P5, using the same logic as above.
> Besides, 486 was even easier to crank up the clock - when Pentium was
> around 100 (or 120 - too ancient history to remember exactly) AMD had
> its 586 (486 with bigger L1 cache) at 133, easily overclockable to 160
> - did it myself. So why nobody back then jumped to make dual cores?
> My answer - no software to take advantage of it, at least on
> consumer level. Win95 had nothing in it to take advantage of SMP.
> Ditto Quake 1 ;-). And in corporate world, it was more than a year
> before the introduction of NT4 that might (or might not) have been
> benefitted.
> Any other answers?
> Just another 'what if' speculation with no practical meaning...
>

A dual core 486 would have performed no better than a 486 on any single
program. The paradigm of use in that era was a single process on a
single processor. Dual core would require two programs to be running.
Adding a second core would add no value to the customers. On the other
hand, adding additional decode and execution pipes makes the single
program go faster, something every customer was screaming at intel for.
The choice was obvious, and correct at the time.

Alex

Guest · Jun 14, 2005

Archived from groups: comp.sys.ibm.pc.hardware.chips,comp.sys.intel (More info?)

On Mon, 13 Jun 2005 08:59:49 -0400, Alex Johnson <compuwiz@psualum.com>
wrote:

>nobody@nowhere.net wrote:

>> Now a question: why we didn't see the dual cores many years
>> earlier? The main (OK, in overly simplistic view) difference between
>> 486 and P5(AKA Pentium) was the second integer pipeline. While I
>> don't have the transistor count per pipeline in P5 and proportion of
>> it to the total count, I may suppose (just hypothetically) that making
>> a dual core of 486-style single pipeline (minus branch prediction
>> logic, plus extra FPU for second core) would not be much more
>> complicated than a single core P5, using the same logic as above.
>> Besides, 486 was even easier to crank up the clock - when Pentium was
>> around 100 (or 120 - too ancient history to remember exactly) AMD had
>> its 586 (486 with bigger L1 cache) at 133, easily overclockable to 160
>> - did it myself. So why nobody back then jumped to make dual cores?
>> My answer - no software to take advantage of it, at least on
>> consumer level. Win95 had nothing in it to take advantage of SMP.
>> Ditto Quake 1 ;-). And in corporate world, it was more than a year
>> before the introduction of NT4 that might (or might not) have been
>> benefitted.
>> Any other answers?
>> Just another 'what if' speculation with no practical meaning...
>>
>
>A dual core 486 would have performed no better than a 486 on any single
>program. The paradigm of use in that era was a single process on a
>single processor. Dual core would require two programs to be running.
>Adding a second core would add no value to the customers. On the other
>hand, adding additional decode and execution pipes makes the single
>program go faster, something every customer was screaming at intel for.
> The choice was obvious, and correct at the time.

There *were* multi-tasking environments available at the time. It's even
remotely possible that if dual cores had been available at a reasonable
price, DesqView, which did real pre-emptive multi-tasking for 386 Protected
Mode progs, would not have disappeared into oblivion. Then again.....

--
Rgds, George Macdonald

keith · Jun 14, 2005

Archived from groups: comp.sys.ibm.pc.hardware.chips,comp.sys.intel (More info?)

On Mon, 13 Jun 2005 20:13:57 -0400, George Macdonald wrote:

> On Mon, 13 Jun 2005 08:59:49 -0400, Alex Johnson <compuwiz@psualum.com>
> wrote:
>
>>nobody@nowhere.net wrote:
>
>>> Now a question: why we didn't see the dual cores many years
>>> earlier? The main (OK, in overly simplistic view) difference between
>>> 486 and P5(AKA Pentium) was the second integer pipeline. While I
>>> don't have the transistor count per pipeline in P5 and proportion of
>>> it to the total count, I may suppose (just hypothetically) that making
>>> a dual core of 486-style single pipeline (minus branch prediction
>>> logic, plus extra FPU for second core) would not be much more
>>> complicated than a single core P5, using the same logic as above.
>>> Besides, 486 was even easier to crank up the clock - when Pentium was
>>> around 100 (or 120 - too ancient history to remember exactly) AMD had
>>> its 586 (486 with bigger L1 cache) at 133, easily overclockable to 160
>>> - did it myself. So why nobody back then jumped to make dual cores?
>>> My answer - no software to take advantage of it, at least on
>>> consumer level. Win95 had nothing in it to take advantage of SMP.
>>> Ditto Quake 1 ;-). And in corporate world, it was more than a year
>>> before the introduction of NT4 that might (or might not) have been
>>> benefitted.
>>> Any other answers?
>>> Just another 'what if' speculation with no practical meaning...
>>>
>>
>>A dual core 486 would have performed no better than a 486 on any single
>>program. The paradigm of use in that era was a single process on a
>>single processor. Dual core would require two programs to be running.
>>Adding a second core would add no value to the customers. On the other
>>hand, adding additional decode and execution pipes makes the single
>>program go faster, something every customer was screaming at intel for.
>> The choice was obvious, and correct at the time.
>
> There *were* multi-tasking environments available at the time. It's even
> remotely possible that if dual cores had been available at a reasonable
> price, DesqView, which did real pre-emptive multi-tasking for 386 Protected
> Mode progs, would not have disappeared into oblivion. Then again.....

Not to mention OS/2. ...but the "official" veiw from Mt. Redomond was
that no one needed to multi-task. ...which is obvious because Win
*couldn't*.

No, the real reason was that caches, OoO, and speculation, were a better
use of the real estate until quite recently.

--
Keith

Guest · Jun 15, 2005

Archived from groups: comp.sys.ibm.pc.hardware.chips,comp.sys.intel (More info?)

On Mon, 13 Jun 2005 02:09:16 GMT, "nobody@nowhere.net"
<mygarbage2000@hotmail.com> wrote:

>On Sat, 28 May 2005 01:01:32 -0400, Tony Hill
><hilla_nospam_20@yahoo.ca> wrote:
>
>>That actually brings up a rather interesting point though. At some
>>point SMT may become counter-productive vs. multi-core. In the case
>>of the Power5, if you need to increase you're transistor count by 25%
>>per core for SMT, you only need 4 cores before you've got a enough
>>extra transistors for another full-fledged core. That of course leads
>>to the question, are you better off with 4 cores with SMT or 5 cores
>>without? My money is on 5 cores without.
>
> Now a question: why we didn't see the dual cores many years
>earlier? The main (OK, in overly simplistic view) difference between
>486 and P5(AKA Pentium) was the second integer pipeline.

I'd say that is a grossly over-simplistic view which ignores the
improvements in cache, memory bus, FPU, branch prediction, pipelining,
etc. The two chips were really quite different, to the extent that
the Pentium was easily twice as fast, clock for clock, as a 486.

> While I
>don't have the transistor count per pipeline in P5 and proportion of
>it to the total count, I may suppose (just hypothetically) that making
>a dual core of 486-style single pipeline (minus branch prediction
>logic, plus extra FPU for second core) would not be much more
>complicated than a single core P5, using the same logic as above.

The 486 weighed in at 1.2M transistors, the P5 had 3.1M transistors.
I don't know the exact break-down of the transistor count, but it is
certainly quite reasonable to assume you could build a dual-core 486
for no more (and probably less than) a single-core P5.

Of course, the 486 would be a LOT slower. In fact, in '93 when the
Pentium was released a dual-core 486 would really struggle to be more
than 5% faster than a single-core 486 in any application at all, most
would end up being slower. The first problem was lock of software,
but it didn't end there.

>Besides, 486 was even easier to crank up the clock - when Pentium was
>around 100 (or 120 - too ancient history to remember exactly) AMD had
>its 586 (486 with bigger L1 cache) at 133, easily overclockable to 160
>- did it myself.

Of course the 100MHz Pentium was MUCH faster than the 160MHz AMD 486.

> So why nobody back then jumped to make dual cores?

Because they could get more performance in 99% of the cases by going
with a beefier single-core.

> My answer - no software to take advantage of it, at least on
>consumer level. Win95 had nothing in it to take advantage of SMP.

Win95 (and '98 and Me) didn't support SMP at all. If you booted Win95
on a dual-processor (either dual-core or two separate processors) then
the second processor would simply be disabled because it had
absolutely zero support.

Oh, and the Pentium pre-dated Win95 by 2 years, so really we're
talking about Win3.1 timeframe.

>Ditto Quake 1 ;-). And in corporate world, it was more than a year
>before the introduction of NT4 that might (or might not) have been
>benefitted.

WinNT 4.0 at least support multiple processors, but most of the
software would have made little to no use of it. I suppose you could
have done ok in OS/2 as well, though really only the weirdos like
Keith ran OS/2 :>

> Any other answers?
> Just another 'what if' speculation with no practical meaning...

Back with the 486 vs. Pentium, doubling the transistors resulted in
roughly doubling the performance in single-core chips. Since this
gave you twice as much performance in ALL situations, going to
dual-core which only gave you a small increase in performance in a
very few applications available at the time made no sense.

Now the tables are very much turned. With the Northwood P4 vs.
Prescott P4, Intel more than doubled their transistor count. The
result was basically a negligible increase in performance. On the
other hand a LOT more software is available now that can take
advantage of multiple processing cores and we do a lot more
multitasking than we did back in the Win3.1 or even Win9x days.

Processor design is always a question of trade-offs, what feature
gives you the most performance for most users with a given number of
transistors and/or power consumption. Back in the 486 -> Pentium days
it was pretty clear that adding logic transistors was the best way to
go. This held true when going from the Pentium to the PPro as well,
though after that time the trend shifted to adding cache transistors.
Now we're starting to see the benefits of extra cache trailing off,
but multiple cores is becoming more interesting. Eventually it's
likely that adding more cores will no longer buy you much and people
will have to think up something altogether new to do with their
transistors.

-------------
Tony Hill
hilla <underscore> 20 <at> yahoo <dot> ca

The end of Netburst in 2006

Guest

Guest

Guest

Guest

Guest

Guest

Guest

Guest

Guest

Guest

Distinguished

Guest

Guest

Guest

Guest

Guest

Guest

Guest

Guest

Guest

Guest

Distinguished

Guest

Guest

Distinguished

Guest

Guest

Guest

Guest

Guest

Guest

Distinguished

Guest

Guest

Distinguished

Distinguished

Guest

Guest

Guest

Guest

Distinguished

Guest

Guest

Share this page