A Map to the New World

Archived from groups: comp.sys.ibm.pc.hardware.chips (More info?)

There are some very astute techies who follow this NG and who
regularly contribute to it. There's also a lot of folk who mostly
lurk and, hopefully, learn. This writeup is intended for this latter
group.


I love show & tell. Let's make a computer core: we'll need some
checkers (round), dominoes (rectangular) and a large kiddie block (a
cube).

First we place the cube on a tabletop. The whole tabletop is our die
(the chip). Now we start in the middle of the cube and start building
a fence, bending it around so that it's about a foot in diameter and
returns to the opposite face of the cube. This produces two tabletop
spaces - inside the fence, and outside.

The cube is our computer control center. Outside the fence, but on
the tabletop, is all the memory and I/O including the caches. Inside
the fence is our logic core. The control center exists in both
worlds, and connects them.

OK, let's take about 16 checkers and stack them up. This is our
computer execution pipeline, which does all the work. Looks like a
smokestack, doesn't it? Put it in the middle of the fenced area. Now
we need something to hold the machine state: a register set, program
and stack pointers etc. We'll represent this with a dominoe, which
we'll place inside the fence alongside the execution pipeline. Voila!
A CPU core!

Note that this represents the 486 core and the Opteron core equally
well. BUT. It does **not** represent the P4 core. Why?
Hyperthreading.

The hyperthreaded P4 maintains 4 copies of the computer state. So, we
place three more dominoes around the execution pipe; four altogether.
Only one "dominoe" is active at any one time. Which dominoe is active
depends on which thread is being executed. At any given time, there
are three inactive dominoes. Only one thread can be executed by the
P4 at any one time.

Is this SMT, **Simultaneous** MultiTasking? Not in my book, it isn't.

What it is is a way to give the execution pipeline something to do
(another thread) when one thread stalls due to a cache miss. This can
change the performance of the P4, and in general it does (when used).
The change can be from about -7% to +25%. On average, the change is
positive (an improvement).

THIS IS IMPORTANT: Hyperthreading (and SMT in general) is both a
hardware and a software technique. When in use, that "control center"
block becomes more generalized as it includes both the hardware _and_
the OS (software), plus the application itself.

Building a True SMT Chip
------------------------

Simplicity itself. Start with our P4 representation, with four
dominoes. Spread them apart, and add three more execution pipelines,
so that each dominoe has its own pipeline. This is CMP, chip
multiprocessing. It is capable of true SMT; four threads can be
executed simultaneously.

This example has four cores on-chip. AMD and now Intel have announced
dual-core chips while IBM has been there for a while. Sun has just
announced an 8-core die as a "throughput" machine.

But Wait! We're Not Done Yet!
------------------------------

Let's Hyperthread-enable each of our four cores. We place 4 dominoes
around each of the four execution stacks. That's 16 dominoes all
told. Our chip can now execute 16 threads, four of them
simultaneously. Is this an SMT chip? Definitely yes. Can all its
threads be executed simultaneously? Definitely no.

Gentlemen, I present the future. In the future, the CPU die will
contain multiple cores, and each core will hold several copies of that
core's machine state.

Have we lost anything? Yes. We've lost the ability to focus all the
chip's compute resources on a singe user's single-threaded
application. Technically, this isn't much of a disadvantage; most
legacy apps run much faster than needed already. BUT: IMNSHO there's
a HUGE marketing problem here.

If you think some folk are hanging onto their old CPUs too long
already, when a new CPU would in fact run legacy apps faster, wait
until the new era when a new CPU will NOT run legacy apps any faster
than the old, obsolescent CPU!

I don't think Intel has thought this part out. ;-(
8 answers Last reply
More about world
  1. Archived from groups: comp.sys.ibm.pc.hardware.chips (More info?)

    On Sat, 08 May 2004 18:09:54 GMT, "Felger Carbon" <fmsfnf@jfoops.net>
    wrote:

    >legacy apps run much faster than needed already. BUT: IMNSHO there's
    >a HUGE marketing problem here.
    >
    >If you think some folk are hanging onto their old CPUs too long
    >already, when a new CPU would in fact run legacy apps faster, wait
    >until the new era when a new CPU will NOT run legacy apps any faster
    >than the old, obsolescent CPU!

    hmm.... but wouldn't legacy apps still see a performance increase from
    going from last years' 2 core 1.5Ghz model to this year's 4 core 2Ghz
    model?

    --
    L.Angel: I'm looking for web design work.
    If you need basic to med complexity webpages at affordable rates, email me :)
    Standard HTML, SHTML, MySQL + PHP or ASP, Javascript.
    If you really want, FrontPage & DreamWeaver too.
    But keep in mind you pay extra bandwidth for their bloated code
  2. Archived from groups: comp.sys.ibm.pc.hardware.chips (More info?)

    The little lost angel <a?n?g?e?l@lovergirl.lrigrevol.moc.com> wrote:
    > On Sat, 08 May 2004 18:09:54 GMT, "Felger Carbon" <fmsfnf@jfoops.net>
    > wrote:
    > >If you think some folk are hanging onto their old CPUs too long
    > >already, when a new CPU would in fact run legacy apps faster, wait
    > >until the new era when a new CPU will NOT run legacy apps any faster
    > >than the old, obsolescent CPU!
    >
    > hmm.... but wouldn't legacy apps still see a performance increase from
    > going from last years' 2 core 1.5Ghz model to this year's 4 core 2Ghz
    > model?

    Sure. Not to mention the fact that the 4-core model may well have 3mb of L2
    cache.

    Further, the move to needing multithreading to get the full value out of a
    CPU has already allegedly happened, with the Hyperthreading-enabled models
    of the P4s. With multiple-core CPUs, it's just a matter of degree.

    --
    Nate Edel http://www.nkedel.com/

    "Elder Party 2004: Cthulhu for President -- this time WE'RE the lesser
    evil."
  3. Archived from groups: comp.sys.ibm.pc.hardware.chips (More info?)

    "The little lost angel" <a?n?g?e?l@lovergirl.lrigrevol.moc.com> wrote
    in message news:40a05cc4.382707671@news.pacific.net.sg...
    > On Sat, 08 May 2004 18:09:54 GMT, "Felger Carbon"
    <fmsfnf@jfoops.net>
    > wrote:
    >
    > >legacy apps run much faster than needed already. BUT: IMNSHO
    there's
    > >a HUGE marketing problem here.
    > >
    > >If you think some folk are hanging onto their old CPUs too long
    > >already, when a new CPU would in fact run legacy apps faster, wait
    > >until the new era when a new CPU will NOT run legacy apps any
    faster
    > >than the old, obsolescent CPU!
    >
    > hmm.... but wouldn't legacy apps still see a performance increase
    from
    > going from last years' 2 core 1.5Ghz model to this year's 4 core
    2Ghz
    > model?

    The problem is that tomorrow's 2-core CPU will not run today's
    single-threaded apps faster than today's 1-core CPU.

    Multi-core CPUs are a huge win for almost all servers. What the hell
    they'll be doing on my single-user private party desktop, I dunno.
  4. Archived from groups: comp.sys.ibm.pc.hardware.chips (More info?)

    Felger Carbon <fmsfnf@jfoops.net> wrote:
    > Multi-core CPUs are a huge win for almost all servers. What the hell

    If they're running bloatware. But a fileserver, newserver
    or popserver should be I/O (disk & network) limited.
    A webserver with lots of scripts might wind up compute limited.
    AFIAK a Google-style search engine is memory bandwidth-limited.

    > they'll be doing on my single-user private party desktop, I dunno.

    Me neither, unless some must-have compute-intensive software
    comes along. 'Til then, I like my 1999 vintage dual Celeron.

    -- Robert

    >
    >
  5. Archived from groups: comp.sys.ibm.pc.hardware.chips (More info?)

    Robert Redelmeier wrote:
    > Felger Carbon <fmsfnf@jfoops.net> wrote:
    >
    >>Multi-core CPUs are a huge win for almost all servers. What the hell
    >
    >
    > If they're running bloatware. But a fileserver, newserver
    > or popserver should be I/O (disk & network) limited.

    OLTP tends to stall alot. Nothing much you can do about it, apparently,
    except to add more pipes and to let them stall. You can haggle over how
    to add the pipes, like SMT or CMP, but those are details. In the end,
    adding pipes is a win, up to a point, as long as you are putting
    underutilized bandwidth to work. Beyond that point, you are just
    wasting transistors on pipes that will stall. Working this all out is a
    paycheck for somebody. :-).

    The critical resouce for OLTP is bandwidth. Disk and I/O bandwidth are
    shared and so are not affected by how you deploy processors, but memory
    bandwidth most definitely is. If you have a single pipe and a single
    memory controller, then the memory bandwidth that controller schedules
    will be used in haphazard ways, and you will inevitably throw away
    bandwidth. If you let multiple pipes share a single memory controller,
    then, with luck, you will wind up with the maximum bandwidth utilization
    possible.

    > A webserver with lots of scripts might wind up compute limited.

    Ever seen any published evidence of that? Not a challenge. I'm
    curious. I'd guess that you'd see spikes in processor utilization
    interspersed with an idle processor and an average utilization no better
    than the ~40% utilization that fully-loaded OLTP processors see, anyway.

    > AFIAK a Google-style search engine is memory bandwidth-limited.

    When this subject came up in comp.arch, a poster claimed that Google
    claimed (i.e., I can come up with a link to a comp.arch thread and
    nothing more) that compute, disk, and memory were all roughly in balance
    for their system. Would you expect any less from such a bright bunch?

    >>they'll be doing on my single-user private party desktop, I dunno.
    >
    >
    > Me neither, unless some must-have compute-intensive software
    > comes along. 'Til then, I like my 1999 vintage dual Celeron.

    I went into a cataleptic state for ten seconds, consulted my private
    oracle, and came out with the conclusion that programming styles,
    compilers, or architectures would have to change so that more ordinary
    applications could utilize multiple pipes, and that such a change is
    inevitable. If my oracle says so, it must be true. :-).

    So many things point that way that I do not see how things will be
    otherwise. If we haven't reached a knee in the performance curve for a
    single pipe, we're going to very soon. Graphics processors and thus
    games already make abundant use of threaded programming.

    I'd like to see programming for parallelism become safer and easier.
    Having had that problem on the table for so long with what I see as very
    little progress, I'm not optimistic. There are people for whom I have
    profound respect working on this problem, but I'm not sure that they
    don't underestimate it. People work at the surface with languages and
    whatnot, when the real problem is that we lack even the most basic tools
    for talking about actual algorithms in a formal way. People are working
    at that level, too. For the most part, though, no one pays attention.

    Compilers for parallelism have also seen alot of work and went through a
    period when a tremendous amount of money was pumped into them. 'Nuff said.

    In the face of this record of disappointment, hardware architects have
    made what I see as substantial progress on the problem while addressing
    the memory latency problem. If you want something done, ask a busy
    person to do it. OoO processors look through a rather large window in
    an instruction stream (and it has to be large to hide memory latency)
    and move forward whatever they can. As agressive as the strategies
    currently in use are, they could be even more aggressive. An OoO
    superscalar processor already parallelizes opportunistically, and it is
    only a short step to handing a processor a binary compiled for
    single-threaded execution and watching it execute multi-threaded. It
    will happen. I heard it straight from my own private oracle. :-).

    RM
  6. Archived from groups: comp.sys.ibm.pc.hardware.chips (More info?)

    "Robert Myers" <rmyers1400@comcast.net> wrote in message
    news:iEaoc.73922$Ik.5254893@attbi_s53...
    >
    > I went into a cataleptic state for ten seconds, consulted
    > my private oracle, and came out with the conclusion that
    > programming styles, compilers, or architectures would
    > have to change so that more ordinary applications could
    > utilize multiple pipes, and that such a change is
    > inevitable. If my oracle says so, it must be true. :-).

    Gosh, I hope you make a full recovery, Robert. How else will I learn
    what the future holds? ;-)

    But in the past - starting a microsecond ago - all my software was
    single-threaded. Past software is legacy software. Future software
    is not legacy software. I stated (didn't I?) that the future
    dual-core CPUs would not improve the performance of (single-threaded)
    **legacy** applications.

    In other words, Robert, you changed the subject from legacy software
    (that's the stuff we already own) to future software (which none of us
    own). This means to benefit from future dual-core CPUs we will also
    have to buy new software? Hmm. New hardware **and** new software?
    Sounds like we'll all have to throw out what we have right now, today,
    and buy all new stuff, both hardware and software.

    I suggested this might be a huge marketing problem, as I recall.

    I'm sure glad you're around. I would never have come to the
    conclusion that we're all gonna trash what we have now! ;-) ;-)
  7. Archived from groups: comp.sys.ibm.pc.hardware.chips (More info?)

    Felger Carbon wrote:
    > "Robert Myers" <rmyers1400@comcast.net> wrote in message
    > news:iEaoc.73922$Ik.5254893@attbi_s53...
    >
    >>I went into a cataleptic state for ten seconds, consulted
    >>my private oracle, and came out with the conclusion that
    >>programming styles, compilers, or architectures would
    >>have to change so that more ordinary applications could
    >>utilize multiple pipes, and that such a change is
    >>inevitable. If my oracle says so, it must be true. :-).
    >
    >
    > Gosh, I hope you make a full recovery, Robert.

    Even the most optimistic of my friends gave up on that thought long ago.

    > How else will I learn what the future holds? ;-)

    I wouldn't be missed. No shortage of fortune tellers on Usenet. ;-).

    > But in the past - starting a microsecond ago - all my software was
    > single-threaded. Past software is legacy software. Future software
    > is not legacy software. I stated (didn't I?) that the future
    > dual-core CPUs would not improve the performance of (single-threaded)
    > **legacy** applications.
    >
    > In other words, Robert, you changed the subject from legacy software
    > (that's the stuff we already own) to future software (which none of us
    > own).

    The implication being that the software we own can't be used on a
    radically different hardware. Always an incorrect conclusion for Linux
    users who can just recompile. Transmeta has some ideas of its own, and
    the most likely line of development I see for aggressively scheduled SMT
    cores wouldn't need new software, either.

    > This means to benefit from future dual-core CPUs we will also
    > have to buy new software? Hmm. New hardware **and** new software?
    > Sounds like we'll all have to throw out what we have right now, today,
    > and buy all new stuff, both hardware and software.
    >

    Ah, but you see, onboard scheduling hardware already evokes parallelism
    from a nominally single-threaded instruction stream. The parallelism
    frequently is there, whether you are accustomed to diagramming it that
    way (or whatever mental way you have of thinking of parallel processes)
    or not. On-board scheduling hardware, among other things, discovers and
    implements streaming parallelism, although not always without a struggle.

    On board scheduling hardware could initiate new threads where there is
    exploitable parallelism. If you're not _sure_ the parallelism is there,
    you can speculate, often successfully. In the most naive of strategies,
    you just pick places to jump into the instruction stream and start a new
    thread.

    One place to read about this kind of stuff is Andy Glew's home page.
    It's also been a subject, one way or another, of a fair number of my
    posts to comp.arch. Dick Wilmot has a particularly aggressive scheme
    that he calls data surfing. Since I don't know any better, I think of
    Andy Glew as the leading proponent of this particular set of tactics.

    If you google comp.arch for "dusty decks" over the last year, you will
    find more than one thread talking about feeding, er, legacy software to
    a hungry multi-threaded monster.

    > I suggested this might be a huge marketing problem, as I recall.

    If Intel decides that persuading people that "legacy" software is
    something they don't need or want is in their best interest, they'll
    find a way to market it. I would suspect that things are a bit tense
    between the two halves of the Wintel monopoly just about now.

    As it is, I don't think Intel will need any spectacular marketing ploys,
    because my most likely scenario is that hardware will manage to
    accommodate legacy software, anyway.

    > I'm sure glad you're around. I would never have come to the
    > conclusion that we're all gonna trash what we have now! ;-) ;-)
    >

    It's always nice to feel appreciated. Thank you. :-).

    I would suspect that most performance-sensitive "legacy" applications,
    meaning really Windows applications, have been gradually retuned and
    recompiled from source as it is. You think you don't own software that
    doesn't know about MMX, SSE, SSE2, and the vagaries of the Pentium 4?
    If you don't, you haven't bought software in a long time.

    RM
  8. Archived from groups: comp.sys.ibm.pc.hardware.chips (More info?)

    In article <GO8oc.16447$Hs1.13999
    @newsread2.news.pas.earthlink.net>, fmsfnf@jfoops.net says...
    > "The little lost angel" <a?n?g?e?l@lovergirl.lrigrevol.moc.com> wrote
    > in message news:40a05cc4.382707671@news.pacific.net.sg...
    > > On Sat, 08 May 2004 18:09:54 GMT, "Felger Carbon"
    > <fmsfnf@jfoops.net>
    > > wrote:
    > >
    > > >legacy apps run much faster than needed already. BUT: IMNSHO
    > there's
    > > >a HUGE marketing problem here.
    > > >
    > > >If you think some folk are hanging onto their old CPUs too long
    > > >already, when a new CPU would in fact run legacy apps faster, wait
    > > >until the new era when a new CPU will NOT run legacy apps any
    > faster
    > > >than the old, obsolescent CPU!
    > >
    > > hmm.... but wouldn't legacy apps still see a performance increase
    > from
    > > going from last years' 2 core 1.5Ghz model to this year's 4 core
    > 2Ghz
    > > model?
    >
    > The problem is that tomorrow's 2-core CPU will not run today's
    > single-threaded apps faster than today's 1-core CPU.
    >
    > Multi-core CPUs are a huge win for almost all servers. What the hell
    > they'll be doing on my single-user private party desktop, I dunno.


    Geez, Felg. How many times do I have to tell you that WinBlows
    isn't all there is to computing! Sometimes people have multiple
    things going on at once! ;-)

    SMP is Billy's best chance to actually have a multitasking OS!
    ....although others figured out how to do it on a UP long ago.

    --
    Keith
Ask a new question

Read More

CPUs