Pretty good explanation of x86-64 by HP

Archived from groups: comp.arch,comp.sys.intel,comp.sys.ibm.pc.hardware.chips,alt.comp.hardware.amd.x86-64 (More info?)

I found this whitepaper from HP to be pretty good, it is surprisingly
candid, considering HP was the coinventor of the Itanium. It does a
pretty good job of explaining and summarizing the similarities and
differences between AMD64 and EM64T, and their comparison to the
Itanium's IA64 instruction set. AMD64 and EM64T are "broadly
compatible", but IA64 is a different animal altogether.

Yousuf Khan

http://h200001.www2.hp.com/bc/docs/support/SupportManual/c00238028/c00238028.pdf
64 answers Last reply
More about pretty good explanation
  1. Archived from groups: comp.arch,comp.sys.intel,comp.sys.ibm.pc.hardware.chips,alt.comp.hardware.amd.x86-64 (More info?)

    On Sun, 05 Dec 2004 01:02:11 -0500, Yousuf Khan <bbbl67@ezrs.com> wrote:

    >I found this whitepaper from HP to be pretty good, it is surprisingly
    >candid, considering HP was the coinventor of the Itanium. It does a
    >pretty good job of explaining and summarizing the similarities and
    >differences between AMD64 and EM64T, and their comparison to the
    >Itanium's IA64 instruction set. AMD64 and EM64T are "broadly
    >compatible", but IA64 is a different animal altogether.
    >
    > Yousuf Khan
    >
    >http://h200001.www2.hp.com/bc/docs/support/SupportManual/c00238028/c00238028.pdf

    Hmm and the following quote: "However, the latency difference between local
    and remote accesses is actually very small because the memory controller is
    integrated into and operates at the core speed of the processor, and
    because of the fast interconnect between processors." is relevant to
    another discussion here. I wish we could get a firm answer on this one.

    Rgds, George Macdonald

    "Just because they're paranoid doesn't mean you're not psychotic" - Who, me??
  2. Archived from groups: comp.arch,comp.sys.intel,comp.sys.ibm.pc.hardware.chips,alt.comp.hardware.amd.x86-64 (More info?)

    Yousuf Khan wrote:
    > I found this whitepaper from HP to be pretty good, it is surprisingly
    > candid, considering HP was the coinventor of the Itanium. It does a
    > pretty good job of explaining and summarizing the similarities and
    > differences between AMD64 and EM64T, and their comparison to the
    > Itanium's IA64 instruction set. AMD64 and EM64T are "broadly
    > compatible", but IA64 is a different animal altogether.
    >
    > Yousuf Khan
    >
    > http://h200001.www2.hp.com/bc/docs/support/SupportManual/c00238028/c00238028.pdf

    When did the non-Xeon Prescott P4s start offering EMT64 as listed in
    the paper? News to me. Does HP know something the rest of the world
    doesn't?

    Bill
  3. Archived from groups: comp.arch,comp.sys.intel,comp.sys.ibm.pc.hardware.chips,alt.comp.hardware.amd.x86-64 (More info?)

    On Sun, 05 Dec 2004 06:44:31 GMT, Bill Bradley
    <senator2@NOSPAMearthlink.net> wrote:

    >Yousuf Khan wrote:
    >> I found this whitepaper from HP to be pretty good, it is surprisingly
    >> candid, considering HP was the coinventor of the Itanium. It does a
    >> pretty good job of explaining and summarizing the similarities and
    >> differences between AMD64 and EM64T, and their comparison to the
    >> Itanium's IA64 instruction set. AMD64 and EM64T are "broadly
    >> compatible", but IA64 is a different animal altogether.
    >>
    >> Yousuf Khan
    >>
    >> http://h200001.www2.hp.com/bc/docs/support/SupportManual/c00238028/c00238028.pdf
    >
    > When did the non-Xeon Prescott P4s start offering EMT64 as listed in
    >the paper? News to me. Does HP know something the rest of the world
    >doesn't?

    Not that they know something the rest of the world doesn't, just that
    they have access to processors that most of us do not. IBM sells them
    as well, but for the time being Intel will ONLY sell them for use in
    servers. Why? I really don't know. Maybe it's just a bit too much
    crow for them to eat after saying (only a bit over a year ago) that
    64-bit wouldn't be useful for the desktop until the end of the year?

    -------------
    Tony Hill
    hilla <underscore> 20 <at> yahoo <dot> ca
  4. Archived from groups: comp.arch,comp.sys.intel,comp.sys.ibm.pc.hardware.chips,alt.comp.hardware.amd.x86-64 (More info?)

    Bill Bradley wrote:
    > When did the non-Xeon Prescott P4s start offering EMT64 as listed in
    > the paper? News to me. Does HP know something the rest of the world
    > doesn't?

    It must have been at least two or three months now, I posted a message
    about it in one of these newsgroups.

    Google Search: g:thl403337196d
    http://groups.google.ca/groups?q=g:thl403337196d&dq=&hl=en&lr=&selm=cGVPc.1412825%24Ar.705528%40twister01.bloor.is.net.cable.rogers.com

    or,

    http://tinyurl.com/6tnjy

    Yousuf Khan
  5. Archived from groups: comp.arch,comp.sys.intel,comp.sys.ibm.pc.hardware.chips,alt.comp.hardware.amd.x86-64 (More info?)

    George Macdonald wrote:
    > Hmm and the following quote: "However, the latency difference between local
    > and remote accesses is actually very small because the memory controller is
    > integrated into and operates at the core speed of the processor, and
    > because of the fast interconnect between processors." is relevant to
    > another discussion here. I wish we could get a firm answer on this one.

    Yeah, but that's why I think AMD insists on calling their multiprocessor
    connection scheme as SUMO (Sufficiently Uniform Memory Organization),
    rather than NUMA. It's not worth headaching over such small differences
    in latency, is basically what they're saying.

    Yousuf Khan
  6. Archived from groups: comp.arch,comp.sys.intel,comp.sys.ibm.pc.hardware.chips,alt.comp.hardware.amd.x86-64 (More info?)

    George Macdonald wrote:
    > On Sun, 05 Dec 2004 01:02:11 -0500, Yousuf Khan <bbbl67@ezrs.com> wrote:
    >
    >
    >>I found this whitepaper from HP to be pretty good, it is surprisingly
    >>candid, considering HP was the coinventor of the Itanium. It does a
    >>pretty good job of explaining and summarizing the similarities and
    >>differences between AMD64 and EM64T, and their comparison to the
    >>Itanium's IA64 instruction set. AMD64 and EM64T are "broadly
    >>compatible", but IA64 is a different animal altogether.
    >>
    >> Yousuf Khan
    >>
    >>http://h200001.www2.hp.com/bc/docs/support/SupportManual/c00238028/c00238028.pdf
    >
    >
    > Hmm and the following quote: "However, the latency difference between local
    > and remote accesses is actually very small because the memory controller is
    > integrated into and operates at the core speed of the processor, and
    > because of the fast interconnect between processors." is relevant to
    > another discussion here. I wish we could get a firm answer on this one.
    >

    Not sure if this is exactly what you are looking for in the
    way of a "firm answer", but the latencies in a Opteron system are:

    0 hops 80 ns uniprocessor (Local access)
    100 ns multiprocessor (Local access, with cache snooping on other processors)
    1 hop 115 ns
    2 hops 150 ns
    3 hops 190 ns

    I couldn't find my original source for those numbers, and
    the two and three hop numbers above are a little higher
    than I remembered them as being. This time around I got
    them from this thread:
    http://www.aceshardware.com/forum?read=80030960

    That thread refers to this article:
    http://www.digit-life.com/articles2/amd-hammer-family/
    which gives slightly different numbers for a 2 GHz Opteron
    with DDR333:
    Uni-processor system: 45 ns
    Dual-processor system: 0-hop - 69 ns, 1-hop - 117 ns.
    Four-processor system: 0-hop - 100 ns, 1-hop - 118 ns, 2-hop - 136 ns.


    I don't know if any of the numbers above are for cache misses
    or if they are averages that include both hits and misses.
  7. Archived from groups: comp.arch,comp.sys.intel,comp.sys.ibm.pc.hardware.chips,alt.comp.hardware.amd.x86-64 (More info?)

    "Bill Bradley" <senator2@NOSPAMearthlink.net> wrote in message
    news:j7ysd.1769$yr1.125@newsread3.news.pas.earthlink.net...
    > Yousuf Khan wrote:
    >> I found this whitepaper from HP to be pretty good, it is surprisingly
    >> candid, considering HP was the coinventor of the Itanium. It does a
    >> pretty good job of explaining and summarizing the similarities and
    >> differences between AMD64 and EM64T, and their comparison to the
    >> Itanium's IA64 instruction set. AMD64 and EM64T are "broadly compatible",
    >> but IA64 is a different animal altogether.
    >>
    >> Yousuf Khan
    >>
    >> http://h200001.www2.hp.com/bc/docs/support/SupportManual/c00238028/c00238028.pdf
    >
    > When did the non-Xeon Prescott P4s start offering EMT64 as listed in the
    > paper? News to me. Does HP know something the rest of the world
    > doesn't?
    >
    > Bill

    www.overclockers.co.uk had some a few weeks back, and htey sold very quikly.
    I think theres a few more in now.

    hamman
  8. Archived from groups: comp.arch,comp.sys.intel,comp.sys.ibm.pc.hardware.chips,alt.comp.hardware.amd.x86-64 (More info?)

    "George Macdonald" <fammacd=!SPAM^nothanks@tellurian.com> wrote in message
    news:hmr5r05drs3hird56j69qs2nbu5mth1b95@4ax.com...

    > Hmm and the following quote: "However, the latency difference between
    > local
    > and remote accesses is actually very small because the memory controller
    > is
    > integrated into and operates at the core speed of the processor, and
    > because of the fast interconnect between processors." is relevant to
    > another discussion here. I wish we could get a firm answer on this one.

    In typical Opteron setups (2-8 CPUs, using the Opteron's build in SMP
    hardware), the latency difference between local and remote memory accesses
    is so small that the benefits of treating it as NUMA are typically
    outweighed by the costs. Generally, you just distribute the memory evenly
    and interleaved on the nodes (if you can) to avoid overloading one memory
    controller channel.

    DS
  9. Archived from groups: comp.arch,comp.sys.intel,comp.sys.ibm.pc.hardware.chips,alt.comp.hardware.amd.x86-64 (More info?)

    Tony Hill <hilla_nospam_20@yahoo.ca> writes:

    >Not that they know something the rest of the world doesn't, just that
    >they have access to processors that most of us do not. IBM sells them
    >as well, but for the time being Intel will ONLY sell them for use in
    >servers. Why? I really don't know. Maybe it's just a bit too much
    >crow for them to eat after saying (only a bit over a year ago) that
    >64-bit wouldn't be useful for the desktop until the end of the year?

    How much does Intel stockpile? Could it be that they have warehouses
    full of already produced non-64-bit processors, and those want to be
    sold at the projected prices, not thrown away?

    best regards
    Patrick
  10. Archived from groups: comp.arch,comp.sys.intel,comp.sys.ibm.pc.hardware.chips,alt.comp.hardware.amd.x86-64 (More info?)

    > Patrick Schaaf <mailer-daemon@bof.de> wrote:

    > How much does Intel stockpile?

    Well, according to the Reg,
    <http://www.theregister.co.uk/2004/12/03/intel_eol_p2/>
    they just finally announced EOL for the Pentium-II.

    "The Register reveals that you'll be able to continue
    ordering the part for a year, with the last trays
    leaving the chip giant's Pentium II warehouse on
    1 June 2006."

    > Could it be that they have warehouses full of already
    > produced non-64-bit processors, and those want to be
    > sold at the projected prices, not thrown away?

    Whether there is any connection between your hypothesis
    and the Reg news, is left as an exercise for the reader :-)

    --
    Regards, Bob Niland mailto:name@ispname.tld
    http://www.access-one.com/rjn email4rjn AT yahoo DOT com
    NOT speaking for any employer, client or Internet Service Provider.
  11. Archived from groups: comp.arch,comp.sys.intel,comp.sys.ibm.pc.hardware.chips,alt.comp.hardware.amd.x86-64 (More info?)

    On Sun, 05 Dec 2004 16:30:15 -0500, Yousuf Khan wrote:

    > George Macdonald wrote:
    >> Hmm and the following quote: "However, the latency difference between local
    >> and remote accesses is actually very small because the memory controller is
    >> integrated into and operates at the core speed of the processor, and
    >> because of the fast interconnect between processors." is relevant to
    >> another discussion here. I wish we could get a firm answer on this one.
    >
    > Yeah, but that's why I think AMD insists on calling their multiprocessor
    > connection scheme as SUMO (Sufficiently Uniform Memory Organization),
    > rather than NUMA. It's not worth headaching over such small differences
    > in latency, is basically what they're saying.

    I'd say that because in small systems (less than 8 CPUs), Opterons are
    coherent in hardware thus sufficiently tightly coupled to be called UMA,
    as far as the user is concerned.

    --
    Keith
  12. Archived from groups: comp.arch,comp.sys.intel,comp.sys.ibm.pc.hardware.chips,alt.comp.hardware.amd.x86-64 (More info?)

    On Sun, 05 Dec 2004 19:47:30 +0000, Patrick Schaaf wrote:

    > Tony Hill <hilla_nospam_20@yahoo.ca> writes:
    >
    >>Not that they know something the rest of the world doesn't, just that
    >>they have access to processors that most of us do not. IBM sells them
    >>as well, but for the time being Intel will ONLY sell them for use in
    >>servers. Why? I really don't know. Maybe it's just a bit too much
    >>crow for them to eat after saying (only a bit over a year ago) that
    >>64-bit wouldn't be useful for the desktop until the end of the year?
    >
    > How much does Intel stockpile? Could it be that they have warehouses
    > full of already produced non-64-bit processors, and those want to be
    > sold at the projected prices, not thrown away?

    Unsold inventory is a very bad thing indeed. The tax man isn't happy.
    Stockholders aren't happy. Executives shiver.

    --
    Keith
  13. Archived from groups: comp.arch,comp.sys.ibm.pc.hardware.chips (More info?)

    On Sun, 05 Dec 2004 19:12:44 -0800, Greg Lindahl wrote:

    > In article <pan.2004.12.06.02.44.34.997332@att.bizzzz>,
    > keith <krw@att.bizzzz> wrote:
    >
    >>I'd say that because in small systems (less than 8 CPUs), Opterons are
    >>coherent in hardware thus sufficiently tightly coupled to be called UMA,
    >>as far as the user is concerned.
    >
    > However, it's not hard to show with benchmarks that paying attention
    > to the NUMA nature of the Opteron is a significant win. So you can
    > call it what you want, but...

    Point well taken. So we have a desert topping and a floor wax. ;-)

    > Newsgroups trimmed.

    ..chips added back in.

    --
    Keith
  14. Archived from groups: comp.arch,comp.sys.intel,comp.sys.ibm.pc.hardware.chips,alt.comp.hardware.amd.x86-64 (More info?)

    In comp.arch Tony Hill <hilla_nospam_20@yahoo.ca> wrote:

    > Not that they know something the rest of the world doesn't, just that
    > they have access to processors that most of us do not. IBM sells them
    > as well, but for the time being Intel will ONLY sell them for use in
    > servers. Why? I really don't know.

    FWIW, Dell are shipping EM64T-equipped non-Xeon P4 workstations (the
    Precision 370).

    -a
  15. Archived from groups: comp.arch,comp.sys.intel,comp.sys.ibm.pc.hardware.chips,alt.comp.hardware.amd.x86-64 (More info?)

    On Sun, 05 Dec 2004 17:37:16 GMT, Rob Stow <rob.stow.nospam@shaw.ca> wrote:

    >George Macdonald wrote:
    >> On Sun, 05 Dec 2004 01:02:11 -0500, Yousuf Khan <bbbl67@ezrs.com> wrote:
    >>
    >>
    >>>I found this whitepaper from HP to be pretty good, it is surprisingly
    >>>candid, considering HP was the coinventor of the Itanium. It does a
    >>>pretty good job of explaining and summarizing the similarities and
    >>>differences between AMD64 and EM64T, and their comparison to the
    >>>Itanium's IA64 instruction set. AMD64 and EM64T are "broadly
    >>>compatible", but IA64 is a different animal altogether.
    >>>
    >>> Yousuf Khan
    >>>
    >>>http://h200001.www2.hp.com/bc/docs/support/SupportManual/c00238028/c00238028.pdf
    >>
    >>
    >> Hmm and the following quote: "However, the latency difference between local
    >> and remote accesses is actually very small because the memory controller is
    >> integrated into and operates at the core speed of the processor, and
    >> because of the fast interconnect between processors." is relevant to
    >> another discussion here. I wish we could get a firm answer on this one.
    >>
    >
    >Not sure if this is exactly what you are looking for in the
    >way of a "firm answer", but the latencies in a Opteron system are:
    >
    >0 hops 80 ns uniprocessor (Local access)
    > 100 ns multiprocessor (Local access, with cache snooping on other processors)
    >1 hop 115 ns
    >2 hops 150 ns
    >3 hops 190 ns
    >
    >I couldn't find my original source for those numbers, and
    >the two and three hop numbers above are a little higher
    >than I remembered them as being. This time around I got
    >them from this thread:
    >http://www.aceshardware.com/forum?read=80030960
    >
    >That thread refers to this article:
    > http://www.digit-life.com/articles2/amd-hammer-family/
    >which gives slightly different numbers for a 2 GHz Opteron
    >with DDR333:
    > Uni-processor system: 45 ns
    > Dual-processor system: 0-hop - 69 ns, 1-hop - 117 ns.
    > Four-processor system: 0-hop - 100 ns, 1-hop - 118 ns, 2-hop - 136 ns.
    >
    >
    >I don't know if any of the numbers above are for cache misses
    >or if they are averages that include both hits and misses.

    Thanks for the data but no I guess I should have highlighted better what I
    was getting at: "the memory controller is integrated into and operates at
    the core speed of the processor", which is what was being
    discussed/disputed in another thread.

    I haven't been able to find any hard data from AMD on where the clock
    domain boundaries are in the Opteron/Athlon64 but if the memory controller
    is not operating at "core speed" it's now at the stage of Internet
    Folklore.

    Rgds, George Macdonald

    "Just because they're paranoid doesn't mean you're not psychotic" - Who, me??
  16. Archived from groups: comp.arch,comp.sys.intel,comp.sys.ibm.pc.hardware.chips,alt.comp.hardware.amd.x86-64 (More info?)

    On Sun, 05 Dec 2004 16:30:15 -0500, Yousuf Khan <bbbl67@ezrs.com> wrote:

    >George Macdonald wrote:
    >> Hmm and the following quote: "However, the latency difference between local
    >> and remote accesses is actually very small because the memory controller is
    >> integrated into and operates at the core speed of the processor, and
    >> because of the fast interconnect between processors." is relevant to
    >> another discussion here. I wish we could get a firm answer on this one.
    >
    >Yeah, but that's why I think AMD insists on calling their multiprocessor
    > connection scheme as SUMO (Sufficiently Uniform Memory Organization),
    >rather than NUMA. It's not worth headaching over such small differences
    >in latency, is basically what they're saying.

    See my reply to Rob Stow.

    Rgds, George Macdonald

    "Just because they're paranoid doesn't mean you're not psychotic" - Who, me??
  17. Archived from groups: comp.arch,comp.sys.intel,comp.sys.ibm.pc.hardware.chips (More info?)

    In article <8c97r0hqh2sqf8sh89ut3153lpdmddfs76@4ax.com>,
    George Macdonald <fammacd=!SPAM^nothanks@tellurian.com> wrote:

    >I haven't been able to find any hard data from AMD on where the clock
    >domain boundaries are in the Opteron/Athlon64 but if the memory controller
    >is not operating at "core speed" it's now at the stage of Internet
    >Folklore.

    Note that the STREAM bandwidth and lmbench latency changes with every
    cpuspeedbump. So clearly part of the memory controller is at the cpu
    core frequency, or a related frequency, and not at the HT frequency,
    or the SDRAM external bus frequency.

    Please reduce the cross-post. Followups set to a group I read.

    -- greg
  18. Archived from groups: comp.arch,comp.sys.intel,comp.sys.ibm.pc.hardware.chips (More info?)

    In article <pqe8r0hsb4vfiqv4uvpk6h2h7cn8gq5q37@4ax.com>,
    Tony Hill <hilla_nospam_20@yahoo.ca> wrote:

    >It does, but the difference is small, usually less than 10% and often
    >much closer to 0%.

    No, it's not. The Opteron builds the best 4-cpu SMP system out there
    according to the SPECrate2000 cpu benchmark, but in order to get that
    best result, you need to pin the individual processes to cpus and
    memory using a utility. Without it, the performance is no longer the
    best. So people really care about that last bit of performance.

    Now I don't have the directly comparison for that, but here's a
    comparison on some benchmarks for a recent competitive bid. "Slow" is
    a system without the processor binding and with "node interleave"
    turned on. "Fast" is with processor binding and node interleave off,
    which lets the processor binding have the best benefit. Note that it's
    only a trivial amount of work to get this improvement for a serial
    code, so this is a common situation, although these benchmarks are, of
    course, particular to this scientific-computing customer. In these
    results, the comparison is scaling for 4 processes on a 4 cpu machine.
    4.0 would be a perfect score.

    fast slow difference
    benchmark 1 3.71 3.03 + 22 %
    benchmark 2 3.76 3.29 + 14 %
    benchmark 3 3.78 3.26 + 16 %
    benchmark 4 3.79 3.45 + 10 %
    benchmark 5 3.92 3.89 + 1 %
    benchmark 6 3.88 3.71 + 5 %

    These benchmarks were run with the best Opteron compiler, so this
    scaling improvement was very good to see. And it's bigger than
    "usually less than 10%".

    > When well over 90% of your memory access is coming
    > from cache anyway and (assuming a totally random distribution in a
    > strictly UMA setup) 50% of your memory access is going to be local,
    > most of the performance difference is lost in the noise.

    Handwaving is a bad way to evaluate effects like this.

    >I've said it before and I'll say it again: Hardware is cheap,
    >software is expensive. It would be a true disservice to your
    >customers to tell them to spend thousands upon thousands of dollars
    >changing all their software for the small improvement in performance
    >equal to a few hundred dollars of hardware costs.

    Customers know what 10% or 20% more performance means, as do vendors
    who are doing competitive bidding. The fact that I care a lot about
    this should give you a clue. And in some cases, such as serial codes,
    the benefits are easy to achieve. It took only a moderate amount of
    work in our OpenMP compiler and runtime to get these benefits for some
    parallel programs, too. Well worth it to our customers.

    -- greg
    speaking for myself, not PathScale
  19. Archived from groups: comp.arch,comp.sys.intel,comp.sys.ibm.pc.hardware.chips,alt.comp.hardware.amd.x86-64 (More info?)

    On 05 Dec 2004 19:47:30 GMT, mailer-daemon@bof.de (Patrick Schaaf)
    wrote:

    >Tony Hill <hilla_nospam_20@yahoo.ca> writes:
    >
    >>Not that they know something the rest of the world doesn't, just that
    >>they have access to processors that most of us do not. IBM sells them
    >>as well, but for the time being Intel will ONLY sell them for use in
    >>servers. Why? I really don't know. Maybe it's just a bit too much
    >>crow for them to eat after saying (only a bit over a year ago) that
    >>64-bit wouldn't be useful for the desktop until the end of the year?
    >
    >How much does Intel stockpile? Could it be that they have warehouses
    >full of already produced non-64-bit processors, and those want to be
    >sold at the projected prices, not thrown away?

    ALL of the "Prescott" and "Nocona" cores are 64-bit capable excluding
    those that would pass a validation as 32-bit chips but fail as 64-bit
    chips, but such chips would be rather few and far between. It could
    be that Intel still has a reasonable amount of inventory of their old
    "Northwood" P4 chips and they want to clear those out first, but that
    certainly doesn't seem to be the case looking at Intel's pricing
    structure and what is being sold by the major OEMs (Intel seems to be
    pushing Prescott VERY hard here).

    Long story short, I'm not quite sure what the actual answer is, but
    excessive inventory of 32-bit chips doesn't seem to make sense from
    what I've seen.

    -------------
    Tony Hill
    hilla <underscore> 20 <at> yahoo <dot> ca
  20. Archived from groups: comp.arch,comp.sys.intel,comp.sys.ibm.pc.hardware.chips,alt.comp.hardware.amd.x86-64 (More info?)

    On 6 Dec 2004 00:30:48 GMT, ammonton@cc.full.stop.helsinki.fi wrote:

    >In comp.arch Tony Hill <hilla_nospam_20@yahoo.ca> wrote:
    >
    >> Not that they know something the rest of the world doesn't, just that
    >> they have access to processors that most of us do not. IBM sells them
    >> as well, but for the time being Intel will ONLY sell them for use in
    >> servers. Why? I really don't know.
    >
    >FWIW, Dell are shipping EM64T-equipped non-Xeon P4 workstations (the
    >Precision 370).

    Ahh, thanks. When I first wrote the above I had actually included
    Dell's name as well, but then removed it when I couldn't find any
    EM64T P4 processors in any of their servers (didn't think to check
    workstations first). I figured that if anyone was selling 64-bit P4s
    it would be Dell!

    -------------
    Tony Hill
    hilla <underscore> 20 <at> yahoo <dot> ca
  21. Archived from groups: comp.arch,comp.sys.intel,comp.sys.ibm.pc.hardware.chips (More info?)

    "Greg Lindahl" <lindahl@pbm.com> wrote in message
    news:41b45512$1@news.meer.net...
    > ...
    > These benchmarks were run with the best Opteron compiler
    > ...

    Visual C?

    :-)

    Thanks,
    Eugene
  22. Archived from groups: comp.arch,comp.sys.ibm.pc.hardware.chips (More info?)

    Greg Lindahl wrote:

    > These benchmarks were run with the best Opteron compiler [...]

    Which compiler would that be? PathScale?

    :-)
  23. Archived from groups: comp.arch,comp.sys.intel,comp.sys.ibm.pc.hardware.chips,alt.comp.hardware.amd.x86-64 (More info?)

    keith wrote:
    > I'd say that because in small systems (less than 8 CPUs), Opterons are
    > coherent in hardware thus sufficiently tightly coupled to be called UMA,
    > as far as the user is concerned.

    Yes, exactly my point, it's more or less UMA in the upto 8 processor
    range. After that, then you can start thinking of it as NUMA. But having
    upto 8 processors being treated as UMA is quite a lot.

    Yousuf Khan
  24. Archived from groups: comp.arch,comp.sys.intel,comp.sys.ibm.pc.hardware.chips,alt.comp.hardware.amd.x86-64 (More info?)

    George Macdonald wrote:
    > On Sun, 05 Dec 2004 17:37:16 GMT, Rob Stow <rob.stow.nospam@shaw.ca> wrote:
    >
    >
    >>George Macdonald wrote:
    >>
    >>>On Sun, 05 Dec 2004 01:02:11 -0500, Yousuf Khan <bbbl67@ezrs.com> wrote:
    >>>
    >>>
    >>>
    >>>>I found this whitepaper from HP to be pretty good, it is surprisingly
    >>>>candid, considering HP was the coinventor of the Itanium. It does a
    >>>>pretty good job of explaining and summarizing the similarities and
    >>>>differences between AMD64 and EM64T, and their comparison to the
    >>>>Itanium's IA64 instruction set. AMD64 and EM64T are "broadly
    >>>>compatible", but IA64 is a different animal altogether.
    >>>>
    >>>> Yousuf Khan
    >>>>
    >>>>http://h200001.www2.hp.com/bc/docs/support/SupportManual/c00238028/c00238028.pdf
    >>>
    >>>
    >>>Hmm and the following quote: "However, the latency difference between local
    >>>and remote accesses is actually very small because the memory controller is
    >>>integrated into and operates at the core speed of the processor, and
    >>>because of the fast interconnect between processors." is relevant to
    >>>another discussion here. I wish we could get a firm answer on this one.
    >>>
    >>
    >>Not sure if this is exactly what you are looking for in the
    >>way of a "firm answer", but the latencies in a Opteron system are:
    >>
    >>0 hops 80 ns uniprocessor (Local access)
    >> 100 ns multiprocessor (Local access, with cache snooping on other processors)
    >>1 hop 115 ns
    >>2 hops 150 ns
    >>3 hops 190 ns
    >>
    >>I couldn't find my original source for those numbers, and
    >>the two and three hop numbers above are a little higher
    >>than I remembered them as being. This time around I got
    >>them from this thread:
    >>http://www.aceshardware.com/forum?read=80030960
    >>
    >>That thread refers to this article:
    >> http://www.digit-life.com/articles2/amd-hammer-family/
    >>which gives slightly different numbers for a 2 GHz Opteron
    >>with DDR333:
    >> Uni-processor system: 45 ns
    >> Dual-processor system: 0-hop - 69 ns, 1-hop - 117 ns.
    >> Four-processor system: 0-hop - 100 ns, 1-hop - 118 ns, 2-hop - 136 ns.
    >>
    >>
    >>I don't know if any of the numbers above are for cache misses
    >>or if they are averages that include both hits and misses.
    >
    >
    > Thanks for the data but no I guess I should have highlighted better what I
    > was getting at: "the memory controller is integrated into and operates at
    > the core speed of the processor", which is what was being
    > discussed/disputed in another thread.
    >
    > I haven't been able to find any hard data from AMD on where the clock
    > domain boundaries are in the Opteron/Athlon64 but if the memory controller
    > is not operating at "core speed" it's now at the stage of Internet
    > Folklore.

    Ah, that one is much easier to answer. ;-)

    Straight from the horse's mouth:
    http://www.amd.com/us-en/Processors/ProductInformation/0%2C%2C30_118_4699_7981%5E7983%2C00.html

    "By running at the processor’s core frequency, an integrated
    memory controller greatly increases bandwidth directly available
    to the processor at significantly reduced latencies."
  25. Archived from groups: comp.arch,comp.sys.intel,comp.sys.ibm.pc.hardware.chips,alt.comp.hardware.amd.x86-64 (More info?)

    In article <gbe8r0pp5ip0dl4d58fvktklbuu35442it@4ax.com>, Tony Hill wrote:
    > It could
    > be that Intel still has a reasonable amount of inventory of their old
    > "Northwood" P4 chips and they want to clear those out first, but that
    > certainly doesn't seem to be the case looking at Intel's pricing
    > structure and what is being sold by the major OEMs (Intel seems to be
    > pushing Prescott VERY hard here).

    A friend recently (1 month ago IIRC) wanted a Northwood for his DIY
    computer, but he found that none of the usual suspects around here had
    them in stock. Eventually he called the importer, who said that
    they're out of stock and they're not getting anymore either, buy a
    Prescott instead.

    > Long story short, I'm not quite sure what the actual answer is, but
    > excessive inventory of 32-bit chips doesn't seem to make sense from
    > what I've seen.

    Considering the rate chips depreciate I guess manufacturers think
    pretty hard about what they can do to minimize inventory.


    --
    Janne Blomqvist
  26. Archived from groups: comp.arch,comp.sys.intel,comp.sys.ibm.pc.hardware.chips,alt.comp.hardware.amd.x86-64 (More info?)

    On Mon, 06 Dec 2004 18:39:46 GMT, Rob Stow <rob.stow.nospam@shaw.ca> wrote:

    >George Macdonald wrote:
    >> On Sun, 05 Dec 2004 17:37:16 GMT, Rob Stow <rob.stow.nospam@shaw.ca> wrote:
    <<snip>>

    >> Thanks for the data but no I guess I should have highlighted better what I
    >> was getting at: "the memory controller is integrated into and operates at
    >> the core speed of the processor", which is what was being
    >> discussed/disputed in another thread.
    >>
    >> I haven't been able to find any hard data from AMD on where the clock
    >> domain boundaries are in the Opteron/Athlon64 but if the memory controller
    >> is not operating at "core speed" it's now at the stage of Internet
    >> Folklore.
    >
    >Ah, that one is much easier to answer. ;-)
    >
    >Straight from the horse's mouth:
    >http://www.amd.com/us-en/Processors/ProductInformation/0%2C%2C30_118_4699_7981%5E7983%2C00.html
    >
    > "By running at the processor’s core frequency, an integrated
    > memory controller greatly increases bandwidth directly available
    > to the processor at significantly reduced latencies."

    Ah so there we have it... assuming this has been approved by the technical
    folks.:-) BTW I notice that AMD seems to cutting back on the depth of info
    in their technical docs - the Product Data Sheets now consist of one
    page... a far cry from the excruciating detail on cache operation etc. we
    used to get.

    Rgds, George Macdonald

    "Just because they're paranoid doesn't mean you're not psychotic" - Who, me??
  27. Archived from groups: comp.arch,comp.sys.ibm.pc.hardware.chips (More info?)

    On Sun, 05 Dec 2004 23:29:08 -0800, Greg Lindahl wrote:

    > In article <8c97r0hqh2sqf8sh89ut3153lpdmddfs76@4ax.com>,
    > George Macdonald <fammacd=!SPAM^nothanks@tellurian.com> wrote:
    >
    >>I haven't been able to find any hard data from AMD on where the clock
    >>domain boundaries are in the Opteron/Athlon64 but if the memory controller
    >>is not operating at "core speed" it's now at the stage of Internet
    >>Folklore.
    >
    > Note that the STREAM bandwidth and lmbench latency changes with every
    > cpuspeedbump. So clearly part of the memory controller is at the cpu
    > core frequency, or a related frequency, and not at the HT frequency,
    > or the SDRAM external bus frequency.

    That does *not* mean that the memory corntoller runs at the core speed.
    >It would be nuts to assume such. Would you assume the cashes of the
    >PII run at the the I/O bus speed?


    > Please reduce the cross-post. Followups set to a group I read.

    Isn't his a rather egotistical statement? "I don't read other
    groups, so no one else matters!" Hint: Others are reading this thread
    from other groups! It's posted to *three* related groups (hardly a breech
    of USENET protocol).

    --
    Keith
  28. Archived from groups: comp.arch,comp.sys.intel,comp.sys.ibm.pc.hardware.chips,alt.comp.hardware.amd.x86-64 (More info?)

    On Sun, 05 Dec 2004 01:02:11 -0500, Yousuf Khan <bbbl67@ezrs.com> wrote,
    in part:

    >I found this whitepaper from HP to be pretty good, it is surprisingly
    >candid, considering HP was the coinventor of the Itanium. It does a
    >pretty good job of explaining and summarizing the similarities and
    >differences between AMD64 and EM64T, and their comparison to the
    >Itanium's IA64 instruction set. AMD64 and EM64T are "broadly
    >compatible", but IA64 is a different animal altogether.

    >http://h200001.www2.hp.com/bc/docs/support/SupportManual/c00238028/c00238028.pdf

    I would have preferred if you had given the URL of a page with a *link*
    on it to this manual. That would make it easier to back-navigate for
    other items of related interest, and it would have meant that the manual
    could be downloaded with a right-click without waiting for the browser
    plug-in to display the whole manual.

    On page 13, under the heading "Power Considerations", I noticed a real
    whopper. Or, at least, what _seemed_ to me to be a real whopper
    initially.

    It is true that for a given implementation, a higher clock speed means
    more power consumption. It takes more power to make gates switch faster.

    However, if a higher clock speed is obtained by splitting the pipeline
    into more itty-bitty pieces, for the same level of instruction latency,
    then one still has the same number of gates, each consuming the same
    amount of power. (Except for the overhead of the pipelining process...
    and one more thing to be noted later.)

    What is the point of splitting up a pipeline into smaller pieces? Is it
    to put more megahertz in the ad copy? No, it is so that more
    instructions can be executing, in different stages, at once. (Which
    means that a Pentium IV ought to have explicit vector instructions. Yes,
    it has a separate instruction cache and data cache, but there's still
    only one bus to *main memory*, and caches do have to get filled from
    somewhere.)

    Since CMOS gates only consume power when they are changing state, unused
    elements of a non-pipelined ALU are not consuming power, so it may well
    be that a 14-stage pipelined ALU can consume twice as much power as a
    7-stage pipelined ALU.

    But that will be because twice as much of it is in use, not because it
    is going "twice as fast".

    Since they are still sort of right, even if for the wrong reason,
    perhaps all I am criticizing is an oversimplification here. But I think
    that this can lead to a profound misconception of how microprocessors
    work.

    John Savard
    http://home.ecn.ab.ca/~jsavard/index.html
  29. Archived from groups: comp.arch,comp.sys.intel,comp.sys.ibm.pc.hardware.chips,alt.comp.hardware.amd.x86-64 (More info?)

    "John Savard" <jsavard@excxn.aNOSPAMb.cdn.invalid> wrote in message
    news:41b50db1.4580547@news.ecn.ab.ca...
    > On Sun, 05 Dec 2004 01:02:11 -0500, Yousuf Khan <bbbl67@ezrs.com>
    wrote,
    > in part:

    OK, I wont trim the wonderful newsgroup list, all of whose readers are
    breathlessly awaiting my imortal prose....
    >
    > >I found this whitepaper from HP to be pretty good, it is surprisingly
    > >candid, considering HP was the coinventor of the Itanium. It does a
    > >pretty good job of explaining and summarizing the similarities and
    > >differences between AMD64 and EM64T, and their comparison to the
    > >Itanium's IA64 instruction set. AMD64 and EM64T are "broadly
    > >compatible", but IA64 is a different animal altogether.
    >
    >
    >http://h200001.www2.hp.com/bc/docs/support/SupportManual/c00238028/c002
    38028.pdf
    >
    > I would have preferred if you had given the URL of a page with a
    *link*
    > on it to this manual. That would make it easier to back-navigate for
    > other items of related interest, and it would have meant that the
    manual
    > could be downloaded with a right-click without waiting for the browser
    > plug-in to display the whole manual.

    What braindamaged newsreader are you using that won't let you right
    click the link in the newsreader? Even OE does that. So quit whining
    and switch to a decent newsreader.
    >
    > On page 13, under the heading "Power Considerations", I noticed a real
    > whopper. Or, at least, what _seemed_ to me to be a real whopper
    > initially.
    >
    > It is true that for a given implementation, a higher clock speed means
    > more power consumption. It takes more power to make gates switch
    faster.
    >
    Probably referring to that esoteric equation P= (sf)*.5*C*V**2 which you
    may have encountered. Or perhaps I=Cdv/dt.

    > However, if a higher clock speed is obtained by splitting the pipeline
    > into more itty-bitty pieces, for the same level of instruction
    latency,
    > then one still has the same number of gates, each consuming the same
    > amount of power. (Except for the overhead of the pipelining process...
    > and one more thing to be noted later.)
    If one adds pipe stages one has more gates and more latches and more
    clock drivers. And the power per gate goes up because of the higher
    frequency.
    >
    > What is the point of splitting up a pipeline into smaller pieces? Is
    it
    > to put more megahertz in the ad copy? No, it is so that more
    > instructions can be executing, in different stages, at once. (Which
    > means that a Pentium IV ought to have explicit vector instructions.
    Yes,
    > it has a separate instruction cache and data cache, but there's still
    > only one bus to *main memory*, and caches do have to get filled from
    > somewhere.)

    Actually one reason for intel to "superpipeline" was to jack up the freq
    for the ad copy.
    You lost me with the "Pentium IV ought to have explicit vector
    instructions" leap.
    >
    > Since CMOS gates only consume power when they are changing state,
    unused
    > elements of a non-pipelined ALU are not consuming power, so it may
    well
    > be that a 14-stage pipelined ALU can consume twice as much power as a
    > 7-stage pipelined ALU.
    Or maybe 4 times, if the freq is double.
    >
    > But that will be because twice as much of it is in use, not because it
    > is going "twice as fast".

    Clearly they are using "twice as fast" to mean "double the frequency".
    Why do you find that so hard to understand?
    >
    > Since they are still sort of right, even if for the wrong reason,
    > perhaps all I am criticizing is an oversimplification here. But I
    think
    > that this can lead to a profound misconception of how microprocessors
    > work.

    What ARE you talking about?
    >
    > John Savard
    > http://home.ecn.ab.ca/~jsavard/index.html

    Del Cecchi.
  30. Archived from groups: comp.arch,comp.sys.intel,comp.sys.ibm.pc.hardware.chips,alt.comp.hardware.amd.x86-64 (More info?)

    George Macdonald wrote:
    > On Mon, 06 Dec 2004 18:39:46 GMT, Rob Stow <rob.stow.nospam@shaw.ca> wrote:
    >
    >
    >>George Macdonald wrote:
    >>
    >>>On Sun, 05 Dec 2004 17:37:16 GMT, Rob Stow <rob.stow.nospam@shaw.ca> wrote:
    >
    > <<snip>>
    >
    >>>Thanks for the data but no I guess I should have highlighted better what I
    >>>was getting at: "the memory controller is integrated into and operates at
    >>>the core speed of the processor", which is what was being
    >>>discussed/disputed in another thread.
    >>>
    >>>I haven't been able to find any hard data from AMD on where the clock
    >>>domain boundaries are in the Opteron/Athlon64 but if the memory controller
    >>>is not operating at "core speed" it's now at the stage of Internet
    >>>Folklore.
    >>
    >>Ah, that one is much easier to answer. ;-)
    >>
    >>Straight from the horse's mouth:
    >>http://www.amd.com/us-en/Processors/ProductInformation/0%2C%2C30_118_4699_7981%5E7983%2C00.html
    >>
    >> "By running at the processor’s core frequency, an integrated
    >> memory controller greatly increases bandwidth directly available
    >> to the processor at significantly reduced latencies."
    >
    >
    > Ah so there we have it... assuming this has been approved by the technical
    > folks.:-) BTW I notice that AMD seems to cutting back on the depth of info
    > in their technical docs - the Product Data Sheets now consist of one
    > page... a far cry from the excruciating detail on cache operation etc. we
    > used to get.

    The "Product Data Sheets" are indeed so brief as to be
    virtually useless, but there is still a wealth of PDFs
    that provide details about just about everything.

    The useless Product Data Sheet heads the list of
    "AMD Opteron™ Processor Tech Docs" at
    http://www.amd.com/us-en/Processors/TechnicalResources/0,,30_182_739_9003,00.html
    but the other PDFs there have mind numbing details about
    every little thing that does not give away trade secrets.
    For example, read the "BIOS and Kernel Developer's Guide
    for AMD Athlon™ 64 and AMD Opteron™ Processors".
  31. Archived from groups: comp.arch,comp.sys.ibm.pc.hardware.chips (More info?)

    In article <pan.2004.12.07.01.37.06.417847@att.bizzzz>,
    keith <krw@att.bizzzz> wrote:

    >> Note that the STREAM bandwidth and lmbench latency changes with every
    >> cpuspeedbump. So clearly part of the memory controller is at the cpu
    >> core frequency, or a related frequency, and not at the HT frequency,
    >> or the SDRAM external bus frequency.
    >
    >That does *not* mean that the memory corntoller runs at the core speed.
    >>It would be nuts to assume such. Would you assume the cashes of the
    >>PII run at the the I/O bus speed?

    "or a related frequency", i.e. based on the cpu frequency with a
    constant divider.

    >> Please reduce the cross-post. Followups set to a group I read.
    >
    >Isn't his a rather egotistical statement?

    No, it follows Usenet tradition: post only to groups that you read.

    But thanks for giving me the benefit of the doubt.

    -- greg
  32. Archived from groups: comp.arch,comp.sys.intel,comp.sys.ibm.pc.hardware.chips (More info?)

    Eugene Nalimov wrote:

    > Greg Lindahl wrote:
    >
    >> These benchmarks were run with the best Opteron compiler
    >
    > Visual C?
    >
    > :-)

    Maybe he meant GCC!

    ;-)
  33. Archived from groups: comp.arch,comp.sys.intel,comp.sys.ibm.pc.hardware.chips,alt.comp.hardware.amd.x86-64 (More info?)

    Del Cecchi wrote:

    > What braindamaged newsreader are you using that won't let you right
    > click the link in the newsreader? Even OE does that. So quit whining
    > and switch to a decent newsreader.

    Speaking of brain-damaged newsreaders, take a look at the mess yours
    did when you quoted John's message. I rest my case.
  34. Archived from groups: comp.arch,comp.sys.intel,comp.sys.ibm.pc.hardware.chips,alt.comp.hardware.amd.x86-64 (More info?)

    "Grumble" <devnull@kma.eu.org> wrote in message
    news:cp4djh$vdh$1@news-rocq.inria.fr...
    > Del Cecchi wrote:
    >
    > > What braindamaged newsreader are you using that won't let you right
    > > click the link in the newsreader? Even OE does that. So quit whining
    > > and switch to a decent newsreader.
    >
    > Speaking of brain-damaged newsreaders, take a look at the mess yours
    > did when you quoted John's message. I rest my case.

    A few lines got wrapped. That what you are talking about?

    del
  35. Archived from groups: comp.arch,comp.sys.ibm.pc.hardware.chips (More info?)

    Del Cecchi wrote:

    > Grumble wrote:
    >
    >> Del Cecchi wrote:
    >>
    >>> What braindamaged newsreader are you using that won't let you
    >>> right click the link in the newsreader? Even OE does that.
    >>> So quit whining and switch to a decent newsreader.
    >>
    >> Speaking of brain-damaged newsreaders, take a look at the mess
    >> yours did when you quoted John's message. I rest my case.
    >
    > A few lines got wrapped. That what you are talking about?

    Yessir!

    Perhaps OE-QuoteFix might help if you must use OE?
  36. Archived from groups: comp.arch,comp.sys.intel,comp.sys.ibm.pc.hardware.chips,alt.comp.hardware.amd.x86-64 (More info?)

    In article <41B578C0.1000400@sgi.com>, Michael Woodacre wrote:
    > Another example would be making sure that people understand that when
    > Opteron goes dual core, unless you double the memory bandwidth
    > available, you effectively cut the bandwidth per core in half. This will
    > impact some workloads quite dramatically. Has AMD made public statements
    > about supporting higher local bandwidth for the dual core chip?

    No public statements that I know of, but there are rumors that the
    90nm Opterons, due Real Soon Now, will support DDR2 in addition to
    plain old DDR. See e.g.

    http://www.xbitlabs.com/news/cpu/display/20040212022200.html

    By the time dual core Opterons arrive, I suspect that DDR2-800 will
    also be available, thus providing twice the memory BW compared to the
    current single core offerings using DDR-400.


    --
    Janne Blomqvist
  37. Archived from groups: comp.arch,comp.sys.intel,comp.sys.ibm.pc.hardware.chips,alt.comp.hardware.amd.x86-64 (More info?)

    On Mon, 6 Dec 2004 20:16:21 -0600, "del cecchi" <dcecchi.nojunk@att.net>
    wrote, in part:

    >What braindamaged newsreader are you using that won't let you right
    >click the link in the newsreader?

    Clicking on the link in the newsreader, supposing I could do that, would
    simply cause the link to open in a browser window. Which is exactly what
    I achieved by cutting and pasting.

    Maybe some newsreaders do allow right-clicking links. Such newsreaders
    would probably also do dangerous and reckless things like rendering HTML
    posts instead of displaying them in all their <angle bracket> glory.

    This could result in having a brain-damaged computer, were I to view the
    wrong post by accident.

    As the posting in question was a text posting, this means that the
    newsreader would have to guess at what constituted an URL, as well, with
    no doubt occasional hilarious results.

    John Savard
    http://home.ecn.ab.ca/~jsavard/index.html
  38. Archived from groups: comp.arch,comp.sys.intel,comp.sys.ibm.pc.hardware.chips,alt.comp.hardware.amd.x86-64 (More info?)

    On 06 Dec 2004 14:12:20 +0100, Per Ekman <pek@pdc.kth.se> wrote:

    >Tony Hill <hilla_nospam_20@yahoo.ca> writes:
    >
    >> It does, but the difference is small, usually less than 10% and often
    >> much closer to 0%.
    >
    >And sometimes 50%...

    Sure, there will be extreme cases in everything.

    >> Most users don't use their computer to run STREAM though. Even in the
    >> HPC community where memory bandwidth is king, STREAM is still a rather
    >> extreme case.
    >
    >I admit I'm from the HPC-sector and memory bandwidth is very important
    >to many applications here.

    One thing that you need to keep in mind is that you represent a VERY
    small minority here in terms of PC server sales. Just because it
    matters to your application probably doesn't have much reference to
    the bulk of the buying public, and it almost certainly isn't going to
    have implications for what the marketing people write in the trade
    rags.

    >> Besides, they do recognize that it is NUMA, just that they are saying
    >> you don't NEED to worry about that if you don't want to because for
    >> the vast majority of times the performance difference is lost in the
    >> noise.
    >
    >It's a pretty strange argument in my eyes, "If you ignore the
    >applications that run poorly because of property X, then it makes
    >sense to downplay property X." True, but not helpful if you have such
    >an application.

    Ahh, but it's VERY helpful if you're in the marketing department! :>

    In the end, the people that are going to take a performance due to
    lack of NUMA optimizations probably already know as much and have
    factored it into their buying decisions. The people who are talking
    to Dell or HPaq's server sales and are thinking about an Opteron
    system but are worried that this here NoooMah thingy might cause their
    application to run slow most likely don't have to worry about much.
    Hence SUMO.

    It's all a matter of perspective.

    -------------
    Tony Hill
    hilla <underscore> 20 <at> yahoo <dot> ca
  39. Archived from groups: comp.arch,comp.sys.intel,comp.sys.ibm.pc.hardware.chips,alt.comp.hardware.amd.x86-64 (More info?)

    On Wed, 08 Dec 2004 00:19:59 -0500, Tony Hill <hilla_nospam_20@yahoo.ca>
    wrote:

    >On 06 Dec 2004 14:12:20 +0100, Per Ekman <pek@pdc.kth.se> wrote:
    >
    >>Tony Hill <hilla_nospam_20@yahoo.ca> writes:
    >>
    >>> It does, but the difference is small, usually less than 10% and often
    >>> much closer to 0%.
    >>
    >>And sometimes 50%...
    >
    >Sure, there will be extreme cases in everything.
    >
    >>> Most users don't use their computer to run STREAM though. Even in the
    >>> HPC community where memory bandwidth is king, STREAM is still a rather
    >>> extreme case.
    >>
    >>I admit I'm from the HPC-sector and memory bandwidth is very important
    >>to many applications here.
    >
    >One thing that you need to keep in mind is that you represent a VERY
    >small minority here in terms of PC server sales. Just because it
    >matters to your application probably doesn't have much reference to
    >the bulk of the buying public, and it almost certainly isn't going to
    >have implications for what the marketing people write in the trade
    >rags.

    I think you're underestimating the size of the "workstation" market, which
    will include people finding they can migrate down to PC-grade CPUs to
    replace old "higher power" systems as well as people on the lower-end
    fringe who may have grown their problem complexity beyond a uni-PC, or who
    *could* get by with a fastish PC but like the comfort of the move up to
    dual for future growth. Add them to the current established base of CAD,
    engineering and modeling etc. applications and there is a decent sized
    market.

    There are a lot of mathematical/engineering problems out there which are
    just part of everyday business computing - many *used* to be considered HPC
    and are now quite routine on desktop sized boxes. In many cases,
    proprietary (purchased) software is used and the algorithmic methods are
    only understood fairly superficially by the user; what that user wants is
    response, whether it's measured in minutes, hours or a day or more. The
    software vendor thus feels responsible for supplying the best combination
    of software and recommended hardware selection.

    Rgds, George Macdonald

    "Just because they're paranoid doesn't mean you're not psychotic" - Who, me??
  40. Archived from groups: comp.arch,comp.sys.ibm.pc.hardware.chips (More info?)

    On Tue, 07 Dec 2004 09:56:44 -0800, Greg Lindahl wrote:

    > In article <pan.2004.12.07.01.37.06.417847@att.bizzzz>,
    > keith <krw@att.bizzzz> wrote:
    >
    >>> Note that the STREAM bandwidth and lmbench latency changes with every
    >>> cpuspeedbump. So clearly part of the memory controller is at the cpu
    >>> core frequency, or a related frequency, and not at the HT frequency,
    >>> or the SDRAM external bus frequency.
    >>
    >>That does *not* mean that the memory corntoller runs at the core speed.
    >>>It would be nuts to assume such. Would you assume the cashes of the
    >>>PII run at the the I/O bus speed?
    >
    > "or a related frequency", i.e. based on the cpu frequency with a
    > constant divider.

    Ok, how many "unrelated frequencies" are there in a CPU? Let's get real
    here.

    >>> Please reduce the cross-post. Followups set to a group I read.
    >>
    >>Isn't his a rather egotistical statement?
    >
    > No, it follows Usenet tradition: post only to groups that you read.

    No, that is *not* Usenet tradition. The tradition is to limit
    cross-postings to on-topic newsgroups. Cross-posting is not expensive
    (unless you have a dran-bamaged newsreader).

    > But thanks for giving me the benefit of the doubt.

    Cutting off your audience, particularly those who *you* have responded to
    is rude. Sorry if I've ruffled your feathers!

    --
    Keith
  41. Archived from groups: comp.arch,comp.sys.intel,comp.sys.ibm.pc.hardware.chips,alt.comp.hardware.amd.x86-64 (More info?)

    David Schwartz wrote:
    > The scaling advantage comes largely from the architecture of a single
    > processor. The memory controller is on the chip. The main reason this
    > matters is that it means that local memory accesses don't have to content
    > with any other inter-CPU or I/O traffic.

    That's only partly true. The Opterons still talk to each other even on local
    accesses (coherency tokens only, no real data transfer). This takes both
    time and adds to the traffic, since such a token needs to get everywhere.

    What's missing here is a "exclusive" bit in the page table, for non-coherent
    pages. The OS pretty well knows (or can know) which core is accessing a
    page, and for a page that's not shared, the coherency token is not
    necessary.

    --
    Bernd Paysan
    "If you want it done right, you have to do it yourself"
    http://www.jwdt.com/~paysan/
  42. Archived from groups: comp.arch,comp.sys.intel,comp.sys.ibm.pc.hardware.chips,alt.comp.hardware.amd.x86-64 (More info?)

    In comp.arch David Schwartz <davids@webmaster.com> wrote:
    > In typical Opteron setups (2-8 CPUs, using the Opteron's build
    > in SMP hardware), the latency difference between local and remote
    > memory accesses is so small that the benefits of treating it as NUMA
    > are typically outweighed by the costs.

    SPECweb99_SSL is probably atypical then (Yes, one of my favorite
    benchmarks :) - the evolution of the tunes for Opteron systems on that
    benchmark show the size of the Zeus tuanble "cache_small_file"
    increasing to 90000 bytes. That brings many more of the URLs into the
    "malloc" cache of Zeus where they are replicated per Zeus instance and
    in this case then per-CPU (things being bound to CPUs) "Normal"
    practice is to have cache_small_file be "NBPG"/numCPU to optimize the
    memory comsumption.

    It all depends of course:) Maybe that wasn't done for latency but to
    cut-down the bandwidth consumed. Who knows - although I am interested
    in trying to find-out :)

    > Generally, you just distribute the memory evenly and interleaved on
    > the nodes (if you can) to avoid overloading one memory controller
    > channel.

    FWIW, I've noticed that Node interleave is (or seems to be, it was set
    that way on the first one I saw and had no indication from the source
    that it had been altered) disabled by default on the Sun V20z's.
    Anyone have data on how Node interleave defaults on other
    Opteron-based systems?

    rick jones
    --
    a wide gulf separates "what if" from "if only"
    these opinions are mine, all mine; HP might not want them anyway... :)
    feel free to post, OR email to raj in cup.hp.com but NOT BOTH...
  43. Archived from groups: comp.arch,comp.sys.intel,comp.sys.ibm.pc.hardware.chips,alt.comp.hardware.amd.x86-64 (More info?)

    Rick Jones <foo@bar.baz.invalid> writes:

    >
    >FWIW, I've noticed that Node interleave is (or seems to be, it was set
    >that way on the first one I saw and had no indication from the source
    >that it had been altered) disabled by default on the Sun V20z's.
    >Anyone have data on how Node interleave defaults on other
    >Opteron-based systems?

    It defaults to "off" on Penguin systems, too.

    scott
  44. Archived from groups: comp.arch,comp.sys.intel,comp.sys.ibm.pc.hardware.chips,alt.comp.hardware.amd.x86-64 (More info?)

    Rick Jones <foo@bar.baz.invalid> writes:

    > FWIW, I've noticed that Node interleave is (or seems to be, it was set
    > that way on the first one I saw and had no indication from the source
    > that it had been altered) disabled by default on the Sun V20z's.
    > Anyone have data on how Node interleave defaults on other
    > Opteron-based systems?

    As far as I know it's disabled by default on most shipping Opteron
    servers. Only a few build-it-yourself dual motherboards have it
    enabled by default.

    For Linux use i would recommend to always disable it. The modern
    kernel can do page interleaving on demand (with numactl or libnuma),
    which is nearly as good, and most programs seem to just prefer
    good memory latency.

    -Andi
  45. Archived from groups: comp.arch,comp.sys.intel,comp.sys.ibm.pc.hardware.chips (More info?)

    lindahl@pbm.com (Greg Lindahl) writes:

    > benchmark 1 3.71 3.03 + 22 %
    > benchmark 2 3.76 3.29 + 14 %
    > benchmark 3 3.78 3.26 + 16 %
    > benchmark 4 3.79 3.45 + 10 %
    > benchmark 5 3.92 3.89 + 1 %
    > benchmark 6 3.88 3.71 + 5 %
    >
    > These benchmarks were run with the best Opteron compiler, so this
    > scaling improvement was very good to see. And it's bigger than
    > "usually less than 10%".

    Averages out to 11 % .

    Sounds like "usually less than 10%" may be right when talking about non scientific workloads.
  46. Archived from groups: comp.arch,comp.sys.intel,comp.sys.ibm.pc.hardware.chips,alt.comp.hardware.amd.x86-64 (More info?)

    jsavard@excxn.aNOSPAMb.cdn.invalid (John Savard) writes:

    > As the posting in question was a text posting, this means that the
    > newsreader would have to guess at what constituted an URL, as well, with
    > no doubt occasional hilarious results.

    Sorry, you dont make sense.
    You really should get a decent newsreader.
  47. Archived from groups: comp.arch,comp.sys.intel,comp.sys.ibm.pc.hardware.chips,alt.comp.hardware.amd.x86-64 (More info?)

    On Wed, 15 Dec 2004 03:35:43 +0000, Israel T wrote:

    > jsavard@excxn.aNOSPAMb.cdn.invalid (John Savard) writes:
    >
    >> As the posting in question was a text posting, this means that the
    >> newsreader would have to guess at what constituted an URL, as well, with
    >> no doubt occasional hilarious results.
    >
    > Sorry, you dont make sense.
    > You really should get a decent newsreader.

    Hmmm, I alwasy though Agent was fairly good. Perhaps yours can't show
    headers? ...oh, another emacs bigot.

    --
    Keith
  48. Archived from groups: comp.arch,comp.sys.intel,comp.sys.ibm.pc.hardware.chips,alt.comp.hardware.amd.x86-64 (More info?)

    keith <krw@att.bizzzz> writes:

    > Hmmm, I alwasy though Agent was fairly good. Perhaps yours can't show
    > headers?

    I used Agent for some years untill it's limitations became irritating.

    >...oh, another emacs bigot.

    It is a matter of using the right tool for the job.
    Emac's mail/news sub-system, Gnus is superb.
  49. Archived from groups: comp.arch,comp.sys.intel,comp.sys.ibm.pc.hardware.chips,alt.comp.hardware.amd.x86-64 (More info?)

    On Tue, 14 Dec 2004 23:41:04 -0500, keith <krw@att.bizzzz> wrote:

    >On Wed, 15 Dec 2004 03:35:43 +0000, Israel T wrote:
    >
    >> jsavard@excxn.aNOSPAMb.cdn.invalid (John Savard) writes:
    >>
    >>> As the posting in question was a text posting, this means that the
    >>> newsreader would have to guess at what constituted an URL, as well, with
    >>> no doubt occasional hilarious results.
    >>
    >> Sorry, you dont make sense.
    >> You really should get a decent newsreader.
    >
    >Hmmm, I alwasy though Agent was fairly good. Perhaps yours can't show
    >headers? ...oh, another emacs bigot.

    Well jsavard is using an *old* version of Free Agent but even the 1.93 I'm
    using doesn't have a right click and "Save Link Target As.." I dunno what
    the big deal is on either side here - copy/paste of a URL is always coming
    up as a nuisance for file downloads, especially with the Adobe reader 6.0
    being so damned slow to get started - the plugin has to load its err,
    plugins to get started and then you also have to have it configured to turn
    off "fast web view" to get the whole document without paging through the
    bugger... all a royal PITA.

    Rgds, George Macdonald

    "Just because they're paranoid doesn't mean you're not psychotic" - Who, me??
Ask a new question

Read More

CPUs Hardware x86 Hewlett Packard