Sign in with
Sign up | Sign in
Your question

Link on a 32 bit 64 bit software.(hammer )

Last response: in CPUs
Share
December 12, 2002 1:28:45 AM

For all who was not seing graph or word saying that AMD is king and they are the god of CPU so in conclusion have not read the post or follow the discussion.

The linux programeur claim a 50% decrease in performance when moving to 64 bit on hammer.The test was a SPEC CPU 2000.(some test)

Now what to do??
Related resources
December 12, 2002 2:52:02 AM

Really? Let's see what he's actually saying.

Quote:

> The reason I'm making such a stink about this is that I don't want
> people believing that "the code generation improvements due to the
> extra x86_64 registers available nullifies the bloat cost from
> going to 64-bit"

Actually, it tends to nullify the bloat cost and then make it few
percent faster... For most of spec2000 modulo two or three cache-bound
tests that are 50% slower :-(.

He's actually saying that in most cases, 64-bit mode is actually faster on Hammer, despite the added bloat. It's just a few cases ("modulo two or three cache-bound tests") where 64-bit causes the 50% performance hit.

Also, check out another interesting quote earlier in the thread:

Quote:

> On Sun, Dec 01, 2002 at 08:46:40PM -0800, David S. Miller wrote:
> > X86_64 on the other hand seems to run x86 binaries in a similar
> > fashion. I don't know how people currently doing this port intend
> > to do the useland, but I bet it would benefit from a mostly 32-bit
> > userland just like sparc64/ppc64 does, both in space and performance.
>
> Except that x86-64 binaries get to use 16 more registers, can use
> pc-relative addressing modes, and have a sane function calling
> convention. So things tend to run a bit faster in 64-bit mode.

Cache locality tends to make a much bigger difference. I've been reported
a performance difference of over 100% for certain applications running 32-bit
on MIPS. 64-bit stuff just tends to push apps out of caches so I don't
see us on MIPS switch to a pure 64-bit any time soon either. Assuming the
availability of 64-bit compiler etc that is - at the moment we're entirely
dependant on the 32-bit environment.

This quote suggests that pure 64-bit mode on any architecture is actually <i><b>bad</b></i>--primarily because of the cache-bound cases mentioned in the first quote. But since IA64 doesn't have anything <i>but</i> pure 64-bit mode (at least not one that performs decently), it necessarily fares much worse in those special cases. Which suggests the x86-64 design approach is actually a better decision than IA64...

It seems the thread you linked actually defeats your argument, juin. Maybe you should have left that thread alone...

<i>I can love my fellow man...but I'm damned if I'll love yours.</i>
December 12, 2002 5:08:14 AM

He's actually saying that in most cases, 64-bit mode is actually faster on Hammer, despite the added bloat. It's just a few cases ("modulo two or three cache-bound tests") where 64-bit causes the 50% performance hit


Good i was wanting to bring the discusion.Code foot print get bigger when miving to 64 bit why dont really a issue on server side but clawhammer will have hard time having a 32 bit librairy and 64 bit librairy.Goes Os overhead


This quote suggests that pure 64-bit mode on any architecture is actually bad--primarily because of the cache-bound cases mentioned in the first quote. But since IA64 doesn't have anything but pure 64-bit mode (at least not one that performs decently), it necessarily fares much worse in those special cases. Which suggests the x86-64 design approach is actually a better decision than IA64...

It seems the thread you linked actually defeats your argument, juin. Maybe you should have left that thread alone...


No ithanium can work with 32 bit precision FPU i suggest you read on 64 bit cpu before spread false information.I remind you that Itanium the best fastest CPU in the world.

Now what to do??
December 12, 2002 6:11:03 AM

Quote:
No ithanium can work with 32 bit precision FPU i suggest you read on 64 bit cpu before spread false information.I remind you that Itanium the best fastest CPU in the world.

Every worthwhile FPU can work with 32-bit (single-precision) FP numbers; that's not the real problem. What about 32-bit address pointers?

When you move to a 64-bit architecture, forced use of 64-bit address pointers is what really kills the cache space--not 64-bit integers, because in many 64-bit architectures, the compiler still sets the default integer size to 32 bits.

The only way to really circumvent this cache-bloat problem is to have an execution mode where address pointers default to 32 bits. As far as I can tell, IA64 doesn't support this without forcing abysmal performance (via x86 compatibility mode). So in the cache-bound SPEC2000 cases, where the Hammer can fall back to a worthwhile 32-bit mode, Itanium is just forced to take the performance hit.

Quote:
I remind you that Itanium the best fastest CPU in the world.

Itanium II, perhaps, but not Itanium 1. And Hammer samples apparently already surpass Itanium II in integer ops, while coming rather close in FPU math--all while consuming much less power and supporting better multiprocessor scaling.

<i>I can love my fellow man...but I'm damned if I'll love yours.</i>
December 12, 2002 6:22:25 AM

Quote:
The only way to really circumvent this cache-bloat problem is to have an execution mode where address pointers default to 32 bits. As far as I can tell, IA64 doesn't support this without forcing abysmal performance (via x86 compatibility mode). So in the cache-bound SPEC2000 cases, where the Hammer can fall back to a worthwhile 32-bit mode, Itanium is just forced to take the performance hit.


In legacy 32-bit mode, you don't have the extra advantages of the increased register set. I think it may actually be better to run in 64-bit mode with the added registers and and take the hit from using 64-bit address pointers.

Quote:
Itanium II, perhaps, but not Itanium 1. And Hammer samples apparently already surpass Itanium II in integer ops, while coming rather close in FPU math--all while consuming much less power and supporting better multiprocessor scaling.


A 2.8 P4 surpasses the Itanium 2 in SpecInt. It's one of the apparant strengths of modern x86 MPU's. Currently, the 3.06 P4 is the top with over 1000 SpecInt, ahead of the Itanium 2 1 GHz, ahead of the IBM Power4 1.2 GHz, ahead of the Ultrasparc64 V 1.25 GHz. Itanium 2's SpecFP is 1431 with the new compilers, this is significantly higher than the close to 1000 SpecFP that AMD released for a 2.0 GHz Opteron and don't forget, that chip is still months away.

I do, however, think that the comparison is pretty pointless. The two chips are not direct competitors yet and might not be for a while.

As for the suffering of 64-bit pointers when moving to 64-bit architecture, the point is pretty much moot with IA-64 as the new ISA saves a ton of die space on the actual logical core (20 million transistors for the Itanium 2 core) leaving a lot of room for cache. The advantages of moving to the ISA vs x86 much more than makes up for the increase in pointer size especially considering you have a lot more headroom for cache.

"We are Microsoft, resistance is futile." - Bill Gates, 2015.
December 12, 2002 7:56:42 AM

Quote:
In legacy 32-bit mode, you don't have the extra advantages of the increased register set.

Actually, you do. That's one of the things kernel developers are discussing in the linked thread--a 32-bit mode that can still access the extra GPRs. 32-bit programs would essentially be able to toss around 64-bit integers faster and use the extra registers for whatever they like, they just couldn't address a 64-bit memory space. This same kind of "transitional" optimization was available for 32-bit integers with the 386.

Quote:
As for the suffering of 64-bit pointers when moving to 64-bit architecture, the point is pretty much moot with IA-64 as the new ISA saves a ton of die space on the actual logical core (20 million transistors for the Itanium 2 core) leaving a lot of room for cache.

Since the transistor count on the Pentium 4 jumped from 42mil to 55mil from Willamette to Northwood, we can infer that each 256K of L2 cache on Northwood takes about 13mil transistors, and that the Northwood's core logic is probably about 29mil transistors. I wouldn't regard 9mil transistors as a "ton" of savings, especially as McKinley is still on an .18µ process. By my guess, the die space savings is good for no more than another 256K of cache.

<i>I can love my fellow man...but I'm damned if I'll love yours.</i>
December 12, 2002 8:22:47 AM

Quote:
Actually, you do. That's one of the things kernel developers are discussing in the linked thread--a 32-bit mode that can still access the extra GPRs. 32-bit programs would essentially be able to toss around 64-bit integers faster and use the extra registers for whatever they like, they just couldn't address a 64-bit memory space. This same kind of "transitional" optimization was available for 32-bit integers with the 386.


As I recall, the Hammer require either or as far as legacy and 64-bit mode. There's an x86 prefix you have to include to shift the processor to a different mode. I'm not entirely sure how this could be circumvented. Please elaborate.

Quote:
Since the transistor count on the Pentium 4 jumped from 42mil to 55mil from Willamette to Northwood, we can infer that each 256K of L2 cache on Northwood takes about 13mil transistors, and that the Northwood's core logic is probably about 29mil transistors. I wouldn't regard 9mil transistors as a "ton" of savings, especially as McKinley is still on an .18µ process. By my guess, the die space savings is good for no more than another 256K of cache.


The neccessary execution logic for a desktop processor is nowhere near that of the Itanium 2. Currently, the Itanium 2 has more execution logic than a P4 with less transistors. Consider how much savings an IA-64 chip with comparable execution logic to a P4 would have in terms of transistor count and put those towards cache.

"We are Microsoft, resistance is futile." - Bill Gates, 2015.
December 12, 2002 10:10:17 AM

Quote:
As I recall, the Hammer require either or as far as legacy and 64-bit mode. There's an x86 prefix you have to include to shift the processor to a different mode. I'm not entirely sure how this could be circumvented. Please elaborate.

There's "64-bit" mode, "compatibility" mode, and "legacy" mode. "Legacy" is what you get when nothing bothers to activate the 64-bit logic of the Hammer--not even the operating system. "Compatibility" is basically code running in a 32-bit cage set up by a 64-bit O/S--but that doesn't mean the code in that 32-bit cage can't use 64-bit registers. It's a simple matter of applying an instruction prefix to change the default operand width (see <A HREF="http://www.x86-64.org/documentation_folder/white_paper...." target="_new">x86-64 white paper</A>, page 8, table, footnote 1). The code knows about the register extensions and how to access them, but it still lets itself get dropped into a 32-bit address space on the pretext that 64-bit addressing is just overkill. The code has to be recompiled to get this benefit, but this is not such a big deal in Linux.

As for legacy mode, I'm not sure whether 32-bit code running in legacy mode can access 64-bit registers like that. Unreal64 is supposed to be released as soon as Hammer's available, apparently before any x86-64-ready WinXP is available--so maybe they figured out a way to hack it?

<i>I can love my fellow man...but I'm damned if I'll love yours.</i>
December 12, 2002 4:38:32 PM

Quote:
This quote suggests that pure 64-bit mode on any architecture is actually bad--primarily because of the cache-bound cases mentioned in the first quote. But since IA64 doesn't have anything but pure 64-bit mode (at least not one that performs decently), it necessarily fares much worse in those special cases.

This is one of the reasons why Intel is pursuing the use of a trace cache for the L1 in its processors. The trace cache stores the instructions after they have already been decoded. It does not store physical addresses. The trace cache can be implemented in such a manner that 64-bit addressing does not add that much overhead. This eliminates the problem for 64-bit Intel processors.

-Raystonn


= The views stated herein are my personal views, and not necessarily the views of my employer. =
December 12, 2002 8:07:23 PM

Quote:
This is one of the reasons why Intel is pursuing the use of a trace cache for the L1 in its processors.

Unfortunately, from all I've read, McKinley lacks this feature--it just has the old-style, plain-jane 32K L1 cache. Is trace cache scheduled for Madison?

<i>I can love my fellow man...but I'm damned if I'll love yours.</i>
December 12, 2002 9:18:06 PM

No, Madison, as I recall, will be a die shrink and extra cache as far as I'm aware. However, simply adding more cache would alleviate this problem of overhead somewhat.

"We are Microsoft, resistance is futile." - Bill Gates, 2015.
December 12, 2002 10:18:17 PM

Itanium II, perhaps, but not Itanium 1. And Hammer samples apparently already surpass Itanium II in integer ops, while coming rather close in FPU math--all while consuming much less power and supporting better multiprocessor scaling.

It useless to compare a desktop CPU with a bigger cache to a pure 64 bit CPU

Now what to do??
December 12, 2002 10:22:55 PM

lantency Size bandwith of cache are at lease 100% better on Itanium 2 compare itanium 1 even better compare to AXP

Now what to do??
December 12, 2002 10:27:43 PM

VLIW make you save a lot of transistor as the core increase in size you compare a 5 unites to over 6 alu 4 mmx 4 FPU 3 branch.

Now what to do??
December 12, 2002 10:31:20 PM

16KB L1 instruction cache for all IA-64

Now what to do??
December 13, 2002 12:11:50 AM

Quote:
It useless to compare a desktop CPU with a bigger cache to a pure 64 bit CPU

Opteron's not a desktop CPU, and we all know this.

Besides which, the comparison isn't really all that useless when the "desktop CPU" actually compares rather favorably to the pure 64-bit CPU and actually has most of the same 64-bit advantages. Highlighting silly semantics issues to try to invalidate some result you don't like seeing--now <i>that</i> could be considered pretty useless...

Quote:
lantency Size bandwith of cache are at lease 100% better on Itanium 2 compare itanium 1 even better compare to AXP

The Athlon actually has the same L1/L2 cache sizes as Itanium 2--Itanium 2 just gets a large (and high-latency) L3 cache tacked on the side. Opteron will actually have <i>more</i> L1/L2 cache than Itanium 2, plus a low-latency on-die memory controller, and it supports an external L3 cache as well. That can effectively nullify and surpass any cache-size advantage Itanium 2 has.

We don't know exactly what Hammer's cache latency/bandwidth will be. It will probably be an improvement over the Athlon.

<i>I can love my fellow man...but I'm damned if I'll love yours.</i>
December 13, 2002 12:37:02 AM

If you want to compare Opteron to any CPU we must take the actual of 2003 H1.We got a Madison presscott nocona.Madison have no trouble be 2X faster that opteron and 3X real time on 64 bit code and P4 3.06 allready equal and superior SPEC score on 32 bit code.I dont see that in the same market.


The Athlon actually has the same L1/L2 cache sizes as Itanium 2--Itanium 2 just gets a large (and high-latency) L3 cache tacked on the side. Opteron will actually have more L1/L2 cache than Itanium 2, plus a low-latency on-die memory controller, and it supports an external L3 cache as well. That can effectively nullify and surpass any cache-size advantage Itanium 2 has.

As the opteron is aim at 2 way to N way is lantency grow at each CPU add due to it extra HT lantency and bandwith limitation.Big cache help to prevent bus use but Opteron have 6 time smaller cache.L3 cache external cache having more that 25 cycle lantency vs the 11 of the on-die L3 of the itanium 2.L2 as opteron dont go for L3 it must use larger L2 (equal the size of the desktop CPU for intel 1 MB) lantecy time raise as the size grow all athlon have much more lantency on L2 that all intel CPU as the size will put it close to 9/8 cycle in a case that AMD as one a miracle on L2 the latency will be 80% higher that itanium 2 1 or 2 cycle over presscott.L1 P4 prove it latency is more important that size Athlon L1 cache come from the day that there were no on die L2 on the CPU P2 K7 original.
Bandwith wise that a disater intel beat it on every level be 100%.I suggest that you attack on power comsumation level of the IA-64 ligne as intel put it egg on strained silicon wich is dont really usefull for itanium but is for P4 high clock speed.As 2004 iatanium will having a equal micron process that P4 or P5 (if tejas or presscott use a others name)

Now what to do??
December 13, 2002 12:40:06 AM

you have to explain me where to put the x trance cache systemes in IA-64 ligne unless intel planing to put a OoO decoder on Itanium.Wich kill all advantage of Itanium.

Now what to do??
December 13, 2002 1:37:14 AM

Just so things don't get to out of whack, realize that an 11 tick latency on a 1ghz chip is equal to a 22 tick latency on a 2ghz chip. By that same logic, a DDR 400 cas2 dram should outperform the Itanium's L3 cache.

Dichromatic for your viewing plesure...
December 13, 2002 2:22:29 AM

Hammer has no L3. L3 caches became unnecessary when memory speeds moved below 10ns. 10ns=100mhz. Now that memory speeds are crossing the 5ns(=200mhz) barrier, they're even more unnecessary. I would love to know why the IA64 really needs an L3?

Dichromatic for your viewing plesure...
December 13, 2002 2:29:20 AM

Quote:
BTW I don't recall ever hearing that Hammer will have L3 cache support. Anyone has any official links on this?

<A HREF="http://www.amd.com/us-en/assets/content_type/white_pape..." target="_new">Link</A>

It mentions an external L3 cache option for Hammer.

<i>I can love my fellow man...but I'm damned if I'll love yours.</i>
December 13, 2002 3:34:44 AM

Quote:
If you want to compare Opteron to any CPU we must take the actual of 2003 H1.We got a Madison presscott nocona.

Prescott and Nocona are pushed to the end of 2003. You may get Madison in 2Q03 if you're lucky--which means Opteron will probably beat it to market.

Quote:
As the opteron is aim at 2 way to N way is lantency grow at each CPU add due to it extra HT lantency and bandwith limitation.

This is a problem for all multiprocessor platforms. AMD's solution (NUMA-like memory arrangement) actually cuts down on the required interaction between processors and gives each CPU its own little dedicated high-bandwidth memory bus.

IA64's solution (shared bus) is the same SMP architecture used by every SMP-capable Intel chip since the PPro, and it's one of the big reasons Intel isn't a big-tin player. It forces most processors to starve when one processor hogs the bus--this leads to poor multiprocessor scaling (see below).

Quote:
lantecy time raise as the size grow all athlon have much more lantency on L2 that all intel CPU as the size will put it close to 9/8 cycle in a case that AMD as one a miracle on L2 the latency will be 80% higher that itanium 2 1 or 2 cycle over presscott.

As I mentioned before, we don't know what Hammer's cache-latency properties will be. I've seen nothing to suggest that L2 cache latency will increase with size, and I'm pretty sure your personal crystal ball doesn't qualify. :wink:

Quote:
L1 P4 prove it latency is more important that size

It proves latency is more important than size <i>on the P4's cache archtecture.</i> It proves nothing about more common cache architectures like the Athlon's.

Quote:
Bandwith wise that a disater intel beat it on every level be 100%.

So far, I've seen no benchmarks proving that. Theoretically your figures are way off.

Opteron has 5.2GB/sec of memory bandwidth; McKinley has 6.4GB/sec. Hardly a 100% increase for McKinley.

McKinley then loses this advantage when it tries to scale SMP-wise, because of the shared multiprocessor bus. When one CPU makes heavy use of the memory bandwidth, other processors are forced to starve. Opteron, on the other hand, gets 5.2GB/sec of memory bandwidth <i>per CPU,</i> so by the time it reaches 2-way, it's already theoretically outpaced the Itanium. It pulls even further ahead in 4-way configurations.

<i>I can love my fellow man...but I'm damned if I'll love yours.</i>
December 13, 2002 3:56:00 AM

Quote:
Just so things don't get to out of whack, realize that an 11 tick latency on a 1ghz chip is equal to a 22 tick latency on a 2ghz chip. By that same logic, a DDR 400 cas2 dram should outperform the Itanium's L3 cache.

I should probably mention that while the 2 GHz chip's L3 cache latency would be equivalent in realtime, it's really how many clockcycles which are wasted that's important in terms of access latency. A 22 cycle access latency can be very hampering especially when the processor absolutely needs the data/instruction.

Quote:
McKinley then loses this advantage when it tries to scale SMP-wise, because of the shared multiprocessor bus. When one CPU makes heavy use of the memory bandwidth, other processors are forced to starve. Opteron, on the other hand, gets 5.2GB/sec of memory bandwidth per CPU, so by the time it reaches 2-way, it's already theoretically outpaced the Itanium. It pulls even further ahead in 4-way configurations.

This isn't really true at all. Processors don't just "suck up" a bus like that. Most of the time the memory bus remains idle as requests and accesses are queued up and sent. The "sweet" spot with your traditional GTL+ bus like that of the Itanium 2 is usually 4 processors. It doesn't mean that each will only get 1/4 the bus, it means that the chipset will have to schedule accesses better. So in reality, no, there won't be a "starving", not even if the memory bus is being accessed constantly (which, considering the huge amount of L3 cache, probably isn't true).

Quote:
Hammer has no L3. L3 caches became unnecessary when memory speeds moved below 10ns. 10ns=100mhz. Now that memory speeds are crossing the 5ns(=200mhz) barrier, they're even more unnecessary. I would love to know why the IA64 really needs an L3?


Comparing DRAM access with SRAM access is pretty futile. The memory bus and FSB is running at 200MHz. Let's assume that it takes one clock for the processor access to reach memory (assuming a very well tuned integrated memory controller), that's 1 clock out of 200MHz, which is 5 ns, on a 1 GHz, chip that's equal to 5 clocks. Then you have memory accesses. Let's assume the best case and you need no precharge or strobe access latency, so you're down to cas latency. Let's assume that it takes 2 clocks (which, at 200MHz for DRAM is NOT common). That's still another 10 ns to access which translates into 10 clocks of processor idle time. Then let's take the time for the memory to transfer the first critical bit back to main processor (assuming the memory has critical word support), that's another 5 ns. So, under the best circumstances (which only happen less than 20% of the time), you have a 20 cycle latency on a 1 GHz processor. And that latency worsens as the processor scales. At 2 GHz, the processor would experience a 40 cycle latency for every memory access. Compare that with even a 22 cycle access latency (under best AND worst conditions) for onboard SRAM. L3 cache useless? I think not.

"We are Microsoft, resistance is futile." - Bill Gates, 2015.
December 13, 2002 5:04:15 AM

Quote:
This isn't really true at all. Processors don't just "suck up" a bus like that. Most of the time the memory bus remains idle as requests and accesses are queued up and sent. The "sweet" spot with your traditional GTL+ bus like that of the Itanium 2 is usually 4 processors. It doesn't mean that each will only get 1/4 the bus, it means that the chipset will have to schedule accesses better. So in reality, no, there won't be a "starving", not even if the memory bus is being accessed constantly (which, considering the huge amount of L3 cache, probably isn't true).

4-way x86 may seem like the "sweet spot" if you're used to x86's SMP scaling characteristics. Many other RISC-based platforms do much better, though--AlphaServers and SGI Origins often get close to doubling their SPEC results when going from 1 to 2 CPUs, or even 4 to 8 (sometimes even farther). x86 can't touch that by a long shot--40-50% is about the average, just when going from single to dual x86 CPUs. As x86 tries to scale to quad CPUs, the scaling gets even worse, and the shared GTL+ bus is primarily responsible for that.

McKinley approaches about a 95% scaling when going from 1 to 2 CPUs--no thanks to the SMP bus architecture. The (relatively) enormous amount of memory bandwidth and large caches just about make up for the shared bus limitations. Try to push it to 4-way though, and you start to see the scaling drop off--the factor falls to around 60%.

<i>I can love my fellow man...but I'm damned if I'll love yours.</i>
December 13, 2002 6:15:39 AM

Quote:
4-way x86 may seem like the "sweet spot" if you're used to x86's SMP scaling characteristics. Many other RISC-based platforms do much better, though--AlphaServers and SGI Origins often get close to doubling their SPEC results when going from 1 to 2 CPUs, or even 4 to 8 (sometimes even farther). x86 can't touch that by a long shot--40-50% is about the average, just when going from single to dual x86 CPUs. As x86 tries to scale to quad CPUs, the scaling gets even worse, and the shared GTL+ bus is primarily responsible for that.


I see no evidence to support that the GTL+ bus is responsible. In <A HREF="http://www.anandtech.com/IT/showdoc.html?i=1747" target="_new">Anand's</A> database benchmark, scaling was mainly limited by the load on the processor. As he upped the load, performance scaled much better when going from 2-way to 4-way Xeon MP. This is database transactions may I add, which entails a lot of memory accesses and no, it was not i/o bound.

Quote:
McKinley approaches about a 95% scaling when going from 1 to 2 CPUs--no thanks to the SMP bus architecture. The (relatively) enormous amount of memory bandwidth and large caches just about make up for the shared bus limitations. Try to push it to 4-way though, and you start to see the scaling drop off--the factor falls to around 60%.


Again, I see no evidence that the bus architecture is in any way related to the scaling considering there is no point of reference. No one has tested an Itanium 2 with a different bus system. However, judging from the scaling of the 2-way and 4-way XeonMP's, I would definitely say that the bus is not the limitation.

"We are Microsoft, resistance is futile." - Bill Gates, 2015.
December 13, 2002 6:32:41 AM

No matter how big a L3 cache is, the processor is still going to have to pay the exact same penalty each time it has to fill a cache line from main memory. The difference is, with a larger cache, less time is spent refilling cache lines once they're dumped from the cache. I'd much rather have a large L2, than a large L3. I do get your point though.

Dichromatic for your viewing plesure...
December 13, 2002 8:34:04 AM

Quote:
I see no evidence to support that the GTL+ bus is responsible.

No offense but...it's been common knowledge for ages that the GTL+ doesn't scale well. The SPEC scores are evidence that something kills SMP scalability on x86; process of elimination works it down to CPUs having to share access to the memory.

It can't be the PCI interface, because everyone and their second cousin has standardized to PCI. The Alpha used PCI almost from the day the PCI standard was finalized. Even the ever-arrogant Sun long ago broke down and admitted that PCI was all-around superior to their proprietary SBUS interface.

It can't be peripheral I/O, because Xeons are gifted with the ServerWorks chipsets, specifically designed to address this.

It can't be disk or network access, because... well... that's peripheral I/O.

It could be the extra CPU caches on RISC platforms, if Xeons didn't come with the same buttloads of cache at even higher speeds.

So that sort of leaves...the SMP bus.

Quote:
In Anand's database benchmark, scaling was mainly limited by the load on the processor. As he upped the load, performance scaled much better when going from 2-way to 4-way Xeon MP. This is database transactions may I add, which entails a lot of memory accesses and no, it was not i/o bound.

Well...this is liable to take a bit of explaining.

Anand's test setup doesn't really test the platform's SMP scalability. A "proper" SMP scalability test would to have similar 1-way, 2-way, and 4-way setups, with each setup tested at a load roughly equal to the number of processors. Taking the load lower than the number of processors leaves processors sitting idle (obviously), and taking the load much higher hammers the system with context switches--the CPUs spend too much time in the process scheduler instead of doing real work.

Unfortunately, Anand doesn't have anything like that. So we'll have to assume there's a skew. If we just go by the 4x-load tests, there's going to be a skew in the 2-way tests--if the 2-way box was run at a "proper" 2x load, it would score higher marks thanks to more efficient throughput. How much skew this causes is anyone's guess, but I'd estimate that it's nothing to sneeze at.

(Maybe Anand could set up a proper test scenario just for us--or maybe he could just loan me his test boxen...ok, wishful thinking. :wink: )

When we look at the 4x-load WebDB tests, we see a boost of about 95% going from 2-way to 4-way. Pretty good, until we remember the skew.

When we look at the 4x-load AdDB tests, the boost drops to about 55%, and that isn't even considering the skew. The access patterns are supposed to be roughly the same; the only difference is that the system has to handle a much larger database (and thus throw around a much larger amount of memory). Either the test is I/O bound (longer disk access times thanks to larger files?), or the processors are having to fight for memory bandwidth.

<i>I can love my fellow man...but I'm damned if I'll love yours.</i>
December 16, 2002 12:55:10 AM

DDR 400 have 200 clock lantency more if not been able to prefetch.

Now what to do??
December 16, 2002 1:17:31 AM

Prescott and Nocona are pushed to the end of 2003. You may get Madison in 2Q03 if you're lucky--which means Opteron will probably beat it to market.

JP morgan just say that hammer wont be able to hit market on H1 unlike itanium 3 or 2.5 as benn show working working at full spedd and stable.It will go on validation process until stabily target will be hit.


This is a problem for all multiprocessor platforms. AMD's solution (NUMA-like memory arrangement) actually cuts down on the required interaction between processors and gives each CPU its own little dedicated high-bandwidth memory bus.

Dedicated memory becoming a real burden as many CPU can acess the same ram so having a 5.4 if PC 2700 is use wich i guess that most of super computer will go for PC 2100.PC 2700 as never been validata for any workstation from intel not from IBM not from Alpha.


As I mentioned before, we don't know what Hammer's cache-latency properties will be. I've seen nothing to suggest that L2 cache latency will increase with size, and I'm pretty sure your personal crystal ball doesn't qualify

As SRAM grow it scale in lantency as there a part allwayse put futher from the core also adding a lot of bypass wire.


So far, I've seen no benchmarks proving that. Theoretically your figures are way off.

Opteron has 5.2GB/sec of memory bandwidth; McKinley has 6.4GB/sec. Hardly a 100% increase for McKinley.

McKinley then loses this advantage when it tries to scale SMP-wise, because of the shared multiprocessor bus. When one CPU makes heavy use of the memory bandwidth, other processors are forced to starve. Opteron, on the other hand, gets 5.2GB/sec of memory bandwidth per CPU, so by the time it reaches 2-way, it's already theoretically outpaced the Itanium. It pulls even further ahead in 4-way configurations

Cache wise yes or more 4 times larger bit path on most of the level.


Opteron has 5.2GB/sec of memory bandwidth; McKinley has 6.4GB/sec. Hardly a 100% increase for McKinley.

The market will use PC 2100 for all CPU even if itanium cannot use all the bandwith.Intel could have raise the FSB to 133 wich as been allready done for the P4.It important to keep ystemes upgrade.Intel have promise a power cosumation that will not be change and it will be socket compatible and boards compatible.



Now what to do??
December 16, 2002 1:23:57 AM

It could be the extra CPU caches on RISC platforms, if Xeons didn't come with the same buttloads of cache at even higher speeds.

So that sort of leaves...the SMP bus.

Alpha come with huge off-die L2 Power 4 have 1.5 mb of L2 and 128 mb of off-die L3

Now what to do??
December 16, 2002 2:34:21 AM

Quote:
DDR 400 have 200 clock lantency more if not been able to prefetch

Do you mean that DDR 400’s latency is based on a 200mhz clock? I have no idea what you mean by 200 latency. I hope you don’t mean 200 ticks because you are way off.

Look back at imgod2u's elaborated timings.

Dichromatic for your viewing plesure...
December 16, 2002 3:35:17 PM

Quote:
JP morgan just say that hammer wont be able to hit market on H1 unlike itanium 3 or 2.5 as benn show working working at full spedd and stable.It will go on validation process until stabily target will be hit.

And you'd prefer to trust the technical advice of a stockbroker? I suppose you'd trust JP Morgan to build your computer for you?

A lot of stockbrokers also thought Rambus would be the next big stock, thanks to their bogus patent claims. Just about every clear-minded, non-fanboy tech geek knew better. At best, JP Morgan's statement is a somewhat educated guess.

Quote:
Dedicated memory becoming a real burden as many CPU can acess the same ram so having a 5.4 if PC 2700 is use wich i guess that most of super computer will go for PC 2100.PC 2700 as never been validata for any workstation from intel not from IBM not from Alpha.

The O/S is primarily responsible for keeping specific memory regions bound to specific processors as much as possible. This is really a no-brainer to implement, especially as kernels like the Linux kernel already have NUMA support. The only thing that really becomes a problem is threads--but again, since NUMA is nothing new, many developers are already optimizing threaded apps for NUMA.

Quote:
As SRAM grow it scale in lantency as there a part allwayse put futher from the core also adding a lot of bypass wire.

And when the cache is on-die, this isn't such a problem. Trace paths can be optimized far better.

<i>I can love my fellow man...but I'm damned if I'll love yours.</i>
December 17, 2002 2:04:55 AM

And when the cache is on-die, this isn't such a problem. Trace paths can be optimized far better.


You are force to increase the lantency or it will be become the critical path so you are force to create a adtinonal mini pipeline so reducing your overall clock speed so why to create a 3 MB L1 it will be more simple simply SIZE equal hihger Lantency.


And you'd prefer to trust the technical advice of a stockbroker? I suppose you'd trust JP Morgan to build your computer for you?

A lot of stockbrokers also thought Rambus would be the next big stock, thanks to their bogus patent claims. Just about every clear-minded, non-fanboy tech geek knew better. At best, JP Morgan's statement is a somewhat educated guess.

Yes compare to most of the website that fear attacking AMD because 90% of viewer are lemming.It more reliable to you faithful.


The O/S is primarily responsible for keeping specific memory regions bound to specific processors as much as possible. This is really a no-brainer to implement, especially as kernels like the Linux kernel already have NUMA support. The only thing that really becomes a problem is threads--but again, since NUMA is nothing new, many developers are already optimizing threaded apps for NUMA

As you write as much as posible.NUMA under X86 is new blade server just start selling xeon work normaly in 2 way or 4 way setup.Nothing to add to the fact that Hammer will be stock on a real 4.2 GB/S less 1GB/S for I/O plus bigger I/O latency.


Now what to do??
December 17, 2002 3:01:38 AM

Quote:
256K of L2 cache on Northwood takes about 13mil transistors

just like to add that the minimum number of transistors for 256 KB is 12.6 Million, without logic circuitry. So about 400 thousand on on logic circuitry i think is pretty average for a 8 way set associative cache, ( i assuming the northwood, as the williamite has an 8 way set associative L2 cache with a line size of 128 bytes, anyone know any different?)

6b x 8B x 1K x 256 = 12,582,912
(1K = 1024)


(bb || !bb) - Shakespeare<P ID="edit"><FONT SIZE=-1><EM>Edited by tnadrev on 12/17/02 00:11 AM.</EM></FONT></P>
!