Tom's Hardware > Forum > CPU & Components > CPUs > AMD's Barcelona to "stink" in ORACLE

AMD's Barcelona to "stink" in ORACLE - Page 4

Forum CPU & Components : CPUs - AMD's Barcelona to "stink" in ORACLE

Tom's Hardware: Over 1.4 million members in 6 different countries available to answer all your high-tech questions. Sign up now! Its free!
Word :    Username :           
 

Last message on previous page:
"Surely, for applications that have lots of inter-process communication, no 8S would perform well unless say it has a 4GHz broadcast bus with"

...not without some NUMA-awareness built in to the code along with NUMA-optimized mutual exclusion primitives...but who am I to know?

Reply to pdxkevinc
Sponsored Links
Register or log in to remove.

Quote :


1. A local memory transaction do not need to wait for remote caches unless the cache line is currently shared.


But the CPU doesn't know if the memory is shared or not unless it asks every socket first.

Reply to accord99

Quote :


1. A local memory transaction do not need to wait for remote caches unless the cache line is currently shared.


But the CPU doesn't know if the memory is shared or not unless it asks every socket first.

You have wrong idea about both NUMA and snooping protocol, then. With snooping protocol, CPU will know when a memory is shared or not; with proper "NUMA" support, CPU will further know whether a memory page is local or not.

Reply to abinstein

Will this snooping function be like a SSE type instruction or is it something entirely different?

Reply to dasickninja

I think it's a protocol, quite different than an instruction set.

Reply to GlacierFreeze

Ah thanks.


That post was just a way of changing the subject header into something less annoying.

Reply to dasickninja

Quote :


You have wrong idea about both NUMA and snooping protocol, then. With snooping protocol, CPU will know when a memory is shared or not;


And the only way to find out is to ask other CPUs, which consumes HT bandwidth, which is why latencies of Opteron systems increase as socket counts are increased.

Quote :

with proper "NUMA" support, CPU will further know whether a memory page is local or not.


With proper NUMA support, the OS will attempt to use local memory as much as possible per CPU. NUMA support is for the OS and software, not for the CPU.

Reply to accord99

Quote :


You have wrong idea about both NUMA and snooping protocol, then. With snooping protocol, CPU will know when a memory is shared or not;


And the only way to find out is to ask other CPUs, which consumes HT bandwidth, which is why latencies of Opteron systems increase as socket counts are increased.
Yes and no. The local processor do not "ask" other processors about current cache line state; all it has to do is to "push" (broadcast) state changes of shared cache lines to other processors.

The local processor needs to "ask" other CPUs when there's L2 cache miss. But that is not unique to Opteron. Every multi-processor system using a snooping protocol must do so, by definition. With HT, this type of messages are switched through all processors in a manner much more efficient than with FSB.

Also, "asking other CPUs" itself won't increase main memory access latency unless it's a remote cache miss, since in cHT cache coherence and memory access can go in parallel. The increase latency you see is the result of remote memory access, not the result of snooping.

Quote :

with proper "NUMA" support, CPU will further know whether a memory page is local or not.


With proper NUMA support, the OS will attempt to use local memory as much as possible per CPU. NUMA support is for the OS and software, not for the CPU.
Under NUMA, the OS can reserve an address space exclusively for a process. Then it is straightforward for the CPU to know that any memory access to this address space is local only.

Reply to abinstein

Quote :


Yes and no. The local processor do not "ask" other processors about current cache line state; all it has to do is to "push" (broadcast) state changes of shared cache lines to other processors.

The local processor needs to "ask" other CPUs when there's L2 cache miss. But that is not unique to Opteron. Every multi-processor system using a snooping protocol must do so, by definition. With HT, this type of messages are switched through all processors in a manner much more efficient than with FSB.


It broadcasts and must wait for a response wait from every other socket.

Quote :

Also, "asking other CPUs" itself won't increase main memory access latency unless it's a remote cache miss, since in cHT cache coherence and memory access can go in parallel. The increase latency you see is the result of remote memory access, not the result of snooping.


The increased latency is the result of the broadcasts, since the latency increases even in NUMA OSes running a single application.

Reply to accord99

Quote :

Will this snooping function be like a SSE type instruction or is it something entirely different?



Snooping is a protocol used by multiprocessors to track and snoop global cache states. Its operation is like caching - usually not controlled by (and transparent to) software.

Reply to abinstein

Quote :


Yes and no. The local processor do not "ask" other processors about current cache line state; all it has to do is to "push" (broadcast) state changes of shared cache lines to other processors.

The local processor needs to "ask" other CPUs when there's L2 cache miss. But that is not unique to Opteron. Every multi-processor system using a snooping protocol must do so, by definition. With HT, this type of messages are switched through all processors in a manner much more efficient than with FSB.


It broadcasts and must wait for a response wait from every other socket.

It is true only among the processor nodes where the address space is accessible. Besides, it has nothing to do with NUMA. Snooping under FSBs does precisely the same thing

Quote :

Also, "asking other CPUs" itself won't increase main memory access latency unless it's a remote cache miss, since in cHT cache coherence and memory access can go in parallel. The increase latency you see is the result of remote memory access, not the result of snooping.


The increased latency is the result of the broadcasts, since the latency increases even in NUMA OSes running a single application.
I don't understand your argument. Yes, broadcast messages will increase congestion thus latency; the "best" example of this is actually the FSB, which by definition uses broadcast only.

OTOH, messaging under NUMA can be multicast, only to the concerning nodes. For actual data reply, which is the one that is slowest and has highest BW, it is switched unicast, thus much, much better than using UMA/FSB.

It seems to me that you kind of lost the focus of this discussion, which I believe is to discuss NUMA benefits w.r.t. UMA. Please don't repeat yourself and speak of memory access problems in general.

Reply to abinstein
- 0 +

LordAMD,

In this case I would have to reply that this article is useless.

Again speculation on performance Positive or Negative, AMD or Intel does not matter.

It is a matter of testing engineering samples to establish an early baseline (still not the total performer that should be released but close) and then expanding on the trended information from reputable sites.

Even then throwing out the outliers.

AMD may in fact perform well in this arena.

Also LordAMD,

Posting for the sake of posting is really not called for. A half hearted attempt at appeasing the seemingly Intel crowd is not what is needed either.

What "is" needed is factual information both FOR and AGAINST reviewed CPUs.

Early speculation is bound to happen... But it is by no means a substitute for accurate and legitimate info.

Reply to Ches111

Quote :

Also, "asking other CPUs" itself won't increase main memory access latency unless it's a remote cache miss, since in cHT cache coherence and memory access can go in parallel. The increase latency you see is the result of remote memory access, not the result of snooping.


The increased latency is the result of the broadcasts, since the latency increases even in NUMA OSes running a single application.

accord99, you know, exchanging with you reminds me of playing with one of those artificial character: extremely "localized" replies with extremely short memory.

For some reason, I don't think you really know nothing. But it seems to me that you just play dumb for the sake of arguing with me. I have no idea why you keep mixing up a few very orthogonal aspects of memory access:

1. UMA or NUMA
2. FSB or direct connect
3. snooping or not (e.g., message passing)

1 is programming paradign. 2 is architecture implementation. 3 is protocol algorithm. So what do you exactly want to discuss, or to object to? NUMA is more scalable than UMA; direct connect is more efficient than FSB. Both of these are proven - what's your objection? These have little to do with the problem/deficiency of snooping.

If you think broadcast of cache line states is bad, then don't even think of FSB, since it not only broadcast snooping messages but actual every byte of memory read/write data. UMA will not help you any bit in this case, either. However, with NUMA and its (non)assumption of asymmetric memory access, not all nodes need to access all address spaces. With direct-connect, not all nodes need to hear all (data) messages.

Reply to abinstein

Quote :


1. A local memory transaction do not need to wait for remote caches unless the cache line is currently shared.


But the CPU doesn't know if the memory is shared or not unless it asks every socket first.


The MOESI protocol marks cache lines. It only has to check how the cache line is marked, not query the other sockets. If that were the case, Opteron wouldn't rule the 4S transaction world.

The cores would spend most of their time snooping HT.

Reply to BaronMatrix

Quote :


accord99, you know, exchanging with you reminds me of playing with one of those artificial character: extremely "localized" replies with extremely short memory.


This sub-thread between you and I came about why the Quad FX-74 is slower than the FX-62 in games and most single-threaded applications. I presented the reason.

Reply to accord99

Quote :


The cores would spend most of their time snooping HT.


That is one reason why scaling to 8 sockets is poor.

Reply to accord99

Quote :


The cores would spend most of their time snooping HT.


That is one reason why scaling to 8 sockets is poor.

That is why Barcelona will have L3. This allows better sharing of data amongst sockets. You either need L3 or a helluva fast bus as someone said.

Reply to BaronMatrix

Quote :


The cores would spend most of their time snooping HT.


That is one reason why scaling to 8 sockets is poor.

Beyond 8 sockest, AMD uses an HT switch to keep the number of hops down to a reasonable level.

Yeah, or the Horus chipset, which had some attention a couple of years ago but suprisingly didn't make it to market. Possibly the appearance of dual-cores reducing the need of really big x86 servers.

http://www.realworldtech.com/page. [...] 202353&p=1

Reply to accord99

Quote :


accord99, you know, exchanging with you reminds me of playing with one of those artificial character: extremely "localized" replies with extremely short memory.


This sub-thread between you and I came about why the Quad FX-74 is slower than the FX-62 in games and most single-threaded applications. I presented the reason.

Your claim was actually this:

Quote :

Games are latency dependent, which is why the Quad FX loses to the single socket FXes in games, let alone the Core 2s.


And my question was:

Quote :

why would latency be higher if NUMA was properly supported?


Then your response was:

Quote :

Snooping, every memory transaction requires the CPU to check the other CPU first. NUMA doesn't solve this "requirement".



See your problem here? Sure, NUMA doesn't "solve" the snooping problem, but does it make it worse? You didn't answer the question at all, instead you spoke as if it's NUMA's deficiency for not doing away snooping's requirement. And of course if you run single threaded apps or games that are optimized for UMA/dual-core, QFX will have higher memory latency. This is not the fault of NUMA, but your poor choice of application.

Snooping efficiency has little to do with being NUMA or UMA. It's a factor of system architecture (FSB or direct connect) and algorithm (MOESI, etc).

Further, if a process is bound to a node, and an address space is reserved exclusively for the process, then a cache miss in that address space do not need to consult remote cache at all. This is true to NUMA, but not UMA.

BTW, directory-based cache coherence usually has higher latency than snooping. The former trade off latency for better scalability.

Reply to abinstein

Quote :

See your problem here? Sure, NUMA doesn't "solve" the snooping problem, but does it make it worse?


It's the cause of the problem. As the Quad FX is targetted for enthusiasts and gamers, its inferior performance in games versus the single-socket FX-62 is a problem.

Having NUMA has its advantages, but it also has its disadvantages in the desktop market. Clearly it was a poorly planned knee-jerk reaction on the part of AMD in trying to bring a 2-socket platform for the desktop market, giving up memory latency and thus reducing performance in desktop applications and games in order to increase throughput, though still not a match for the mega-tasking ability of the QX6700.

Quote :

Further, if a process is bound to a node, and an address space is reserved exclusively for the process, then a cache miss in that address space do not need to consult remote cache at all. This is true to NUMA, but not UMA.


And I doubt if it's true of ccNUMA as implemented in Opterons.

Reply to accord99

Quote :

See your problem here? Sure, NUMA doesn't "solve" the snooping problem, but does it make it worse?


It's the cause of the problem. As the Quad FX is targetted for enthusiasts and gamers, its inferior performance in games versus the single-socket FX-62 is a problem.
NUMA is the cause of the problem? That's pure BS. What in heaven's name does NUMA/UMA affect snooping delay, which is really affected by implementation (e.g. FSB vs. direct connect) and algorithm (e.g. MOESI vs. write-update)?

If you say NUMA makes some memory access longer, then yes, but not much if you had optimized for it.

Quote :

Having NUMA has its advantages, but it also has its disadvantages in the desktop market. Clearly it was a poorly planned knee-jerk reaction on the part of AMD in trying to bring a 2-socket platform for the desktop market, giving up memory latency and thus reducing performance in desktop applications and games in order to increase throughput, though still not a match for the mega-tasking ability of the QX6700.


The real disadvantage of NUMA on desktop today is lack of software support. Period.

However, 4x4 is far better than "poorly planned knee-jerk reaction." Even today there are megatasking scenarios where NUMA is better than UMA. For example, if you run two virtual machines each running some high-BW apps, there is no need to mix them together with uniform memory access. This is what megatasking is about, not just running your favorite game in SLI while encoding mp3s - far more than that.

Quote :

Further, if a process is bound to a node, and an address space is reserved exclusively for the process, then a cache miss in that address space do not need to consult remote cache at all. This is true to NUMA, but not UMA.


And I doubt if it's true of ccNUMA as implemented in Opterons.
Allow me to say that, even if this is not already the case in Opteron, with some tweaks, it will be transparently possible.

Imagine how you could make it true in an FSB/UMA architecture? With UMA, you can't. With FSB, you can't to the square. The NUMA propagation from server to desktop, IMO, is just a matter of time, and AMD is accelerating it.

Reply to abinstein

Quote :


If you say NUMA makes some memory access longer, then yes, but not much if you had optimized for it.


It's about the difference between DDR2-800 and DDR2-533.

Quote :


The real disadvantage of NUMA on desktop today is lack of software support. Period.


Actually, it's a lack of benefits. Desktop apps benefit more from lower latency than higher bandwidth.

Quote :

However, 4x4 is far better than "poorly planned knee-jerk reaction." Even today there are megatasking scenarios where NUMA is better than UMA. For example, if you run two virtual machines each running some high-BW apps, there is no need to mix them together with uniform memory access. This is what megatasking is about, not just running your favorite game in SLI while encoding mp3s - far more than that.


I wouldn't use a Quad FX system since a) it has only 4 DIMMs and b) it doesn't support registered ECC memory. So the Quad FX is a product in search of a market. Too slow in desktop applications and games (not to mention too much power consumption and require too large of a case), while not enough RAS and memory and too much extraneous stuff for workstations.

Quote :


Allow me to say that, even if this is not already the case in Opteron, with some tweaks, it will be transparently possible.


A simple tweaks such as a complete reworking of the Opteron platform.

Quote :


Imagine how you could make it true in an FSB/UMA architecture? With UMA, you can't. With FSB, you can't to the square. The NUMA propagation from server to desktop, IMO, is just a matter of time, and AMD is accelerating it.


I hadn't noticed this trend. It's actually more the opposite, everything coming to a single socket.

Reply to accord99

Quote :


If you say NUMA makes some memory access longer, then yes, but not much if you had optimized for it.


It's about the difference between DDR2-800 and DDR2-533.
this is the non-optimized case.

Quote :


The real disadvantage of NUMA on desktop today is lack of software support. Period.


Actually, it's a lack of benefits. Desktop apps benefit more from lower latency than higher bandwidth.
whatever. you have your definition of "desktop apps," and if you so like then believe it with all your heart. although NUMA not necessarily offer higher bandwidth, and in case it does, it usually also offer lower latency.

Quote :


Allow me to say that, even if this is not already the case in Opteron, with some tweaks, it will be transparently possible.


A simple tweaks such as a complete reworking of the Opteron platform.
So an additional "local-ness" bit in the page table is a complete reworking of the platform?

Quote :


Imagine how you could make it true in an FSB/UMA architecture? With UMA, you can't. With FSB, you can't to the square. The NUMA propagation from server to desktop, IMO, is just a matter of time, and AMD is accelerating it.


I hadn't noticed this trend. It's actually more the opposite, everything coming to a single socket.
do you expect an FSB with UMA to work well with 16 or even 8 cores of Conroe on a single socket? if you think it's plausible, why is Intel working on CSI now (or do you think it isn't)?

Reply to abinstein

Quote :


this is the non-optimized case.


This is the case in reality.

Quote :


whatever. you have your definition of "desktop apps," and if you so like then believe it with all your heart. although NUMA not necessarily offer higher bandwidth, and in case it does, it usually also offer lower latency.


Desktop apps = games and single-threaded apps that the Quad FX is slower than the FX-62 in.

Quote :

So an additional "local-ness" bit in the page table is a complete reworking of the platform?


And yet no such example exists.

Quote :

do you expect an FSB with UMA to work well with 16 or even 8 cores of Conroe on a single socket? if you think it's plausible, why is Intel working on CSI now (or do you think it isn't)?


And you can't even fit 2 memory controllers in a normal size board now.

Reply to accord99

Quote :


this is the non-optimized case.


This is the case in reality.
No, it's just your reality where you only look at non-NUMA optimized programs.

Quote :


whatever. you have your definition of "desktop apps," and if you so like then believe it with all your heart. although NUMA not necessarily offer higher bandwidth, and in case it does, it usually also offer lower latency.


Desktop apps = games and single-threaded apps that the Quad FX is slower than the FX-62 in.
I knew it. Suffice to say QFX is not only for games and single-threaded apps.

Quote :

So an additional "local-ness" bit in the page table is a complete reworking of the platform?


And yet no such example exists.
Just go search NUMAchine on google. there are others, but this one is the easiest to find a reference on.

Quote :

do you expect an FSB with UMA to work well with 16 or even 8 cores of Conroe on a single socket? if you think it's plausible, why is Intel working on CSI now (or do you think it isn't)?


And you can't even fit 2 memory controllers in a normal size board now.
WTF? Memory controllers are relatively small. Having two memory controllers on a chip (not to mention a board) is nothing compared to having 16 cores share a single FSB.

Reply to abinstein

Quote :


The cores would spend most of their time snooping HT.


That is one reason why scaling to 8 sockets is poor.

That is why Barcelona will have L3. This allows better sharing of data amongst sockets. You either need L3 or a helluva fast bus as someone said.

No it doesn't.

I always thought cache was used to bring needed data/instructions closer to the core to avoid main memory latency and reduce overall demand on the bus. Needless to say, adding another level of cache simply means another cache pool that will need to be checked when sharing --- hence the drive to a new HT revision with more BW and lower latency.


That's what the extra bandwidth is for. HT latency is almost nil. Barcelona also has 4 HT links which connects 8 CPUs with 1 hop NUMA. If you ubgang you can get out to a cache coherent 32S.

Reply to BaronMatrix

Quote :


No, it's just your reality where you only look at non-NUMA optimized programs.


Which are almost every application.

Quote :

it. Suffice to say QFX is not only for games and single-threaded apps.


That's what it was targetted for, enthusiasts. I guess until AMD realized what it a dud it was for enthusiasts.And for mega-tasking, it's slower than the QX6700, and usually the Q6600.

And now unfortunately for AMD, they don't know what its targetted for, other than being a even better heater than the 840 EE.

Quote :


Just go search NUMAchine on google. there are others, but this one is the easiest to find a reference on.


What does an academic research project, that has their own OS and compiler, have to do with production machines today. And it doesn't really look much different than say IBM's X3.

Quote :

WTF? Memory controllers are relatively small. Having two memory controllers on a chip (not to mention a board) is nothing compared to having 16 cores share a single FSB.


What about the pin count to get memory to the controllers. The Quad FX motherboard is a clear example, larger than the standard MB. Good luck fitting it into a typical Dell case.

Reply to accord99

Quote :

See your problem here? Sure, NUMA doesn't "solve" the snooping problem, but does it make it worse?


It's the cause of the problem. As the Quad FX is targetted for enthusiasts and gamers, its inferior performance in games versus the single-socket FX-62 is a problem.

Having NUMA has its advantages, but it also has its disadvantages in the desktop market. Clearly it was a poorly planned knee-jerk reaction on the part of AMD in trying to bring a 2-socket platform for the desktop market, giving up memory latency and thus reducing performance in desktop applications and games in order to increase throughput, though still not a match for the mega-tasking ability of the QX6700.

Quote :

Further, if a process is bound to a node, and an address space is reserved exclusively for the process, then a cache miss in that address space do not need to consult remote cache at all. This is true to NUMA, but not UMA.


And I doubt if it's true of ccNUMA as implemented in Opterons.


So what you're saying is that SSE performance doesn't count because it does have proper SW support?

If you don't have an awareness of dual+ sockets, then your code will be fragmented and it's "kernel" will not look for local RAM vs. just calling malloc.

I have long said that using SLI-like "Profiles" would allow for smart swapping so that the game will take all of the RAM in the banks local to the executing processor (QFX banks have a max of 2GB ). This would only mean slightly longer loading times as data is swapped.

Vista shows nice improvements with everything and people who buy this could sacrifice a few fps for the ability to run a virus and spyware scan while playing at 1920 (that can happen).

Reply to BaronMatrix

Quote :


The cores would spend most of their time snooping HT.


That is one reason why scaling to 8 sockets is poor.

That is why Barcelona will have L3. This allows better sharing of data amongst sockets. You either need L3 or a helluva fast bus as someone said.

No it doesn't.

I always thought cache was used to bring needed data/instructions closer to the core to avoid main memory latency and reduce overall demand on the bus. Needless to say, adding another level of cache simply means another cache pool that will need to be checked when sharing --- hence the drive to a new HT revision with more BW and lower latency.

That's what the extra bandwidth is for. HT latency is almost nil. Barcelona also has 4 HT links which connects 8 CPUs with 1 hop NUMA. If you ubgang you can get out to a cache coherent 32S.
A few things -
1. IMO, the L3 helps data sharing among the cores inside a socket. It doesn't seem to help data sharing among different sockets.
2. The main purpose of cache is to reduce average memory access latency.
3. With cHT, remote cache coherence and local/remote memory access go in parallel. The latency increases slowly with load.
4. cHT latency (local & remote memory access difference) is about 3 cycles for Opteron @1.4GHz. See Fig.21 on page 21 of this paper.

Reply to abinstein

Quote :

I have long said that using SLI-like "Profiles" would allow for smart swapping so that the game will take all of the RAM in the banks local to the executing processor (QFX banks have a max of 2GB ). This would only mean slightly longer loading times as data is swapped.



You have long said it, but that does not make it true.... 'what is smart swapping'? Does it start here and end there? Exactly what is it about smart swapping that makes it localize better?


Are you being funny or are you just dumb? It seems pretty obvious it means swapping one bank up to the max RAM in the profile.

For example, I have viewed RAM on Q4 (Task Mgr) and it peaked at about 1.5GB. If the bank has 2GB, that leaves .5GB for other apps. This would have to also be careful of memory fragmentation within the bank but the algorithm would not be difficult and could even be generic.

Reply to BaronMatrix

Quote :


The cores would spend most of their time snooping HT.


That is one reason why scaling to 8 sockets is poor.

That is why Barcelona will have L3. This allows better sharing of data amongst sockets. You either need L3 or a helluva fast bus as someone said.

No it doesn't.

I always thought cache was used to bring needed data/instructions closer to the core to avoid main memory latency and reduce overall demand on the bus. Needless to say, adding another level of cache simply means another cache pool that will need to be checked when sharing --- hence the drive to a new HT revision with more BW and lower latency.

That's what the extra bandwidth is for. HT latency is almost nil. Barcelona also has 4 HT links which connects 8 CPUs with 1 hop NUMA. If you ubgang you can get out to a cache coherent 32S.
A few things -
1. IMO, the L3 helps data sharing among the cores inside a socket. It doesn't seem to help data sharing among different sockets.


Not correct. The cache protocol itself can determine if more than one socket needs the data, which will fill all L3 with it.

2. The main purpose of cache is to reduce average memory access latency.

Yes, it would lower latency by smart prefetching.


3. With cHT, remote cache coherence and local/remote memory access go in parallel. The latency increases slowly with load.

Yes, that is again why Barcelona has 4 links so that separate accesses can happen between 0,4 and 0,3. With things like Oracle bandwidth can actually hide latency and the transactional nature of it means that transactions aren't shared across sockets or even cores.

SQL uses a memory pool unlike games which have to allow for turning left or right or firing or not firing. This is why they are so sensitive to latency.

4. cHT latency (local & remote memory access difference) is about 3 cycles for Opteron @1.4GHz. See Fig.21 on page 21 of this paper.

That means that one cycle is 1/1.4,000,000,000 or ~7ns or 21ns total. A 2.8GHz Opteron should theoretically have half the latency difference.

Reply to BaronMatrix

Quote :


No, it's just your reality where you only look at non-NUMA optimized programs.


Which are almost every application.
Stop playing a dumb enthusiast, will you? Singled-threaded and game apps are not all apps. There are probably more apps optimized for NUMA than for SSE (as Baron made the good point).

Quote :

it. Suffice to say QFX is not only for games and single-threaded apps.


That's what it was targetted for, enthusiasts. I guess until AMD realized what it a dud it was for enthusiasts.And for mega-tasking, it's slower than the QX6700, and usually the Q6600.
QX6700 can't even perform the mega-tasking workload that QFX demonstrated.

Quote :

And now unfortunately for AMD, they don't know what its targetted for, other than being a even better heater than the 840 EE.


Now what are you? Mind reader? IMO you're just either playing dumb or being blind.

Quote :


Just go search NUMAchine on google. there are others, but this one is the easiest to find a reference on.


What does an academic research project, that has their own OS and compiler, have to do with production machines today. And it doesn't really look much different than say IBM's X3.
Search the paper for "multicast," which show an example of what I said, multicast. Again, you're playing your extremely short memory here.
BTW, no, this architecture is totally different from IBM's X3.

Quote :

WTF? Memory controllers are relatively small. Having two memory controllers on a chip (not to mention a board) is nothing compared to having 16 cores share a single FSB.


What about the pin count to get memory to the controllers. The Quad FX motherboard is a clear example, larger than the standard MB. Good luck fitting it into a typical Dell case.
Thousands of pins per chip is technically nothing. 10 years ago it wasn't possible to fit 1207 pins to one single socket. You're just too backward looking. Besides, hard to connect everything to one place is yet another reason why NUMA is preferable.

Reply to abinstein

Quote :


Stop playing a dumb enthusiast, will you? Singled-threaded and game apps are not all apps. There are probably more apps optimized for NUMA than for SSE (as Baron made the good point).


I doubt that considering SSE is by default the mode for floating point operations in Windows x64 mode, while many compilers are primarily compiling floating point operations to SSE ops now.

And if so many apps are "optimized" for NUMA, why does the Quad FX with a greater than 100% advantage in bandwidth lose so badly to the QX6700?

Quote :

can't even perform the mega-tasking workload that QFX demonstrated.


Of course it can, and as all the reviews prove it can do it faster.

Quote :

Now what are you? Mind reader? IMO you're just either playing dumb or being blind.


It's not hard to see with the disastrous results of the Quad FX. It's pretty easy to see the lack of markets for it, considering its excessive power consumption, excessive costs, excessive size and lack of memory slots.

Quote :


Search the paper for "multicast," which show an example of what I said, multicast. Again, you're playing your extremely short memory here.
BTW, no, this architecture is totally different from IBM's X3.


It has nothing to do with what you're talking about. It's basically just like IBM's X3, its stations are equivalent to IBM's quads, it utilizes ideas similar to IBM's snoop filters.

And its from 1995.

Quote :


Thousands of pins per chip is technically nothing. 10 years ago it wasn't possible to fit 1207 pins to one single socket. You're just too backward looking. Besides, hard to connect everything to one place is yet another reason why NUMA is preferable.


But thousands of pins across a cheap motherboard is not. The width of memory has grown much smaller than the clockspeed speed of memory over the years.

Reply to accord99

Quote :


Stop playing a dumb enthusiast, will you? Singled-threaded and game apps are not all apps. There are probably more apps optimized for NUMA than for SSE (as Baron made the good point).


I doubt that considering SSE is by default the mode for floating point operations in Windows x64 mode, while many compilers are primarily compiling floating point operations to SSE ops now.

And if so many apps are "optimized" for NUMA, why does the Quad FX with a greater than 100% advantage in bandwidth lose so badly to the QX6700?
there are lots of sweet fruits but they pick the sour ones. does that make all fruits sour?

Quote :

can't even perform the mega-tasking workload that QFX demonstrated.


Of course it can, and as all the reviews prove it can do it faster.
what review shows mega-tasking? don't give me a list of reviews each a copy of one another. point out which is precisely measuring mega-tasking performance.

Quote :

Now what are you? Mind reader? IMO you're just either playing dumb or being blind.


It's not hard to see with the disastrous results of the Quad FX. It's pretty easy to see the lack of markets for it, considering its excessive power consumption, excessive costs, excessive size and lack of memory slots.
you are confused of QFX's technical merit with its business/marketing ones.

Quote :


Search the paper for "multicast," which show an example of what I said, multicast. Again, you're playing your extremely short memory here.
BTW, no, this architecture is totally different from IBM's X3.


It has nothing to do with what you're talking about. It's basically just like IBM's X3, its stations are equivalent to IBM's quads, it utilizes ideas similar to IBM's snoop filters.

And its from 1995.
I am talking about multicast. are you saying the ability to do multicast has nothing to do with the multicast I talk about? It may use a different mechanism but the possibility of multicast is due to its NUMA nature.
Not just IBM's X3 uses quads, so does Newisys Horus. Not just X3 uses snoop filters, so does Horus. In your superficial view, everything becomes the same.

Quote :


Thousands of pins per chip is technically nothing. 10 years ago it wasn't possible to fit 1207 pins to one single socket. You're just too backward looking. Besides, hard to connect everything to one place is yet another reason why NUMA is preferable.


But thousands of pins across a cheap motherboard is not. The width of memory has grown much smaller than the clockspeed speed of memory over the years.
you're again confused of the memory clock freq with access time, and the board level with chip level. The clock freq increases not because of shorter access time but more pipelined/parallel access ports.

Reply to abinstein

Quote :


there are lots of sweet fruits but they pick the sour ones. does that make all fruits sour?


When they pick a wide spectrum of applications and the results are all the same, then it's good evidence that NUMA has virtually no impact on performance on applications used by people.

Quote :


what review shows mega-tasking? don't give me a list of reviews each a copy of one another. point out which is precisely measuring mega-tasking performance.


Every review had their own scenarios. The QX6700 won almost every one. They were also typically more demanding that your weak Youtube version.

Quote :


you are confused of QFX's technical merit with its business/marketing ones.


Its original business and marketing aims determine how its (lack of) technical merits are looked.

Quote :


I am talking about multicast. are you saying the ability to do multicast has nothing to do with the multicast I talk about? It may use a different mechanism but the possibility of multicast is due to its NUMA nature.
Not just IBM's X3 uses quads, so does Newisys Horus. Not just X3 uses snoop filters, so does Horus. In your superficial view, everything becomes the same.


No it has nothing to do with what you said since there is currently no way to tell an Opteron CPU to not snoop another connected one. And yes, Horus and X3 are all pretty similar in idea, minimizing snoops across quads.

Quote :


you're again confused of the memory clock freq with access time, and the board level with chip level. The clock freq increases not because of shorter access time but more pipelined/parallel access ports.


And again you are talking about something you have no understanding about.

Reply to accord99

Quote :


And again you are talking about something you have no understanding about.


Frankly, you were talking but not really speaking of anything material.
- An Opteron do not need to snoop another chip when it's accessing memory exclusive to itself.
- As I have said, in all cases that an Opteron has to snoop another Opteron, a Woodcrest/Clovertown with FSB must snoop another Woodcrest/Clovertown, too. The latter will actually put more pressure on the shared bandwidth and thus experience higher delay. I.e., snooping or not makes little difference between K8 and Core 2 (K8's MOESI is actually more efficient than Core 2's MESI). Snooping over direct connect arch with cHT is in fact more efficient than that over shared FSB.
- X3 and Horus have vastly different implementations. If you're talking about similar ideas, then yes, from K8 to Conroe to Power are all "the same" in this regard; but in fact they all have different mechanisms and pros and cons.
- It takes about 10 years for memory speed to double (measured by reduced access time). OTOH, from Socket 7 (1995) to Socket F (2006) the pin counts almost quadrupled.
- Your talk about business/marketing strategy affecting technical merits is just nonsense. You wouldn't have thought like this if you knew the loathe some Intel engineers feel about marketing BS (even from their own company).

It's clear to me now that you live in a totally different world from mine in terms of technical understanding. I'll only say that if you don't like QFX, then know that it was never meant to please you; if you think FSB/UMA is sufficient/efficient, then know that you'll soon become part of the forgotten past.

Reply to abinstein

Frankly, it's clear that you simply don't understand as

a) you persist in your mistaken belief that Opterons somehow don't need to snoop a second socket in all case, despite all evidence to the contrary
b) the benefits of the direct connected architecture is pretty small given the superior performance of the QX6700
c) and overall, neither NUMAchine or Horus or X3 has anything related to your original claim
d)memory bandwidth on the other hand has jumped over 10x, primarily due to increased clockspeed, while the width has only gone from 64-bit to 128-bit
e)I'd imagine it's the same loathing AMD engineers feel when Quad FX was talked about

It's clear to me from this discussion, your many initial false statements (Quad FX better mega-tasking, QX6700 can't handle two intensive tasks, lack of knowledge of the Quad FX's inability to handle 4 dual-slot video cards) and usage of that incredibly inaccurate Tom Yager article that your sense of reality is badly distorted.

Reply to accord99

Quote :

Frankly, it's clear that you simply don't understand as

a) you persist in your mistaken belief that Opterons somehow don't need to snoop a second socket in all case, despite all evidence to the contrary


BS. I said "NUMA," not "Opteron." Here is my original claim: "Snooping efficiency has little to do with being NUMA or UMA. It's a factor of system architecture (FSB or direct connect) and algorithm (MOESI, etc). Further, if a process is bound to a node, and an address space is reserved exclusively for the process, then a cache miss in that address space do not need to consult remote cache at all. This is true to NUMA, but not UMA." Where did I say Opteron?

I studied the architectures of NUMA (and COMA, MPI, among others). I was debunking your false claims of NUMA not suited for mega-tasking; I was not trying to make technical inductions based on a few on-line enthusiast reviews.

Quote :

b) the benefits of the direct connected architecture is pretty small given the superior performance of the QX6700


QX6700 wins over QFX the same or less than Conroe over K8. The former's win is clearly on single core optimization (both software and hardware), and has little to do with system architecture. In terms of memory architecture, direct connect (plus NUMA) is huge when it comes to mega-tasking and virtualization.

Quote :

c) and overall, neither NUMAchine or Horus or X3 has anything related to your original claim


BS. I was speaking of the NUMA architecture, of which both NUMAchine and Horus are examples. X3 OTOH is what you brought on to the table, and indeed has little to do with what I spoke of.

Quote :

d)memory bandwidth on the other hand has jumped over 10x, primarily due to increased clockspeed, while the width has only gone from 64-bit to 128-bit


Another BS. You simply don't understand the working of SDRAM. The "10x" increased clockrate is not due to memory access speed (which increases only ~2x in 10 years), but intensive use of banking inside/among the DRAM chips. The "double-ness" of DDR and higher frequency over the SDRAM are not due to memory access speed increase, either, but increased parallel access of memory cells; such a parallelism went from 64-bit of single rate to 128-bit of twice the double rate (DDR2), or about 8x increase over 10 years.

Quote :

e)I'd imagine it's the same loathing AMD engineers feel when Quad FX was talked about


AMD never talks about QFX the way enthusiasts like you did.

Quote :

It's clear to me from this discussion, your many initial false statements (Quad FX better mega-tasking, QX6700 can't handle two intensive tasks, lack of knowledge of the Quad FX's inability to handle 4 dual-slot video cards) and usage of that incredibly inaccurate Tom Yager article that your sense of reality is badly distorted.


From this discussion I clearly see that you are a FUDer, nothing more. I never said QFX able to handle 4 dual-slot video cards; I said you don't need 4 dual-slots to run 4 video cards. I never said QX6700 can't handle two intensive tasks; I clearly said mega-tasking is more than running two intensive tasks.

OTOH, it was you who claimed that QFX has higher memory access delay (than QX6700) due to snooping(!). You went on to claim that it's all because NUMA that is not going to be supported, or NUMA performs poorly even if supported, or you simply don't know how could NUMA be properly supported (please at least pick your favorite...). If you only criticized QFX for its lack of business tactics, I would have no objection. But you went further to make false technical implications based on wrong (single-threaded & games) benchmarks and (lack of) business appeal. You were just plain wrong.

Besides, Tom Yager's article has nothing to do with our discussion here. Does it talk about NUMA? Does it talk about snooping? Are you sure your mind is clear, or were you on some drugs?

Reply to abinstein

Quote :


BS. I said "NUMA," not "Opteron." Here is my original claim: "Snooping efficiency has little to do with being NUMA or UMA. It's a factor of system architecture (FSB or direct connect) and algorithm (MOESI, etc). Further, if a process is bound to a node, and an address space is reserved exclusively for the process, then a cache miss in that address space do not need to consult remote cache at all. This is true to NUMA, but not UMA." Where did I say Opteron?



This is what you said.

2) I see no reason that memory latency on Quad FX would be higher if NUMA is properly supported. Maybe you can tell me how and why?

Even if NUMA is properly supported, a Quad FX's memory latency will simply be unable to match a comparable clocked single socket FX.

Quote :


QX6700 wins over QFX the same or less than Conroe over K8. The former's win is clearly on single core optimization (both software and hardware), and has little to do with system architecture. In terms of memory architecture, direct connect (plus NUMA) is huge when it comes to mega-tasking and virtualization.


The QX6700 wins on almost everything, from single applications to mega-tasking and basically shows the direct connection/NUMA advantage has little effect on the overwhelming majority of applications.

Quote :


BS. I was speaking of the NUMA architecture, of which both NUMAchine and Horus are examples. X3 OTOH is what you brought on to the table, and indeed has little to do with what I spoke of.


And neither of which have anything to do with your little flip a bit and an Opteron won't have to snoop another socket claim.

Quote :


Another BS. You simply don't understand the working of SDRAM. The "10x" increased clockrate is not due to memory access speed (which increases only ~2x in 10 years), but intensive use of banking inside/among the DRAM chips. The "double-ness" of DDR and higher frequency over the SDRAM are not due to memory access speed increase, either, but increased parallel access of memory cells; such a parallelism went from 64-bit of single rate to 128-bit of twice the double rate (DDR2), or about 8x increase over 10 years.


As I said, the bandwidth has increased from the increased clock rates of the memory devices, not by increasing the width of the memory connection or by adding more memory controllers.

Quote :


AMD never talks about QFX the way enthusiasts like you did.



Let's read what AMD originally said about the 4X4:

New Enthusiast Platform
Building on AMD’s already recognized leadership among PC enthusiasts, AMD also announced plans for a new enthusiast platform codenamed “4x4” that will extend AMD’s long-standing commitment to those consumers who demand the highest-performing PCs. The 4x4 platform features a four-core, multi-socket processor configuration uniquely possible via AMD’s Direct Connect Architecture. The 4X4 platform will be designed to be upgraded to eight total processor cores when AMD launches quad-core processors in 2007. Project 4x4 represents system-level enthusiast enhancements and is designed for ultimate multi-tasking performance across gaming, digital video, processor-intensive and heavily-threaded applications.


http://images.tomshardware.com/2006/11/10/amd4by4.jpg

Quote :


From this discussion I clearly see that you are a FUDer, nothing more. I never said QFX able to handle 4 dual-slot video cards; I said you don't need 4 dual-slots to run 4 video cards.


Let me refresh your memory:

I guess those reviews were performed with 4 graphics cards performing 2+ graphically intensive tasks? Ouch, aren't they?

Or do they simply think like most people on this forum did, that hacking (C2D and C2Q) around for SLI doesn't worth one's efforts? Last time I checked quad-SLI isn't even possible on any Core 2 platform (please correct me if I'm wrong).


Quote :

I never said QX6700 can't handle two intensive tasks; I clearly said mega-tasking is more than running two intensive tasks.


Let me refresh your memory:

OTOH, one would react to such a phrase so much probably because it was right on the spot for him. 4x4 and C2Q are different animals, for different purposes. AMD showcased 4x4 by running 4 heavy weight tasks on 4 screens smootly at the same time; can C2Q do so? Can C2Q even do two?

Quote :

OTOH, it was you who claimed that QFX has higher memory access delay (than QX6700) due to snooping(!).


I claimed that the purpose of SLI is to run games, and that games are heavily dependent on memory latency which is why the Quad FX-74 loses to a FX-62 in games. Which is true as proven out by every review. I never said the Quad FX has higher memory latency than the QX6700.

http://forumz.tomshardware.com/har [...] 56#1494556

Quote :

You went on to claim that it's all because NUMA that is not going to be supported, or NUMA performs poorly even if supported, or you simply don't know how could NUMA be properly supported (please at least pick your favorite...).


I went on to say that NUMA will not fix the memory latency issue, and the FX-74 will simply never be able to match the memory latency of a single socket AMD platform and that NUMA will not be a magic fix that suddenly improves performance in desktop applications.

Quote :

Besides, Tom Yager's article has nothing to do with our discussion here. Does it talk about NUMA? Does it talk about snooping? Are you sure your mind is clear, or were you on some drugs?


It just goes to show your lack of technical knowledge when you post an article that has show a staggering number of mistakes and inaccuracies.

Reply to accord99

Quote :


BS. I said "NUMA," not "Opteron." Here is my original claim: "Snooping efficiency has little to do with being NUMA or UMA. It's a factor of system architecture (FSB or direct connect) and algorithm (MOESI, etc). Further, if a process is bound to a node, and an address space is reserved exclusively for the process, then a cache miss in that address space do not need to consult remote cache at all. This is true to NUMA, but not UMA." Where did I say Opteron?



This is what you said.

2) I see no reason that memory latency on Quad FX would be higher if NUMA is properly supported. Maybe you can tell me how and why?

Even if NUMA is properly supported, a Quad FX's memory latency will simply be unable to match a comparable clocked single socket FX.
First, you didn't tell me why.
Second, your claims are wrong, anyway. QFX's (essentially dual-Opteron) memory latency will be smaller due to its integrated memory controller and direct connect architecture. Under high bandwidth, it's memory latency will be smaller due to its NUMA, with benchmarks that are NUMA-optimized.

Quote :


QX6700 wins over QFX the same or less than Conroe over K8. The former's win is clearly on single core optimization (both software and hardware), and has little to do with system architecture. In terms of memory architecture, direct connect (plus NUMA) is huge when it comes to mega-tasking and virtualization.


The QX6700 wins on almost everything, from single applications to mega-tasking and basically shows the direct connection/NUMA advantage has little effect on the overwhelming majority of applications.
You're simply docking the real point. QX6700's win has little to do with memory architecture, but the (single) core optimization.

Quote :


BS. I was speaking of the NUMA architecture, of which both NUMAchine and Horus are examples. X3 OTOH is what you brought on to the table, and indeed has little to do with what I spoke of.


And neither of which have anything to do with your little flip a bit and an Opteron won't have to snoop another socket claim.
Really, how much a FUDer you're going to be? First, my claim is NUMA, not Opteron. Second, just because they use another mechanism doesn't mean 1) the mechanism I described can't be done, 2) NUMA must snoop all other sockets.

Quote :


Another BS. You simply don't understand the working of SDRAM. The "10x" increased clockrate is not due to memory access speed (which increases only ~2x in 10 years), but intensive use of banking inside/among the DRAM chips. The "double-ness" of DDR and higher frequency over the SDRAM are not due to memory access speed increase, either, but increased parallel access of memory cells; such a parallelism went from 64-bit of single rate to 128-bit of twice the double rate (DDR2), or about 8x increase over 10 years.


As I said, the bandwidth has increased from the increased clock rates of the memory devices, not by increasing the width of the memory connection or by adding more memory controllers.
As I said, you're confused/mixing up the intra-chip and inter-chip levels. Inside the chips, width of memory access has been increased to support the higher clock rates.

Quote :


AMD never talks about QFX the way enthusiasts like you did.



Let's read what AMD originally said about the 4X4:

New Enthusiast Platform
Building on AMD’s already recognized leadership among PC enthusiasts, AMD also announced plans for a new enthusiast platform codenamed “4x4” that will extend AMD’s long-standing commitment to those consumers who demand the highest-performing PCs. The 4x4 platform features a four-core, multi-socket processor configuration uniquely possible via AMD’s Direct Connect Architecture. The 4X4 platform will be designed to be upgraded to eight total processor cores when AMD launches quad-core processors in 2007. Project 4x4 represents system-level enthusiast enhancements and is designed for ultimate multi-tasking performance across gaming, digital video, processor-intensive and heavily-threaded applications.

I must've missed it! Where did AMD say it'll run faster than QX6700 on games and single-threaded applications?

Quote :


From this discussion I clearly see that you are a FUDer, nothing more. I never said QFX able to handle 4 dual-slot video cards; I said you don't need 4 dual-slots to run 4 video cards.


Let me refresh your memory:

I guess those reviews were performed with 4 graphics cards performing 2+ graphically intensive tasks? Ouch, aren't they?

Or do they simply think like most people on this forum did, that hacking (C2D and C2Q) around for SLI doesn't worth one's efforts? Last time I checked quad-SLI isn't even possible on any Core 2 platform (please correct me if I'm wrong).

You FUDer, where did I say QFX able to handle 4 dual-slot video cards? OTOH, QFX is able to run 4 video cards. It's demoed by AMD. And I was right that quad-SLI isn't available on Core 2 platform. The Dell system you happily suggested was a Netburst system, not Core 2!

Quote :

I never said QX6700 can't handle two intensive tasks; I clearly said mega-tasking is more than running two intensive tasks.


Let me refresh your memory:

OTOH, one would react to such a phrase so much probably because it was right on the spot for him. 4x4 and C2Q are different animals, for different purposes. AMD showcased 4x4 by running 4 heavy weight tasks on 4 screens smootly at the same time; can C2Q do so? Can C2Q even do two?
In case you're reading deficient (not to suggest you are, but you might since you really can't read what I said), I was asking, Can C2Q even do two.

Quote :

OTOH, it was you who claimed that QFX has higher memory access delay (than QX6700) due to snooping(!).


I claimed that the purpose of SLI is to run games, and that games are heavily dependent on memory latency which is why the Quad FX-74 loses to a FX-62 in games. Which is true as proven out by every review. I never said the Quad FX has higher memory latency than the QX6700.
How many times do I need to tell you that those games are just not optimized for NUMA. What's wrong with your mind, something blocking its working? If you run a game on QFX with a single critical path sometimes on the first socket and sometimes moved to the second, of course its memory latency will be higher. It's not the problem of QFX, its your problem of not making the game NUMA-aware.

Your argument is stupid. First you claim QFX is (only) for games just because AMD mentioned "gamers" in its ads. Then you claim the games mean only currently available, non-NUMA-aware ones. Then you start to FUD on NUMA of QFX based on the two erroneous implications above. Yes, games performance depend on memory latency, I never said otherwise, but that doesn't make your arguments a bit more correct.

Quote :

You went on to claim that it's all because NUMA that is not going to be supported, or NUMA performs poorly even if supported, or you simply don't know how could NUMA be properly supported (please at least pick your favorite...).


I went on to say that NUMA will not fix the memory latency issue, and the FX-74 will simply never be able to match the memory latency of a single socket AMD platform and that NUMA will not be a magic fix that suddenly improves performance in desktop applications.
You were again FUDing. NUMA will make memory latency worse if the program is not NUMA-aware. NUMA should not make memory latency worse if the program properly support NUMA. Your QFX to FX comparison proves nothing, because the games used are not NUMA aware or optimized.

Quote :

Besides, Tom Yager's article has nothing to do with our discussion here. Does it talk about NUMA? Does it talk about snooping? Are you sure your mind is clear, or were you on some drugs?


It just goes to show your lack of technical knowledge when you post an article that has show a staggering number of mistakes and inaccuracies.
I quote Tom Yager's article for its description on the shared L3 cache on Barcelona. So that shows my lack of knowledge on QFX? You're just BS-ting.

Reply to abinstein

Quote :


Second, your claims are wrong, anyway. QFX's (essentially dual-Opteron) memory latency will be smaller due to its integrated memory controller and direct connect architecture. Under high bandwidth, it's memory latency will be smaller due to its NUMA, with benchmarks that are NUMA-optimized.


Where are these benchmarks?

Quote :


You're simply docking the real point. QX6700's win has little to do with memory architecture, but the (single) core optimization.


You're missing the point. The direct connect architecture barely matters in comparison to the core.

Quote :

Really, how much a FUDer you're going to be? First, my claim is NUMA, not Opteron. Second, just because they use another mechanism doesn't mean 1) the mechanism I described can't be done, 2) NUMA must snoop all other sockets.


And the fact is, in an Opteron system today, each socket must snoop every other socket before it can do a local memory transaction.

Quote :

you're confused/mixing up the intra-chip and inter-chip levels. Inside the chips, width of memory access has been increased to support the higher clock rates.


And you're confusing the fact that this doesn't matter, what matters is that PC designers are not willing to increase the width of the memory connections on the motherboard significantly, which would be needed if multiple memory controoler

Quote :


I must've missed it! Where did AMD say it'll run faster than QX6700 on games and single-threaded applications?


The target is enthusiasts and gamers. The fact that its slower than a FX-62 in games makes it dead on arrival, let alone the across the board loss to the QX6700.

Quote :

You FUDer, where did I say QFX able to handle 4 dual-slot video cards? OTOH, QFX is able to run 4 video cards. It's demoed by AMD. And I was right that quad-SLI isn't available on Core 2 platform. The Dell system you happily suggested was a Netburst system, not Core 2!


It hasn't been demoed by AMD at all. The physical locations of the x16 slots make it impossible to fit 4 top-end video cards so it's a useless feature anyways. And I didn't recommend any system, that was Gojdo who only pointed out that its readily available on the Intel platform. But that's to be expected, given your lack of memory.

Quote :


In case you're reading deficient (not to suggest you are, but you might since you really can't read what I said), I was asking, Can C2Q even do two?


And the way you phrase clearly suggested you didn't think it can. Despite the evidence that the QX6700 handles 4 threads better than your beloved Quad FX.

And after that you went on little incorrect rant about running 4 video cards.

Quote :


How many times do I need to tell you that those games are just not optimized for NUMA. What's wrong with your mind, something blocking its working? If you run a game on QFX with a single critical path sometimes on the first socket and sometimes moved to the second, of course its memory latency will be higher. It's not the problem of QFX, its your problem of not making the game NUMA-aware.


Well, the Quad FX's problem is software engineering then. And of course it's the Quad FX's problem since games will probably never be NUMA optimizable.

Quote :


Your argument is stupid. First you claim QFX is (only) for games just because AMD mentioned "gamers" in its ads.


I never claimed it was just for gamers. I claimed it was targeted for gamers and enthusiasts because that's what AMD announced the platform for, and what the FX series of processors were meant for over the years. If you want AMD to switch their advertising so that the Quad FX is aiming at the heater market, you should tell them that because that would be a good idea.

Quote :

Then you claim the games mean only currently available, non-NUMA-aware ones.


Where are these NUMA aware ones? There won't even be many multi-threaded games in the near future.

Quote :


Then you start to FUD on NUMA of QFX based on the two erroneous implications above. Yes, games performance depend on memory latency, I never said otherwise, but that doesn't make your arguments a bit more correct.


If memory latency is not higher than why does the 3GHz FX-74 consistently lose to the 2.8GHz FX-62 in games?

Quote :


You were again FUDing. NUMA will make memory latency worse if the program is not NUMA-aware. NUMA should not make memory latency worse if the program properly support NUMA. Your QFX to FX comparison proves nothing, because the games used are not NUMA aware or optimized.


Yawn give it up. A NUMA aware program can't stop the Opteron from snooping each other, which is what causes the memory latency increase.

Quote :


I quote Tom Yager's article for its description on the shared L3 cache on Barcelona. So that shows my lack of knowledge on QFX? You're just BS-ting.


What description? He doesn't reveal anything new is and even in that paragraph alone, has at least two mistakes.

Reply to accord99

Quote :


Second, your claims are wrong, anyway. QFX's (essentially dual-Opteron) memory latency will be smaller due to its integrated memory controller and direct connect architecture. Under high bandwidth, it's memory latency will be smaller due to its NUMA, with benchmarks that are NUMA-optimized.


Where are these benchmarks?
it's not that difficult to write one and test the memory latency yourself.

Quote :


You're simply docking the real point. QX6700's win has little to do with memory architecture, but the (single) core optimization.


You're missing the point. The direct connect architecture barely matters in comparison to the core.
direct connect architecture matters when it comes to memory latency.

Quote :

Really, how much a FUDer you're going to be? First, my claim is NUMA, not Opteron. Second, just because they use another mechanism doesn't mean 1) the mechanism I described can't be done, 2) NUMA must snoop all other sockets.


And the fact is, in an Opteron system today, each socket must snoop every other socket before it can do a local memory transaction.
the fact is, this is also true to the dual-socket Intel machines. maybe truer. so now you're not talking against NUMA, but multiple sockets on desktop? If so, don't mention NUMA or QFX, because these are not the only multiple socket desktop systems.

Quote :

you're confused/mixing up the intra-chip and inter-chip levels. Inside the chips, width of memory access has been increased to support the higher clock rates.


And you're confusing the fact that this doesn't matter, what matters is that PC designers are not willing to increase the width of the memory connections on the motherboard significantly, which would be needed if multiple memory controoler
I'll leave that decision to the market, not you.

Quote :


I must've missed it! Where did AMD say it'll run faster than QX6700 on games and single-threaded applications?


The target is enthusiasts and gamers. The fact that its slower than a FX-62 in games makes it dead on arrival, let alone the across the board loss to the QX6700.
Can't you just answer my question straight? Where did AMD say it'll run faster than QX6700 on games and single-threaded applications? Because that's the way you described it.

Quote :

You FUDer, where did I say QFX able to handle 4 dual-slot video cards? OTOH, QFX is able to run 4 video cards. It's demoed by AMD. And I was right that quad-SLI isn't available on Core 2 platform. The Dell system you happily suggested was a Netburst system, not Core 2!


It hasn't been demoed by AMD at all. The physical locations of the x16 slots make it impossible to fit 4 top-end video cards so it's a useless feature anyways. And I didn't recommend any system, that was Gojdo who only pointed out that its readily available on the Intel platform. But that's to be expected, given your lack of memory.
gOJDO or you, doesn't matter. quad-SLI is not available on Core 2, and I didn't say it's available on K8. Plus, I said 4 video cards, not 4 top end or dual-slot ones. You really can't read, and keep being a illogical FUDer.

Quote :


In case you're reading deficient (not to suggest you are, but you might since you really can't read what I said), I was asking, Can C2Q even do two?


And the way you phrase clearly suggested you didn't think it can. Despite the evidence that the QX6700 handles 4 threads better than your beloved Quad FX.

And after that you went on little incorrect rant about running 4 video cards.
The way I phrase? it's your way of reading that has problem. And just admit it, there are 4 video cards (built-in or dual-slot) in the 4x4 system AMD demoed.

Quote :


How many times do I need to tell you that those games are just not optimized for NUMA. What's wrong with your mind, something blocking its working? If you run a game on QFX with a single critical path sometimes on the first socket and sometimes moved to the second, of course its memory latency will be higher. It's not the problem of QFX, its your problem of not making the game NUMA-aware.


Well, the Quad FX's problem is software engineering then. And of course it's the Quad FX's problem since games will probably never be NUMA optimizable.
If we run a game inside a HW VM, the game itself doesn't need to be NUMA optimized to benefit from QFX, as long as the virtual machine that runs it is NUMA-aware.
But this is not the point. The point is running a game as-is on QFX doesn't show the system's strength nor benefit. You keep talking as if all gamers want is just an increase from 100fps to 110fps - which is bull shit. Your eyes can't even tell the difference. What's far more interesting is to run one game on one (top-end) card, a few other intensive tasks on another card, and yet a few other programs probably in a different virtual machine. AMD demoed a smooth operation in a similar scenario; I've yet to see an Intel box that demos such.

Quote :


Your argument is stupid. First you claim QFX is (only) for games just because AMD mentioned "gamers" in its ads.


I never claimed it was just for gamers. I claimed it was targeted for gamers and enthusiasts because that's what AMD announced the platform for, and what the FX series of processors were meant for over the years. If you want AMD to switch their advertising so that the Quad FX is aiming at the heater market, you should tell them that because that would be a good idea.
You claim the purpose of QFX is to run games. so... that's not just for gamers, but for enthusaists (whose purpose is still to run games) as well? please... where's your logic at all?

Quote :

Then you claim the games mean only currently available, non-NUMA-aware ones.


Where are these NUMA aware ones? There won't even be many multi-threaded games in the near future.
Games is not the only type of program that enthusiasts run. Nor is running one game with all the power of a system the only plausible scenario.

Quote :


Then you start to FUD on NUMA of QFX based on the two erroneous implications above. Yes, games performance depend on memory latency, I never said otherwise, but that doesn't make your arguments a bit more correct.


If memory latency is not higher than why does the 3GHz FX-74 consistently lose to the 2.8GHz FX-62 in games?
BECAUSE THE PROGRAMS ARE NOT NUMA AWARE/OPTIMIZED.
Gosh.... what's the problem with you?

Quote :


You were again FUDing. NUMA will make memory latency worse if the program is not NUMA-aware. NUMA should not make memory latency worse if the program properly support NUMA. Your QFX to FX comparison proves nothing, because the games used are not NUMA aware or optimized.


Yawn give it up. A NUMA aware program can't stop the Opteron from snooping each other, which is what causes the memory latency increase.
Bullshit. All Intel-based multiprocessors must snoop each other; NUMA doesn't solve it, but it doesn't make it worse. I've said this to you before, but it seems your head has problem acquiring knowledge. THE CAUSE OF NUMA LATENCY INCREASE IS NOT SNOOPING, BUT LACK OF SOFTWARE SUPPORT. (Don't even mention snoop filter here, because it helps Opteron, too.)

Quote :


I quote Tom Yager's article for its description on the shared L3 cache on Barcelona. So that shows my lack of knowledge on QFX? You're just BS-ting.


What description? He doesn't reveal anything new is and even in that paragraph alone, has at least two mistakes.
whether he made mistake or not doesn't change the fact that the article has nothing to do with what i'm saying here, and your mentioning his article only shows a logic as messed-up as a plate of spaghetti.

Reply to abinstein
- 0 +

Well Barcelona is coming out soon which should replace that 4x4 soon enough!

Reply to bfellow

Quote :


it's not that difficult to write one and test the memory latency yourself.


What's the point, when I can see that the applications are slower.

Quote :


direct connect architecture matters when it comes to memory latency.


But it doesn't particularly matter for the desktop.

Quote :


the fact is, this is also true to the dual-socket Intel machines. maybe truer. so now you're not talking against NUMA, but multiple sockets on desktop?


Yes in general, which is why AMD's Quad FX is a knee-jerk reaction.

Quote :


I'll leave that decision to the market, not you.


The market's already made the decision. Systems are getting smaller, notebooks are increasing in popularity.

Quote :


Can't you just answer my question straight? Where did AMD say it'll run faster than QX6700 on games and single-threaded applications? Because that's the way you described it.


The point is that it's supposed to the ultimate platform for the enthusiast, your youtube video spokesman even said it.

Quote :

quad-SLI is not available on Core 2, and I didn't say it's available on K8. Plus, I said 4 video cards, not 4 top end or dual-slot ones. You really can't read, and keep being a FUDer in decent to a lier.


It's not available for any platform, unless you count the 7950 GX2s. And 4 sli-ed less than top-end video cards will get destroyed by two top-end video cards in the end.

Quote :


The way I phrase? it's your way of reading that has problem. And just admit it, there are 4 video cards (built-in or dual-slot) in the 4x4 system AMD demoed.


AMD has never made such a demonstration.

Quote :


But this is not the point. The point is running a game as-if on QFX doesn't show the system's strength nor benefit. You keep acting as all enthusiast gamers want is just an increase from 100fps to 110fps - which is bull ****.


Well, they sure don't want a decrease to 95fps after spending a couple of grand.

Quote :

What's far more interesting is to run one game on one top-end card, a few other intensive tasks on another card, and probably a few other programs in a totally different virtual machine. AMD demoed a smooth operation in such scenario; I've yet to see an Intel box that demos such.


They haven't demoed anything like that. They had a WOW title screen, a couple of movies and simple encoder. Not particularly tough for a dual-core.

Quote :


You claim the purpose of QFX is to run games. Right... that's not just for gamers, but for enthusaists (who are not gamers but whose purpose is still to run games) as well.


And for gamers its slower than the FX-62 and for enthusiasts, its slower than the QX6700.

Quote :


Games is not the only type of program that enthusiasts run. Nor is running one game with all the power of the system the only type of scenario.


Well, the Quad FX will be waiting a long time for the particular software to arrive.

Quote :


BECAUSE THE PROGRAMS ARE NOT NUMA AWARE/OPTIMIZED.
Gosh.... what's the problem with you?


And almost all applications will never will be. That's AMD's problem.

Quote :


Bullshit. All Intel-based multiprocessors must snoop each other; NUMA doesn't solve it, but it doesn't make it worse. I've said this to you before, but it seems your head has problem acquiring knowledge. THE CAUSE OF NUMA LATENCY INCREASE IS NOT SNOOPING, BUT LACK OF SOFTWARE SUPPORT. (Don't even mention snoop filter here, because it helps Opteron, too.)


Intel's snoop doesn't have a significant impact on memory latency, so there is no noticeable loss in performance in games and desktop applications unlike the Quad FX.

Quote :


whether he made mistake or not doesn't change the fact that the article has nothing to do with what i'm saying here, and the fact that your logic, based on what you've claim above, is as thick as a plate of spaghetti.


It does have to do with your lack of knowledge.

Reply to accord99

Quote :

Well Barcelona is coming out soon which should replace that 4x4 soon enough!


That is true or not. Barcelona can replace 4x4, or to fit into 4x4. The point of 4x4 is not the fastest quad-core to run a game, but a more efficient way to realize multi-socket desktop. NUMA will help performance when it comes to hardware-based virtualization, which is becoming more and more a reality given the increase of processing cores and total amount of memory a system has. FSB is soon to be obsolete except for low-end legacy purposes, just like the Ethernet hubs gave ways to the high-speed switches.

Reply to abinstein
1 2 3 4
Next
Tom's Hardware > Forum > CPU & Components > CPUs > AMD's Barcelona to "stink" in ORACLE
Go to:

There are 570 identified and unidentified users. To see the list of identified users, Click here.

Please mind

You are about to answer a thread that has been inactive for more than 6 months.
If you still wish to proceed, please ensure that your posting is original and does not duplicate or overlap any prior responses to this thread.

Add a reply Cancel
Sponsored links
  • Ask the community now
  • Publish
Ad
They won a badge
Join us in greeting them