K10 benchmark investigation; POV-Ray experts needed :P

m25

Distinguished
May 23, 2006
2,363
0
19,780
Well, since we're all confused about yesterday's AMD first kind-of benchmark, one of the most crucial aspects to clarify, is the way POV-Ray used those system in the tests, so if anybody had exact information on any POV-Ray limitation in thread number, it would help us a lot to determine WHAT THE **** (replace the '*' with what you please) HAPPENED in that test.
 

WR

Distinguished
Jul 18, 2006
603
0
18,980
This is all from observing others' data and discussions:

POVRay certainly has no problem scaling to 16 cores on 4 sockets. It exhibits no measurable slowdown from 1 to 8 or 16 cores on a Clovertown system with centralized memory.

There is some mention that the setup of the K10 caches can interfere slightly with distributed workloads - while the IMC more than compensates in other cases.

It comes down to a 4S K10 having 36 different caches to keep coherent, as each L1-D, L2, and L3 potentially contains a unique copy of overlapping data. A 4S Clovertown system, in contrast, has just 8 caches for the controller to worry about, since L1 data is contained wholly in L2.

POVRay is a benchmark that fits entirely into cache, so to speak, and so it downplays the fact that 4S K10 has 8 channels of RAM, while simultaneously it exacerbates the snooping inefficiency caused by so many separate caches.
 

Thanatos421

Distinguished
Mar 26, 2007
549
0
18,990
Well put.

As an aside.....Noone will know how Barcelona will compete against Intel's offerings until AMD wants us to know.... Period.

I am still optimistic. I love AMD, although, not blindly. I really want Barcelona to exceed what Intel has to offer for the time being. I'm sure the 2nd coming of AMD won't last as long though. Intel is out for blood now and they seem to have their proverbial pedal to the metal.

Here's for hoping!
 

r0ck

Distinguished
Oct 8, 2006
469
0
18,780
I have a theory! Intel would not have reminded us of their V8 score unless they were somehow sure that AMD couldn't trump their score. I'll leave it at that.
 

HYST3R

Distinguished
Feb 27, 2006
463
0
18,780
i too was wondering about the IMC's with soo many seperate chips but im not sure thats the only problem.

im still kind of confused about the whole thing, was AMD trying to show off power consuption, or something else?
 

m25

Distinguished
May 23, 2006
2,363
0
19,780
I don't think there's anything about cache here, since rendering as a process is more or less a streaming job, where cache is more or less cut off and only FP efficiency and somehow branching count, especially in the AMD ( K8 ) arch. You can see this comparing in the CPU charts the render time of a 2.0GHz / 1M L2 Athlon64 and a 20.GHz / 128K L2; the difference is insignificantly 0.58% or 1sec while the L2 differs by a factor of 8X:
257-263-74.png
 

BaronMatrix

Splendid
Dec 14, 2005
6,655
0
25,790
This is all from observing others' data and discussions:

POVRay certainly has no problem scaling to 16 cores on 4 sockets. It exhibits no measurable slowdown from 1 to 8 or 16 cores on a Clovertown system with centralized memory.

There is some mention that the setup of the K10 caches can interfere slightly with distributed workloads - while the IMC more than compensates in other cases.

It comes down to a 4S K10 having 36 different caches to keep coherent, as each L1-D, L2, and L3 potentially contains a unique copy of overlapping data. A 4S Clovertown system, in contrast, has just 8 caches for the controller to worry about, since L1 data is contained wholly in L2.

POVRay is a benchmark that fits entirely into cache, so to speak, and so it downplays the fact that 4S K10 has 8 channels of RAM, while simultaneously it exacerbates the snooping inefficiency caused by so many separate caches.


You are incorrect about the caching setup. L1D, and L2 are different. L3 is a victim cache and also contains different data, but only data shared. RealWorld did a real in-depth analysis recently - someone posted the link - and it shows that the way Barcelona handles it.

It only has to keep the L3 coherent so that's only 4 caches. Because they are still on HT1.1 they can only go out to 4 sockets and keep the 1-hop latency.

The closest approximation with all of the other POV-Ray scaling articles is that the Opteron was running at 2.8GHz and the Barcelona was running at 1.9-2.1GHz (the starting frequencies). Again I don't know but it looks like the test would ned to be run with a 2P Barclona vs. a 4P Opteron that would make it 8 cores - 8 cores.

There seems to be NO scaling from 8 - 16 cores so it could be that he SW isn't as multithreaded as it could be. Maybe it's too coarse such that it is putting large blocks in each core rather than putting smaller blocks in each core - that wold put more stress on the load store mechanism rather than the functional units.

If you look at the way Valve is doing their's they noted that you can lose perf when increasing multithreading. This is the first multithreaded version and it seems to be listed as a beta everywhere.

Also, on the site they make note only of single core - dual core - quad core but not out to 16 core. Though at the bottom of the beta page they say "There have been reports of benefits for
users of hyperthreading systems, particularly with higher thread counts (e.g.
16 threads)."

I'm not really sure if C2Q or Barcelona or Opteron are bearing this out, though.

I guess someone should be asking them. They have the source code for 3.6 but not 3.7.
 
I would think that the key would be reducing the number of times you dynamically allocate memory. The program (in this case POV-Ray) must treat each thread independently and allocate memory for its data and works with its specific cpu memory area.

Since each processor in an AMD multi-processor / socket system has its own local memory - when a processor accesses remote memory (i.e., cpu0 to cpu1 memory banks) latencies have to grow, right?

Can one of you programmers explain the Allocation Call Tree? In the case of an AMD SMP rig do you have linear reading/writing from/to memory? I don't see where that is possible when cpu0 has to access data from cpu1 memory . . .

I'll need one of yahs to figure this one out . . .
 

WR

Distinguished
Jul 18, 2006
603
0
18,980
You are incorrect about the caching setup. L1D, and L2 are different. L3 is a victim cache and also contains different data, but only data shared. RealWorld did a real in-depth analysis recently - someone posted the link - and it shows that the way Barcelona handles it.

It only has to keep the L3 coherent so that's only 4 caches.
In order for coherency among the 4 L3's to be sufficient, the L3 must act as a write cache for all data from 4 cores, which is the reverse of the K10 victim read-only design - any modified L3 data is written back into L1D or L2, then that portion in L3 is flushed.

There are 16 independent threads, each of which touches an individual L1D and/or L2, so the minimum number of caches to worry about is 32 - assuming the L3 never steps in. The processors don't know these are independent threads because there is no central manager among the 16 cores telling them so.

The reason 4S Clovertown gets this down to 8 caches is that (1) the L1D is included fully in L2 and (2) each L2 contains all the cached data for two cores and handles all the read/write load. Neither method is present in K10, or else we'd see only L1's and a single, large L2 per die.

I don't think there's anything about cache here, since rendering as a process is more or less a streaming job, where cache is more or less cut off and only FP efficiency and somehow branching count, especially in the AMD ( K8 ) arch. You can see this comparing in the CPU charts the render time of a 2.0GHz / 1M L2 Athlon64 and a 20.GHz / 128K L2; the difference is insignificantly 0.58% or 1sec while the L2 differs by a factor of 8X:
This is not about the amount of cache but the simple yet tedious problem of keeping so many different caches in sync.
 
Tweakers.net had a review of a 8 socket Sun x4600. Povray shows improved performance between 8 and 16 threads. The scaling from 8->16 is much poorer than 4->8 but can't say whether it's because of Povray or because of the 8 sockets.

http://tweakers.net/reviews/674/9

I'm just guessin' . . . I'd say it's due to a single cpu (let's say cpu1) having to access memory at a separate socket (either cpu0, cpu2 or cpu3). I know this is really simplistic (due to my pea brain) but are you not potentially introducing a 4-fold increase in latencies (i.e., per cpu socket memory banks) ??
 

crazypyro

Distinguished
Mar 4, 2006
325
0
18,780
Since the used version of POV-ray is more or less still in the beta phase, why would AMD use it solely for the purpose of a PR benchmark? I've came up with a few answers myself

(1) They are marketing to the not so inclined audience of "omfg Barcelona is 2x faster than opteron" crowd, when showing 4 K10s (16cores effectively) doubles the performance of 4 Opterons (8cores effectively)

(2) AMD was using POV-ray to test their new architecture before Computex and released the results to appease the raging media and public

(3) AMD has Alzheimer's (I think they need to get their F@H R600 farms running)
 

BaronMatrix

Splendid
Dec 14, 2005
6,655
0
25,790
I would think that the key would be reducing the number of times you dynamically allocate memory. The program (in this case POV-Ray) must treat each thread independently and allocate memory for its data and works with its specific cpu memory area.

Since each processor in an AMD multi-processor / socket system has its own local memory - when a processor accesses remote memory (i.e., cpu0 to cpu1 memory banks) latencies have to grow, right?

Can one of you programmers explain the Allocation Call Tree? In the case of an AMD SMP rig do you have linear reading/writing from/to memory? I don't see where that is possible when cpu0 has to access data from cpu1 memory . . .

I'll need one of yahs to figure this one out . . .


I would say you have to look at some of the older benchmarks that use different size RAM blocks. IIRC, some apps like the larger blocks and some don't.

All of the POVRay tests I have seen show VERY LITTLE scaling from 8-16 threads. Even C2Q. 32 threads showed very little also from 16. Perhaps a better test would be SQL Server or Oracle.

AMD has analog tracing gear that can simulate real world (intel does also) usage of transistors. They would have known many moons ago if it wasn't tracing better than Opteron.

Again, it's just a few weeks until Computex and more light should be shed.
 

BaldEagle

Distinguished
Jul 28, 2004
652
0
18,980
All of the POVRay tests I have seen show VERY LITTLE scaling from 8-16 threads. Even C2Q. 32 threads showed very little also from 16. Perhaps a better test would be SQL Server or Oracle.
.

There are not many sites that show data for 8-16 cores -- try and look, however what you refer to 16 to 32 threads on a C2Q not scaling at all ... well, the scaling stops when when the thread count exceeds the core count... duh. You are not very bright.

There you go, now he is going to say something else stupid for Wombat to put into his sig.
 

BaldEagle

Distinguished
Jul 28, 2004
652
0
18,980
:) :) Baron has 5779 posts, of which 5769 are something stupid -- he has plenty to choose from.

Yes, but of the 5769 stupid one's there are only 30 or so that are worthy of repeating over and over again. Best of luck educating Baron seems to be a real slow learner.
 

BaronMatrix

Splendid
Dec 14, 2005
6,655
0
25,790
You are incorrect about the caching setup. L1D, and L2 are different. L3 is a victim cache and also contains different data, but only data shared. RealWorld did a real in-depth analysis recently - someone posted the link - and it shows that the way Barcelona handles it.

It only has to keep the L3 coherent so that's only 4 caches.
In order for coherency among the 4 L3's to be sufficient, the L3 must act as a write cache for all data from 4 cores, which is the reverse of the K10 victim read-only design - any modified L3 data is written back into L1D or L2, then that portion in L3 is flushed.

There are 16 independent threads, each of which touches an individual L1D and/or L2, so the minimum number of caches to worry about is 32 - assuming the L3 never steps in. The processors don't know these are independent threads because there is no central manager among the 16 cores telling them so.

The reason 4S Clovertown gets this down to 8 caches is that (1) the L1D is included fully in L2 and (2) each L2 contains all the cached data for two cores and handles all the read/write load. Neither method is present in K10, or else we'd see only L1's and a single, large L2 per die.

I don't think there's anything about cache here, since rendering as a process is more or less a streaming job, where cache is more or less cut off and only FP efficiency and somehow branching count, especially in the AMD ( K8 ) arch. You can see this comparing in the CPU charts the render time of a 2.0GHz / 1M L2 Athlon64 and a 20.GHz / 128K L2; the difference is insignificantly 0.58% or 1sec while the L2 differs by a factor of 8X:
This is not about the amount of cache but the simple yet tedious problem of keeping so many different caches in sync.


Again you should look at some of the Barcelona analysis articles. The way it works is AMD uses a MOESI protocol to determine if a query is needed. In K8 all cores were always queried, but Barcelona has a new method where if a cache line is not marked shared then the query doesn't happen.

Even then a 4P Opteron has seriously low NUMA latency. They also don't say if the Barcelona is using DDR2-800.

I gues the difference between what I think and what others think is that most of us ca't have it faster than C2Q because you all talk too much (include yourself if you want).

I believe that AMD has Itanium, Alpha, and K7 engrs/designers so they will do what they set out to. Beat the crap out of

OPTERON.
 

halbhh

Distinguished
Mar 21, 2006
965
0
18,980
If someone makes you want to lower yourself down to making insults, it's usually better to just to dryly point out objective facts in a reply to yourself I've found. Else we end up in endless back and forth that's about ego, really. I'm just objecting to the forumz degrading into flames. If the folks like you and me start flaming, that will result in pretty much nothing but flames in 98% of threads after the first post or two, imo.
 

m25

Distinguished
May 23, 2006
2,363
0
19,780
Again you should look at some of the Barcelona analysis articles.


He has Baron, and he is interpreting them correctly --- you are not.

Even then a 4P Opteron has seriously low NUMA latency. They also don't say if the Barcelona is using DDR2-800.

Define seriously low.... it is this latency that causes the 4x4 to take a performance hit when you compare 1 vs 2 sockets populated:
http://www.anandtech.com/cpuchipsets/showdoc.aspx?i=2879&p=6
However you interpret those results, they don't make much sense either; if both kinds of cores were running at the same frequency, how would it be possible for a K10 to have the same, identical IPC of a K8?! I think that at the end, this test was just set up to show what they wanted to show; most probably there ware 2.8Ghz Opterons and 1.9Ghz Barcelonas. First the guy says they are the FASTEST optys they have (that means the 3.0GHz ones) and Barcelona is not even planned to clock that high, so how can they run at the same frequency?! Then he says that the Barcelona was not the fastest clocked they plan to deliver and I don't think that they will release a +3.0GHz quad; it's a stupid paradox however you see it :roll:
 

HotFoot

Distinguished
May 26, 2004
789
0
18,980
I couldn't see from the video wha the CPU load was for the 4xQuad Barcelona on the video you linked above. If something else is holding up these processors so that they're not being utilised as well as the Operton's, then we would know something is definitely up. How the hell could Barcelona not have IPC improvements over K8? This could be incredibly dissappointing.

In AMDs defence, they are offering double the performance in the same thermal envelope. Now, if they offer it at a competitive cost, then they could still have a winning solution on their hands. Doubling performance for server-type tasks is what matters... not how many cores you used to get there, so long as you fit in the same platform.

On the other hand, what this means to home users is that Barcelona may have nothing incredible to offer at all. If the only useful new feature is the ability to choose a quad-core, I see little reason to expect Barcelona to keep up with the Core 2 Quads, some of which will have been around for a full year by the time Barcelona is out of the gate.
 

halbhh

Distinguished
Mar 21, 2006
965
0
18,980
[......


However you interpret those results, they don't make much sense either; if both kinds of cores were running at the same frequency, how would it be possible for a K10 to have the same, identical IPC of a K8?! I think that at the end, this test was just set up to show what they wanted to show; most probably there ware 2.8Ghz Opterons and 1.9Ghz Barcelonas. First the guy says they are the FASTEST optys they have (that means the 3.0GHz ones) and Barcelona is not even planned to clock that high, so how can they run at the same frequency?! Then he says that the Barcelona was not the fastest clocked they plan to deliver and I don't think that they will release a +3.0GHz quad; it's a stupid paradox however you see it :roll:

This does make sense, and is a good possibility. So it's logical imo to try to disprove this, and guess it may be right until disproved.
 

r0ck

Distinguished
Oct 8, 2006
469
0
18,780
most probably there ware 2.8Ghz Opterons and 1.9Ghz Barcelonas. First the guy says they are the FASTEST optys they have (that means the 3.0GHz ones) and Barcelona is not even planned to clock that high, so how can they run at the same frequency?! Then he says that the Barcelona was not the fastest clocked they plan to deliver and I don't think that they will release a +3.0GHz quad; it's a stupid paradox however you see it :roll:

http://www.youtube.com/watch?v=VGiv9Dtrc5Q
I recall "identical frequency" and don't remember "fastest Opteron".
 

halbhh

Distinguished
Mar 21, 2006
965
0
18,980
Re what frequencies, it's possible Randy Allen misunderstood what the setup was. Just a possibility. Also, even though the "2 core" is an "opteron", does that guarantee it's not a new arch opty?

I don't know anything about POV. Is there a reason to expect it to be better with a better arch?
 

gOJDO

Distinguished
Mar 16, 2006
2,309
1
19,780
Baron, have you ever tried to break a brick with your head?
Honestly, I really wonder...

I doubt it --- why ruin a perfectly good brick.What if you need a half of the brick?
Imagine the effort needed to break it with hammer. Why agonize if Baron can break it into pieces with one head-shot? I am just guessing...
 

HotFoot

Distinguished
May 26, 2004
789
0
18,980
Sorry for the off-topic, but is the Ninja Appreciation Week thing by invitation only? I'd like to show my appreciation, but I don't want to pretend I'm part of some club that I haven't been invited to.