Sign in with
Sign up | Sign in
Your question

K10 benchmark investigation; POV-Ray experts needed :P

Tags:
  • CPUs
  • Benchmark
  • Crucial
  • AMD
Last response: in CPUs
Share
May 23, 2007 8:16:52 PM

Well, since we're all confused about yesterday's AMD first kind-of benchmark, one of the most crucial aspects to clarify, is the way POV-Ray used those system in the tests, so if anybody had exact information on any POV-Ray limitation in thread number, it would help us a lot to determine WHAT THE **** (replace the '*' with what you please) HAPPENED in that test.

More about : k10 benchmark investigation pov ray experts needed

May 23, 2007 8:41:28 PM

Tweakers.net had a review of a 8 socket Sun x4600. Povray shows improved performance between 8 and 16 threads. The scaling from 8->16 is much poorer than 4->8 but can't say whether it's because of Povray or because of the 8 sockets.

http://tweakers.net/reviews/674/9
May 23, 2007 9:04:12 PM

This is all from observing others' data and discussions:

POVRay certainly has no problem scaling to 16 cores on 4 sockets. It exhibits no measurable slowdown from 1 to 8 or 16 cores on a Clovertown system with centralized memory.

There is some mention that the setup of the K10 caches can interfere slightly with distributed workloads - while the IMC more than compensates in other cases.

It comes down to a 4S K10 having 36 different caches to keep coherent, as each L1-D, L2, and L3 potentially contains a unique copy of overlapping data. A 4S Clovertown system, in contrast, has just 8 caches for the controller to worry about, since L1 data is contained wholly in L2.

POVRay is a benchmark that fits entirely into cache, so to speak, and so it downplays the fact that 4S K10 has 8 channels of RAM, while simultaneously it exacerbates the snooping inefficiency caused by so many separate caches.
May 23, 2007 9:20:55 PM

Well put.

As an aside.....Noone will know how Barcelona will compete against Intel's offerings until AMD wants us to know.... Period.

I am still optimistic. I love AMD, although, not blindly. I really want Barcelona to exceed what Intel has to offer for the time being. I'm sure the 2nd coming of AMD won't last as long though. Intel is out for blood now and they seem to have their proverbial pedal to the metal.

Here's for hoping!
May 23, 2007 9:31:41 PM

I have a theory! Intel would not have reminded us of their V8 score unless they were somehow sure that AMD couldn't trump their score. I'll leave it at that.
May 23, 2007 9:32:08 PM

i too was wondering about the IMC's with soo many seperate chips but im not sure thats the only problem.

im still kind of confused about the whole thing, was AMD trying to show off power consuption, or something else?
May 23, 2007 10:20:39 PM

I don't think there's anything about cache here, since rendering as a process is more or less a streaming job, where cache is more or less cut off and only FP efficiency and somehow branching count, especially in the AMD ( K8 ) arch. You can see this comparing in the CPU charts the render time of a 2.0GHz / 1M L2 Athlon64 and a 20.GHz / 128K L2; the difference is insignificantly 0.58% or 1sec while the L2 differs by a factor of 8X:
May 24, 2007 1:18:29 PM

Quote:
This is all from observing others' data and discussions:

POVRay certainly has no problem scaling to 16 cores on 4 sockets. It exhibits no measurable slowdown from 1 to 8 or 16 cores on a Clovertown system with centralized memory.

There is some mention that the setup of the K10 caches can interfere slightly with distributed workloads - while the IMC more than compensates in other cases.

It comes down to a 4S K10 having 36 different caches to keep coherent, as each L1-D, L2, and L3 potentially contains a unique copy of overlapping data. A 4S Clovertown system, in contrast, has just 8 caches for the controller to worry about, since L1 data is contained wholly in L2.

POVRay is a benchmark that fits entirely into cache, so to speak, and so it downplays the fact that 4S K10 has 8 channels of RAM, while simultaneously it exacerbates the snooping inefficiency caused by so many separate caches.



You are incorrect about the caching setup. L1D, and L2 are different. L3 is a victim cache and also contains different data, but only data shared. RealWorld did a real in-depth analysis recently - someone posted the link - and it shows that the way Barcelona handles it.

It only has to keep the L3 coherent so that's only 4 caches. Because they are still on HT1.1 they can only go out to 4 sockets and keep the 1-hop latency.

The closest approximation with all of the other POV-Ray scaling articles is that the Opteron was running at 2.8GHz and the Barcelona was running at 1.9-2.1GHz (the starting frequencies). Again I don't know but it looks like the test would ned to be run with a 2P Barclona vs. a 4P Opteron that would make it 8 cores - 8 cores.

There seems to be NO scaling from 8 - 16 cores so it could be that he SW isn't as multithreaded as it could be. Maybe it's too coarse such that it is putting large blocks in each core rather than putting smaller blocks in each core - that wold put more stress on the load store mechanism rather than the functional units.

If you look at the way Valve is doing their's they noted that you can lose perf when increasing multithreading. This is the first multithreaded version and it seems to be listed as a beta everywhere.

Also, on the site they make note only of single core - dual core - quad core but not out to 16 core. Though at the bottom of the beta page they say "There have been reports of benefits for
users of hyperthreading systems, particularly with higher thread counts (e.g.
16 threads)."

I'm not really sure if C2Q or Barcelona or Opteron are bearing this out, though.

I guess someone should be asking them. They have the source code for 3.6 but not 3.7.
a c 124 à CPUs
a b À AMD
May 24, 2007 2:28:14 PM

I would think that the key would be reducing the number of times you dynamically allocate memory. The program (in this case POV-Ray) must treat each thread independently and allocate memory for its data and works with its specific cpu memory area.

Since each processor in an AMD multi-processor / socket system has its own local memory - when a processor accesses remote memory (i.e., cpu0 to cpu1 memory banks) latencies have to grow, right?

Can one of you programmers explain the Allocation Call Tree? In the case of an AMD SMP rig do you have linear reading/writing from/to memory? I don't see where that is possible when cpu0 has to access data from cpu1 memory . . .

I'll need one of yahs to figure this one out . . .
May 24, 2007 2:29:31 PM

Quote:
You are incorrect about the caching setup. L1D, and L2 are different. L3 is a victim cache and also contains different data, but only data shared. RealWorld did a real in-depth analysis recently - someone posted the link - and it shows that the way Barcelona handles it.

It only has to keep the L3 coherent so that's only 4 caches.

In order for coherency among the 4 L3's to be sufficient, the L3 must act as a write cache for all data from 4 cores, which is the reverse of the K10 victim read-only design - any modified L3 data is written back into L1D or L2, then that portion in L3 is flushed.

There are 16 independent threads, each of which touches an individual L1D and/or L2, so the minimum number of caches to worry about is 32 - assuming the L3 never steps in. The processors don't know these are independent threads because there is no central manager among the 16 cores telling them so.

The reason 4S Clovertown gets this down to 8 caches is that (1) the L1D is included fully in L2 and (2) each L2 contains all the cached data for two cores and handles all the read/write load. Neither method is present in K10, or else we'd see only L1's and a single, large L2 per die.

Quote:
I don't think there's anything about cache here, since rendering as a process is more or less a streaming job, where cache is more or less cut off and only FP efficiency and somehow branching count, especially in the AMD ( K8 ) arch. You can see this comparing in the CPU charts the render time of a 2.0GHz / 1M L2 Athlon64 and a 20.GHz / 128K L2; the difference is insignificantly 0.58% or 1sec while the L2 differs by a factor of 8X:

This is not about the amount of cache but the simple yet tedious problem of keeping so many different caches in sync.
a c 124 à CPUs
a b À AMD
May 24, 2007 2:36:53 PM

Quote:
Tweakers.net had a review of a 8 socket Sun x4600. Povray shows improved performance between 8 and 16 threads. The scaling from 8->16 is much poorer than 4->8 but can't say whether it's because of Povray or because of the 8 sockets.

http://tweakers.net/reviews/674/9


I'm just guessin' . . . I'd say it's due to a single cpu (let's say cpu1) having to access memory at a separate socket (either cpu0, cpu2 or cpu3). I know this is really simplistic (due to my pea brain) but are you not potentially introducing a 4-fold increase in latencies (i.e., per cpu socket memory banks) ??
May 24, 2007 2:46:40 PM

Since the used version of POV-ray is more or less still in the beta phase, why would AMD use it solely for the purpose of a PR benchmark? I've came up with a few answers myself

(1) They are marketing to the not so inclined audience of "omfg Barcelona is 2x faster than opteron" crowd, when showing 4 K10s (16cores effectively) doubles the performance of 4 Opterons (8cores effectively)

(2) AMD was using POV-ray to test their new architecture before Computex and released the results to appease the raging media and public

(3) AMD has Alzheimer's (I think they need to get their F@H R600 farms running)
May 24, 2007 3:54:09 PM

Quote:
I would think that the key would be reducing the number of times you dynamically allocate memory. The program (in this case POV-Ray) must treat each thread independently and allocate memory for its data and works with its specific cpu memory area.

Since each processor in an AMD multi-processor / socket system has its own local memory - when a processor accesses remote memory (i.e., cpu0 to cpu1 memory banks) latencies have to grow, right?

Can one of you programmers explain the Allocation Call Tree? In the case of an AMD SMP rig do you have linear reading/writing from/to memory? I don't see where that is possible when cpu0 has to access data from cpu1 memory . . .

I'll need one of yahs to figure this one out . . .



I would say you have to look at some of the older benchmarks that use different size RAM blocks. IIRC, some apps like the larger blocks and some don't.

All of the POVRay tests I have seen show VERY LITTLE scaling from 8-16 threads. Even C2Q. 32 threads showed very little also from 16. Perhaps a better test would be SQL Server or Oracle.

AMD has analog tracing gear that can simulate real world (intel does also) usage of transistors. They would have known many moons ago if it wasn't tracing better than Opteron.

Again, it's just a few weeks until Computex and more light should be shed.
May 24, 2007 4:20:49 PM

Quote:

All of the POVRay tests I have seen show VERY LITTLE scaling from 8-16 threads. Even C2Q. 32 threads showed very little also from 16. Perhaps a better test would be SQL Server or Oracle.
.


There are not many sites that show data for 8-16 cores -- try and look, however what you refer to 16 to 32 threads on a C2Q not scaling at all ... well, the scaling stops when when the thread count exceeds the core count... duh. You are not very bright.

There you go, now he is going to say something else stupid for Wombat to put into his sig.
May 24, 2007 4:44:08 PM

Quote:
:)  :)  Baron has 5779 posts, of which 5769 are something stupid -- he has plenty to choose from.


Yes, but of the 5769 stupid one's there are only 30 or so that are worthy of repeating over and over again. Best of luck educating Baron seems to be a real slow learner.
May 24, 2007 4:55:56 PM

Quote:
You are incorrect about the caching setup. L1D, and L2 are different. L3 is a victim cache and also contains different data, but only data shared. RealWorld did a real in-depth analysis recently - someone posted the link - and it shows that the way Barcelona handles it.

It only has to keep the L3 coherent so that's only 4 caches.

In order for coherency among the 4 L3's to be sufficient, the L3 must act as a write cache for all data from 4 cores, which is the reverse of the K10 victim read-only design - any modified L3 data is written back into L1D or L2, then that portion in L3 is flushed.

There are 16 independent threads, each of which touches an individual L1D and/or L2, so the minimum number of caches to worry about is 32 - assuming the L3 never steps in. The processors don't know these are independent threads because there is no central manager among the 16 cores telling them so.

The reason 4S Clovertown gets this down to 8 caches is that (1) the L1D is included fully in L2 and (2) each L2 contains all the cached data for two cores and handles all the read/write load. Neither method is present in K10, or else we'd see only L1's and a single, large L2 per die.

Quote:
I don't think there's anything about cache here, since rendering as a process is more or less a streaming job, where cache is more or less cut off and only FP efficiency and somehow branching count, especially in the AMD ( K8 ) arch. You can see this comparing in the CPU charts the render time of a 2.0GHz / 1M L2 Athlon64 and a 20.GHz / 128K L2; the difference is insignificantly 0.58% or 1sec while the L2 differs by a factor of 8X:

This is not about the amount of cache but the simple yet tedious problem of keeping so many different caches in sync.


Again you should look at some of the Barcelona analysis articles. The way it works is AMD uses a MOESI protocol to determine if a query is needed. In K8 all cores were always queried, but Barcelona has a new method where if a cache line is not marked shared then the query doesn't happen.

Even then a 4P Opteron has seriously low NUMA latency. They also don't say if the Barcelona is using DDR2-800.

I gues the difference between what I think and what others think is that most of us ca't have it faster than C2Q because you all talk too much (include yourself if you want).

I believe that AMD has Itanium, Alpha, and K7 engrs/designers so they will do what they set out to. Beat the crap out of

OPTERON.
May 24, 2007 5:12:19 PM

If someone makes you want to lower yourself down to making insults, it's usually better to just to dryly point out objective facts in a reply to yourself I've found. Else we end up in endless back and forth that's about ego, really. I'm just objecting to the forumz degrading into flames. If the folks like you and me start flaming, that will result in pretty much nothing but flames in 98% of threads after the first post or two, imo.
May 24, 2007 5:17:02 PM

Quote:
Again you should look at some of the Barcelona analysis articles.



He has Baron, and he is interpreting them correctly --- you are not.

Quote:
Even then a 4P Opteron has seriously low NUMA latency. They also don't say if the Barcelona is using DDR2-800.


Define seriously low.... it is this latency that causes the 4x4 to take a performance hit when you compare 1 vs 2 sockets populated:
http://www.anandtech.com/cpuchipsets/showdoc.aspx?i=287...
However you interpret those results, they don't make much sense either; if both kinds of cores were running at the same frequency, how would it be possible for a K10 to have the same, identical IPC of a K8?! I think that at the end, this test was just set up to show what they wanted to show; most probably there ware 2.8Ghz Opterons and 1.9Ghz Barcelonas. First the guy says they are the FASTEST optys they have (that means the 3.0GHz ones) and Barcelona is not even planned to clock that high, so how can they run at the same frequency?! Then he says that the Barcelona was not the fastest clocked they plan to deliver and I don't think that they will release a +3.0GHz quad; it's a stupid paradox however you see it :roll:
May 24, 2007 5:22:16 PM

Baron, have you ever tried to break a brick with your head?
Honestly, I really wonder...
May 24, 2007 5:24:34 PM

I couldn't see from the video wha the CPU load was for the 4xQuad Barcelona on the video you linked above. If something else is holding up these processors so that they're not being utilised as well as the Operton's, then we would know something is definitely up. How the hell could Barcelona not have IPC improvements over K8? This could be incredibly dissappointing.

In AMDs defence, they are offering double the performance in the same thermal envelope. Now, if they offer it at a competitive cost, then they could still have a winning solution on their hands. Doubling performance for server-type tasks is what matters... not how many cores you used to get there, so long as you fit in the same platform.

On the other hand, what this means to home users is that Barcelona may have nothing incredible to offer at all. If the only useful new feature is the ability to choose a quad-core, I see little reason to expect Barcelona to keep up with the Core 2 Quads, some of which will have been around for a full year by the time Barcelona is out of the gate.
May 24, 2007 5:32:25 PM

Quote:
[......


However you interpret those results, they don't make much sense either; if both kinds of cores were running at the same frequency, how would it be possible for a K10 to have the same, identical IPC of a K8?! I think that at the end, this test was just set up to show what they wanted to show; most probably there ware 2.8Ghz Opterons and 1.9Ghz Barcelonas. First the guy says they are the FASTEST optys they have (that means the 3.0GHz ones) and Barcelona is not even planned to clock that high, so how can they run at the same frequency?! Then he says that the Barcelona was not the fastest clocked they plan to deliver and I don't think that they will release a +3.0GHz quad; it's a stupid paradox however you see it :roll:


This does make sense, and is a good possibility. So it's logical imo to try to disprove this, and guess it may be right until disproved.
May 24, 2007 5:32:48 PM

Quote:
most probably there ware 2.8Ghz Opterons and 1.9Ghz Barcelonas. First the guy says they are the FASTEST optys they have (that means the 3.0GHz ones) and Barcelona is not even planned to clock that high, so how can they run at the same frequency?! Then he says that the Barcelona was not the fastest clocked they plan to deliver and I don't think that they will release a +3.0GHz quad; it's a stupid paradox however you see it :roll:


http://www.youtube.com/watch?v=VGiv9Dtrc5Q
I recall "identical frequency" and don't remember "fastest Opteron".
May 24, 2007 5:43:05 PM

Re what frequencies, it's possible Randy Allen misunderstood what the setup was. Just a possibility. Also, even though the "2 core" is an "opteron", does that guarantee it's not a new arch opty?

I don't know anything about POV. Is there a reason to expect it to be better with a better arch?
May 24, 2007 5:43:44 PM

Quote:
Baron, have you ever tried to break a brick with your head?
Honestly, I really wonder...


I doubt it --- why ruin a perfectly good brick.What if you need a half of the brick?
Imagine the effort needed to break it with hammer. Why agonize if Baron can break it into pieces with one head-shot? I am just guessing...
May 24, 2007 5:45:54 PM

Sorry for the off-topic, but is the Ninja Appreciation Week thing by invitation only? I'd like to show my appreciation, but I don't want to pretend I'm part of some club that I haven't been invited to.
May 24, 2007 6:02:11 PM

Quote:

All of the POVRay tests I have seen show VERY LITTLE scaling from 8-16 threads. Even C2Q. 32 threads showed very little also from 16. Perhaps a better test would be SQL Server or Oracle.
.


There are not many sites that show data for 8-16 cores -- try and look, however what you refer to 16 to 32 threads on a C2Q not scaling at all ... well, the scaling stops when when the thread count exceeds the core count... duh. You are not very bright.

How many do you need? Someone posted a X4600 test from somewhere that showed up to 32 threads/16 cores compared. Depending on the thread scheduling mecahanism, it is possible to run more threads than cores, you should get a SMALL increase.
May 24, 2007 6:03:36 PM

Quote:
Baron, have you ever tried to break a brick with your head?
Honestly, I really wonder...


I doubt it --- why ruin a perfectly good brick.

No but I'd definitely break your head with a brick.
May 24, 2007 6:05:32 PM

Quote:
How many do you need? Someone posted a X4600 test from somewhere that showed up to 32 threads/16 cores compared. Depending on the thread scheduling mecahanism, it is possible to run more threads than cores, you should get a SMALL increase.


The point is that you cannot (yet) get a C2Q chip for 4P, so the most cores that you can get for the Core in one system is 8. Saying that POVray doesn't scale well on C2Q from 8 to 16 cores is meaningless when you're trying to justify why a 16-core AMD system did relatively poorly. It's as simple as that.
May 24, 2007 6:05:40 PM

Quote:
Baron, have you ever tried to break a brick with your head?
Honestly, I really wonder...


I doubt it --- why ruin a perfectly good brick.

No but I'd definitely break your brick with a head.
May 24, 2007 6:07:04 PM

Quote:
No but I'd definitely break your head with a brick.


... Thus proving, once again, that you are indeed at the top of the "intellectual food chain". :lol: 
May 24, 2007 6:14:47 PM

Quote:
Baron, have you ever tried to break a brick with your head?
Honestly, I really wonder...


I doubt it --- why ruin a perfectly good brick.

No but I'd definitely break your head with a brick.
But, why Baron? :( 
Why all this hate? What is that rancor you have accumulated?
We should respect each other. We should be friends. :cry: 
May 24, 2007 6:24:39 PM

Quote:
HotFoot"]No but I'd definitely break your head with a brick.

... Thus proving, once again, that you are indeed at the top of the "intellectual food chain". :lol: [/quote]


There's a time to reap and a time to sow. There's atime to .....ehhh, never mind.

Quote:
displaying or characterized by quickness of understanding, sound thought, or good judgment


That's the definition of intelligent. It has nothing to do with what you know but what you can learn. There hasn't been much. My PhD instructor in college recommended me for positions as I could take allof those equations and understand how to use them rather than thinking that the equations came first before experimentation.

It's too bad I couldn't mass market my PID retrofit Cruise control. That thing is cool. It uses an analog algorithm to adjust for hills, rough roads, desert etc.


You guys don't seem to undrstand that the more crap you talk the more conceited I get. Wait til I whip out my next big thing. All I can say is WOW.
May 24, 2007 6:26:59 PM

Quote:
Baron, have you ever tried to break a brick with your head?
Honestly, I really wonder...


I doubt it --- why ruin a perfectly good brick.

No but I'd definitely break your head with a brick.
But, why Baron? :( 
Why all this hate? What is that rancor you have accumulated?
We should respect each other. We should be friends. :cry: 


To quote DMX,

Stop talking S&*T.
May 24, 2007 6:30:18 PM

Quote:
To quote DMX,

Stop talking S&*T.
But If I kiss you, are we going to be friends? again?
May 24, 2007 7:07:27 PM

Baron, your posts are the only thing i read on this forum. Leave these "jokes of nature" to insult you.
If their mother's loves them, they have to be very, very happy, beacouse these are the only womans ever would.
Don't bother any more, you can't make them smarter.


P.S. sorry for my english, i'm from Croatia, and i learn german in my school.
May 24, 2007 7:23:19 PM

Quote:
How many do you need? Someone posted a X4600 test from somewhere that showed up to 32 threads/16 cores compared. Depending on the thread scheduling mecahanism, it is possible to run more threads than cores, you should get a SMALL increase.


The point is that you cannot (yet) get a C2Q chip for 4P, so the most cores that you can get for the Core in one system is 8. Saying that POVray doesn't scale well on C2Q from 8 to 16 cores is meaningless when you're trying to justify why a 16-core AMD system did relatively poorly. It's as simple as that.

The test I mentioned had an 8P Sun Opteron box, I think that's 16 cores.
May 24, 2007 7:45:50 PM

Oh, I must have just misread what we were talking abou then. What I read was this:

Quote:
All of the POVRay tests I have seen show VERY LITTLE scaling from 8-16 threads. Even C2Q. 32 threads showed very little also from 16. Perhaps a better test would be SQL Server or Oracle.


That, and the conversation that followed afterwards. You have a habit of making a statement, and then when someone refutes that statement, you insist that you were right and try to tell people that you were really talking about something else.

"Oops, I was out in left-field on that one, you got me." - Now, how hard was that? You'd avoid a lot of the attacks in these forumz if you weren't so often both cocky and wrong at the same time.
May 24, 2007 8:52:25 PM

Quote:
To quote DMX,

Stop talking S&*T.
But If I kiss you, are we going to be friends? again?
nah...
but if Henri kisses him, he might.
May 24, 2007 8:54:24 PM

Quote:
Oh, I must have just misread what we were talking abou then. What I read was this:

All of the POVRay tests I have seen show VERY LITTLE scaling from 8-16 threads. Even C2Q. 32 threads showed very little also from 16. Perhaps a better test would be SQL Server or Oracle.


That, and the conversation that followed afterwards. You have a habit of making a statement, and then when someone refutes that statement, you insist that you were right and try to tell people that you were really talking about something else.

"Oops, I was out in left-field on that one, you got me." - Now, how hard was that? You'd avoid a lot of the attacks in these forumz if you weren't so often both cocky and wrong at the same time.
well....he's Baron..
May 24, 2007 9:50:37 PM

Quote:
To quote DMX,

Stop talking S&*T.
But If I kiss you, are we going to be friends? again?
8O Hey, woah,.. kiss what, where?!
May 24, 2007 10:20:17 PM

You studying under baron or what?
It says EXPECTED!!!
May 24, 2007 10:49:02 PM

Quote:
I couldn't see from the video wha the CPU load was for the 4xQuad Barcelona on the video you linked above. If something else is holding up these processors so that they're not being utilised as well as the Operton's, then we would know something is definitely up. How the hell could Barcelona not have IPC improvements over K8? This could be incredibly dissappointing.

Task manager shows a full green bar at the left for each system (0:55 - 1:10, mid-render), which indicates full utilization across all cores. With blurriness, I can't honestly tell between 90% and 100% load, but if there were a software scaling problem, that bar would only be half full or less, like it is for some dated audio and video encoders which cap at 2 or 4 worker threads. AMD wouldn't let that oversight slip through - you can bet their demo uses all the cores given.

As for IPC improvements, K10 is a derivative, not a complete redesign, of K8, and so you're bound to find applications that don't exhibit higher IPC on K10 - they simply don't take advantage of the logic that K10 improved upon. Here, I suspect they're running the x87 (legacy FPU) version of POVRay and thus hiding SSE2 performance; the K10 is not intrinsically faster than K8 at x87.

m25, there is no need to imagine words that aren't there - Randy Allen clearly states multiple times that the frequencies are identical and not the fastest available for either architecture. This benchmark doesn't spell doom for K10; it just doesn't cast K10 in the best light.

Quote:
Again you should look at some of the Barcelona analysis articles. The way it works is AMD uses a MOESI protocol to determine if a query is needed. In K8 all cores were always queried, but Barcelona has a new method where if a cache line is not marked shared then the query doesn't happen.

I appreciate that AMD continually tries to improve parallel caching efficiency, but MOESI (which I think is already in K8) isn't going to help applications which write almost exclusively to shared memory, of which ray-tracing software is an example.

Ray-tracing works by following individual rays of light backward from camera to light source and calculating surface color every time the ray intersects/reflects off objects. To remain efficient, those surface calculations are cached as part of the scene, usable by any other thread - this is apparent when you see an object reflected on the surface of another.

Of course, without understanding exactly how POVRay is coded, it's only a theory that the 4S K10 bogged down from so many caches. It's equally valid to speculate (as WiseCracker did) that the scaling slowdown came from latency in the NUMA setup. Whatever weaknesses an FSB has, going from 1 to 2 or 4 FSBs isn't associated with a latency penalty, thus such a slowdown is absent from Clovertown.

Quote:
All of the POVRay tests I have seen show VERY LITTLE scaling from 8-16 threads. Even C2Q. 32 threads showed very little also from 16. Perhaps a better test would be SQL Server or Oracle.

Scaling occurs up to the minimum of # of threads and # of cores. If you have one simple core, running 16 threads would be a futile way to extract more performance. With that disclaimer, I haven't seen a test showing poor scaling from 8 to 16 cores... but I've seen only two tests incorporating 16 cores, the Clovertown 4S system last October and the K10 demo a few days ago. It's not every day that a reviewer benches an 8S dual-core Opteron with POVRay, and 4S Clovertowns aren't commercially available to my knowledge. Have you any links of 16-core benches showing nonlinear performance?

As for software capability, POVRay is capable of dispatching up to 255 threads, and I don't see a problem right off, as long as the OS can manage that many cores.

Quote:
It comes down to a 4S K10 having 36 different caches to keep coherent, as each L1-D, L2, and L3 potentially contains a unique copy of overlapping data. A 4S Clovertown system, in contrast, has just 8 caches for the controller to worry about, since L1 data is contained wholly in L2..

Could you briefly explain the difference between exclusive and inclusive caching and why your argument works they way it does...

Thanks,
Jack
Shoot, I rethought this for a few moments and noticed I mischaracterized a common argument. It doesn't matter how many caches are active per core, as cache snooping uses a broadcast to search for data (that's why it's fast). However, as many pointed out, such snooping faces scalability issues as a fixed interconnect bandwidth supplies an ever-increasing number of cores, each trying to access RAM.

Additionally, there has to be some overhead associated with processing snoops from 15 other active cores (instead of 7). This overhead is mitigated in 4S Clovertown because of (1) a snoop filter in the Northbridge monitoring communications between sockets and (2) the inclusive L1 cache allowing a core to read data while L2 is preoccupied.

Quote:

AMD has analog tracing gear that can simulate real world (intel does also) usage of transistors. They would have known many moons ago if it wasn't tracing better than Opteron.

That is news for this century... links? Last I checked, logic device simulations were all digital - software based.

As mentioned earlier, it's a big risk in CPU design to change what works fine. There are many parts of K10 which are simply copied from K8, just as portions of C2D derive from PII/III, P4, and P-M. This isn't like the PIII to P4 transition, a total overhaul which arguably was a huge mistake. Frequently, changes intended to increase performance also decrease it elsewhere, and it's a designer's call whether the sacrifice is worth it.
May 24, 2007 10:51:31 PM

Quote:
Penn and Teller should do a special on BM: http://en.wikipedia.org/wiki/Bullshit!


Right, we've already been through this. I remember comparing EIT horror stories on here with an ACTUAL ENGINEER.

I do liek that new quote though. Maybe I can give you another one.

How about

I am probably the most justifiably conceited person to ever live.
May 24, 2007 11:02:37 PM

Quote:
As for IPC improvements, K10 is a derivative, not a complete redesign, of K8, and so you're bound to find applications that don't exhibit higher IPC on K10 - they simply don't take advantage of the logic that K10 improved upon. Here, I suspect they're running the x87 (legacy FPU) version of POVRay and thus hiding SSE2 performance; the K10 is not intrinsically faster than K8 at x87.



Barcelona has dual fp units and dual 128bit loads. That CAN'T be slower than a single fp unit that loads 1 64bit load. Also, because Barc has a special routine that doesn't change to store context until a certain amount of data is buffered for writes, it should not bottleneck on switches. It can also read and write at the same time with two 64bit buffers.

The problem with latency on QFX was on SINGLETHREADED games because as I said K8 queries EVERYTIME while Barc only queries if the L1D cache line is marked shared.

Even Opteron, WITHOUT L3, DID NOT show bottlenecking on multithreaded apps. It actually shows perfect scaling on DataBases/Warehousing, which is why I suggested that they use SQL.
May 24, 2007 11:15:14 PM

Quote:

displaying or characterized by quickness of understanding, sound thought, or good judgment


That's the definition of intelligent. It has nothing to do with what you know but what you can learn. There hasn't been much. My PhD instructor in college recommended me for positions as I could take allof those equations and understand how to use them rather than thinking that the equations came first before experimentation.

It's too bad I couldn't mass market my PID retrofit Cruise control. That thing is cool. It uses an analog algorithm to adjust for hills, rough roads, desert etc.

Baron's net result of practicing what he preaches...

Quote:
You guys don't seem to undrstand that the more crap you talk the more conceited I get. Wait til I whip out my next big thing. All I can say is WOW.
May 24, 2007 11:25:09 PM

The Inq is reporting that they did some testing at some lab they have and determined that the 82xx numbers were for a 1.8GHz Opteron. This means that a 1.8GHz Barc will be at 65W.

Linkage!

It also means that with say 500-600 for each 100MHz, a 2.5GHz Barc will do 7000.

That maybe a little high but if someone feels like determining what 100MHz increase does a simple multiplication will show what a 2.9GHz Phenom will do.

If that score was really a 1.8GHz Barc, then to quote Sharikou, C2Q will be worth $125.
May 24, 2007 11:37:25 PM

Quote:
The Inq is reporting that they did some testing at some lab they have and determined that the 82xx numbers were for a 1.8GHz Opteron. This means that a 1.8GHz Barc will be at 65W.

Linkage!

It also means that with say 500-600 for each 100MHz, a 2.5GHz Barc will do 7000.

That maybe a little high but if someone feels like determining what 100MHz increase does a simple multiplication will show what a 2.9GHz Phenom will do.

If that score was really a 1.8GHz Barc, then to quote Sharikou, C2Q will be worth $125.


FUD, FUD, and more FUD
I am not going to point out the obviuous why what you're saying is crap...damn
More Lies, Baron kissing Sharikou, and Videotapes

BTW I finally found a common link between Baron & Sharikou posts... 8O :)  :D 
May 24, 2007 11:47:21 PM

Quote:
The Inq is reporting that they did some testing at some lab they have and determined that the 82xx numbers were for a 1.8GHz Opteron. This means that a 1.8GHz Barc will be at 65W.

Linkage!

It also means that with say 500-600 for each 100MHz, a 2.5GHz Barc will do 7000.

That maybe a little high but if someone feels like determining what 100MHz increase does a simple multiplication will show what a 2.9GHz Phenom will do.


1. The slowest Socket F 8xxx Opterons run at least 2GHz
2. 2.5/1.8=38%. 7000 is 75% more than 4000 :roll:
3. AMD HE is 68W.
      • 1 / 2
      • 2
      • Newest
!