Ad
News

Concern grows over Kama Sutra computer virus

Published on January 31, 2006

A destructive worm posing as a pornographic e-mail may already have infected hundreds of thousands of computers and could erase many everyday files on February 3, security experts warned on Tuesday. Read more

Hackers steal 3800 customer credit card numbers from Guidance Software

Published on December 22, 2005

In an ironic twist, hackers have stolen credit card and personal data from a major computer forensic developer, Guidance Software. According to company officials the intrusion occurred in November and customers were informed last week in a letter. Read more

Futuremark unveils benchmark 3DMarkMobile06

Published on October 27, 2005

Futuremark announced a new addition to its existing portfolio of benchmark tools. 3DMarkMobile06 allows to benchmark handheld devices that support OpenGL ES 1.0 and 1.1. Read more

FBI Warns of Sober Worm E-Mail

Published on November 22, 2005

The Federal Bureau of Investigation has issued a warning about e-mail that appears to be sent from the FBI but instead comes from hackers attempting to spread the Sober worm. Read more

Latest Reviews & Articles

System Builder Marathon: $500 Gaming PC

Published on October 30, 2008

For the second to last day of our System Builder Marathon series, we add a $500 gaming PC to the mix. It's not going to be as quick as our other two builds, but we think Paul was able to get some serious value from this thing. Read more

Tom's SBM: The $1,500 Mainstream PC

Published on October 29, 2008

We're following up yesterday's $4,500 behemoth with a more affordable $1,500 mid-range build. Let's see what sort of performance (and overclocking headroom) you can get when you spend one third of the money. Read more

System Builder Marathon: The $4,500 Super PC

Published on October 28, 2008

This month's System Builder Marathon spreads the system prices out even further to $4,500, $1,500, and $500. Is today’s $4,500 system really worth three times as much as an upper-mainstream performance machine? Read more

Can Your Old Athlon 64 Still Game?

Published on October 24, 2008

We'd all love to upgrade every time a new piece of gaming hardware drops, but that's an expensive proposition. You think your Athlon 64 system is fairly quick--any chance a simple graphics upgrade can bring it up speed? We're aiming to find out. Read more

  Tom's Hardware Forums » CPU & Components » CPUs » K10 benchmark investigation; POV-Ray experts needed :P
 

K10 benchmark investigation; POV-Ray experts needed :P




Word :   Username :  
 
 Page : 1 2
Previous
Author
 Thread : K10 benchmark investigation; POV-Ray experts needed :P
 
m25
Profile: Faithful Poster
More Information

Well, since we're all confused about yesterday's AMD first kind-of benchmark, one of the most crucial aspects to clarify, is the way POV-Ray used those system in the tests, so if anybody had exact information on any POV-Ray limitation in thread number, it would help us a lot to determine WHAT THE **** (replace the '*' with what you please) HAPPENED in that test.

Related Product

Register or log in to remove.

Profile: enthusiast
More Information

Tweakers.net had a review of a 8 socket Sun x4600. Povray shows improved performance between 8 and 16 threads. The scaling from 8->16 is much poorer than 4->8 but can't say whether it's because of Povray or because of the 8 sockets.

http://tweakers.net/reviews/674/9

wr
Profile: addict
More Information

This is all from observing others' data and discussions:

POVRay certainly has no problem scaling to 16 cores on 4 sockets. It exhibits no measurable slowdown from 1 to 8 or 16 cores on a Clovertown system with centralized memory.

There is some mention that the setup of the K10 caches can interfere slightly with distributed workloads - while the IMC more than compensates in other cases.

It comes down to a 4S K10 having 36 different caches to keep coherent, as each L1-D, L2, and L3 potentially contains a unique copy of overlapping data. A 4S Clovertown system, in contrast, has just 8 caches for the controller to worry about, since L1 data is contained wholly in L2.

POVRay is a benchmark that fits entirely into cache, so to speak, and so it downplays the fact that 4S K10 has 8 channels of RAM, while simultaneously it exacerbates the snooping inefficiency caused by so many separate caches.

Tarheel Blue thru and thru!
Profile: addict
More Information

Well put.

As an aside.....Noone will know how Barcelona will compete against Intel's offerings until AMD wants us to know.... Period.

I am still optimistic. I love AMD, although, not blindly. I really want Barcelona to exceed what Intel has to offer for the time being. I'm sure the 2nd coming of AMD won't last as long though. Intel is out for blood now and they seem to have their proverbial pedal to the metal.

Here's for hoping!

Profile: addict
More Information

I have a theory! Intel would not have reminded us of their V8 score unless they were somehow sure that AMD couldn't trump their score. I'll leave it at that.

Profile: addict
More Information

i too was wondering about the IMC's with soo many seperate chips but im not sure thats the only problem.

im still kind of confused about the whole thing, was AMD trying to show off power consuption, or something else?

m25
Profile: Faithful Poster
More Information

I don't think there's anything about cache here, since rendering as a process is more or less a streaming job, where cache is more or less cut off and only FP efficiency and somehow branching count, especially in the AMD ( K8 ) arch. You can see this comparing in the CPU charts the render time of a 2.0GHz / 1M L2 Athlon64 and a 20.GHz / 128K L2; the difference is insignificantly 0.58% or 1sec while the L2 differs by a factor of 8X:
http://www23.tomshardware.com/charts4/257-263-74.png

Profile: Forum Resident
More Information

Quote :

This is all from observing others' data and discussions:

POVRay certainly has no problem scaling to 16 cores on 4 sockets. It exhibits no measurable slowdown from 1 to 8 or 16 cores on a Clovertown system with centralized memory.

There is some mention that the setup of the K10 caches can interfere slightly with distributed workloads - while the IMC more than compensates in other cases.

It comes down to a 4S K10 having 36 different caches to keep coherent, as each L1-D, L2, and L3 potentially contains a unique copy of overlapping data. A 4S Clovertown system, in contrast, has just 8 caches for the controller to worry about, since L1 data is contained wholly in L2.

POVRay is a benchmark that fits entirely into cache, so to speak, and so it downplays the fact that 4S K10 has 8 channels of RAM, while simultaneously it exacerbates the snooping inefficiency caused by so many separate caches.




You are incorrect about the caching setup. L1D, and L2 are different. L3 is a victim cache and also contains different data, but only data shared. RealWorld did a real in-depth analysis recently - someone posted the link - and it shows that the way Barcelona handles it.

It only has to keep the L3 coherent so that's only 4 caches. Because they are still on HT1.1 they can only go out to 4 sockets and keep the 1-hop latency.

The closest approximation with all of the other POV-Ray scaling articles is that the Opteron was running at 2.8GHz and the Barcelona was running at 1.9-2.1GHz (the starting frequencies). Again I don't know but it looks like the test would ned to be run with a 2P Barclona vs. a 4P Opteron that would make it 8 cores - 8 cores.

There seems to be NO scaling from 8 - 16 cores so it could be that he SW isn't as multithreaded as it could be. Maybe it's too coarse such that it is putting large blocks in each core rather than putting smaller blocks in each core - that wold put more stress on the load store mechanism rather than the functional units.

If you look at the way Valve is doing their's they noted that you can lose perf when increasing multithreading. This is the first multithreaded version and it seems to be listed as a beta everywhere.

Also, on the site they make note only of single core - dual core - quad core but not out to 16 core. Though at the bottom of the beta page they say "There have been reports of benefits for
users of hyperthreading systems, particularly with higher thread counts (e.g.
16 threads)."

I'm not really sure if C2Q or Barcelona or Opteron are bearing this out, though.

I guess someone should be asking them. They have the source code for 3.6 but not 3.7.

Profile: addict
More Information

I would think that the key would be reducing the number of times you dynamically allocate memory. The program (in this case POV-Ray) must treat each thread independently and allocate memory for its data and works with its specific cpu memory area.

Since each processor in an AMD multi-processor / socket system has its own local memory - when a processor accesses remote memory (i.e., cpu0 to cpu1 memory banks) latencies have to grow, right?

Can one of you programmers explain the Allocation Call Tree? In the case of an AMD SMP rig do you have linear reading/writing from/to memory? I don't see where that is possible when cpu0 has to access data from cpu1 memory . . .

I'll need one of yahs to figure this one out . . .

wr
Profile: addict
More Information

Quote :

You are incorrect about the caching setup. L1D, and L2 are different. L3 is a victim cache and also contains different data, but only data shared. RealWorld did a real in-depth analysis recently - someone posted the link - and it shows that the way Barcelona handles it.

It only has to keep the L3 coherent so that's only 4 caches.


In order for coherency among the 4 L3's to be sufficient, the L3 must act as a write cache for all data from 4 cores, which is the reverse of the K10 victim read-only design - any modified L3 data is written back into L1D or L2, then that portion in L3 is flushed.

There are 16 independent threads, each of which touches an individual L1D and/or L2, so the minimum number of caches to worry about is 32 - assuming the L3 never steps in. The processors don't know these are independent threads because there is no central manager among the 16 cores telling them so.

The reason 4S Clovertown gets this down to 8 caches is that (1) the L1D is included fully in L2 and (2) each L2 contains all the cached data for two cores and handles all the read/write load. Neither method is present in K10, or else we'd see only L1's and a single, large L2 per die.

Quote :

I don't think there's anything about cache here, since rendering as a process is more or less a streaming job, where cache is more or less cut off and only FP efficiency and somehow branching count, especially in the AMD ( K8 ) arch. You can see this comparing in the CPU charts the render time of a 2.0GHz / 1M L2 Athlon64 and a 20.GHz / 128K L2; the difference is insignificantly 0.58% or 1sec while the L2 differs by a factor of 8X:


This is not about the amount of cache but the simple yet tedious problem of keeping so many different caches in sync.

Profile: addict
More Information

Quote :

Tweakers.net had a review of a 8 socket Sun x4600. Povray shows improved performance between 8 and 16 threads. The scaling from 8->16 is much poorer than 4->8 but can't say whether it's because of Povray or because of the 8 sockets.

http://tweakers.net/reviews/674/9



I'm just guessin' . . . I'd say it's due to a single cpu (let's say cpu1) having to access memory at a separate socket (either cpu0, cpu2 or cpu3). I know this is really simplistic (due to my pea brain) but are you not potentially introducing a 4-fold increase in latencies (i.e., per cpu socket memory banks) ??

Profile: enthusiast
More Information

Since the used version of POV-ray is more or less still in the beta phase, why would AMD use it solely for the purpose of a PR benchmark? I've came up with a few answers myself

(1) They are marketing to the not so inclined audience of "omfg Barcelona is 2x faster than opteron" crowd, when showing 4 K10s (16cores effectively) doubles the performance of 4 Opterons (8cores effectively)

(2) AMD was using POV-ray to test their new architecture before Computex and released the results to appease the raging media and public

(3) AMD has Alzheimer's (I think they need to get their F@H R600 farms running)

Profile: Forum Resident
More Information

Quote :

I would think that the key would be reducing the number of times you dynamically allocate memory. The program (in this case POV-Ray) must treat each thread independently and allocate memory for its data and works with its specific cpu memory area.

Since each processor in an AMD multi-processor / socket system has its own local memory - when a processor accesses remote memory (i.e., cpu0 to cpu1 memory banks) latencies have to grow, right?

Can one of you programmers explain the Allocation Call Tree? In the case of an AMD SMP rig do you have linear reading/writing from/to memory? I don't see where that is possible when cpu0 has to access data from cpu1 memory . . .

I'll need one of yahs to figure this one out . . .




I would say you have to look at some of the older benchmarks that use different size RAM blocks. IIRC, some apps like the larger blocks and some don't.

All of the POVRay tests I have seen show VERY LITTLE scaling from 8-16 threads. Even C2Q. 32 threads showed very little also from 16. Perhaps a better test would be SQL Server or Oracle.

AMD has analog tracing gear that can simulate real world (intel does also) usage of transistors. They would have known many moons ago if it wasn't tracing better than Opteron.

Again, it's just a few weeks until Computex and more light should be shed.

Profile: old hand
More Information
n°1677787
05-24-2007 at 06:20:49 PM