Sign in with
Sign up | Sign in
Your question

Does L3 really count this much?!

Last response: in CPUs
Share
February 7, 2007 12:30:10 PM

Article:
http://www.infoworld.com/article/07/02/07/07OPcurve_1.html
Quote:
Unlike Intel’s Core, Barcelona gives each core dedicated L2 cache, and Barcelona incorporates a redesign that reduces cache latency (access delays). Barcelona adds Level 3 cache, a newcomer to the x86 and a page out of IBM’s POWER playbook. All four CPU cores in a Barcelona socket will share a single master catalog of recently-retrieved data. A three-level cache is a must-have for a multicore CPU, and that becomes obvious when you get a demo that switches L3 on and off.

More about : count

February 7, 2007 12:43:51 PM

Quote:
Article:
http://www.infoworld.com/article/07/02/07/07OPcurve_1.html
Unlike Intel’s Core, Barcelona gives each core dedicated L2 cache, and Barcelona incorporates a redesign that reduces cache latency (access delays). Barcelona adds Level 3 cache, a newcomer to the x86 and a page out of IBM’s POWER playbook. All four CPU cores in a Barcelona socket will share a single master catalog of recently-retrieved data. A three-level cache is a must-have for a multicore CPU, and that becomes obvious when you get a demo that switches L3 on and off.


With a cross-bar that cannot synchronize the L2 cache, yes.
February 7, 2007 12:58:49 PM

I think that the shared L3 could have significant impact, depending on how much crosstalk there is between CPU's. If I remember correctly, L2 access is something like 16 clocks, L3 is +20 clocks, and memory access is +120-300 clocks, depending on multiplier, bus speed, and memory latency. L3 also provides an effective crosstalk method, as locking could be handled at the cache. If the chip is designed to use the L3 for locking and interchip communication, disabling it could really cripple cross-CPU memory access. To answer your question, the architecture might have made L3 critical.
Related resources
February 7, 2007 1:13:43 PM

Quote:
I think that the shared L3 could have significant impact, depending on how much crosstalk there is between CPU's. If I remember correctly, L2 access is something like 16 clocks, L3 is +20 clocks, and memory access is +120-300 clocks, depending on multiplier, bus speed, and memory latency. L3 also provides an effective crosstalk method, as locking could be handled at the cache. If the chip is designed to use the L3 for locking and interchip communication, disabling it could really cripple cross-CPU memory access. To answer your question, the architecture might have made L3 critical.

AMD's 65nm L2 latency is 20 cycles, maybe that is to make it synchronous with L3 :!: :?: :!: :?: :!: :?:
February 7, 2007 1:17:05 PM

Quote:
I think that the shared L3 could have significant impact, depending on how much crosstalk there is between CPU's. If I remember correctly, L2 access is something like 16 clocks, L3 is +20 clocks, and memory access is +120-300 clocks, depending on multiplier, bus speed, and memory latency. L3 also provides an effective crosstalk method, as locking could be handled at the cache. If the chip is designed to use the L3 for locking and interchip communication, disabling it could really cripple cross-CPU memory access. To answer your question, the architecture might have made L3 critical.

AMD's 65nm L2 latency is 20 cycles, maybe that is to make it synchronous with L3 :!: :?: :!: :?: :!: :?:

More recent data were said to be 14-16 cycles.
(From JJ) :wink:
February 7, 2007 1:20:16 PM

That is more reasonable; I have 14cycles on my 90nm X2 4200+....then what is the cause of the small performance penalty?!
February 7, 2007 1:27:11 PM

i think L3 would make a difference if you are using quad core, but otherwise it doesnt really do much
February 7, 2007 1:28:59 PM

Quote:
That is more reasonable; I have 14cycles on my 90nm X2 4200+....then what is the cause of the small performance penalty?!


Both the increased L2 cache latency and slower memory performance due to lower memory frequency.

e.g.

x2 5000+ (2.6GHz)
DDR2-800=>743MHz, DDR2-667=>650MHz
x2 4800+ (2.5GHz)
DDR2-800=>714MHz, DDR2-667=>625MHz
x2 4600+ (2.4GHz)
DDR2-800=>800MHz, DDR2-667=>600MHz
February 7, 2007 1:30:46 PM

If you ask me, if switching off the L3 cache really makes that much of a difference, all it shows is how much Barcelona, not any multicore CPU, relies on L3 cache. Not that there's anything wrong with that. I just doubt that L3 Cache is essential... Also, didn't the Gallatin 3.46GHz EE also have L3 Cache? What purpose did it serve there?
February 7, 2007 2:01:15 PM

Quote:
If you ask me, if switching off the L3 cache really makes that much of a difference, all it shows is how much Barcelona, not any multicore CPU, relies on L3 cache. Not that there's anything wrong with that. I just doubt that L3 Cache is essential... Also, didn't the Gallatin 3.46GHz EE also have L3 Cache? What purpose did it serve there?


The Gallatin did smoke some multimedia benchmarks in its day...
February 7, 2007 2:02:23 PM

Quote:
If you ask me, if switching off the L3 cache really makes that much of a difference, all it shows is how much Barcelona, not any multicore CPU, relies on L3 cache. Not that there's anything wrong with that. I just doubt that L3 Cache is essential... Also, didn't the Gallatin 3.46GHz EE also have L3 Cache? What purpose did it serve there?


Agreed, just because one specific multicore cpu drops in performance when L3 cache is disabled only means that it is more dependent on the L3 cache.

This is like intel stating that you need atleast 2mb of L2 cache otherwise you will cripple your cpu, when in fact we all know that AMD Athlon X2's run on 512+512kb of L2 and perform just as well, even more so to prove a point thats why AMD discontinued there non-FX 1mb+1mb cpus...
February 7, 2007 2:13:11 PM

Quote:
If you ask me, if switching off the L3 cache really makes that much of a difference, all it shows is how much Barcelona, not any multicore CPU, relies on L3 cache. Not that there's anything wrong with that. I just doubt that L3 Cache is essential... Also, didn't the Gallatin 3.46GHz EE also have L3 Cache? What purpose did it serve there?


The Gallatin did smoke some multimedia benchmarks in its day...
True,...the 3.73GHz EE, but that was still a P4, very cache-hungry.
February 7, 2007 2:15:43 PM

I am just curious; Why 4x512 L2 + 2048 L3 and not just 4x1024 L2 :?: :!:
February 7, 2007 2:30:49 PM

Quote:
I think that the shared L3 could have significant impact, depending on how much crosstalk there is between CPU's. If I remember correctly, L2 access is something like 16 clocks, L3 is +20 clocks, and memory access is +120-300 clocks, depending on multiplier, bus speed, and memory latency. L3 also provides an effective crosstalk method, as locking could be handled at the cache. If the chip is designed to use the L3 for locking and interchip communication, disabling it could really cripple cross-CPU memory access. To answer your question, the architecture might have made L3 critical.

AMD's 65nm L2 latency is 20 cycles, maybe that is to make it synchronous with L3 :!: :?: :!: :?: :!: :?:]


No it was shown that it is 14, not 20. L3 is designed to "eliminate" latency on the desktop and create better cache coherency across SMP servers.

It also has the added affect of seriously boosting single threaded apps as one core can use the entire L3.
February 7, 2007 2:34:07 PM

Quote:
I am just curious; Why 4x512 L2 + 2048 L3 and not just 4x1024 L2


You've got to deal with cache coherency somewhere, and probably much better in L3 than system memory. Also, larger L2's make for more locks to deal with, if each core has 1 MB of L2 cache there is a greater chance that another core will need access to an address that another processor already has allocated.
February 7, 2007 2:34:25 PM

Quote:
It also has the added affect of seriously boosting single threaded apps as one core can use the entire L3.

Sounds good.
February 7, 2007 2:37:03 PM

I'm not certain, but I think it is easier for the cores to talk over the L3 than it is to construct an 8-way crossbar system to share the L2. That would be 8 connections versus 4 for a shared L3 arrangement. And just linking the cache would slave the chip in the center to the other two. So if core 1 wanted something from core 3 it would have to go through core 2 to get it.
February 7, 2007 4:37:04 PM

The advantage to L3 is that it makes data availble to all the cores simultaneously and with exteremely low latency.. Having the data available in 3-5 ns versus 50ns for DDR and 100+ns for DDR2. Large cache was the reason ASCI Red was number 1 in the world for 4 yerars using pentium pro 250mhz with 2mb of L2 when every one else was using 900mhz with 256k cache. 8mb of cache is why Blue Gene L can run at 700mhz and beat Intel in any of the HPC benchmarks

The following are the winners of the 2006 HPC Challenge Class 1 Awards:
G-HPL Achieved System Affiliation Submitter
1st place 259 Tflop/s IBM BG/L DOE/NNSA/LLNL Tom Spelce
1st runner up 67 Tflop/s IBM BG/L IBM T.J. Watson John Gunnels
2nd runner up 57 Tflop/s IBM p5-575 LLNL Charles Grassl
G-RandomAccess Achieved System Affiliation Submitter
1st place 35 GUPS IBM BG/L DOE/NNSA/LLNL Tom Spelce
1st runner up 17 GUPS IBM BG/L IBM T.J. Watson John Gunnels
2nd runner up 10 GUPS Cray XT3 Dual ORNL Jeff Larkin
G-FFT Achieved System Affiliation Submitter
1st place 2311 Gflop/s IBM BG/L DOE/NNSA/LLNL Tom Spelce
1st runner up 1122 Gflop/s Cray XT3 Dual ORNL Jeff Larkin
2nd runner up 1118 Gflop/s Cray XT3 SNL Courtenay Vaughan
EP-STREAM-Triad (system) Achieved System Affiliation Submitter
1st place 160 TB/s IBM BG/L DOE/NNSA/LLNL Tom Spelce
1st runner up 55 TB/s IBM p5-575 LLNL Charles Grassl
2nd runner up 43 TB/s Cray XT3 SNL Courtenay Vaughan
http://www.hpcchallenge.org/
February 7, 2007 6:33:22 PM

wow!
February 8, 2007 12:57:37 AM

Quote:
The advantage to L3 is that it makes data availble to all the cores simultaneously and with exteremely low latency.. Having the data available in 3-5 ns versus 50ns for DDR and 100+ns for DDR2. Large cache was the reason ASCI Red was number 1 in the world for 4 yerars using pentium pro 250mhz with 2mb of L2 when every one else was using 900mhz with 256k cache. 8mb of cache is why Blue Gene L can run at 700mhz and beat Intel in any of the HPC benchmarks

The following are the winners of the 2006 HPC Challenge Class 1 Awards:
G-HPL Achieved System Affiliation Submitter
1st place 259 Tflop/s IBM BG/L DOE/NNSA/LLNL Tom Spelce
1st runner up 67 Tflop/s IBM BG/L IBM T.J. Watson John Gunnels
2nd runner up 57 Tflop/s IBM p5-575 LLNL Charles Grassl
G-RandomAccess Achieved System Affiliation Submitter
1st place 35 GUPS IBM BG/L DOE/NNSA/LLNL Tom Spelce
1st runner up 17 GUPS IBM BG/L IBM T.J. Watson John Gunnels
2nd runner up 10 GUPS Cray XT3 Dual ORNL Jeff Larkin
G-FFT Achieved System Affiliation Submitter
1st place 2311 Gflop/s IBM BG/L DOE/NNSA/LLNL Tom Spelce
1st runner up 1122 Gflop/s Cray XT3 Dual ORNL Jeff Larkin
2nd runner up 1118 Gflop/s Cray XT3 SNL Courtenay Vaughan
EP-STREAM-Triad (system) Achieved System Affiliation Submitter
1st place 160 TB/s IBM BG/L DOE/NNSA/LLNL Tom Spelce
1st runner up 55 TB/s IBM p5-575 LLNL Charles Grassl
2nd runner up 43 TB/s Cray XT3 SNL Courtenay Vaughan
http://www.hpcchallenge.org/

a b à CPUs
February 8, 2007 1:13:47 AM

all depends on architecture ;) 

and BTW the Pentium Pro's from ~10 years ago had 1mb L2 cache
February 8, 2007 1:24:15 AM

Quote:

Another superfluous post.
February 8, 2007 1:39:01 AM

CONGRATULATIONS gOJDO you have made a complete fool of yourself. What you complain of is :

HPC Challenge Award Competition

The DARPA High Productivity Computing Systems (HPCS) Program and HPCWire are pleased to announce the annual HPC Challenge Award Competition. The goal of the competition is to focus the HPC community's attention on developing a broad set of HPC hardware and HPC software capabilities that are necessary to productively use HPC systems. The core of the HPC Challenge Award Competition is the HPC Challenge benchmark suite developed at the University of Tennessee under the DARPA HPCS program with inputs from a wide range of organizations from around the world (see http://icl.cs.utk.edu/hpcc/).
The Competition will focus on four of the most challenging benchmarks in the suite:

* Global HPL
* Global RandomAccess
* EP STREAM (Triad) per system
* Global FFT

There will be two classes of awards.

Class 1: Best Performance (4 awards) Best performance on a base or optimized run submitted to the HPC Challenge website. The benchmarks to be judged are: Global HPL, Global RandomAccess, EP STREAM (Triad) per system and Global FFT. The prize will be $500 plus a certificate for the best of each. Entries will be considered for award if they are submitted before November 13, 2006. http://www.hpcchallenge.org/

For the illiterate who do not know what DARPA is: http://www.darpa.mil/ipto/Programs/hpcs/index.htm

You claim that the premier benchmark suite is 24k crap. There are no more prestigeous awards in the computing community. The crap sir is inside your shoes.
February 8, 2007 1:46:02 AM

Quote:
Unlike Intel’s Core, Barcelona gives each core dedicated L2 cache,

Second, the implication that Intel's shared L2 is somewhat inferior... oddly there are advantages and disadvantages to both implementations. Discrete cache is not capable of thrashing, but L2 cache shared enables easier coherency between the cores. The debate will rage for ever on which is better (at least in the enthusiast circles).
Intel's L2 cache needs to be "smart" since it is shared across 2 cores. The term "smart" implies theres is added logic to ensure the integrity of the cache is maintained between the 2 cores. I would think the additional logic comes with a small performance hit.

AMD's L2 cache is unique to each core. No additional logic. No need to maintain integrity across cores. I would think this results in an overall faster process.

I could be way off-base, but it sounds logical to me. "Smart" does not necessarily equal better.
February 8, 2007 1:51:04 AM

Quote:
CONGRATULATIONS gOJDO you have made a complete fool of yourself.

He does it quite often. :lol: 

If qOJDO doesn't remove that perverted image from his signature, I will report him. Complete and utter disrespect for the forum rules! :x I guess he thinks it's funny... I don't! :x :x
February 8, 2007 8:11:47 AM

Quote:

Intel's L2 cache needs to be "smart" since it is shared across 2 cores. The term "smart" implies theres is added logic to ensure the integrity of the cache is maintained between the 2 cores. I would think the additional logic comes with a small performance hit.

It's "Smart" because it's adaptable, with a single-thread application, one CPU gets the whole amount. With two apps, at worse it becomes two split caches.

Quote:

AMD's L2 cache is unique to each core. No additional logic. No need to maintain integrity across cores. I would think this results in an overall faster process.

Yes, in other words it's just no different than two separate CPUs. And the split L2 caches must communicate to remain coherent unlike Conroe's shared.
February 8, 2007 8:18:38 AM

Quote:
The advantage to L3 is that it makes data availble to all the cores simultaneously and with exteremely low latency.. Having the data available in 3-5 ns versus 50ns for DDR and 100+ns for DDR2. Large cache was the reason ASCI Red was number 1 in the world for 4 yerars using pentium pro 250mhz with 2mb of L2 when every one else was using 900mhz with 256k cache. 8mb of cache is why Blue Gene L can run at 700mhz and beat Intel in any of the HPC benchmarks

No, the low power consumption and good interconnect allowing for a staggering number of CPUs that does HPC code really well (but not so much for anything else) to be linked together is the reason why Blue Gene does well in HPC.
February 8, 2007 9:27:55 AM

L3 is very much needed in multithreaded applications. All previous posts stressed on the prefetch latency..however there is also a write latency not to be underestimated. When a mutex is locked, CPU cache is forced (among other things) to synchronize with system RAM to allow coherent access for all threads to the same copy of data (L2 in SMPs is not shared). With L3 , mutex_lock() might be optimized to synch the L3 cache instead of D RAM and thus gain much more time. Of course this is just my idea...i don't know if such optimizations will exist (or already do exist)
This also works for 3 other thread functions.
February 8, 2007 10:11:09 AM

boo @ you guys all knowing and stuff =P


from waht ive read.. amd made the arch reallly rely on l3...


btw is there such thing as L4? maybe a mobo implementation between cpu , L1, L2 L3 and mem

(sorry if i went off topic.. im sorta high and tired )
February 8, 2007 10:54:21 AM

Quote:
However,

Having the data available in 3-5 ns versus 50ns for DDR and 100+ns for DDR2.


Do you really believe there is such a thing as 3-5 ns latency L3 cache?
Right. :wink:
And I wonder how did he measured the 50ns for DDR and 100+ns for DDR2?
When the data will be available depends on a lot of factors like the size of the data, RAM frequency, RAM latency, etc.
So, what DDR, what DDR2, etc......
If he compares DDR-400 with DDR2-800, both have roughly same latency in ns, but DDR2-800 provides twice bandwidth. Thus, DDR2 will deliver the requested data faster than the DDR-400.

Also this statement belongs to the 24 KARAT CRAP:
Quote:
The advantage to L3 is that it makes data availble to all the cores simultaneously and with exteremely low latency..

He has no idea, what is the L3 VICTIM cache on Barcelona.
February 8, 2007 11:19:18 AM

Quote:



Amen to that!
February 8, 2007 11:45:50 AM

Quote:
CONGRATULATIONS gOJDO you have made a complete fool of yourself. What you complain of is :

HPC Challenge Award Competition

The DARPA High Productivity Computing Systems (HPCS) Program and HPCWire are pleased to announce the annual HPC Challenge Award Competition. The goal of the competition is to focus the HPC community's attention on developing a broad set of HPC hardware and HPC software capabilities that are necessary to productively use HPC systems. The core of the HPC Challenge Award Competition is the HPC Challenge benchmark suite developed at the University of Tennessee under the DARPA HPCS program with inputs from a wide range of organizations from around the world (see http://icl.cs.utk.edu/hpcc/).
The Competition will focus on four of the most challenging benchmarks in the suite:

* Global HPL
* Global RandomAccess
* EP STREAM (Triad) per system
* Global FFT

There will be two classes of awards.

Class 1: Best Performance (4 awards) Best performance on a base or optimized run submitted to the HPC Challenge website. The benchmarks to be judged are: Global HPL, Global RandomAccess, EP STREAM (Triad) per system and Global FFT. The prize will be $500 plus a certificate for the best of each. Entries will be considered for award if they are submitted before November 13, 2006. http://www.hpcchallenge.org/

For the illiterate who do not know what DARPA is: http://www.darpa.mil/ipto/Programs/hpcs/index.htm

You claim that the premier benchmark suite is 24k crap. There are no more prestigeous awards in the computing community. The crap sir is inside your shoes.


In the immortal words of Jimi Hendrix, "blah, blah woof, woof."

Quit posting unrelated spam and you won't get any more awards from gOJDO.

A discussion about K10 L3 implementation has nothing to do with PPro-based supercomputers. And, the key behind these supercomputer metrics was not the mere presence of L2 (not L3, which we were discussing), but the fact that code was highly optimized to use it. Look at the performance difference between 256k, 512k, 1024k, etc. sized cache chips in general, non-assembly optimized code and you'll see maybe a few % increase.

Since the crap you are spewing doesn't add anything to the discussion, I'd like to get back on topic now:

Quote:

Intel's L2 cache needs to be "smart" since it is shared across 2 cores. The term "smart" implies theres is added logic to ensure the integrity of the cache is maintained between the 2 cores. I would think the additional logic comes with a small performance hit.

AMD's L2 cache is unique to each core. No additional logic. No need to maintain integrity across cores. I would think this results in an overall faster process.


Intel has coherent L2, AMD is going for coherent L3. Every layer of cache adds latency, so coherent L2 with no L3 is faster in case of a cache miss, and Intel has designed in advantages like reallocation for cases where a single large thread is in process. Dedicated L2 is great for multiple unrelated threads, like a lot of server tasks where you're handling 100's of unrelated threads in different apartments, or several different unrelated apps. Coherency on the die in the shared L2 or L3 dramatically cuts down on bus traffic for crosstalk between cores, so AMD's 4 core should scale better with multiple sockets than Intel's, at least as far as crosstalk and memory access.

Another benefit will be when tightly coupled multithreaded apps become more common. Game architecture, for example, doesn't see a huge improvement moving to loosely coupled thread models because one thread generally does most of the heavy lifting. With tighter coupling of threads the shared L3 (and probably the shared L2) will show dramatic improvements in performance, assuming the programmers can write code to take advantage of it...

Intel's shared L2 implementation occupies the 'owner core' to make a cross-core cache request, right? Isn't that why AMD's 'crossbar switch' is being touted as a great improvement?
February 8, 2007 12:20:49 PM

you can't really judge how a cache is going to be prforming until you test it: This has been true since before the on die L2 cache of the celeron II. Just wait and see: A new architecture is not a hack onto an old CPU design...if there is an L3 then there must be a good reason for it to be there...we'll see how it compares to CORE's L2 cache. Besides, cache optimizations have always been different at AMD and Intel...even when they both used the L2.
February 8, 2007 12:48:22 PM

Well, that pretty much makes sense, seeing that we are comparing a chip that AMD says themselves will be for server usage to a desktop chip. It would be like comparing the Duo's to Xeon 51XX's.
February 8, 2007 1:21:18 PM

Quote:

Intel's L2 cache needs to be "smart" since it is shared across 2 cores. The term "smart" implies theres is added logic to ensure the integrity of the cache is maintained between the 2 cores. I would think the additional logic comes with a small performance hit.

It's "Smart" because it's adaptable, with a single-thread application, one CPU gets the whole amount. With two apps, at worse it becomes two split caches.
Is this true? Only 2 possible partitions of the cache? What about the case where the cache is shared across both cores? I envisioned the potential for a large number of partitions (possibly in the hundreds or thousands). Each partition could be slated for Core A, Core B, or both Core A & B.

Quote:

AMD's L2 cache is unique to each core. No additional logic. No need to maintain integrity across cores. I would think this results in an overall faster process.

Yes, in other words it's just no different than two separate CPUs. And the split L2 caches must communicate to remain coherent unlike Conroe's shared.
That is the point... The split L2 caches do NOT have communicate. That is what the "shared" L3 cache is for. :wink:
February 8, 2007 1:28:59 PM

Quote:
I am just curious; Why 4x512 L2 + 2048 L3 and not just 4x1024 L2 :?: :!:


'Cause AMD are silly.
February 8, 2007 2:08:16 PM

Quote:
I am just curious; Why 4x512 L2 + 2048 L3 and not just 4x1024 L2 :?: :!:


'Cause AMD are silly.Nope.
The shared VICTIM cache can be dedicated to one or more cores as exclusive, if needed. For example 1 core that has a lot more tasks to do than the others can use all 2MB L3 as exclusive.
AMD's L3 and Intel's L2 shared caches are differently implemented and are working different.
February 8, 2007 2:19:24 PM

Quote:
The shared VICTIM cache can be dedicated to one or more cores as exclusive, if needed. For example 1 core that has a lot more tasks to do than the others can use all 2MB L3 as exclusive.
AMD's L3 and Intel's L2 shared caches are differently implemented and are working different.


Excellent post :wink:

I am more interested about the performance of dual-core Kuma with 2MB L3 cache and 2 blocks of 512KB L2 cache. :wink:
February 8, 2007 2:43:09 PM

while I am for an AM2 quad :wink:
February 8, 2007 3:00:32 PM

Quote:
The shared VICTIM cache can be dedicated to one or more cores as exclusive, if needed. For example 1 core that has a lot more tasks to do than the others can use all 2MB L3 as exclusive.
AMD's L3 and Intel's L2 shared caches are differently implemented and are working different.


Excellent post :wink:

I am more interested about the performance of dual-core Kuma with 2MB L3 cache and 2 blocks of 512KB L2 cache. :wink:

I would liek to see that just to have a better idea of the per clock improvements over X2. I would think the per core increases will be the same for Agena and Kuma, but Agena will have the advantage of the extra two cores.
February 8, 2007 6:54:54 PM

Quote:

Is this true? Only 2 possible partitions of the cache? What about the case where the cache is shared across both cores? I envisioned the potential for a large number of partitions (possibly in the hundreds or thousands). Each partition could be slated for Core A, Core B, or both Core A & B.

No, I was just stating the two extremes.

Quote:

That is the point... The split L2 caches do NOT have communicate. That is what the "shared" L3 cache is for. :wink:

The shared L3 doesn't exist on the current X2s. The speed of the L3 is unknown at this point. I'd rather have 4MB of L2 cache than 4MB total of L2 and L3.
February 8, 2007 9:13:27 PM

Quote:
The shared L3 doesn't exist on the current X2s. The speed of the L3 is unknown at this point. I'd rather have 4MB of L2 cache than 4MB total of L2 and L3.

It's not that simple though. Generally, a larger cache requires higher latencies - using that argument why not do away with L2 cache altogether and make a huge L1? Thats the reason why. By having a seperate L3 cache, you maintain the benefits of a low latency dedicated L2 cache, whilst gaining more performance by having another layer of cache between L2 and main memory.
February 9, 2007 3:20:34 AM

Quote:
The shared L3 doesn't exist on the current X2s. The speed of the L3 is unknown at this point. I'd rather have 4MB of L2 cache than 4MB total of L2 and L3.

True... but keeping in context; we are talking about Barcelona. :wink:

AMD did not choose to implement a shared L3 cache across all cores without a good reason. :wink:

Barcelona example:

If I have 4 threaded application where each thread is running on a separate core. And each thread works in conjunction with the other 3 to accomplish a single main objective. And each thread requires access to a common shared memory region. If a copy of the shared memory region existed in the shared L3 cache, then you have effectively minimized all access to main memory for this memory region. This has the potential to be a big performance gain. :D 

You cannot do that with separate L2 caches. The effected areas of the separate L2 caches would need to be flushed prior to each core accessing the shared memory region. This is a performance hit. :( 

The same example applied to a Core 2 Quad processor would require L2 caches flushes between the 2 separate dual cores' shared L2 cache. Hence, expect the theoretical performance to be somewhere in between.
February 9, 2007 6:35:47 AM

Yes, a shared L3 is useful. But in this case, it's small in comparison to the L2 cache, as compared to the L3:L2 ratios of Xeon Tulsa, Power5 or Itanium. Just how useful it is remains to be seen.
February 9, 2007 6:51:39 AM

Quote:
Yes, a shared L3 is useful. But in this case, it's small in comparison to the L2 cache, as compared to the L3:L2 ratios of Xeon Tulsa, Power5 or Itanium. Just how useful it is remains to be seen.


They have totally different caching system and can't be compared with simple calculations.
!