Sign in with
Sign up | Sign in
Your question

Cache Efficiency Analysis of K8 and Core 2

Last response: in CPUs
Share
October 24, 2006 3:02:18 AM

http://www.digit-life.com/articles2/cpu/rmmt-l2-cache.h...

A great look at the cache and memory efficiency of K8 and Core 2.

It's quite interesting how much more raw L2 bandwidth Core 2 has due to it's 256-bit buses. Despite not having a IMC, Core 2's memory read bandwidth is actually very good, only it's write bandwidth is poor.

In terms of shared cache efficiency, when 2 threads both fit in the L2 cache each individual thread does lose a little bandwidth than if they were to have the cache to themselves, but the bandwidth does scale up to accomodate 2 threads. The FSB is a limiter though since it's bandwidth doesn't scale as well to accomodate 2 threads.

The other interesting issue they dealt with is cache thrashing, where the total amount of cache needed by both threads is larger than the L2 cache size. The big question is does it occur? Sometimes.

They tested with the 2 threads needing 5MB total. In cases where the cache footprint of each thread is largely different (ie. thread 0 uses 1MB and thread 1 needs 4MB), bandwidth is shared efficiently between the threads. It's only when the cache footprint is similar, within 2.5MB of each other (ie thread 0 needs at least 1.25MB and thread 1 needs less than 3.75MB) that some thrashing begins to occur. The worst performance is when the cache footprint is within 1MB of each other (ie. thread 0 needs at least 2MB and thread 1 needs less than 3MB).

While some thrashing will always occur, they mention that an idle shared cache (ie. hope for the future) will reduce this "conflict zone" to the 1MB spread case so that bandwidth doesn't taper off until thread 0 needs more than 2MB of space while thread 1 needs less than 3MB of space for a given 5MB load. Until that time, programmers should avoids creating 2 threads with similar cache footprints, in otherwords 1 thread should use noticeably more cache than the other. Or better yet, program it to fit within the 4MB cache so that thrashing doesn't occur.

No word on whether Penryn will bring updates to the shared cache mechanism although the 6MB L2 cache will reduce trashing. Digit-Life is planning a follow-up article specifically on memory bandwidth sharing rather than cache bandwidth sharing so I'll watch out for that.
October 24, 2006 3:12:11 AM

Interesting find, Thanks
core 2 bandwidth is huge ! :o 
Related resources
October 24, 2006 4:29:56 AM

Quote:
http://www.digit-life.com/articles2/cpu/rmmt-l2-cache.h...

A great look at the cache and memory efficiency of K8 and Core 2.

It's quite interesting how much more raw L2 bandwidth Core 2 has due to it's 256-bit buses. Despite not having a IMC, Core 2's memory read bandwidth is actually very good, only it's write bandwidth is poor.

In terms of shared cache efficiency, when 2 threads both fit in the L2 cache each individual thread does lose a little bandwidth than if they were to have the cache to themselves, but the bandwidth does scale up to accomodate 2 threads. The FSB is a limiter though since it's bandwidth doesn't scale as well to accomodate 2 threads.

The other interesting issue they dealt with is cache thrashing, where the total amount of cache needed by both threads is larger than the L2 cache size. The big question is does it occur? Sometimes.

They tested with the 2 threads needing 5MB total. In cases where the cache footprint of each thread is largely different (ie. thread 0 uses 1MB and thread 1 needs 4MB), bandwidth is shared efficiently between the threads. It's only when the cache footprint is similar, within 2.5MB of each other (ie thread 0 needs at least 1.25MB and thread 1 needs less than 3.75MB) that some thrashing begins to occur. The worst performance is when the cache footprint is within 1MB of each other (ie. thread 0 needs at least 2MB and thread 1 needs less than 3MB).

While some thrashing will always occur, they mention that an idle shared cache (ie. hope for the future) will reduce this "conflict zone" to the 1MB spread case so that bandwidth doesn't taper off until thread 0 needs more than 2MB of space while thread 1 needs less than 3MB of space for a given 5MB load. Until that time, programmers should avoids creating 2 threads with similar cache footprints, in otherwords 1 thread should use noticeably more cache than the other. Or better yet, program it to fit within the 4MB cache so that thrashing doesn't occur.

No word on whether Penryn will bring updates to the shared cache mechanism although the 6MB L2 cache will reduce trashing. Digit-Life is planning a follow-up article specifically on memory bandwidth sharing rather than cache bandwidth sharing so I'll watch out for that.
Every CPU architecture will have some weak point. In the end, the C2D still performs more efficiently than any other arch. :? Two things i wonder about.

1. Why didn't they compare to an AM2 A64...would this have skewed AMD's results at all, compared to s939(i.e. would the increased mem bandwidth, and/or new IMC change anything) ?

2. I think it would have been good to test both Conroe and Allendale processors, to see how much difference the two L2 sizes exhibit.
October 24, 2006 6:54:39 AM

Quote:
Every CPU architecture will have some weak point. In the end, the C2D still performs more efficiently than any other arch. :? Two things i wonder about.

1. Why didn't they compare to an AM2 A64...would this have skewed AMD's results at all, compared to s939(i.e. would the increased mem bandwidth, and/or new IMC change anything) ?

2. I think it would have been good to test both Conroe and Allendale processors, to see how much difference the two L2 sizes exhibit.


1. The bandwidth provided by dual channel DDR2-800 is already 12.8GB/s compared to 6.4GB/s provided by dual channel DDR-400. The memory is already a high latency L3 cache. 8O

2. Agreed :D 
October 24, 2006 8:20:03 AM

I am a little confused over this bit:

Quote:
The FSB is a limiter though since it's bandwidth doesn't scale as well to accomodate 2 threads.


Is this impacted at all, when for example, you change from 266MHz X9 to 400MHz X6? This takes it pretty close to the speed of AMD's IMC right? Or is it wishful thinking that MHz are a good indication of performance, even with regards to this?
October 24, 2006 9:33:41 AM

Thanks, that was an interesting article. I guess having a shared cache has its advantages and disadvantages - in single threaded stuff, having the shared cache is probably an advantage since one core can use all of the cache, whereas with dedicated caches you are still limited to one cores cache. In multithreaded stuff, then this situation can obviously reverse in some situations. On this front, I think AMD probably has it right with using dedicated L2/Shared L3 in the future, considering that things are only going to become more multithreaded in the future.
October 24, 2006 9:40:43 AM

I thought that that was the most contrived crap I have ever seen.
I give it three bags of salt.
October 24, 2006 2:06:00 PM

2M of L2 nullified in the conflict zone... that's why they're still willing to increase the L2 size in future designs.
October 24, 2006 3:01:49 PM

They should really title this article "Core 2 cache efficiency" since it only made a cursory mention of AMD and didn't even include the AM2 setups. Crazy high cache hits for them though. I would have liked to have seen the bandwidth vs. block size chart for the X2 as well.

They did say this, though:
Quote:
The next article will be devoted to a thorough analysis of this problem.


Keep your eyes peeled.
October 24, 2006 5:33:19 PM

Quote:
They should really title this article "Core 2 cache efficiency" since it only made a cursory mention of AMD and didn't even include the AM2 setups. Crazy high cache hits for them though. I would have liked to have seen the bandwidth vs. block size chart for the X2 as well.

They did say this, though:
The next article will be devoted to a thorough analysis of this problem.


Keep your eyes peeled.
Yeah, however, looks pretty odd that half of the cache is down with intensive dual core bandwidth consumption but you don't have a chance to compare it with a X2 cache on this article.
October 24, 2006 10:23:59 PM

I'm guessing that in common use these similar data block size problems are rare at current, but as multithreading grows in use, this could be more of a problem. I believe intel takes the approach of "don't fix it until its a problem" nowadays, they've figured out when the FSB was going to start hurting them and came up with something to implement for that. I'd be surprised if they didn't have a plan to mitigate this problem, either by doing it the AMD way (exclusive L2, shared L3) or some other path.
October 25, 2006 12:13:50 AM

Great analysis, as usual.

"Idle shared cache" seems an interesting concept...

The more I wonder about cache amount limitations (within the current technological stand point), the more I wonder about the IMC benefits, i. e., by the time any [Intel] IMC arrives, the total Ln cache amount per core will, much probably, by huge by any standards (be it Intel or AMD); with Penryn @ 6MB L2 & K8L w/ >2 MB shared L3, chip stacked RAM starts to make more sense than ever...

My opinion.


Cheers!
October 27, 2006 1:56:37 AM

Some people were questioning why a S939 X2 was used to represent AMD and the reason was that the testing software was developed on the S939 X2 3800+ so presenting those results were most convenient. AM2 really didn't change the cache subsystem so S939 results are representative. The memory tests were just presented there for reference, and here now is the dedicated article on threaded memory bandwidth testing. This time they did use a AM2 X2 4800+ as well as an E6600 on both the i975X and P965 chipsets.

http://www.digit-life.com/articles2/mainboard/rmmt-ddr2...

What's interesting is that for read bandwidth Core 2 Duo actually has similar bandwidth to K8 despite the fact that in theory Core 2 Duo can only receive dual channel DDR2 533 equivalent bandwidth (8.5GB/s) compared to dual channel DDR2 800 equivalent on AM2 (12.8GB/s). AM2's advantage is in writing where it has nearly double the bandwidth of Core 2 Duo.

In terms of threading, neither AM2 or Core 2 achieves double the memory bandwidth to feed the 2 full threads versus a single thread. In any case, all this is really just interesting in a theoretical sense since many other factors effect end performance other than just memory bandwidth. Still, it's clear that both AMD and Intel still have some work to do in getting proper bandwidth scaling to match thread scaling and this will become more important as we get to quad core and above. AMD in particular has a lot of potential in their DDR2 800 IMC that has yet to be realized.
October 27, 2006 4:16:25 AM

Quote:
What's interesting is that for read bandwidth Core 2 Duo actually has similar bandwidth to K8 despite the fact that in theory Core 2 Duo can only receive dual channel DDR2 533 equivalent bandwidth (8.5GB/s) compared to dual channel DDR2 800 equivalent on AM2 (12.8GB/s). AM2's advantage is in writing where it has nearly double the bandwidth of Core 2 Duo.


This is the point I try to make when discussing Barcelona. OoO loads will unlock that tremendous bandwidth in that as more stores can already occur, increasing the amount of loads will increase the read bandwidth and enable a noticeable increase in IPC.

This of course is in tandem with the greater prefetch(32B) and line width(256b-128bx2). Core 2 also uses a refined algorithm for threading as is evidenced by the difference between 1 core and 2 cores.

Because Windows has it's own buffering mechanism, efficiency can be increased by tapping ito that buffer - or having it accomodate your buffering scheme.

Vista will improve this by moving drivers to user mode and as manufs get used to the differences.......


blah, blah, blah
October 27, 2006 1:00:20 PM

Quote:
This is the point I try to make when discussing Barcelona. OoO loads will unlock that tremendous bandwidth in that as more stores can already occur, increasing the amount of loads will increase the read bandwidth and enable a noticeable increase in IPC.

This of course is in tandem with the greater prefetch(32B) and line width(256b-128bx2). Core 2 also uses a refined algorithm for threading as is evidenced by the difference between 1 core and 2 cores.


Unless AMD comes up with a similar scheme to Intel's Memory Disambiguation, any L2/L3 conflicts/latencies will hamper any OoOe loads & prefetcher boost; consequently, reflecting on the available-but-not-fully-used BW & the IPC-to-be increase.

(My thoughts, anyway).


Cheers!
!