what does CPU cache do?

CheeseMan · Nov 27, 2003

Does having more cache improve your computer's speed significantly? I have no idea.

pIII_Man · Nov 27, 2003

YES!

cache is like a fast memory on the cpu chip, the processor takes data from the memory which it expects the cpu will soon use and stores it on cache, it significantly speeds up the time your cpu waits for data(hence willamette much slower than NW, pIII-s faster than pIII, barton faster than tbred, celeron slower than p4 and finally duron slower than AXP).

If it isn't a P6 then it isn't a procesor
110% BX fanboy

lhgpoobaa · Nov 27, 2003

Do as i did and try benchmarking without it!

What i did was run sandra with my old Tbird 1200C.

First run with L1 and L2 enabled... that got the score as expected.

Rebooted with L2 off but L1 still on... Scores were approximately two thirds of before.

Rebooted again, this time with L2 on but L1 off.
Scores were less than 50% of normal.

Tried rebooting with both L1 and L2 off.
Gave up waiting for windows to load after an hour it was sooooooo sloooooooow!

Lead me not into temptation.
I know the way myself. :evil:

Regards,
Mr no integrity coward.

pIII_Man · Nov 27, 2003

hehe i had to turn of L2 cache to get my pII 400 to boot windows at 600mhz...ah that was a nice chip!

Although it performed better at 580mhz w/ cache

If it isn't a P6 then it isn't a procesor
110% BX fanboy

lhgpoobaa · Nov 27, 2003

hehe
Interestingly enough not long after that the L2 cache on that chip kinda got perminently corrupted/fried.

Something to do with 1500Mhz and 2.05v core voltage :evil:

Lead me not into temptation.
I know the way myself. :evil:

Regards,
Mr no integrity coward.

pIII_Man · Nov 27, 2003

luckly my pII survived a few hours at 2.8v...

If it isn't a P6 then it isn't a procesor
110% BX fanboy

CheeseMan · Nov 27, 2003

thanks for all your replies.

I was surprised to find out that I only have 256kb L2 cache when I thought most P4s had 512.

Also do you think a CPU with higher cache would achieve higher scores in benchmarks such as 3mark? I'm asking cause my friend who has a 2.4ghz 512kb cache processor with radeon 9800xt scores from 1,000-2,000 points higher than my 1.8ghz 256kb cache and radeon 9800pro. I know that the score difference can't be because of our videocards because the XT isn't that much better than a PRO. Also he has 512MB of PC3200 ram while I only have 512MB of PC2100 ram, do you think the ram could make a difference?

Sorry if I sound like a newb or whatever you wanna call it, thanks for your replies.

pIII_Man · Nov 27, 2003

in reality the jump from 256 - 512kb of cache for the athlon xp really did not effect the performance as much as for the willamette to northwood. Most of the performance comes from clock speed. Example, a barton @ 1.833mhz will perform worse in MOST benchmarks than a tbred (w/ 256kb of cache) 2.0ghz. So IMO if you are buying a cpu and NOT overclocking buy the athlon based on clock speed and not its pr rating.

That benchmarking diffrence would make sense if you are running 3dmark 2001...

If it isn't a P6 then it isn't a procesor
110% BX fanboy

Spitfire_x86 · Nov 28, 2003

I didn't take much performance hit with L2 disabled.

When I disabled L2 cahce, I took ~25% performance hit in LAME MP3 encoding.

----------------
<A HREF="http://geocities.com/spitfire_x86" target="_new">My Website</A>

<A HREF="http://geocities.com/spitfire_x86/myrig.html" target="_new">My Rig & 3DMark score</A>

Spitfire_x86 · Nov 28, 2003

Northwood "A" was onlt little better than Willamette P4 in most cases. Northwood "B", with 512k cache and 533 MHz FSB was really quite better than Willamette in everywhere.

----------------
<A HREF="http://geocities.com/spitfire_x86" target="_new">My Website</A>

<A HREF="http://geocities.com/spitfire_x86/myrig.html" target="_new">My Rig & 3DMark score</A>

Whisper · Nov 28, 2003

No, having more cache doesn't guarantee a performance increase.

Of course, what I mean is there are exceptions. For example bigger caches could be clocked at half the speed, or have higher latency. Bigger caches could have lower associativity (in short, this determines how often useful data is thrown out, a much overlooked factor). Bigger caches might also have less optimal cache line sizes. And 64-bit processors waste more cache.

One the other hand, in practice they mostly don't make compromises with bigger caches. It's not a selling factor, so it's the benchmarks that really count.

Oh, and don't forget that it's very application-specific. Coincidentally, I have been optimizing my application's cache usage lately by replacing some static data with a stack-like structure which is much smaller and more localized. So now my application runs as fast on a Celeron than on a real Pentium 4.

varghesejim · Nov 28, 2003

Cache is there because of the propeerty "locality of reference".That means if we analyze programs most of the times there are certain instructions only is executing because loops are taking most of the execution time.So if there is a fast memory and we place these instuctions in it it will execute faster because there is no need to access this instructions from the slower RAM.
Consider the program below

i=0;
while(i<100)
{
dispay i;
i=i+1;
}

when the program is executing without cache each time the cpu have to access data from the main memory.Note that the loop is executing 100 times the other instruction is taking only one time.With cache firsly the data from the main memory is transfered to cache and the cpu will look at the cache first for data.Now let us come to cache size.If the cache is able to contain the entire loop the Cpu will access the entire loop from the cache,so it will be faster.Otherwise,if the cache size is only one instruction(SAy) the cpu have to transfer the data each time to the cache even in the loop.So the more the cache ,better.

The optimal size of cache depends on the loop size.If the largest loop os 256 Kb in size,there is no perfomance advantage with a 512 Kb cache

Nights_L · Nov 28, 2003

in reality the jump from 256 - 512kb of cache for the athlon xp really did not effect the performance as much as for the willamette to northwood

Why is that? I've been thinking why, but no clue..

Whisper · Nov 28, 2003

Why is that? I've been thinking why, but no clue..

Cache associativity.

Athlons have 16-way set associativity for the L2 cache. It's been shown that it performs almost as good as an 8-way set associative cache of double the size. Taking this further, a 256 kB 16-way set associative cache is nearly like a 8 MB cache without associativity. A Pentium 4 on the other hand has 'only' 8-way set associativity. A 256 kB 8-way set associative cache is nearly like a 4 MB cache without associativity.

So it's clear that the Pentium 4 wins a lot more by doubling its cache size. The Athlon wins a lot less because most applications don't require such a big cache.

Whisper · Nov 28, 2003

The optimal size of cache depends on the loop size.

Wrong. It depends on the average size of the working set. That is the data that is used most often for a while.

For example when applying a filter to an image, the filter parameters are used very often, but the pixels of the image are only read once or a few times. So the filter is part of the working set, the image is not. Loop size corresponds to the image size here, but cache size hardly has an influence on efficiency because only small blocks are processed at once.

On the contrary, it's even best to leave the image data out of the cache so we can keep the often used filter there. This can be done with MMX and SSE assembly instructions that bypass the L2 cache and is used by many image and video editing tools.

pIII_Man · Nov 28, 2003

also the athlons bad prefetch logic may have something to do with it...and shorter piplines.

If it isn't a P6 then it isn't a procesor
110% BX fanboy

darko21 · Nov 28, 2003

Re: Why is that? I've been thinking why, but no clue..

I wonder if cpu speed has somthing to do with it? On chip cache runs at cpu MHz speed, since Intel cpu's run at faster MHz to = amd performance would they not get a bigger performance boost from doubling the cache.

If I glanced at a spilt box of tooth picks on the floor, could I tell you how many are in the pile. Not a chance, But then again I don't have to buy my underware at Kmart.

imgod2u · Nov 29, 2003

Cache associativity.

Athlons have 16-way set associativity for the L2 cache. It's been shown that it performs almost as good as an 8-way set associative cache of double the size. Taking this further, a 256 kB 16-way set associative cache is nearly like a 8 MB cache without associativity. A Pentium 4 on the other hand has 'only' 8-way set associativity. A 256 kB 8-way set associative cache is nearly like a 4 MB cache without associativity.

Associativity doesn't guarantee "equivalent cache-size". And, btw, they didn't change the cache associativity when moving from T-Bred to Barton. The cache associativity was still 16-way.
Higher associativity would offer more flexibility when dealing with memory, however, it would also make for higher search times which, in turn, means slower caches.
I would also like to see numbers as to how this would make a 16-way associative cache "as good as an 8-way set associative cache of double the size".

Associativity has always been a balancing act. Too great, and you'll reach something close to fully associativive caching which will make cache search times incredibly high. Too small, and you'll be direct-mapping memory to cache which would make the cache too restrictive. Every architecture has different caching schemes and different requirements making it more well-suited for different cache-associativity schemes. Saying higher is automatically better is an oversimplification and very very wrong. Saying that it makes it "equivalent to xxx size cache" is just ridiculous.

In reality, the difference between the P4 and Athlon's caching architecture and their respective needs on caching comes down to 3 things:

1. The P4's relatively higher clockspeed
2. The P4's 128-byte cacheline vs 64-byte of the Athlon
3. The P4's relatively small L1 cache.

1. should be obvious. A higher clockspeed processor relies on its cache more for performance because each memory fetch can delay the processor for many (hundreds) of clockcycles. The more data you can get from cache instead of memory, the less of these waits you'll have to suffer.
2. isn't quite as obvious. Modern MPU's fetch data from memory in entire strides. Each fetch takes a certain "cacheline" from memory and stores it in cache. In the P4's case, with every memory request, 128 bytes are transfered from memory to cache. How much of that 128 bytes is useful information depends on the locality of the application. Since each request fills the cache with more data (some of which could be useless data), it requires more cache space to store all of that. Compare this to the Athlon's 64-byte cacheline size and you'll see the Athlon fills the cache with less data with each memory request.
3. should also be rather obvious. A smaller L1 cache means more data will have to be fetched from the L2 cache. This, in turn, means that increasing the L2 cache will have a much greater effect on overall performance. In the P4's case, the L1 cache is more like an L0 cache. It is very very fast, but very very small. The L2 cache functions more like the traditional L1 cache. All FP data is loaded from L2, the L1 data cache only stores integer data. All native x86 instructions are fetched from L2, there is no L1 instruction cache per se. The execution trace cache stores already decoded micro-ops.

All 3 of these factors combine to form the P4's heavy reliance on a fast, big L2 cache, and would also account for the noticable speed increases when its cache is increased.

"We are Microsoft, resistance is futile." - Bill Gates, 2015.

imgod2u · Nov 29, 2003

Wrong. It depends on the average size of the working set. That is the data that is used most often for a while.

No, actually, it depends on the size of the loop. You could have a working dataset of 1 MB, but your loop only works on 256KB at a time. In which case, your optimal data cache size would be 256KB. Once your critical-loop finishes, you can then just load the next 256KB.

For example when applying a filter to an image, the filter parameters are used very often, but the pixels of the image are only read once or a few times. So the filter is part of the working set, the image is not. Loop size corresponds to the image size here, but cache size hardly has an influence on efficiency because only small blocks are processed at once.

That depends very much on how the application is written. Most filters are multipass and relies heavily on accessing the image data to sample values, put it through transforms, sample again, calculate shifts, etc. Only the simplest filters ever only read the image values once.

On the contrary, it's even best to leave the image data out of the cache so we can keep the often used filter there. This can be done with MMX and SSE assembly instructions that bypass the L2 cache and is used by many image and video editing tools.

SSE and MMX instructions don't "bypass" the L2. I have no idea where you get this. Also, by "filter" I'll assume you mean the filter code. I also hope you realize that generally, code is of orders of magnitudes smaller than the data they work on. This would put heavy priority for caching data as instructions have very high locality and doesn't take up that much cache space.

"We are Microsoft, resistance is futile." - Bill Gates, 2015.

c0d1f1ed · Nov 29, 2003

Associativity doesn't guarantee "equivalent cache-size".

I didn't say anything about equivalence. I said "performs nearly like", which is true.

And, btw, they didn't change the cache associativity when moving from T-Bred to Barton. The cache associativity was still 16-way.

I didn't say anything about changing associativity from Thoroughbred to Barton. I did say "Althons have 16-way set associativity for L2 cache" and both are Athlon processors so this is true as well.

Higher associativity would offer more flexibility when dealing with memory, however, it would also make for higher search times which, in turn, means slower caches.

Which is another reason why Pentium 4's have lower associativity.

I would also like to see numbers as to how this would make a 16-way associative cache "as good as an 8-way set associative cache of double the size".

Since you're such an expert I thought you'd know... Oh well, look for a program names sim-cache.

Saying higher is automatically better is an oversimplification and very very wrong.

Wow you're such a very good reader. I must have written that in really small print because I can't find me stating that somewhere.

Saying that it makes it "equivalent to xxx size cache" is just ridiculous.

It is, isn't it?

HardWareBoss · Nov 29, 2003

LOL so you sat there for an hour waiting for windows to boot?

c0d1f1ed · Nov 29, 2003

No, actually, it depends on the size of the loop. You could have a working dataset of 1 MB, but your loop only works on 256KB at a time. In which case, your optimal data cache size would be 256KB. Once your critical-loop finishes, you can then just load the next 256KB.

Most of the time there is absolutely no relationship between cache size and loop size. Like in my example of a filter, you mostly can't keep the whole image in the cache so you have to read every pixel from RAM anyway. So it doesn't matter how big your loop is. What does matter is the filter size. It's very uncommon, but if it didn't fit in the L1 cache the working set it would be much less efficient. But again, it directly has nothing to do with loop size. It's always about data size.

That depends very much on how the application is written. Most filters are multipass and relies heavily on accessing the image data to sample values, put it through transforms, sample again, calculate shifts, etc. Only the simplest filters ever only read the image values once.

Absolutely, that's why I said "once or a few times". But the point is that it isn't cachable. That's not terrible either because it's prefetchable. Again however it has nothting to do with loop size.

SSE and MMX instructions don't "bypass" the L2. I have no idea where you get this.

I din't say all instructions do this. The prefetchnta and movntq and movntps instructions do.

Also, by "filter" I'll assume you mean the filter code.

No, I meant the filter data. Seemed obvious to me so I din't think I had to specify it further...

lhgpoobaa · Nov 30, 2003

LOL nah.

IIRC i was reading something...
After the hour it was up to the main windows splash screen but not progressing anywhere. So i gave up.

YAY for Cache!

Lead me not into temptation.
I know the way myself. :evil:

Regards,
Mr no integrity coward.

imgod2u · Nov 30, 2003

I didn't say anything about equivalence. I said "performs nearly like", which is true.

Correct me if I'm wrong but "performs nearly like" is pretty much stating it's about equivalent.

I didn't say anything about changing associativity from Thoroughbred to Barton. I did say "Althons have 16-way set associativity for L2 cache" and both are Athlon processors so this is true as well.

Then why is it the Athlon did not benefit from its increase in cache? Assuming its associativity remained the same, the delta in caching performance shouldn't have been affected. Double 16-way is still double, just as double 8-way is still double...

Since you're such an expert I thought you'd know... Oh well, look for a program names sim-cache.

What exactly are you using sim-cache to measure when you say "it performs almost as good as an 8-way set associative cache of double the size."? Cache miss-rate? I would assume so, however, I would argue that cache miss-rate also depends on cacheline size and the adaptiveness of the prefetch algorithm. So tell me, where exactly did you get data on two identical architectures, with one having 8-way associative caching, and the other with 16, and measured that the cache miss rate of the 16 was close to 1/2 that of the 8-way?

Wow you're such a very good reader. I must have written that in really small print because I can't find me stating that somewhere.

"It's been shown that it performs almost as good as an 8-way set associative cache of double the size. Taking this further, a 256 kB 16-way set associative cache is nearly like a 8 MB cache without associativity. A Pentium 4 on the other hand has 'only' 8-way set associativity. A 256 kB 8-way set associative cache is nearly like a 4 MB cache without associativity."

Not only was your implication that we can compare n-way associative cache with discrete-mapping in a *linear* way, but you also implied that the higher your associativity, the larger cache size your cache will "act like".

It is, isn't it?

See above.

Most of the time there is absolutely no relationship between cache size and loop size. Like in my example of a filter, you mostly can't keep the whole image in the cache so you have to read every pixel from RAM anyway. So it doesn't matter how big your loop is. What does matter is the filter size. It's very uncommon, but if it didn't fit in the L1 cache the working set it would be much less efficient. But again, it directly has nothing to do with loop size. It's always about data size.

You don't have to keep the whole image in cache. How many image filtering programs do you know work on the whole image? Most split the image into sub-blocks and process those, then runs a global-level filter in the end to "weed out" any edge distortions. DiVX is one of the video imaging filters that do this quite extensively for low-bitrate video.

Absolutely, that's why I said "once or a few times". But the point is that it isn't cachable. That's not terrible either because it's prefetchable. Again however it has nothting to do with loop size.

I would say it is cacheable. See above.

No, I meant the filter data. Seemed obvious to me so I din't think I had to specify it further...

I would say the majority of the filter data would just be a subset of the image you're working on.

"We are Microsoft, resistance is futile." - Bill Gates, 2015.

c0d1f1ed · Nov 30, 2003

Correct me if I'm wrong but "performs nearly like" is pretty much stating it's about equivalent."

That's true, but you said equivalent and not -about- equivalent. You're as good in bending your own words than in bending mine. You work in PR?

Then why is it the Athlon did not benefit from its increase in cache? Assuming its associativity remained the same, the delta in caching performance shouldn't have been affected. Double 16-way is still double, just as double 8-way is still double...

It did benefit. Just not as much. The 16-way associativity of Athlons results in less data being evicted by other data that shares the same hash (in practice, the lower address bits). In other words the cache miss rate will be lower. So it can more effectively keep the active working set close to the core. Or, rephrased again for simplicity, it can cache more useful data.

Hence, if an application's average working set fits in the Thoroughbred's cache, it won't run much faster with double cache size. For the Willamette it might not be possible to cache this working set effectively because it's lower associativity throws useful data out more quickly, thereby increasing miss rate compared to the Thoroughbred. But if the cache size is doubled, this inefficiency is solved so it has benefited a lot more of it than the Athlon.

What exactly are you using sim-cache to measure when you say "it performs almost as good as an 8-way set associative cache of double the size."? Cache miss-rate? I would assume so, however, I would argue that cache miss-rate also depends on cacheline size and the adaptiveness of the prefetch algorithm. So tell me, where exactly did you get data on two identical architectures, with one having 8-way associative caching, and the other with 16, and measured that the cache miss rate of the 16 was close to 1/2 that of the 8-way?

Yes, cache miss rate is the main factor that, together with search latency and miss penalty, determines performance. You clearly didn't try hard to find the answers yourself, because sim-cache is completely configurable and is an emulator. So it doesn't matter what your own computer's cache line size or prefetch unit is like.

I've run a little test of cache miss rates for you to illustrate the importance of associativity. Of course I took the liberty to make it as extreme as possible (but still using a standard trace):

256 kB, non-associative: 0,0635
512 kB, non-associative: 0,0398
256 kB, 2-way associative, LRU: 0,0264

I hope this is an eye-opener for you. Of course we have to include some other factors like search latency, which increases with associativity as well as with size, but in practice these are of lesser importance. For 8-way and 16-way associativity the difference is less extreme (double size usually wins slightly from double associativity), but a cache with double associativity still behaves -nearly- like a cache with double size.

Not only was your implication that we can compare n-way associative cache with discrete-mapping in a *linear* way, but you also implied that the higher your associativity, the larger cache size your cache will "act like".

See above.

You don't have to keep the whole image in cache. How many image filtering programs do you know work on the whole image? Most split the image into sub-blocks and process those, then runs a global-level filter in the end to "weed out" any edge distortions. DiVX is one of the video imaging filters that do this quite extensively for low-bitrate video.

Applications like Paint Shop Pro and ColorDraw work interactively so you see the result between every filter step. Hence the image doesn't get reused in a short time and it isn't cached. Video filters like Avisynth also apply one filter at a time, exactly to avoid edge distortion effects. Run-time filters like used with DivX might work a little different but I would be surprised if more than a few basic filters could operate on subsets of the image. I would appreciate it if you could show me a reference that confirms your claim. Anyway, varghesejim's original sentence was "the optimal size of cache depends on the loop size". This is not true in general, and certainly not for the loop he showed.

what does CPU cache do?

Distinguished

Splendid

Illustrious

Splendid

Illustrious

Splendid

Distinguished

Splendid

Splendid

Splendid

Distinguished

Distinguished

Distinguished

Distinguished

Distinguished

Splendid

Distinguished

Distinguished

Distinguished

Distinguished

Splendid

Distinguished

Illustrious

Distinguished

Distinguished

Share this page