Why is that? I've been thinking why, but no clue..in reality the jump from 256 - 512kb of cache for the athlon xp really did not effect the performance as much as for the willamette to northwood
Cache associativity.Why is that? I've been thinking why, but no clue..
Wrong. It depends on the average size of the working set. That is the data that is used most often for a while.The optimal size of cache depends on the loop size.
Cache associativity.
Athlons have 16-way set associativity for the L2 cache. It's been shown that it performs almost as good as an 8-way set associative cache of double the size. Taking this further, a 256 kB 16-way set associative cache is nearly like a 8 MB cache without associativity. A Pentium 4 on the other hand has 'only' 8-way set associativity. A 256 kB 8-way set associative cache is nearly like a 4 MB cache without associativity.
Wrong. It depends on the average size of the working set. That is the data that is used most often for a while.
For example when applying a filter to an image, the filter parameters are used very often, but the pixels of the image are only read once or a few times. So the filter is part of the working set, the image is not. Loop size corresponds to the image size here, but cache size hardly has an influence on efficiency because only small blocks are processed at once.
On the contrary, it's even best to leave the image data out of the cache so we can keep the often used filter there. This can be done with MMX and SSE assembly instructions that bypass the L2 cache and is used by many image and video editing tools.
I didn't say anything about equivalence. I said "performs nearly like", which is true.Associativity doesn't guarantee "equivalent cache-size".
I didn't say anything about changing associativity from Thoroughbred to Barton. I did say "Althons have 16-way set associativity for L2 cache" and both are Athlon processors so this is true as well.And, btw, they didn't change the cache associativity when moving from T-Bred to Barton. The cache associativity was still 16-way.
Which is another reason why Pentium 4's have lower associativity.Higher associativity would offer more flexibility when dealing with memory, however, it would also make for higher search times which, in turn, means slower caches.
Since you're such an expert I thought you'd know... Oh well, look for a program names sim-cache.I would also like to see numbers as to how this would make a 16-way associative cache "as good as an 8-way set associative cache of double the size".
Wow you're such a very good reader. I must have written that in really small print because I can't find me stating that somewhere.Saying higher is automatically better is an oversimplification and very very wrong.
It is, isn't it?Saying that it makes it "equivalent to xxx size cache" is just ridiculous.
Most of the time there is absolutely no relationship between cache size and loop size. Like in my example of a filter, you mostly can't keep the whole image in the cache so you have to read every pixel from RAM anyway. So it doesn't matter how big your loop is. What does matter is the filter size. It's very uncommon, but if it didn't fit in the L1 cache the working set it would be much less efficient. But again, it directly has nothing to do with loop size. It's always about data size.No, actually, it depends on the size of the loop. You could have a working dataset of 1 MB, but your loop only works on 256KB at a time. In which case, your optimal data cache size would be 256KB. Once your critical-loop finishes, you can then just load the next 256KB.
Absolutely, that's why I said "once or a few times". But the point is that it isn't cachable. That's not terrible either because it's prefetchable. Again however it has nothting to do with loop size.That depends very much on how the application is written. Most filters are multipass and relies heavily on accessing the image data to sample values, put it through transforms, sample again, calculate shifts, etc. Only the simplest filters ever only read the image values once.
I din't say all instructions do this. The prefetchnta and movntq and movntps instructions do.SSE and MMX instructions don't "bypass" the L2. I have no idea where you get this.
No, I meant the filter data. Seemed obvious to me so I din't think I had to specify it further...Also, by "filter" I'll assume you mean the filter code.
Correct me if I'm wrong but "performs nearly like" is pretty much stating it's about equivalent.I didn't say anything about equivalence. I said "performs nearly like", which is true.
Then why is it the Athlon did not benefit from its increase in cache? Assuming its associativity remained the same, the delta in caching performance shouldn't have been affected. Double 16-way is still double, just as double 8-way is still double...I didn't say anything about changing associativity from Thoroughbred to Barton. I did say "Althons have 16-way set associativity for L2 cache" and both are Athlon processors so this is true as well.
What exactly are you using sim-cache to measure when you say "it performs almost as good as an 8-way set associative cache of double the size."? Cache miss-rate? I would assume so, however, I would argue that cache miss-rate also depends on cacheline size and the adaptiveness of the prefetch algorithm. So tell me, where exactly did you get data on two identical architectures, with one having 8-way associative caching, and the other with 16, and measured that the cache miss rate of the 16 was close to 1/2 that of the 8-way?Since you're such an expert I thought you'd know... Oh well, look for a program names sim-cache.
"It's been shown that it performs almost as good as an 8-way set associative cache of double the size. Taking this further, a 256 kB 16-way set associative cache is nearly like a 8 MB cache without associativity. A Pentium 4 on the other hand has 'only' 8-way set associativity. A 256 kB 8-way set associative cache is nearly like a 4 MB cache without associativity."Wow you're such a very good reader. I must have written that in really small print because I can't find me stating that somewhere.
It is, isn't it?
You don't have to keep the whole image in cache. How many image filtering programs do you know work on the whole image? Most split the image into sub-blocks and process those, then runs a global-level filter in the end to "weed out" any edge distortions. DiVX is one of the video imaging filters that do this quite extensively for low-bitrate video.Most of the time there is absolutely no relationship between cache size and loop size. Like in my example of a filter, you mostly can't keep the whole image in the cache so you have to read every pixel from RAM anyway. So it doesn't matter how big your loop is. What does matter is the filter size. It's very uncommon, but if it didn't fit in the L1 cache the working set it would be much less efficient. But again, it directly has nothing to do with loop size. It's always about data size.
I would say it is cacheable. See above.Absolutely, that's why I said "once or a few times". But the point is that it isn't cachable. That's not terrible either because it's prefetchable. Again however it has nothting to do with loop size.
I would say the majority of the filter data would just be a subset of the image you're working on.No, I meant the filter data. Seemed obvious to me so I din't think I had to specify it further...
That's true, but you said equivalent and not -about- equivalent. You're as good in bending your own words than in bending mine. You work in PR?Correct me if I'm wrong but "performs nearly like" is pretty much stating it's about equivalent."
It did benefit. Just not as much. The 16-way associativity of Athlons results in less data being evicted by other data that shares the same hash (in practice, the lower address bits). In other words the cache miss rate will be lower. So it can more effectively keep the active working set close to the core. Or, rephrased again for simplicity, it can cache more useful data.Then why is it the Athlon did not benefit from its increase in cache? Assuming its associativity remained the same, the delta in caching performance shouldn't have been affected. Double 16-way is still double, just as double 8-way is still double...
Yes, cache miss rate is the main factor that, together with search latency and miss penalty, determines performance. You clearly didn't try hard to find the answers yourself, because sim-cache is completely configurable and is an emulator. So it doesn't matter what your own computer's cache line size or prefetch unit is like.What exactly are you using sim-cache to measure when you say "it performs almost as good as an 8-way set associative cache of double the size."? Cache miss-rate? I would assume so, however, I would argue that cache miss-rate also depends on cacheline size and the adaptiveness of the prefetch algorithm. So tell me, where exactly did you get data on two identical architectures, with one having 8-way associative caching, and the other with 16, and measured that the cache miss rate of the 16 was close to 1/2 that of the 8-way?
See above.Not only was your implication that we can compare n-way associative cache with discrete-mapping in a *linear* way, but you also implied that the higher your associativity, the larger cache size your cache will "act like".
Applications like Paint Shop Pro and ColorDraw work interactively so you see the result between every filter step. Hence the image doesn't get reused in a short time and it isn't cached. Video filters like Avisynth also apply one filter at a time, exactly to avoid edge distortion effects. Run-time filters like used with DivX might work a little different but I would be surprised if more than a few basic filters could operate on subsets of the image. I would appreciate it if you could show me a reference that confirms your claim. Anyway, varghesejim's original sentence was "the optimal size of cache depends on the loop size". This is not true in general, and certainly not for the loop he showed.You don't have to keep the whole image in cache. How many image filtering programs do you know work on the whole image? Most split the image into sub-blocks and process those, then runs a global-level filter in the end to "weed out" any edge distortions. DiVX is one of the video imaging filters that do this quite extensively for low-bitrate video.