The Cache Question (Warning Ugly)

Schmide · Mar 5, 2002

Ok this is ugly but thank MeldarthX for this.
Some basic information.

A table on cache information.

L1CS = L1 Cache Size (in bytes)
L1CLS = L1 Cache Line Size (in bytes)
L1CL = L1 Cache Lines
L2CR = L2 Cache Ratio
L2CA = L2 Cache Associations
L2CS = L2 Cache Size (in bytes)
L2CL = L2 Cache Lines

PROC****L1CS**L1CLS***L1CL***L2CR**L2CA***L2CS****L2CL
P4*******8192****4*****2048****8*****8****262144****16384
P4NW****8192****4*****2048***16*****8****524288***32768
AMD*****65536***4*****16384**1******8****262144***32768
AMD13a**65536***4*****16384**2******8****524288***65536
AMD13b*131072***4*****32768**1******8****524288***65536

PS I could not find anything on the actual size of the P4’s L1 except that it holds over 12000 micro-ops. So a fair amount of these numbers have been fudged. The factor L2 Cache Ratio provides a multiple such that the L1 Cache size multiplied by the L2 Cache Associations, which is always 8 on 7+ generation processors, equals the final cache size. All numbers assume that half of the cache will be associated with data and half with code.

AMD13a would be an increase in the L2 cache to 512K, while AMD13b would be an increase the L1 and L2.

Enough said. The question is as follows. Considering some people believe that the P4 has a crippled L1, while the Athlon has a very robust L1 with a relatively modest L2, and if NW acquired a 10 to 20% increase by increasing its L2, what percentage of increase would there be for an appropriately configured Athlon?

Matisaro · Mar 6, 2002

The northwood is more dependant on data throughput than the amd, the gain from a 512k cache on an amd chip will NOT be as high % wise as the gain the NW recieved.

However there would be a gain.

"The Cash Left In My Pocket,The BEST Benchmark"
No Overclock+stock hsf=GOOD!

MeldarthX · Mar 7, 2002

the size of the cache on the PIII and P4 is 8k, it hasn't changed to my knowledge, but if that is wrong please let me know.

Compareable, the Athlon with its 64k of L1 cache doesn't fill as fast as the PIII's and P4's L1 cache. This is why the varible jump in performance in going from 256k to 512k would be less than the PIII's and P4's jump.

Also the PIII's and P4's L1 cache is also repeated in the L2, so reducing the cache size, but speeding transfer from the L1 to the L2 which is important to Intel's cpus.

The Athlon's cache is unified, with the L1 and L2 see as whole.

This brings and interesting question to the equation. If said cache of L1 on the Athlon was raise by 64k bringing the total to 128k of cache on the L1....would that raised varible of performance be larger than said raised varible of performance; of raised L2 cache from 256k to 512k on the Athlon.

I believe it would be. We all saw the jump in performance with the first Athlon. We also saw the jump in the first tbirds, just bringing the cache on to the die was a nice jump. Not as large as some thought it would be because of the large L1 cache.

For a real jump in performance, AMD should raise the L1 and L2......L1 by 64k, and the L2 by the difference....it would make a funky L2 number, but the increase in performance would be huge......

If there is flaws in my math.....let me know, I just got off a 10 hour shift.....

MeldarthX

eden · Mar 7, 2002

Yes you got flaws there!
The Athlons have 128K now.
AFAIK, there is enough in L1 to ensure the L2 won't need much more. I'm wondering if Dual-Channel Cache is possible?

Oh and since when did the P3 had 8K of L1? I don't remember any processor before a 486 to have such low!!

--
For the first time, Hookers are hooked on Phonics!!

AMD_Man · Mar 7, 2002

What the Athlon needs is faster cache not more cache, unlike the P4. A good upgrade to low latency 256-bit (rather than 64-bit) L2 Cache should give the Athlon a sizable boost.

AMD technology + Intel technology = Intel/AMD Pentathlon IV; the ULTIMATE PC processor

Schmide · Mar 7, 2002

AFAIK, the Athlon does have 128k of L1 cache but its separated into 64k code and data. These two caches then share their 8 way associations with their L2 of 256k. (I.e. 64k * 8 = 256k). The PIII has 32K L1 divided into 16K data and code and do not share their associations. (I.e. 16k * 8 = 128k) What I don’t understand is how Intel chips can maintain their 8 way associations and still increase their L2. It would seem to me that a PIII with an L1 of 16k would require 16 way associations to fill a 256k L2 cache. (512k = 16k * 16 + 16k * 16) My head would explode if I attempted to comment on the P4.

All errors are undocumented features waiting to be discovered.

MeldarthX · Mar 7, 2002

thanks...........its been awhile since i've been able to read the tech docs on cpus..........

AMD Man points out is right, Athlon's don't suffer from a lack of cache just cache speed. Hopefully the Tbred will fix this..........

But now an increase in the pathways of the cache in the Athlon and raising it to 512k would be a very nice boost indeed........

MeldarthX

phsstpok · Mar 7, 2002

Can you explain cache associations? or direct us to some reading on this subject.

These two caches then share their 8 way associations with their L2 of 256k. (I.e. 64k * 8 = 256k).

You seem to be indicating that the size of the L1 cache dictates the size of the L2 cache. How does this mechanism differ between Athlons and Durons. Both have 128KB of L1 cache but the latter only has 64KB of L2 cache.

I've often wondered how such a small L2 cache benefits the Duron at all.

We are all beta testers!

Schmide · Mar 7, 2002

Here is a quick one.

<A HREF="http://www.anandtech.com/cpu/showdoc.html?i=1252&p=5" target="_new">anandtech</A>

I was under the assumption that the number associations of the L2 are directly related to the size of the L1 but I am not completely sure.

All errors are undocumented features waiting to be discovered.

Schmide · Mar 7, 2002

Here’s a good table

<A HREF="http://common.ziffdavisinternet.com/download/0/1326/AMDvsIntel.pdf" target="_new">AMDvsIntel.pdf</A>

and I’m definitely wrong about the L1 determining the size of the L2. Oh well I hope I didn’t cause that much damage.

All errors are undocumented features waiting to be discovered.

phsstpok · Mar 7, 2002

No harm done (except to my eyes. That PDF has what looks like a 4 point type size on a 15" monitor).

Guess I'll have to do my own research. I think I understand how a set associative cache works but I'm looking to learn the interaction between L1 and L2 caches. I'd like to learn how different cache architectures enhance/degrade performance. I'm also a little curious why on-motherboard L3 caches aren't used. There was a time when motherboards had 1 and 2 MB of L2 cache.

We are all beta testers!

Schmide · Mar 7, 2002

Here is the best one yet, old but very informative.

<A HREF="http://www.systemlogic.net/articles/00/10/cache/index.php" target="_new">systemlogic.net</A>

I guess before I post I should do a search.

All errors are undocumented features waiting to be discovered.

MeldarthX · Mar 9, 2002

last chip to use l3 was the K6-3 chip from AMD......the biggest reason why motherboards don't have cache on them anymore is cost. Putting a couple of megs of cache on the mb raises the cost of the mb and cuts profits on them. Nvidia is playing with up to 8 megs of l3 cache for their nforce mb chipsets for the Athlons......

I'm already very impressed with that chipset and we've seen as the drives matured just how fast it really could be....

One of the problems of the Intel cache's system is that the L1 cache is duplicated in the L2.....yes on some accounts it does speed up transfer of data, but you have less room to work with and to keep those speeds you have to have very large bit data rate for the data.

AMD's cache is associate*sp*.....meaning it sees the cache as one large cache, that is why it data path could get away with only being 64 bit. But we are starting to see that data path's limitation also. It has less bandwidth for the data than Intel's but much larger cache to hold the data. If the new Athlon's cache data path is changed to 256 bits, then we will see a nice boost even if there is no added cache. But I am sure we will see more cache because of the .13 process.....

MeldarthX

lhgpoobaa · Mar 9, 2002

regarding the athlon cache, you COULD use the duron as comparison...

i.e. a late model spitfire duron at 1000 and an athlon 1000B would give good values for 64k -> 256k one could assume then that 256 -> 512 would give similar or slightly smaller advantages.

I love helping people in Toms Forums... It reinforces my intellectual superiority!

phsstpok · Mar 9, 2002

Great link! That is just what I was looking for. Plain english! Last time I was trying to understand this stuff, all I could find were various engineering white papers. Way over my head.

We are all beta testers!

phsstpok · Mar 9, 2002

I'm still trying absorb the information at Schmide's <A HREF="http://" target="_new">Systemlogic.net link</A>. I not sure how expensive 1MB of L3 and associated circuitry would be. On the otherhand, there was a time I was paying $300+ for motherboard (Pre-Athlon, pre-P3 days). For all I know those large motherboard caches could have been the reason for the high costs.

We are all beta testers!

Schmide · Mar 9, 2002

the biggest reason why motherboards don't have cache on them anymore is cost

Second biggest reason is that memory speeds have made up the difference so the L3 is no longer necessary.

All errors are undocumented features waiting to be discovered.

MeldarthX · Mar 9, 2002

Bingo, I remember those days very well. consumers wanted lower costs, one of the things that got cut was the l3 cache.

MeldarthX

MeldarthX · Mar 9, 2002

Yes and no.......as l2 cache became faster l3 seemed to become less and less important, but the K6-3 proves that wrong. The K6-3 was the first to have onboard L2 cache and use the l3 cache of the motherboard. We all saw the difference between the K6-2 and the K6-3. The k6-3 was much faster because of the L3 cache.

The K6-3 was designed to see the L3 cache as part of the cpu cache as a whole, even though the mb cache was running much slower speed. When you have the information preloaded into the mb cache its feeding the cpu directly.

Will we ever seen L3 cache again. I think we will, Nvidia was playing with 8mg L3 cache for their crush chipset and there was some very nice performance gains by it.

MeldarthX

AMD_Man · Mar 9, 2002

The K6-3 was faster than the K6-2 because it had integrated L2 cache. The K6-2 relied on the motherboard for the cache.

AMD technology + Intel technology = Intel/AMD Pentathlon IV; the ULTIMATE PC processor

phsstpok · Mar 9, 2002

I had completely forgotten about the on-chip L3 cache with which nVidia has been experimenting. I have really high hopes for the 2nd generation of nForce. I think their lightspeed (or whatever it's called) memory architecture will come into its own with faster DRAM. As it stands in nForce(1), the bandwidth is split between onboard video and main system. Bandwidth for video is far below the average video card and main system bandwidth no greater than other chipsets. Granted, the bus speed limits bandwidth the most but the architecture doesn't seem to show any improvement over conventional memory architecture. Perhaps an on-chip (MCP chip, or whatever that's call. I don't have a good memory for techno-jargon) cache WILL greatly help.

We are all beta testers!

eden · Mar 9, 2002

IMO the P4 would greatly benefit from L3 cache. Anything that adds bandwidth would help it. The L3 could maybe help in storing needed operations that the FPU alone cannot do or ALU...
Anybody could explain?

--
For the first time, Hookers are hooked on Phonics!!

phsstpok · Mar 9, 2002

Checkout the last link that Schmide provided. Large caches can improve performance but on average tend to reduce bandwidth. Larger caches mean longer cache searches. I believe the P4's tiny 8K L1 cache was chosen to boost memory bandwidth but this is only true for sequential memory access. More scattered access and more program branching kills the bandwidth (in my opinion).

We are all beta testers!

girish · Mar 10, 2002

The dataset an application works with decides the optimim size of the cache. Thats why Intel reduced the L2 cache to half with the Coppermine processors, a typical desktop PC doesnt need half an MB of cache, which costs alot too.

With P4's memory bandwidth as high as 3.2 GB/sec, the amount of memory getting into the P4 isnt really a problem, critical is the internal bandwidth, L2 to processor core. having higher L2 cache would help in memory intensive apps, increasing the effective bandwidth, but not by much. The memories have become fast, but proportion isnt still yet quite right. Internal bandwidth requirements of the Pentium4 core are as high as 12+ GB/sec (IPC x core speed) which need to be satisfied by a better L2/L1 cache combination. 256k or 512k of core speed cache is fine, but what serves the core is the L1, which should be large enough and filled in properly by the L2. L3 cache in real world terms (more scattered access and more program branching) would be a waste of resources since it wont really help much in performance, increasing the L2 is much better option.

In real world terms, where memory access is more scattered access and more program branching takes place, you need a larger cache thats closer to the core. Ideally, we need register speed memory, which is acheived to a close approximation by the L1 cache. having a larger L1 cache will certainly improve the overall core bandwidth, but overdoing it wont opffer any advantages. 8k is too small, 32 or 64k might be optimum, as Athlon and P3 performance indicates.

girish

Nothing is fool-proof. Fools are Ingenious!

phsstpok · Mar 10, 2002

8k is too small, 32 or 64k might be optimum, as Athlon and P3 performance indicates.

I wonder how one goes about determining optimum cache sizes. I understand that there are trade offs. A large cache increases the chance of a "hit" but too large a cache means more time is spent searching the cache. I have no idea how penalizing a cache "miss" is. I don't know nor understand things like how quickly data can be accessed in L1 cache, L2 cache, or main memory. I only vaguely understand that retreiving data from sequential addresses is much faster than if the addresses are truly random. However I don't know how much faster nor how fast relative to cache speeds data access occurs. It seems obvious that cache sizes can't be optimal for all situations. I imagine it depends on how a program is coded, compiled and optimized. It depends on how the data the program uses is arranged. The amount of total system memory must also be a factor. So how does one pick optimum cache size when a processor can be used in very different ways? Why did Intel choose a size of 8kb for L1 cache? There must be a reason.

We are all beta testers!

The Cache Question (Warning Ugly)

Distinguished

Splendid

Distinguished

Champion

Splendid

Distinguished

Distinguished

Splendid

Distinguished

Distinguished

Splendid

Distinguished

Distinguished

Illustrious

Splendid

Splendid

Distinguished

Distinguished

Distinguished

Splendid

Splendid

Champion

Splendid

Distinguished

Splendid

Share this page