Sign in with
Sign up | Sign in
Your question

The Cache Question (Warning Ugly)

Last response: in CPUs
Share
March 5, 2002 4:37:34 PM

Ok this is ugly but thank MeldarthX for this.
Some basic information.

A table on cache information.

L1CS = L1 Cache Size (in bytes)
L1CLS = L1 Cache Line Size (in bytes)
L1CL = L1 Cache Lines
L2CR = L2 Cache Ratio
L2CA = L2 Cache Associations
L2CS = L2 Cache Size (in bytes)
L2CL = L2 Cache Lines

PROC****L1CS**L1CLS***L1CL***L2CR**L2CA***L2CS****L2CL
P4*******8192****4*****2048****8*****8****262144****16384
P4NW****8192****4*****2048***16*****8****524288***32768
AMD*****65536***4*****16384**1******8****262144***32768
AMD13a**65536***4*****16384**2******8****524288***65536
AMD13b*131072***4*****32768**1******8****524288***65536

PS I could not find anything on the actual size of the P4’s L1 except that it holds over 12000 micro-ops. So a fair amount of these numbers have been fudged. The factor L2 Cache Ratio provides a multiple such that the L1 Cache size multiplied by the L2 Cache Associations, which is always 8 on 7+ generation processors, equals the final cache size. All numbers assume that half of the cache will be associated with data and half with code.

AMD13a would be an increase in the L2 cache to 512K, while AMD13b would be an increase the L1 and L2.

Enough said. The question is as follows. Considering some people believe that the P4 has a crippled L1, while the Athlon has a very robust L1 with a relatively modest L2, and if NW acquired a 10 to 20% increase by increasing its L2, what percentage of increase would there be for an appropriately configured Athlon?
March 6, 2002 3:36:11 AM

The northwood is more dependant on data throughput than the amd, the gain from a 512k cache on an amd chip will NOT be as high % wise as the gain the NW recieved.


However there would be a gain.

"The Cash Left In My Pocket,The BEST Benchmark"
No Overclock+stock hsf=GOOD!
March 7, 2002 2:01:22 AM

the size of the cache on the PIII and P4 is 8k, it hasn't changed to my knowledge, but if that is wrong please let me know.

Compareable, the Athlon with its 64k of L1 cache doesn't fill as fast as the PIII's and P4's L1 cache. This is why the varible jump in performance in going from 256k to 512k would be less than the PIII's and P4's jump.

Also the PIII's and P4's L1 cache is also repeated in the L2, so reducing the cache size, but speeding transfer from the L1 to the L2 which is important to Intel's cpus.

The Athlon's cache is unified, with the L1 and L2 see as whole.

This brings and interesting question to the equation. If said cache of L1 on the Athlon was raise by 64k bringing the total to 128k of cache on the L1....would that raised varible of performance be larger than said raised varible of performance; of raised L2 cache from 256k to 512k on the Athlon.

I believe it would be. We all saw the jump in performance with the first Athlon. We also saw the jump in the first tbirds, just bringing the cache on to the die was a nice jump. Not as large as some thought it would be because of the large L1 cache.

For a real jump in performance, AMD should raise the L1 and L2......L1 by 64k, and the L2 by the difference....it would make a funky L2 number, but the increase in performance would be huge......

If there is flaws in my math.....let me know, I just got off a 10 hour shift.....:) 

MeldarthX
Related resources
March 7, 2002 2:12:47 PM

Yes you got flaws there!
The Athlons have 128K now.
AFAIK, there is enough in L1 to ensure the L2 won't need much more. I'm wondering if Dual-Channel Cache is possible?

Oh and since when did the P3 had 8K of L1? I don't remember any processor before a 486 to have such low!!

--
For the first time, Hookers are hooked on Phonics!!
March 7, 2002 2:22:59 PM

What the Athlon needs is faster cache not more cache, unlike the P4. A good upgrade to low latency 256-bit (rather than 64-bit) L2 Cache should give the Athlon a sizable boost.

AMD technology + Intel technology = Intel/AMD Pentathlon IV; the <b>ULTIMATE</b> PC processor
March 7, 2002 2:40:24 PM

AFAIK, the Athlon does have 128k of L1 cache but its separated into 64k code and data. These two caches then share their 8 way associations with their L2 of 256k. (I.e. 64k * 8 = 256k). The PIII has 32K L1 divided into 16K data and code and do not share their associations. (I.e. 16k * 8 = 128k) What I don’t understand is how Intel chips can maintain their 8 way associations and still increase their L2. It would seem to me that a PIII with an L1 of 16k would require 16 way associations to fill a 256k L2 cache. (512k = 16k * 16 + 16k * 16) My head would explode if I attempted to comment on the P4.

All errors are undocumented features waiting to be discovered.
March 7, 2002 7:01:15 PM

thanks...........its been awhile since i've been able to read the tech docs on cpus..........:) 

AMD Man points out is right, Athlon's don't suffer from a lack of cache just cache speed. Hopefully the Tbred will fix this..........

But now an increase in the pathways of the cache in the Athlon and raising it to 512k would be a very nice boost indeed........

MeldarthX
March 7, 2002 7:02:15 PM

Can you explain cache associations? or direct us to some reading on this subject.

Quote:
These two caches then share their 8 way associations with their L2 of 256k. (I.e. 64k * 8 = 256k).

You seem to be indicating that the size of the L1 cache dictates the size of the L2 cache. How does this mechanism differ between Athlons and Durons. Both have 128KB of L1 cache but the latter only has 64KB of L2 cache.

I've often wondered how such a small L2 cache benefits the Duron at all.

<b>We are all beta testers!</b>
March 7, 2002 7:24:15 PM

Here is a quick one.

<A HREF="http://www.anandtech.com/cpu/showdoc.html?i=1252&p=5" target="_new">anandtech</A>

I was under the assumption that the number associations of the L2 are directly related to the size of the L1 but I am not completely sure.

All errors are undocumented features waiting to be discovered.
March 7, 2002 7:37:27 PM

Here’s a good table

<A HREF="http://common.ziffdavisinternet.com/download/0/1326/AMD..." target="_new">AMDvsIntel.pdf</A>

and I’m definitely wrong about the L1 determining the size of the L2. Oh well I hope I didn’t cause that much damage.


All errors are undocumented features waiting to be discovered.
March 7, 2002 7:59:12 PM

No harm done (except to my eyes. That PDF has what looks like a 4 point type size on a 15" monitor).

Guess I'll have to do my own research. I think I understand how a set associative cache works but I'm looking to learn the interaction between L1 and L2 caches. I'd like to learn how different cache architectures enhance/degrade performance. I'm also a little curious why on-motherboard L3 caches aren't used. There was a time when motherboards had 1 and 2 MB of L2 cache.

<b>We are all beta testers!</b>
March 9, 2002 5:40:37 AM

last chip to use l3 was the K6-3 chip from AMD......the biggest reason why motherboards don't have cache on them anymore is cost. Putting a couple of megs of cache on the mb raises the cost of the mb and cuts profits on them. Nvidia is playing with up to 8 megs of l3 cache for their nforce mb chipsets for the Athlons......

I'm already very impressed with that chipset and we've seen as the drives matured just how fast it really could be....:) 

One of the problems of the Intel cache's system is that the L1 cache is duplicated in the L2.....yes on some accounts it does speed up transfer of data, but you have less room to work with and to keep those speeds you have to have very large bit data rate for the data.

AMD's cache is associate*sp*.....meaning it sees the cache as one large cache, that is why it data path could get away with only being 64 bit. But we are starting to see that data path's limitation also. It has less bandwidth for the data than Intel's but much larger cache to hold the data. If the new Athlon's cache data path is changed to 256 bits, then we will see a nice boost even if there is no added cache. But I am sure we will see more cache because of the .13 process.....

MeldarthX
March 9, 2002 7:09:31 AM

regarding the athlon cache, you COULD use the duron as comparison...

i.e. a late model spitfire duron at 1000 and an athlon 1000B would give good values for 64k -> 256k one could assume then that 256 -> 512 would give similar or slightly smaller advantages.

I love helping people in Toms Forums... It reinforces my intellectual superiority! :smile:
March 9, 2002 3:29:56 PM

Great link! That is just what I was looking for. Plain english! Last time I was trying to understand this stuff, all I could find were various engineering white papers. Way over my head.

<b>We are all beta testers!</b>
March 9, 2002 3:38:14 PM

I'm still trying absorb the information at Schmide's <A HREF="http://" target="_new">Systemlogic.net link</A>. I not sure how expensive 1MB of L3 and associated circuitry would be. On the otherhand, there was a time I was paying $300+ for motherboard (Pre-Athlon, pre-P3 days). For all I know those large motherboard caches could have been the reason for the high costs.





<b>We are all beta testers!</b>
March 9, 2002 3:44:54 PM

Quote:
the biggest reason why motherboards don't have cache on them anymore is cost

Second biggest reason is that memory speeds have made up the difference so the L3 is no longer necessary.


All errors are undocumented features waiting to be discovered.
March 9, 2002 5:06:07 PM

Bingo, I remember those days very well. consumers wanted lower costs, one of the things that got cut was the l3 cache.

MeldarthX
March 9, 2002 5:11:20 PM

Yes and no.......as l2 cache became faster l3 seemed to become less and less important, but the K6-3 proves that wrong. The K6-3 was the first to have onboard L2 cache and use the l3 cache of the motherboard. We all saw the difference between the K6-2 and the K6-3. The k6-3 was much faster because of the L3 cache.

The K6-3 was designed to see the L3 cache as part of the cpu cache as a whole, even though the mb cache was running much slower speed. When you have the information preloaded into the mb cache its feeding the cpu directly.

Will we ever seen L3 cache again. I think we will, Nvidia was playing with 8mg L3 cache for their crush chipset and there was some very nice performance gains by it.

MeldarthX
March 9, 2002 5:49:53 PM

The K6-3 was faster than the K6-2 because it had integrated L2 cache. The K6-2 relied on the motherboard for the cache.

AMD technology + Intel technology = Intel/AMD Pentathlon IV; the <b>ULTIMATE</b> PC processor
March 9, 2002 6:46:18 PM

I had completely forgotten about the on-chip L3 cache with which nVidia has been experimenting. I have really high hopes for the 2nd generation of nForce. I think their lightspeed (or whatever it's called) memory architecture will come into its own with faster DRAM. As it stands in nForce(1), the bandwidth is split between onboard video and main system. Bandwidth for video is far below the average video card and main system bandwidth no greater than other chipsets. Granted, the bus speed limits bandwidth the most but the architecture doesn't seem to show any improvement over conventional memory architecture. Perhaps an on-chip (MCP chip, or whatever that's call. I don't have a good memory for techno-jargon) cache WILL greatly help.

<b>We are all beta testers!</b>
March 9, 2002 7:03:40 PM

IMO the P4 would greatly benefit from L3 cache. Anything that adds bandwidth would help it. The L3 could maybe help in storing needed operations that the FPU alone cannot do or ALU...
Anybody could explain?

--
For the first time, Hookers are hooked on Phonics!!
March 9, 2002 7:53:40 PM

Checkout the last link that Schmide provided. Large caches can improve performance but on average tend to reduce bandwidth. Larger caches mean longer cache searches. I believe the P4's tiny 8K L1 cache was chosen to boost memory bandwidth but this is only true for sequential memory access. More scattered access and more program branching kills the bandwidth (in my opinion).

<b>We are all beta testers!</b>
March 10, 2002 7:06:53 AM

The dataset an application works with decides the optimim size of the cache. Thats why Intel reduced the L2 cache to half with the Coppermine processors, a typical desktop PC doesnt need half an MB of cache, which costs alot too.

With P4's memory bandwidth as high as 3.2 GB/sec, the amount of memory getting into the P4 isnt really a problem, critical is the internal bandwidth, L2 to processor core. having higher L2 cache would help in memory intensive apps, increasing the effective bandwidth, but not by much. The memories have become fast, but proportion isnt still yet quite right. Internal bandwidth requirements of the Pentium4 core are as high as 12+ GB/sec (IPC x core speed) which need to be satisfied by a better L2/L1 cache combination. 256k or 512k of core speed cache is fine, but what serves the core is the L1, which should be large enough and filled in properly by the L2. L3 cache in real world terms (more scattered access and more program branching) would be a waste of resources since it wont really help much in performance, increasing the L2 is much better option.

In real world terms, where memory access is more scattered access and more program branching takes place, you need a larger cache thats closer to the core. Ideally, we need register speed memory, which is acheived to a close approximation by the L1 cache. having a larger L1 cache will certainly improve the overall core bandwidth, but overdoing it wont opffer any advantages. 8k is too small, 32 or 64k might be optimum, as Athlon and P3 performance indicates.

girish

<font color=red>Nothing is fool-proof. Fools are Ingenious!</font color=red>
March 10, 2002 7:59:07 AM

Quote:
8k is too small, 32 or 64k might be optimum, as Athlon and P3 performance indicates.

I wonder how one goes about determining optimum cache sizes. I understand that there are trade offs. A large cache increases the chance of a "hit" but too large a cache means more time is spent searching the cache. I have no idea how penalizing a cache "miss" is. I don't know nor understand things like how quickly data can be accessed in L1 cache, L2 cache, or main memory. I only vaguely understand that retreiving data from sequential addresses is much faster than if the addresses are truly random. However I don't know how much faster nor how fast relative to cache speeds data access occurs. It seems obvious that cache sizes can't be optimal for all situations. I imagine it depends on how a program is coded, compiled and optimized. It depends on how the data the program uses is arranged. The amount of total system memory must also be a factor. So how does one pick optimum cache size when a processor can be used in very different ways? Why did Intel choose a size of 8kb for L1 cache? There must be a reason.

<b>We are all beta testers!</b>
March 10, 2002 10:41:38 AM

Well, there is no hard and fast rule for determining the optimum amount of cache. Especially in the x86 world, people look upon the processor as general-purpose device that can be used for all applications ranging from doing accounts, word processing, surfuing to CAD and Engineering, Scientific apps and heavy duty games and even Serving! it is difficult to cater to such a wide range of applications for a single device. So the 256k cache Intel put into the P3, was a pretty good tradeoff beween the cost and performance, optimum for the major range of applications that the P3 was supposed to work with.

It might be worthwhile to consider even the first Celeron 266 and 300 MHz units without any L2 cache did a pretty good job for day-to-day use, but adding that measly 128k of L2 did help to its performance a lot, and it opened the market for what they call <b>value</b> segment, where performance isnt the real criterion. you could define it as <i>slow and cheap</i>!

As I mentioned earlier, the nature and number of applications the processor is to run and the largest dataset an application is supposed to use dictates what amount of cache would be beneficial. Servers and Workstations need more cache because they execute monotonous code and/or work with large amounts of data. This code or data bytes staying in the cache for longer time is beneficial for their performance. Imagine a Workstation editing a 800x600 bitmap on a processor with 2 MB of L2 cache, the complete bitmap of 1.407 MB is in the cache. A server with the same amount of cache has all the code to serve the webpage and run the scripts would be in the cache most of the time. On the other hand, on a typical desktop system, when you switch between applications, all the code and data has to be driven out and new filled in. No amount of cache can compensate for such a behaviour since a cache miss occurs for sure! Hence, desktops and low-end workstations can do away with smaller amount of cache.

Intel's approach for making two versions of Pentium-III processors is in fact a good policy for both the segments since a desktop user doesnt really need the extra cache. Instead, he goes for the desktop model with smaller cache and saves money. Specialised Server processors like the Xeon and upcoming Hammers have to have much larger cache, of the order of megabytes to improve on their performance.

As for the amount of L1 cache, different theories exist. It in fact, is dependent on the penalty associated with a cache miss that occurs in the L1. If its too high, its better to reduce this kind of misses with a larger cache. If it isnt really much, then why the heck do we want a L1 at all!?

I wonder how much a cache as small as 8k could help! I would realy like to see the performance difference with that cache turned off!

girish


<font color=red>Nothing is fool-proof. Fools are Ingenious!</font color=red>
March 10, 2002 4:19:41 PM

Quote:
I wonder how much a cache as small as 8k could help! I would realy like to see the performance difference with that cache turned off!

This brings up more questions. Since L1 and L2 cache are now on the die why isn't there just one sizeable L1 cache? Why is an L2 cache used at all? Does L2 cache use less die space than L1 cache.


<b>We are all beta testers!</b>
March 10, 2002 4:49:35 PM

Quote:
Yes and no.......as l2 cache became faster l3 seemed to become less and less important, but the K6-3 proves that wrong. The K6-3 was the first to have onboard L2 cache and use the l3 cache of the motherboard. We all saw the difference between the K6-2 and the K6-3. The k6-3 was much faster because of the L3 cache.

Actually this is untrue. The reason the k6-3 offered improvements was do to the fact its l2 was on die and ran at full processor clock as opposed to running at 100 mhz on the motherboard as the K6-2 had to. This made the additional l3 cache on the motherboard rather insignificant and only in rae cases where the motherboard had a large amount(>1 meg) of on board cache ( l3 ) would it even offer any significant improvement at all. This is why the k6-3 was faster at certain applications then the p3( pre-coppermine) of that time period. Unfortunatly, the p3 had a vastly superoir FPU and still was a better gaming chip. As processors moved to higher clockes, it was difficult to have the cache running at full processor clock, and this once again say the movement of the l2 cache of the chip and onto a serperate board. This brought about the cartridge style of cpu's, with there l2 running at or around 1/2 cpu clock. this still was a much higher clock rate however than the motherboards FSB and adding an l3 still made no sense. The only way today adding l3 cache to a motherboard would make any improvements would be by adding a large amount of it at a fast enough clock speed. I beleive micron actually played around with this a bit just before they got out of the business altogether.

It's not what they tell you, its what they don't tell you!
March 10, 2002 6:00:20 PM

Which brings me to ask: Why were there catridge CPUs and Socket CPUs?
I've always wondered...it seems like catridges were less hot since they look protected... and they also looked cooler that way...


--
For the first time, Hookers are hooked on Phonics!!
March 10, 2002 6:46:26 PM

Its simple, they needed to go to a cartridge to get the cache of the core of the processor. this was do to the fact that at that time there was no cache capable of operating at full cpu clock. Hence the slot cartridge with a cpu and its cache mounted on a pcb. The reason why the offered better cooling was due to increased surface area via a heatspreader, plus, one of the heat producing elements was no longer in the die of the CPU.

It's not what they tell you, its what they don't tell you!
March 10, 2002 7:13:26 PM

Ahh nice, so did all P2s had that? I never checked mine and it's gone now :( .
Although my friend's Celly 400 was a slot catridge.
In the end, what allowed them to reput that full speed cache back into socket form?

--
For the first time, Hookers are hooked on Phonics!!
March 10, 2002 11:30:32 PM

my statement wasn't untrue........I just didn't add more about the L2..........which you guys did point out nicely....which you are right, the l2 had a huge impact on the chip performance when compared to rest of the cpus out there. The L3 average size was 1 meg at least, I remember selling boards with 2-3 megs of L3. The performance of the K6-3 jumped even more with those boards.

People tend to down play the roll mb cache played. Until the K6-3, that was the L2 cache. We saw what happened when the cache was moved from the mb to the cpu, a nice jump in performance, but we also got to saw what happened when that cache was left there and was used.

K6-3 was very nice cpu, and still to this date has the highest rated ALU unit in the x86 world. Athlon comes close but doesn't beat it nor does the P4.

MeldarthX
March 10, 2002 11:41:44 PM

Actually the biggest reason Intel went to the Slot formate was to keep AMD and Cyrix from following. Do to a court agreement, Intel had to shard socket designs with AMD and Cyrix through the first pentiums, after that Intel was free to do what they wanted.

Intel created the slot processor to keep AMD and Cyrix out of the market.

The slot also made it easy to move cache closer to the cpu, and that would be the second biggest reason.
March 11, 2002 12:08:52 AM

But if it was advantageous, why'd they switch back to socket which is hotter and requires more handling? Also how did they solve the full speed cache when switched to Socket later on?

--
For the first time, Hookers are hooked on Phonics!!
March 11, 2002 5:52:32 AM

well, i heard a lot of reasons why they got into sockets in the first place!
<b>1.</b> to keep the competition away
<b>2.</b> to remove the heat generated from the processor die off the motherboard
<b>3.</b> mechanical stability with large number of pins on the socket underside is reduced, so they designed the edge connector
<b>4.</b> they wanted to put a some cache alongwith the processor but it was costly fabricating it on the die and it did take up a large amount of space
<b>5.</b> make a ton by making people think it was a revolutionary new design, cutting edge latest tech! see how much these P-II 233 and 266 MHz chips were priced when they were launched!!

they succeeded in all of them! AMD did make a slot design (the Tbird slot A) but it was different from that of Intel.

Now back to sockets! I did have a feeling that time that they would eventually get back to sockets, and they did! maybe the reasons were:
<b>1.</b> making a PCB mounted processor was relatively costly, additional steps and material involved, and it did not give them any significant advantage (I remember the time when the P-II/Celeron prices suddenly increased by as much as 20%, when the custom officials at New Delhi air-port noticed the bare celeron cartridge, and classified the item into a PPCB (populated PCB) like motherboards from Integrated circuits and hiked the import duty from 5% to 22%! it wasn't too good at least for us Indians :lol:  )
<b>2.</b> fabrication technology improved and it was now possible to integrate a fair amount of L2 cache on the processor die itself. so there was no need for any extra real estate
<b>3.</b> FSB had to be ramped up, given the very nature of problem with high frequencies on a PCB, the cartridge was no different and had the same limitations getting the signals to and from the cartridge to the board. it was necessary, as far as possible, that all the signal lines to have equal length to retain data integrity.
<b>4.</b> larger circuitry on the core increased heat generation of the package, and it was difficult to improve the thermal design by laterally expanding the cooler/heatsink/fan modules. the only solution was to direct it upwards and socket is the best way to that!
<b>5.</b> make a ton again by makng people switch their motherboards again to put in the newer processor!

girish

<font color=red>Nothing is fool-proof. Fools are Ingenious!</font color=red>
March 11, 2002 9:25:05 AM

I think you missed it. Pentuim Pro had a onboard cashe. Problem was the price.
!