Cache Architecture or Overall CPU Archticture?

MeTaLrOcKeR · Jan 10, 2002

Ok....I've been pondering this in my head for a long time now, and I just don't know what to think of it....the only thing I can come up with is that it is VERY impressive to say the least...now read on to find out what I'm talking about...

All Socket A processors use a cache data-path (for L2 cache anyways) that is only 64-Bit's Wide....now, that is a VERY narrow bandwidth is it not ?? Especially compared to all Pentium III Coppermines and better (Tully, P4W & NW) which is 256 Bit's Wide.....

Now....The Athlon's (all of them) only have this 64 bit wide cache interface and it STILL outpaces the P3.....Tully, and P4 Willy in the majority of cases.....

Now, could this 64-bit data-path be castrating the Athlon ??

Example.......The unexplained reason why the P4 outperforms the Athlon in Time Demo's of Q3A....its by a significant margin aswell....could this be due to the data-pathway of the L2 cache architecture ??!?? Like, what would happen if AMD but a 256bit wide data-pathway on the Athlon now for thwe L2 Cache?? Would it than have a negligable performance increase? or would it be significant?

This 64-bit L2 cache data-pathway interface is the only thing I can really think of right now that would explian this unexplained experience in Q3A.....we know its not totaly due to the P4 running the Quad pumped FSB, as it still remains when running either DDR or RDRAM on the P4 Platform.....so the cache architecture is what comes to my mind, and I did a bit of research on it, and the Athlon is STILL running with this 64-bit data-pathway and it can totaly compete with the Intel Counterparts......is this good engineering on part of the K7 Architecture or just coincidence?? I'd really like to know.....anyones thoughts, comnments? All is appreciated...

-MeTaL RoCkEr

My Z28 can take your P4 off the line!Edited by MeTaLrOcKeR on 01/11/02 11:39 AM.

MeTaLrOcKeR · Jan 10, 2002

So know one has anything to say about this at all ?

What the hell, is everyone only interested in AMD vs. Intel flame wars or what??? Thsi is something relavent.....and All I wanted was some input to my questions......

-MeTaL RoCkEr

My Z28 can take your P4 off the line!

FatBurger · Jan 10, 2002

I'm interested too, but I have nothing to say.

Quarter Pounder Inside

MeTaLrOcKeR · Jan 11, 2002

Maybe im confused from your statement? Does this mean you only want to view other opions also? Or your just being lazy to put forth any info.?

-MeTaL RoCkEr

My Z28 can take your P4 off the line!

FatBurger · Jan 11, 2002

I too am interested in seeing what people know, but I have nothing useful I can contribute. That's probably how I should've worded it the first time.

Quarter Pounder Inside

juin · Jan 11, 2002

maybe a software prefecht on Quake 3 so P4 run way faster if there is less mispredictrion.

On the cache side not only the cache have less bandwith but also way more latency.L1 cache of the P4 is fast effective and have high bandwith but in a version of 128 KB, can cost a lot.

http://gamershq.madonion.com/products/orb/?publish_compare.shtml?&project_type=6&dprid=2310900

juin · Jan 11, 2002

maybe a software prefecht on Quake 3 so P4 run way faster if there is less mispredictrion.

On the cache side, L1 cache of the P4 is fast effective and have high bandwith but in a version of 128 KB, that can cost a lot.

http://gamershq.madonion.com/products/orb/?publish_compare.shtml?&project_type=6&dprid=2310900

juin · Jan 11, 2002

maybe a software prefecht on Quake 3 so P4 run way faster if there is less misprediction.

On the cache side, L1 cache of the P4 is fast effective and have high bandwith but in a version of 128 KB, that can cost a lot.

http://gamershq.madonion.com/products/orb/?publish_compare.shtml?&project_type=6&dprid=2310900

AMD_Man · Jan 11, 2002

In benchmark tests, the 256-bit L2 cache of modern Intel processors has proven to be 30-50% faster than the Athlon's L2 cache; not 4 times faster as the bus width would suggest. The Athlon is also far more efficent at utilizing availible bandwidth than any Intel processor. In modern AMD processor, data is not replicated from the L1 cache to the L2 cache so, the actual cache space is not wasted. This increasing the effective bandwidth use by reducing the amount of time wasted in copy data from one cache to another. However, I don't agree with AMD's comment that a 256-bit L2 cache wouldn't improve performance. I believe it would improve performance by 5%, on average. That might seem insignificant but it's enough for an Athlon XP 2000+ to outpace a 2.2GHz Northwood.

Remember, with such high speed caches, the bandwidth isn't the worst bottleneck, the latency is. P4 processors may have great bandwidth for their L2 cache but their latency is worse than the AXP. The P4 L1 cache is amazing powerful in potential. I remember reading that the P4 can directly execute data from the L1 cache in some cases and thus virtually eliminating latency altogether. Now, I am not certain about this information (perhaps Raystonn can shed some light), but if this is the case, I would be extremely impressed by the P4's potential. The problem is, not much instructions/data can be cached in the L1. So you are saving only a couple of cycles each time data is found in L1.

In truth, the P4 has always had a lot more potential to be an amazing breaththrough for x86 processors but Intel fell short of their hype in 2000 with the seemly rushed release of the P4. Intel appeared to be afraid of losing market share to AMD in 2000. Had Intel given up a bit of market share temporarily, and spent more time and money on R&D, they may have had a processor that was clock for clock faster and more advanced than anything AMD could ever hope to release with their relatively limited (at least compared to Intel) budget.

AMD technology + Intel technology = Intel/AMD Pentathlon IV; the ULTIMATE PC processor

MeTaLrOcKeR · Jan 11, 2002

Very good information AMD_Man......

But how would increasing the data-pathway to 256-bits wide ONLY make an increase of performance by 5% on average?!?
The Pentium 3 Katimai Processors used a 64 bits wide L2 cache data-pathway to the off-die L2 cache...as did the Slot A Athlon.......but once Intel incorproated everything on-die with "Advanced Transfer Cache" on the coppermines, at least they made it 256-bits wide....and that seems to be the only logical conclusion I can come up with as to why clock for clock the coppermine Vs. Athlon Thunderbird Type-'B' would be neck and neck in Gaming, and by negligable amounts, the P3 had a SLIGHT advantge to the Athlon...even though the P3 was at a 133MHz FSB and the T-Bird is at 100MHz FSB in this case im saying...the difference was almost not there.......One WOULD think that increasing the bandwidth WOULD be a good thing, as I know the Athlon NEVER wastes ANY clock cycles.....its VERY efficeint....maybe even more so than the P4.....just becasue the way the P4 was designed, we all know a big L1 cache would infact make a penalty to the P4, and thats because its got a long pipeline....it would have many cache misses......again, a lot of IPC wasted......good thing that Intel made that new style of L1 cache for the P4.....The Athlon on the other hand, it really utilizes the big Cache.......it is VERY efficient, another reason why I would think if it had a bigger data-pathway, it would be able to process a LOT more per EDIT:CLOCK than it currently can, but maybe even reduce overall heat to the die of the CPU....anyone follow ??? Basically you wont have to cram everything through a narrow pipe and make it heat up the tiny wires/paths because there being forced, but with the bigger pathway it can flow nice and easily, and even MORE efficently.....anyone follow me? or am i not making sense anymore?

Well, please, I'd like this topic to stay up there and more people, post on your thought please =)

Oh and FatBurger...ok I understand, you were right, your first post kinda confused me....

This post was edited as my Typing is terrible, and I wouldnt have wanted to offend anyone...LoL Actually, it just made me look like an idiot.....but please forgive me

-MeTaL RoCkEr

My Z28 can take your P4 off the line!Edited by MeTaLrOcKeR on 01/11/02 11:46 AM.

MeTaLrOcKeR · Jan 11, 2002

Juin......Please be more clear with your post.........

Yes, the P4 wouldnt benefit from a large L1 cache...it would have an even worse IPC than it already does.....

ALSO....can u please refrain from TRIPLE posting? Thanx...

-MeTaL RoCkEr

My Z28 can take your P4 off the line!

Raystonn · Jan 11, 2002

I know the Athlon NEVER wastes ANY clock cycles.....

It wastes plenty... All processors waste clock cycles fetching things from memory when they are not in the L1 or L2 cache.

we all know a big L1 cache would infact make a penalty to the P4, and thats because its got a long pipeline....it would have many caches misses

Huh? (A larger cache would results in less cache misses and a large improvement.)

I would think if it had a bigger data-pathway, it would be able to process a LOT more per cock than it currently can

*chuckle* Watch those typos...

-Raystonn

= The views stated herein are my personal views, and not necessarily the views of my employer. =

eden · Jan 11, 2002

I think Ray has it right though, logically a larger cache means much more efficiency and less latency somewhat. The more available ressources you have, the more you can produce faster rather than simply storing a small ammount and the others must wait. I do not know if increasing bandwidth for AXPs would help, it depends on how it is used, considering its bus and memory usage compared to the P4 which is quite different. In any case AMD isn't failing at their CPUs here, so even without large cache bandwidth we're still talking big performance.

--
The other day I heard an explosion from the other side of town.... It was a 486 booting up...

AMD_Man · Jan 11, 2002

It wastes plenty... All processors waste clock cycles fetching things form memory when they are not in the L1 or L2 cache.

Completely agree! It's impossible not to waste clock cycles. The Athlon wastes cycles even when accessing L1 and L2 cache, however. All processors waste cycles, ALL THE TIME.

we all know a big L1 cache would infact make a penalty to the P4, and thats because its got a long pipeline....it would have many caches misses

I agree with Raystonn, what you said makes absolutely no sense. The bigger the cache the more the cache hits not misses! The large cache compensates for the long pipeline in the P4.

I would think if it had a bigger data-pathway, it would be able to process a LOT more per cock than it currently can

So the Athlon is a pimp now?

AMD technology + Intel technology = Intel/AMD Pentathlon IV; the ULTIMATE PC processor

balzi · Jan 11, 2002

a few issues need clearing here.

Yes, the P4 wouldnt benefit from a large L1 cache...it would have an even worse IPC than it already does.....

how is this so?? in my mind cache increases mean benefits. And certainly IPC is not really related to cache.. I suppose in a way "useful" instructions per clock would go up with larger L1 OR L2 caches, but either way the facts are in opposition to this statement.

other issues are -> datapath to cache. I believe AMD_man's 5% would be close, even if it is speculative. THe reason... well performance drops off as the chunks of data that the processor is working on get bigger than the cache size (L1 and then L2+L1) ..

when you get a cache miss and it takes a certain amount of cycles to grab the data the CPU needs from RAM. All data inside the cache is still being accessed incredibly fast - usually at full speed on todays processors. So if you speed up the cache <-> CPU datapath (widen it) then you only help so much...
However if you could get a few channels or quadruple your effective datapath to the RAM, well now you've got somethin.... oops, that's RDRAM isn't...

in conclusion to Balzi's rant cum thesis, IPC is irrelevant to cache speed arguments, if you can't decide between speeding up cache<->CPU datapath and RAM<->cache path then you need to stop designing mainstream processors.

Here's some info. for a x86/87 reality check.
My processors at work
PIC_inside -> 16f877 44PLCC package.
Clock -> 4.9152 MHz with 4x divider (yes, I mean divider) whopping 2 stage pipeline , 1 decode, 1 execute.
RAM -> 384 bytes full speed
HD -> solid_state flash memory, 256 bytes full speed read, 16mS write time per byte.
no AGP, no PCI, no North, south east or west, up or down Bridges, just a UART, 10 bit ADC (~20kSps)

runs, 1 instruction per cycle or sometimes 2 cycles. UART max speed 75.6kbaud.. WITH ADDRESSING heehhe

beat that.. I know you're all jealous

I spilled coffee all over my wife's nighty... ...serves me right for wearing it?!?

MeTaLrOcKeR · Jan 11, 2002

First i'd liek to apologize to Ray and everyoen else.......Please excuse my typo's, i just quickly scanend through the previous posts and fixed what I saw........you'll notice that.......

Ok......Let me explain what i was trying to say a bit better.....

I was reading a website almost a year ago, and it was explianing the K7 Architecture in detail........I didn't really mwan that it doesnt actually waste any clock cycles ever........it's just I remember reading a few websites saying that When a clock cycle is wasted its put to use somewhere else OR it gets recycled itstead of just being dropped and the CPU having to request for another.......does that make sense Ray ?!? That's more of what I meant, that is why I said that it is very efficent........

Now....about the L1 cache.....again, at this same site I was talkign about, it was a P4 Vs. K7 Architecture comparrision.....and they spoke about this new style L1 Cache that Intel developed for it. They said that BECASUE of the way the P4 was designed, a larger L1 cache would penalize it more than a smaller L1 cache with a fast advanced pre-fetching feature.....logicly a bigger L1 cache would help DECREASE caches misses.....

It's like throwing sand onto a piece of paper.....the more glue spread over the piece of paper the more sand you will have on the sheet.......same concept sort of.......

Supply & Demand, you must have enough supply to meet the demand type thing....

They said something about a lto of Instructions will get lost down the path of thr big pipeline of the P4, and that would be why a smaller L1 cache but at the same time a FASTER L1 cache would be much better than just a bigger sized L1.....

It's like.....You have 5 kids with REALLY bright flashlights walking down a path and they need to get to the end of the forest without wondering off the path otherwise they will get eaten by wild animals and such.......they'd be better off with the bright lights shinning walking quickly down the path than lets say 20 kids with not so bright lights walking down the path as they will have to travel slower otherwise they'd bump into eachother and possibly push people off the path when its already hard enough to see as there lights are not so bright.......

Does that make sense at all Ray or anyone else ???

Thats kind of how they explained it on the website im talking about...dammit I wish i saved my favourites last time i formatted but I always eem to forget too...grrrr......

Anyways.......yea, I think I made SOME sense with this post.......

-MeTaL RoCkEr

My Z28 can take your P4 off the line!

xxsk8er101xx · Jan 11, 2002

ya pretty much just amd vs intel bs. I posted a post about pocket pcs no one was interested. I think it's great technology and has a lot of potential to make startrek type things a reality! But people don't see that they only want to bi'tch and complain.

about the cache architecture. Considering that most or all of the performance increase is due to the cache design. If you look at history cache has been the primary factor in increasing performance.

We cache everything. From L1 cache to virtual memory (thats a cache!). A better cache desigh will influence performance.

How come with intels better cache design it still gets stomped? I would think it would have something to do with the architecture itself.

If AMD had a better cache design it would have an impact. Maybe something 1.4 times. Nothing major but it would help.

<A HREF="http://www.anandtech.com/mysystemrig.html?id=9933" target="_new"> My Rig </A>

LoveGuRu · Jan 11, 2002

you might gain 5%(maybe), yet you have to make your core bigger as you need 4 times the pipelines to L1 memroy, that would traslate into more heat and other panelties.

if i made any mistakes by this post plz do correct me.


*******
*K.I.S.S*
*(k)eep (I)t (S)imple (S)tupid*
*******

xxsk8er101xx · Jan 11, 2002

pipeline is just a concept like cache to increase performance and are not related at all.

--added--
Not related as in if you have more cache you need more pipelines. They are related to that they both increase performance.

also if there was no pipelining it would wait until the whole instruction is done and then go onto the next one.

The next question is single cycle processor vs multi cycle processor? HUGE difference between the two.

-----------

Pipeline is the same concept to that of an assembly line.

see with the MIPS I architecture it has 5 stages to an instruction.

fetch->readblock->execute->writeblock->store

That would be the loadword instruction. Every instruction is different and has different stages. But loading a word takes the most. fetch to store take up a cycle each.

The MIPS I architecture has a 5 stage pipeline.

when one cycle is done it feeds another instruction to the 5 stage pipeline.

so if you had 5 instructions one after another and were all loads it would look like this.

fetch->readblock->execute->writeblock->store
-------fetch->readblock->execute->writeblock->store
--------------fetch->readblock->execute->writeblock->store
---------------------fetch->readblock->execute->writeblock->store
----------------------------fetch->readblock->execute->writeblock->store

thats a 5 stage pipeline and thats the concept behind pipelining. Simple and it's very effective. Now there are more complexities to that such as different forms of hazards that you run into. There are fixes to that as well such as forwarding and stalls. Ever heard of the NOP instruction? there is a reason for it being there.

Actually you would gain a good amount of performance going from 64bit to 256bit data addressing. There is a formula for calculating performance.

The only heat that would be added are the wires that would need to be added in order to implement a 256bit data address. Not to mention additional MUX's and whatnot. So as you can see it would add cost to the chip. For what performance increase at the cost?

<A HREF="http://www.anandtech.com/mysystemrig.html?id=9933" target="_new"> My Rig </A>Edited by xxsk8er101xx on 01/11/02 04:36 PM.

FatBurger · Jan 11, 2002

you might gain 5%(maybe), yet you have to make your core bigger as you need 4 times the pipelines to L1 memroy, that would traslate into more heat and other panelties.

if i made any mistakes by this post plz do correct me.

No problem, I'll be glad to fix them.

You might gain 5% (maybe), yet you have to make your core bigger, as you need 4 times as many pipelines to the L1 memory. That would translate into more heat and other penalties.

If I made any mistakes by this post, please correct me.

Oh, did you mean TECHNICAL mistakes?

Quarter Pounder InsideEdited by FatBurger on 01/14/02 11:29 AM.

MeTaLrOcKeR · Jan 12, 2002

So your a funny man now are ya ?

heh heh

-MeTaL RoCkEr

My Z28 can take your P4 off the line!

MeTaLrOcKeR · Jan 12, 2002

Thats good information Sk8er.. =) Thanx.......

Well, for the mosr part I understand better now on the whole situation.....

Still, antone else woth some more information, it would be greatly appreciated, but again, thanx to everyone so far for what you've contributed to this topic.

Oh BTW Sk8er......about the PDA's.......If they were actually goign to become like Tricorters (sp?) like from Star Trek..than I would DEFINENTLY be interested in seeing how these things work etc.

-MeTaL RoCkEr

My Z28 can take your P4 off the line!

Phelk · Jan 12, 2002

The P4 is designed to pump massive amounts of data through it with its high clock speed, larger cache (finally with Northwood), wide datapath and RAMBUS memory. However, as we all know the processor spends alot of it time spinning its wheels.

To make the most of this bandwidth will take two things:

1) Improve the compilers to reduce wasted cycles occuring
2) Improve I/O

Going on past history the Intel compiler hackers have always been a pretty awesome bunch so I have no worries that P4 will become more efficient overtime.

The RAMBUS roadmap is pretty impressive over the next few years and feeding the processor with data faster will lessen the impact of cache misses. Also the new bus technology like Hyper-transport will get the non-memory data to the processor faster.

 Smoke me a Chip'er ... I'll be back in the Morgan

LoveGuRu · Jan 12, 2002

what your saying is compiling data to specific CPU?
so when you run it on other processor there would be a need of a hybrid component, higher language or just an imulation that would translate into big ass lost CPU time on difrent arcitechtures then INTEL (maybe even P3>P4 change?).
thats not a phesable solution althogh sun is using it with its servers, as they are runing Unix(Solaris) with RISC processor(or not..some 1 correct me plz..) they optimised the server based aplication to their platform for max performance.

either way no one would compile two versions for the same aplication, SSE2(and such) optemisation is exceptable as they do not require second version for other platfors just few scripts.

im sure i made many technical mistakes in this post so dont be shy..

xxsk8er101xx-
thanks for the info there, yet you didnt answare the main question, would this mean a bigger core size as you need aditional pipelines, and doesnt it need some arcitechtural changes as well?

FatBurger-
both plz

*******
*K.I.S.S*
*(k)eep (I)t (S)imple (S)tupid*
*******

LoveGuRu · Jan 13, 2002

restored...


*******
*K.I.S.S*
*(k)eep (I)t (S)imple (S)tupid*
*******

Cache Architecture or Overall CPU Archticture?

Distinguished

Distinguished

Illustrious

Distinguished

Illustrious

Distinguished

Distinguished

Distinguished

Splendid

Distinguished

Distinguished

Distinguished

Champion

Splendid

Distinguished

Distinguished

Splendid

Distinguished

Splendid

Illustrious

Distinguished

Distinguished

Distinguished

Distinguished

Distinguished

Share this page