Sign in with
Sign up | Sign in
Your question

Cache Architecture or Overall CPU Archticture?

Last response: in CPUs
Share
January 10, 2002 3:27:24 PM

Ok....I've been pondering this in my head for a long time now, and I just don't know what to think of it....the only thing I can come up with is that it is VERY impressive to say the least...now read on to find out what I'm talking about...

All Socket A processors use a cache data-path (for L2 cache anyways) that is only 64-Bit's Wide....now, that is a VERY narrow bandwidth is it not ?? Especially compared to all Pentium III Coppermines and better (Tully, P4W & NW) which is 256 Bit's Wide.....

Now....The Athlon's (all of them) only have this 64 bit wide cache interface and it STILL outpaces the P3.....Tully, and P4 Willy in the majority of cases.....

Now, could this 64-bit data-path be castrating the Athlon ??

Example.......The unexplained reason why the P4 outperforms the Athlon in Time Demo's of Q3A....its by a significant margin aswell....could this be due to the data-pathway of the L2 cache architecture ??!?? Like, what would happen if AMD but a 256bit wide data-pathway on the Athlon now for thwe L2 Cache?? Would it than have a negligable performance increase? or would it be significant?

This 64-bit L2 cache data-pathway interface is the only thing I can really think of right now that would explian this unexplained experience in Q3A.....we know its not totaly due to the P4 running the Quad pumped FSB, as it still remains when running either DDR or RDRAM on the P4 Platform.....so the cache architecture is what comes to my mind, and I did a bit of research on it, and the Athlon is STILL running with this 64-bit data-pathway and it can totaly compete with the Intel Counterparts......is this good engineering on part of the K7 Architecture or just coincidence?? I'd really like to know.....anyones thoughts, comnments? All is appreciated...

-MeTaL RoCkEr

My <font color=red>Z28</font color=red> can take your <font color=blue>P4</font color=blue> off the line!<P ID="edit"><FONT SIZE=-1><EM>Edited by MeTaLrOcKeR on 01/11/02 11:39 AM.</EM></FONT></P>
January 10, 2002 8:45:52 PM

So know one has anything to say about this at all ?

What the hell, is everyone only interested in AMD vs. Intel flame wars or what??? Thsi is something relavent.....and All I wanted was some input to my questions......

-MeTaL RoCkEr

My <font color=red>Z28</font color=red> can take your <font color=blue>P4</font color=blue> off the line!
January 10, 2002 8:56:07 PM

I'm interested too, but I have nothing to say.

<font color=orange>Quarter</font color=orange> <font color=blue>Pounder</font color=blue> <font color=orange>Inside</font color=orange>
Related resources
January 10, 2002 9:43:02 PM

Maybe im confused from your statement? Does this mean you only want to view other opions also? Or your just being lazy to put forth any info.? :wink:

-MeTaL RoCkEr

My <font color=red>Z28</font color=red> can take your <font color=blue>P4</font color=blue> off the line!
January 10, 2002 9:45:22 PM

I too am interested in seeing what people know, but I have nothing useful I can contribute. That's probably how I should've worded it the first time.

<font color=orange>Quarter</font color=orange> <font color=blue>Pounder</font color=blue> <font color=orange>Inside</font color=orange>
January 10, 2002 10:04:32 PM

maybe a software prefecht on Quake 3 so P4 run way faster if there is less mispredictrion.

On the cache side not only the cache have less bandwith but also way more latency.L1 cache of the P4 is fast effective and have high bandwith but in a version of 128 KB, can cost a lot.

http://gamershq.madonion.com/products/orb/?publish_comp...
January 10, 2002 10:05:11 PM

maybe a software prefecht on Quake 3 so P4 run way faster if there is less mispredictrion.

On the cache side, L1 cache of the P4 is fast effective and have high bandwith but in a version of 128 KB, that can cost a lot.

http://gamershq.madonion.com/products/orb/?publish_comp...
January 10, 2002 10:05:46 PM

maybe a software prefecht on Quake 3 so P4 run way faster if there is less misprediction.

On the cache side, L1 cache of the P4 is fast effective and have high bandwith but in a version of 128 KB, that can cost a lot.

http://gamershq.madonion.com/products/orb/?publish_comp...
January 10, 2002 10:22:15 PM

In benchmark tests, the 256-bit L2 cache of modern Intel processors has proven to be 30-50% faster than the Athlon's L2 cache; not 4 times faster as the bus width would suggest. The Athlon is also far more efficent at utilizing availible bandwidth than any Intel processor. In modern AMD processor, data is not replicated from the L1 cache to the L2 cache so, the actual cache space is not wasted. This increasing the effective bandwidth use by reducing the amount of time wasted in copy data from one cache to another. However, I don't agree with AMD's comment that a 256-bit L2 cache wouldn't improve performance. I believe it would improve performance by 5%, on average. That might seem insignificant but it's enough for an Athlon XP 2000+ to outpace a 2.2GHz Northwood.

Remember, with such high speed caches, the bandwidth isn't the worst bottleneck, the latency is. P4 processors may have great bandwidth for their L2 cache but their latency is worse than the AXP. The P4 L1 cache is amazing powerful in potential. I remember reading that the P4 can directly execute data from the L1 cache in some cases and thus virtually eliminating latency altogether. Now, I am not certain about this information (perhaps Raystonn can shed some light), but if this is the case, I would be extremely impressed by the P4's potential. The problem is, not much instructions/data can be cached in the L1. So you are saving only a couple of cycles each time data is found in L1.

In truth, the P4 has always had a lot more potential to be an amazing breaththrough for x86 processors but Intel fell short of their hype in 2000 with the seemly rushed release of the P4. Intel appeared to be afraid of losing market share to AMD in 2000. Had Intel given up a bit of market share temporarily, and spent more time and money on R&D, they may have had a processor that was clock for clock faster and more advanced than anything AMD could ever hope to release with their relatively limited (at least compared to Intel) budget.

AMD technology + Intel technology = Intel/AMD Pentathlon IV; the <b>ULTIMATE</b> PC processor
January 10, 2002 10:47:32 PM

Very good information AMD_Man......

But how would increasing the data-pathway to 256-bits wide ONLY make an increase of performance by 5% on average?!?
The Pentium 3 Katimai Processors used a 64 bits wide L2 cache data-pathway to the off-die L2 cache...as did the Slot A Athlon.......but once Intel incorproated everything on-die with "Advanced Transfer Cache" on the coppermines, at least they made it 256-bits wide....and that seems to be the only logical conclusion I can come up with as to why clock for clock the coppermine Vs. Athlon Thunderbird Type-'B' would be neck and neck in Gaming, and by negligable amounts, the P3 had a SLIGHT advantge to the Athlon...even though the P3 was at a 133MHz FSB and the T-Bird is at 100MHz FSB in this case im saying...the difference was almost not there.......One WOULD think that increasing the bandwidth WOULD be a good thing, as I know the Athlon NEVER wastes ANY clock cycles.....its VERY efficeint....maybe even more so than the P4.....just becasue the way the P4 was designed, we all know a big L1 cache would infact make a penalty to the P4, and thats because its got a long pipeline....it would have many cache misses......again, a lot of IPC wasted......good thing that Intel made that new style of L1 cache for the P4.....The Athlon on the other hand, it really utilizes the big Cache.......it is VERY efficient, another reason why I would think if it had a bigger data-pathway, it would be able to process a LOT more per <b>EDIT:<i>CLOCK</i></b> than it currently can, but maybe even reduce overall heat to the die of the CPU....anyone follow ??? Basically you wont have to cram everything through a narrow pipe and make it heat up the tiny wires/paths because there being forced, but with the bigger pathway it can flow nice and easily, and even MORE efficently.....anyone follow me? or am i not making sense anymore?

Well, please, I'd like this topic to stay up there and more people, post on your thought please =)

Oh and FatBurger...ok I understand, you were right, your first post kinda confused me....

<i><b>This post was edited as my Typing is terrible, and I wouldnt have wanted to offend anyone...LoL Actually, it just made me look like an idiot.....but please forgive me</b></i>

-MeTaL RoCkEr

My <font color=red>Z28</font color=red> can take your <font color=blue>P4</font color=blue> off the line!<P ID="edit"><FONT SIZE=-1><EM>Edited by MeTaLrOcKeR on 01/11/02 11:46 AM.</EM></FONT></P>
January 10, 2002 10:50:21 PM

Juin......Please be more clear with your post.........

Yes, the P4 wouldnt benefit from a large L1 cache...it would have an even worse IPC than it already does.....

ALSO....can u please refrain from <b>TRIPLE</b> posting? Thanx...

-MeTaL RoCkEr

My <font color=red>Z28</font color=red> can take your <font color=blue>P4</font color=blue> off the line!
January 10, 2002 11:08:00 PM

Quote:
I know the Athlon NEVER wastes ANY clock cycles.....

It wastes plenty... All processors waste clock cycles fetching things from memory when they are not in the L1 or L2 cache.


Quote:
we all know a big L1 cache would infact make a penalty to the P4, and thats because its got a long pipeline....it would have many caches misses

Huh? (A larger cache would results in less cache misses and a large improvement.)


Quote:
I would think if it had a bigger data-pathway, it would be able to process a LOT more per cock than it currently can

*chuckle* Watch those typos... ;) 

-Raystonn


= The views stated herein are my personal views, and not necessarily the views of my employer. =
January 10, 2002 11:32:51 PM

I think Ray has it right though, logically a larger cache means much more efficiency and less latency somewhat. The more available ressources you have, the more you can produce faster rather than simply storing a small ammount and the others must wait. I do not know if increasing bandwidth for AXPs would help, it depends on how it is used, considering its bus and memory usage compared to the P4 which is quite different. In any case AMD isn't failing at their CPUs here, so even without large cache bandwidth we're still talking big performance.

--
The other day I heard an explosion from the other side of town.... It was a 486 booting up...
January 10, 2002 11:37:46 PM

Quote:

It wastes plenty... All processors waste clock cycles fetching things form memory when they are not in the L1 or L2 cache.

Completely agree! It's impossible not to waste clock cycles. The Athlon wastes cycles even when accessing L1 and L2 cache, however. All processors waste cycles, ALL THE TIME.

Quote:

we all know a big L1 cache would infact make a penalty to the P4, and thats because its got a long pipeline....it would have many caches misses

I agree with Raystonn, what you said makes absolutely no sense. The bigger the cache the more the cache hits not misses! The large cache compensates for the long pipeline in the P4.

Quote:

I would think if it had a bigger data-pathway, it would be able to process a LOT more per cock than it currently can

So the Athlon is a pimp now? :wink:


AMD technology + Intel technology = Intel/AMD Pentathlon IV; the <b>ULTIMATE</b> PC processor
January 11, 2002 3:29:43 AM

a few issues need clearing here.
Quote:
Yes, the P4 wouldnt benefit from a large L1 cache...it would have an even worse IPC than it already does.....

how is this so?? in my mind cache increases mean benefits. And certainly IPC is not really related to cache.. I suppose in a way "useful" instructions per clock would go up with larger L1 OR L2 caches, but either way the facts are in opposition to this statement.

other issues are -> datapath to cache. I believe AMD_man's 5% would be close, even if it is speculative. THe reason... well performance drops off as the chunks of data that the processor is working on get bigger than the cache size (L1 and then L2+L1) ..

when you get a cache miss and it takes a certain amount of cycles to grab the data the CPU needs from RAM. All data inside the cache is still being accessed incredibly fast - usually at full speed on todays processors. So if you speed up the cache <-> CPU datapath (widen it) then you only help so much...
However if you could get a few channels or quadruple your effective datapath to the RAM, well now you've got somethin.... oops, that's RDRAM isn't...

in conclusion to Balzi's rant cum thesis, IPC is irrelevant to cache speed arguments, if you can't decide between speeding up cache<->CPU datapath and RAM<->cache path then you need to stop designing mainstream processors.

Here's some info. for a x86/87 reality check.
My processors at work
PIC_inside -> 16f877 44PLCC package.
Clock -> 4.9152 MHz with 4x divider (yes, I mean divider) whopping 2 stage pipeline , 1 decode, 1 execute.
RAM -> 384 bytes <b>full speed</b>
HD -> solid_state flash memory, 256 bytes <b>full speed</b> read, 16mS write time per byte.
no AGP, no PCI, no North, south east or west, up or down Bridges, just a UART, 10 bit ADC (~20kSps)

runs, 1 instruction per cycle or sometimes 2 cycles. UART max speed 75.6kbaud.. <b>WITH ADDRESSING</b> heehhe

beat that.. I know you're all <font color=green>jealous</font color=green>


I spilled coffee all over my wife's nighty... ...serves me right for wearing it?!?
January 11, 2002 2:59:46 PM

First i'd liek to apologize to Ray and everyoen else.......Please excuse my typo's, i just quickly scanend through the previous posts and fixed what I saw........you'll notice that.......

Ok......Let me explain what i was trying to say a bit better.....


I was reading a website almost a year ago, and it was explianing the K7 Architecture in detail........I didn't really mwan that it doesnt actually waste any clock cycles ever........it's just I remember reading a few websites saying that When a clock cycle is wasted its put to use somewhere else OR it gets recycled itstead of just being dropped and the CPU having to request for another.......does that make sense Ray ?!? That's more of what I meant, that is why I said that it is very efficent........

Now....about the L1 cache.....again, at this same site I was talkign about, it was a P4 Vs. K7 Architecture comparrision.....and they spoke about this new style L1 Cache that Intel developed for it. They said that BECASUE of the way the P4 was designed, a larger L1 cache would penalize it more than a smaller L1 cache with a fast advanced pre-fetching feature.....logicly a bigger L1 cache would help DECREASE caches misses.....

It's like throwing sand onto a piece of paper.....the more glue spread over the piece of paper the more sand you will have on the sheet.......same concept sort of.......

Supply & Demand, you must have enough supply to meet the demand type thing....

They said something about a lto of Instructions will get lost down the path of thr big pipeline of the P4, and that would be why a smaller L1 cache but at the same time a FASTER L1 cache would be much better than just a bigger sized L1.....

It's like.....You have 5 kids with REALLY bright flashlights walking down a path and they need to get to the end of the forest without wondering off the path otherwise they will get eaten by wild animals and such.......they'd be better off with the bright lights shinning walking quickly down the path than lets say 20 kids with not so bright lights walking down the path as they will have to travel slower otherwise they'd bump into eachother and possibly push people off the path when its already hard enough to see as there lights are not so bright.......

Does that make sense at all Ray or anyone else ???

Thats kind of how they explained it on the website im talking about...dammit I wish i saved my favourites last time i formatted but I always eem to forget too...grrrr......

Anyways.......yea, I think I made SOME sense with this post.......

-MeTaL RoCkEr

My <font color=red>Z28</font color=red> can take your <font color=blue>P4</font color=blue> off the line!
January 11, 2002 4:40:47 PM

ya pretty much just amd vs intel bs. I posted a post about pocket pcs no one was interested. I think it's great technology and has a lot of potential to make startrek type things a reality! But people don't see that they only want to bi'tch and complain.

about the cache architecture. Considering that most or all of the performance increase is due to the cache design. If you look at history cache has been the primary factor in increasing performance.

We cache everything. From L1 cache to virtual memory (thats a cache!). A better cache desigh will influence performance.

How come with intels better cache design it still gets stomped? I would think it would have something to do with the architecture itself.

If AMD had a better cache design it would have an impact. Maybe something 1.4 times. Nothing major but it would help.

<A HREF="http://www.anandtech.com/mysystemrig.html?id=9933" target="_new"> My Rig </A>
January 11, 2002 5:12:11 PM

you might gain 5%(maybe), yet you have to make your core bigger as you need 4 times the pipelines to L1 memroy, that would traslate into more heat and other panelties.

if i made any mistakes by this post plz do correct me.

<font color=green>
*******
*K.I.S.S*
*(k)eep (I)t (S)imple (S)tupid*
*******
</font color=green>
January 11, 2002 7:18:15 PM

pipeline is just a concept like cache to increase performance and are not related at all.

--added--
Not related as in if you have more cache you need more pipelines. They are related to that they both increase performance.

also if there was no pipelining it would wait until the whole instruction is done and then go onto the next one.

The next question is single cycle processor vs multi cycle processor? HUGE difference between the two.

-----------

Pipeline is the same concept to that of an assembly line.

see with the MIPS I architecture it has 5 stages to an instruction.

fetch->readblock->execute->writeblock->store

That would be the loadword instruction. Every instruction is different and has different stages. But loading a word takes the most. fetch to store take up a cycle each.

The MIPS I architecture has a 5 stage pipeline.

when one cycle is done it feeds another instruction to the 5 stage pipeline.

so if you had 5 instructions one after another and were all loads it would look like this.

fetch->readblock->execute->writeblock->store
-------fetch->readblock->execute->writeblock->store
--------------fetch->readblock->execute->writeblock->store
---------------------fetch->readblock->execute->writeblock->store
----------------------------fetch->readblock->execute->writeblock->store

thats a 5 stage pipeline and thats the concept behind pipelining. Simple and it's very effective. Now there are more complexities to that such as different forms of hazards that you run into. There are fixes to that as well such as forwarding and stalls. Ever heard of the NOP instruction? there is a reason for it being there.

Actually you would gain a good amount of performance going from 64bit to 256bit data addressing. There is a formula for calculating performance.

The only heat that would be added are the wires that would need to be added in order to implement a 256bit data address. Not to mention additional MUX's and whatnot. So as you can see it would add cost to the chip. For what performance increase at the cost?

<A HREF="http://www.anandtech.com/mysystemrig.html?id=9933" target="_new"> My Rig </A><P ID="edit"><FONT SIZE=-1><EM>Edited by xxsk8er101xx on 01/11/02 04:36 PM.</EM></FONT></P>
January 11, 2002 7:49:46 PM

Quote:
you might gain 5%(maybe), yet you have to make your core bigger as you need 4 times the pipelines to L1 memroy, that would traslate into more heat and other panelties.

if i made any mistakes by this post plz do correct me.

No problem, I'll be glad to fix them.

Quote:
You might gain 5% (maybe), yet you have to make your core bigger, as you need 4 times as many pipelines to the L1 memory. That would translate into more heat and other penalties.

If I made any mistakes by this post, please correct me.

Oh, did you mean TECHNICAL mistakes? :wink:

<font color=orange>Quarter</font color=orange> <font color=blue>Pounder</font color=blue> <font color=orange>Inside</font color=orange><P ID="edit"><FONT SIZE=-1><EM>Edited by FatBurger on 01/14/02 11:29 AM.</EM></FONT></P>
January 12, 2002 12:38:49 PM

So your a funny man now are ya ?

heh heh

-MeTaL RoCkEr

My <font color=red>Z28</font color=red> can take your <font color=blue>P4</font color=blue> off the line!
January 12, 2002 12:42:24 PM

Thats good information Sk8er.. =) Thanx.......

Well, for the mosr part I understand better now on the whole situation.....

Still, antone else woth some more information, it would be greatly appreciated, but again, thanx to everyone so far for what you've contributed to this topic. :smile:

Oh BTW Sk8er......about the PDA's.......If they were actually goign to become like Tricorters (sp?) like from Star Trek..than I would DEFINENTLY be interested in seeing how these things work etc.

-MeTaL RoCkEr

My <font color=red>Z28</font color=red> can take your <font color=blue>P4</font color=blue> off the line!
January 12, 2002 1:22:39 PM

The P4 is designed to pump massive amounts of data through it with its high clock speed, larger cache (finally with Northwood), wide datapath and RAMBUS memory. However, as we all know the processor spends alot of it time spinning its wheels.

To make the most of this bandwidth will take two things:

1) Improve the compilers to reduce wasted cycles occuring
2) Improve I/O

Going on past history the Intel compiler hackers have always been a pretty awesome bunch so I have no worries that P4 will become more efficient overtime.

The RAMBUS roadmap is pretty impressive over the next few years and feeding the processor with data faster will lessen the impact of cache misses. Also the new bus technology like Hyper-transport will get the non-memory data to the processor faster.

<font color=blue> Smoke me a Chip'er ... I'll be back in the Morgan </font color=blue> :eek: 
January 12, 2002 2:35:35 PM

what your saying is compiling data to specific CPU?
so when you run it on other processor there would be a need of a hybrid component, higher language or just an imulation that would translate into big ass lost CPU time on difrent arcitechtures then INTEL (maybe even P3>P4 change?).
thats not a phesable solution althogh sun is using it with its servers, as they are runing Unix(Solaris) with RISC processor(or not..some 1 correct me plz..) they optimised the server based aplication to their platform for max performance.

either way no one would compile two versions for the same aplication, SSE2(and such) optemisation is exceptable as they do not require second version for other platfors just few scripts.

im sure i made many technical mistakes in this post so dont be shy..:) 

xxsk8er101xx-
thanks for the info there, yet you didnt answare the main question, would this mean a bigger core size as you need aditional pipelines, and doesnt it need some arcitechtural changes as well?

FatBurger-
both plz:) 

<font color=green>
*******
*K.I.S.S*
*(k)eep (I)t (S)imple (S)tupid*
*******
</font color=green>
January 13, 2002 6:58:47 PM

restored...

<font color=green>
*******
*K.I.S.S*
*(k)eep (I)t (S)imple (S)tupid*
*******
</font color=green>
January 13, 2002 8:38:27 PM

Quote:



In reply to:


You might gain 5% (maybe), yet you have to make your core bigger, as you need 4 times as many piplines to the L1 memory. That would translate into more heat and other penalties.

If I made any mistakes by this post, please correct me.


no no no....
In your fixing frenzy you've re-spelled pipelines to piplines. *add Goons accent* You idiot....

bye

I spilled coffee all over my wife's nighty... ...serves me right for wearing it?!?
January 14, 2002 7:31:56 AM

LoL

Oh go ye brown-eyed toothless wonder.
January 14, 2002 7:35:13 AM

What about the algorithms used? Could it be that AMD has a much better cache algorithm than Intel?

BTW, single-cycle versus multi-cycle...is that equivalent to RISC vs. CISC? If so what is Intel and AMD's architecture? I heard that it used to be:

AMD = RISC
Intel = CISC

Is this still the case today?

Oh go ye brown-eyed toothless wonder.
January 14, 2002 11:46:40 AM

HAve you tried the 18F*** series of PICS?

Never spit into the wind...
January 14, 2002 11:48:34 AM

HAve you tried the 18F*** series of PICS?

The great thing about the 4 Mhz clock is that it takes one 1us to execute each line of assembler, good for calcuatling delay loops etc.

By the way, Ray (poet and I don't know it) when you are engineering software for good old INTC are you working more with Assembler, C++ or what?

Never spit into the wind...
January 14, 2002 5:31:36 PM

Quote:
AMD = RISC
Intel = CISC


Nope, both have been CISC for all time. All x86 processors are CISC.
Apple's processors on the other hand, are RISC. That's probably what was supposed to be in place of AMD, someone just got confused when they told you.

<font color=orange>Quarter</font color=orange> <font color=blue>Pounder</font color=blue> <font color=orange>Inside</font color=orange>
January 14, 2002 6:13:52 PM

Like FatBurger said.......all x86 CPU's are CISC......BUT, if you look IN DEPTH on the AMD K7 Architecture, it was designed with a RISC Influence......its been said that the K7 Architecture would make a better RISC Processor than an x86 Processor.........but hey, it's kickin' some serious arse still..... =)

-MeTaL RoCkEr

My <font color=red>Z28</font color=red> can take your <font color=blue>P4</font color=blue> off the line!
January 14, 2002 6:52:06 PM

As far as I know, current AMD and Intel processor handle instructions in a RISC fashion, as in they receive CISC instructions and convert them to simple RISC instructions which are faster to execute.

AMD technology + Intel technology = Intel/AMD Pentathlon IV; the <b>ULTIMATE</b> PC processor
January 14, 2002 6:56:47 PM

i'll be polite as much as possible here...

wtf are you talking about, different cache algorithms?

here are the cache implementations:
Direct mapping
Set associative (2 way set-associative for example)
and fully associative

that's it as far as cache "algorithms". There might be other implementations of the cache design in development but thats it. Look under wcpuid and under cpu details. it tells you what the cache implementation is. Either set-associative or fully associative. I haven't seen direct mapping used.

intel and amd both use the same 3 implementations. the last 2 are more used then direct mapping.

"BTW, single-cycle versus multi-cycle...is that equivalent to RISC vs. CISC? If so what is Intel and AMD's architecture? I heard that it used to be:"

absolutly not!

single cycle cpu only does ONE instruction per cycle no matter what the length of the instruction is. You measure what the longest instruction is and thats your "cycle". Every instruction therefore is the same length and takes the same amount of time for each instruction. Even if a add instruction takes 2ns to complete and a loadword instruction takes 5ns (which is usually the longest isntruction), the add instruction would take 5ns instead of 2ns. Thats a single cycle processor. As you can see it does a horrible job in using what resources is available to it.

A multi cycle processor is what the current cpu's are. They can do multiple instructions per cycle and every instruction has it's own cycle for each instruction. So the add instruction that takes 2ns and a loadword that took 5ns, an add instruction would take 2ns and load word would take 5ns because it has it's own cycle. That as you can see uses it's resources more efficiently. Also improves performance depending on the code. if it uses lots of loads or what have you. Adding pipelining to it makes it use it's resources even more efficiently. Notice the module design of these implementations.

if your going "huh" i can't explain it any simpler then that. Infact i probably left out details and some being important. Just go with it and now you somewhat know what it is kind of. If you wanna know more, you must take a class. Computer architecture or computer design or computer organization. They change the names oh so frequently.

As far cisc and risc. cisc were the old pentiums and 486's. The CPU's (intels and amd's) NOW are actually a combination of both! Since the cpu's reduce the instruction and compute that. Hence reduced instruction set code(not sure on this one). So to say the pentium 4 is a cisc or risc processor is wrong since they are both.

<A HREF="http://www.anandtech.com/mysystemrig.html?id=9933" target="_new"> My Rig </A>
January 14, 2002 7:03:01 PM

hey they have voice recognition (cell phones) they have lcd touch screens and software (PDA's) why not mix them in with medicine? Radars. I mean think about! It'll be better then startrek! voice recognition tricorters lol. tricorters that can record the persons condition in real time. See through walls with software and optical sensors and do some nifty things with the software for it. LCD touchpads for notes and store in memory. Which they already have!

You have any idea how exciting this is? All because of the PDA's evolution into a pocket pc.

<A HREF="http://www.anandtech.com/mysystemrig.html?id=9933" target="_new"> My Rig </A>
Anonymous
a b à CPUs
January 14, 2002 7:15:32 PM

Sure you got good amounts of work done per cycle, but can that baby do minesweeper?

PIC's are fun.
Anonymous
a b à CPUs
January 14, 2002 7:15:50 PM

OOps double post. How in the world did that happen?
<P ID="edit"><FONT SIZE=-1><EM>Edited by knewton on 01/14/02 01:19 PM.</EM></FONT></P>
January 15, 2002 4:03:09 AM

Dude...that would be SWEET...LoL..
Tricorters that see through walls....hahahahahahahaha

What about clothes?????(GIRLS ONLY!!!!!!!!!!!!!) =)~

-MeTaL RoCkEr

My <font color=red>Z28</font color=red> can take your <font color=blue>P4</font color=blue> off the line!
January 15, 2002 6:20:48 AM

Quote:
i'll be polite as much as possible here...

O.k.
Quote:
wtf

Ummm...could ya' try a little harder?

Quote:
here are the cache implementations:
Direct mapping
Set associative (2 way set-associative for example)
and fully associative

Exactly, so if I were to compare the CPUid results for both the Intel and AMD processors, would they be identical in terms of the cache implemenentation? Hey wait, I have a Celeron and an AMD system. Here's what I found:


AMD 500Mhz:
[ WCPUID Ver.2.7c (c) 1996-2000 By H.Oda! ]
(Processor 1)
<< Cache Info. >>

[L1 Instruction TLB]
2-Mbyte/4-Mbyte Pages, fully associative, 8 entries
4-Kbyte Pages, fully associative, 16 entries

[L1 Data TLB]
2-Mbyte/4-Mbyte Pages, 4-way set associative, 8 entries
4-Kbyte Pages, fully associative, 24 entries

[L1 Instruction cache]
64K byte cache size, 2-way set associative, 64 byte line size, 1 line par tag

[L1 Data cache]
64K byte cache size, 2-way set associative, 64 byte line size, 1 line par tag

[L2 Unified cache]
512K byte cache size, 2-way set associative, 64 byte line size, 1 line par tag

[L2 Instruction/Unified TLB]
2-Mbyte/4-Mbyte Pages, Off, 0 entries
4-Kbyte Pages, 4-way set associative, 256 entries

[L2 Data TLB]
2-Mbyte/4-Mbyte Pages, Off, 0 entries
4-Kbyte Pages, 4-way set associative, 256 entries


Intel 750Mhz:
[ WCPUID Ver.2.7c (c) 1996-2000 By H.Oda! ]
(Processor 1)
<< Cache Info. >>

[L1 Instruction TLB]
4K byte pages, 4-way set associative, 32 entries
4M byte pages, fully associative, 2 entries

[L1 Data TLB]
4K byte pages, 4-way set associative, 64 entries
4M byte pages, 4-way set associative, 8 entries

[L1 Instruction cache]
16K byte cache size, 4-way set associative, 32 byte line size

[L1 Data cache]
16K byte cache size, 4-way set associative, 32 byte line size

[L2 Unified cache]
256K byte cache size, 8-way set associative, 32 byte cache line

You see what I mean? The caching schemes are indeed different, correct? For example: the Intel uses 4-way set associative for the L1 Instruction cache, while the AMD uses 2-way set associative. And, notice that the Celeron doesn't appear to use a L2 Translation Look-aside buffer.

Secondly, perhaps the memory management algorithm that caches data to RAM is significanlty different between the two. I don't know, just a thought- is that o.k. with you?

Quote:
if your going "huh" i can't explain it any simpler then that.

How about being a little less condenscending? If it frustrates you to answer my post, then don't respond. You make it look like your trying your hardest to look smart. Let me know if you want me to respond with "duh" in order to boost your self-ego...I'll be happy to do so because I'm a nice guy and I want you to feel good about yourself.

Oh go ye brown-eyed toothless wonder.
January 15, 2002 6:47:27 AM

Quote:
Let me know if you want me to respond with "duh" in order to boost your self-ego...I'll be happy to do so because I'm a nice guy and I want you to feel good about yourself.


LOL ... the technical stuff is a little above me, but I like the humor.

<i>Real knowledge is to know the extent of one's ignorance.</i>
January 15, 2002 12:19:22 PM

i said:

---
what your saying is compiling data to specific CPU?
so when you run it on other processor there would be a need of a hybrid component, higher language or just an imulation that would translate into big ass lost CPU time on difrent arcitechtures then INTEL (maybe even P3>P4 change?).
thats not a phesable solution althogh sun is using it with its servers, as they are runing Unix(Solaris) with RISC processor(or not..some 1 correct me plz..) they optimised the server based aplication to their platform for max performance.

either way no one would compile two versions for the same aplication, SSE2(and such) optemisation is exceptable as they do not require second version for other platfors just few scripts.
---

any one gona answare??????
?????????????????????????ARG!

<font color=green>
*******
*K.I.S.S*
*(k)eep (I)t (S)imple (S)tupid*
*******
</font color=green>
January 15, 2002 12:49:02 PM

i did answer your post.

you people have way too much anger built up inside you to make believe i'm being mean. reread it please and read it this time thinking "hey this guy is giving me free information about computers". Remember college isn't free!
infact this will be the last time i share information as you people like to be ignorant and like to think you know it all. Clearly you do not if you think cisc and risk is like single cycle and multi cycle. You do not appreciate knowledge from what i can see! Continue living your life in ignorance.

If you actually read it you would see i did answer your question. You asked is single cycle and multi cycle like risk and cisc. i answered no. If you were to read it you would find that it is completely different then what you thought and I had to set it strait so you understand.

If you would have read what i said, i said there are 3 cache implementations that intel and amd both use. Does that mean they are exactly the same? NO! Learn to read. Also do you understand for set-associative you can have 2 way, 4 way, 16 way set associative? Understand now? I can't believe you actually went out of your way to do that and actually check the cache. i hope you did it out of wanting to learn and not to try and prove me wrong with something you have no clue what it's about?

You should appreciate free information not scorn someone because they gave it to you!

<A HREF="http://www.anandtech.com/mysystemrig.html?id=9933" target="_new"> My Rig </A>
January 15, 2002 1:19:19 PM

I will try and respond to what you said, I just personally dont TOTALY understand what you mean, yout hink you can be a little bit more specific ??!??

Is this what you mean??

Why don't software designers compile a STANDARD for SSE2, meaning, it's no different (the compiled version) whether or not it be on a Socket 462 AMD Platform or a Socket 423/478 Intel Pentium 4 platform, so it uses the same code and executes it the same for each ?????

OR do you mean, why don't they take advantge or all the processer's x87 Commands (FPU Enhancments) which I personalyl don't understand.........example: The AMD Athlon XP uses MMX, Enhanved MMX, 3D Now!, Enhanced 3D Now!, and SSE......
Now, are you saying, why don't they compile somethign that ultizes ALL of these, so that way they ALL work at teh same time?? Theoretically than it would speed up by whatever precentage EACH of those give over RAW Integer and FPU as opposed to using just SSE, or just 3D Now! mmx, etc......

If that is the case I'd also like to know.........
ANOTHER Example.....

Anyone here with a AXP can experiment with this.....

In SiSoft Sandra....you have the CPU Benchamrk, MultiMedia Benchmark etc.....
NOW, in the options for each benachmark you can enable/disable certain things, like SSE, Enhanced 3D Now! etc....now, by default everything is enabled.....btu when you do the test....it returns back for FPU/ALU as using SSE plus the score it got.....BUT if you uncheck SSE, and do the test again, it will say its using Enhanced 3D Now!.......You'll also notice the SSE score is a LITTLE bit higher than that of the Enhanced 3D Now! Score.......NOW, if you disable everything but the SSE, the score wont change from the original SSE score.......and visa versa for the Enhanced 3D Now! score in relation with enabling/disabling MMX etc......NOW what I don't understand is why can't it use BOTH SSE & 3D Now! etc. becaseu theoretically shouldnt it than be a much higher score??? Overall anyways...liek I know SSE and 3D Now! work almost the same but they both have there won advantages, so if it utilized both, it shoudl be better, wouldnt it ??

Anyways, if anyoen still follows me or can answer LoveGuRu's original question, please do =)

-MeTaL RoCkEr

My <font color=red>Z28</font color=red> can take your <font color=blue>P4</font color=blue> off the line!
January 15, 2002 1:22:23 PM

You know that thats all anyone will buy it for! =)

lol....j/k

-MeTaL RoCkEr

My <font color=red>Z28</font color=red> can take your <font color=blue>P4</font color=blue> off the line!
January 15, 2002 2:11:58 PM

lol not me pale lol

have any idea how fast my girlfriend would kick my ass? lol ... i can see for maybe like the groups who prefere to be a single set.

but still it would be cool!

<A HREF="http://www.anandtech.com/mysystemrig.html?id=9933" target="_new"> My Rig </A>
January 15, 2002 7:58:10 PM

Post deleted by Flyboy
January 15, 2002 8:53:49 PM

Well kind of. Algorithm is a series of psuedo code to accomplish a task. An implementation is simply the actual code or task. Cache design i'm sure has an algorithm they all do. However what i was referring to was the actual task itself. Samething almost but not quite. It's just one of those things where it looks to be the same but it isn't. Doesn't matter really.

You didn't get anything wrong. Just got the single cycle and multi cycle thing wrong. It truely doesn't matter but wanted to clear that up.

To be or not to be?



<A HREF="http://www.anandtech.com/mysystemrig.html?id=9933" target="_new"> My Rig </A>
January 15, 2002 9:03:24 PM

Well Flyboy, I won't comment on what is going on between you and Sk8er.....but I will try and understand what you mean...

EARLIER you asked why the AMD chips have a TLB and why the Intel's dont.......
and you asked the differences between the two, why your Celeron is like such and such and why your AMD is like such and such, correct?

Ok, well for one, you said you have an AMD 500MHz.......and by reading the return data you gave that WCPUID gave you, it appears you have an Original K7 (Slot A Athlon) Am I right ?!?

Now....if this is true than I think I might be able to explain a little bit...For one, the Athlon you have is Slot A, using the original K7 core which has off-die L2 cache at ½ the clock speed......The Celeron has 128K of On-Die full speed L2 cache.......AND there different cache designs......

Your AMD Athlon 500MHz chip is more comparable to the original P3, the Katimai (sp?) core which also had off-die ½ speed L2 cache......thats the only thing i can think of in respect to that...

Now, with the method the Athlon uses in conjunction to how the Intel counterparts work in decoding/sending/receiving instructions is ESSESNTIALLY the same, but it also does a few things differently....like the way it goes through the decoders.....like I mentioned before, the K7 Architecture was created with RISC in mind......you have to look into detail on the cache design on the K7.....Toms has a good article....in depth on everything related to the original K7....It was his first article ever on the AMD Athlon....quite a good one at that......check it out in the CPU Guide History section.....the Article should be dated sometime in like August of 1999....

Ass far as the Set-Associative......Every processor is different, as far as I can recall, the AMD Duron's L2 cache is 16 Way Set-Associative and the AMD Thunderbird core Athlon is like 12 or something like that........

Anyways, again, read up on a few past articles, both related to AMD and Intel processors, it will help you learn more about this stuff I can guarantee more than we could just sit here and tell you.......check it out...... =)


-MeTaL RoCkEr

My <font color=red>Z28</font color=red> can take your <font color=blue>P4</font color=blue> off the line!
January 15, 2002 9:04:22 PM

Thanks. Do you feel that either the implementation (differences b/w intel and AMD) or the algorithms may be responsible for the performance differences?

That's where you and I got mixed up I think. My original thought was the actual memory management algorithms used to decide which data to cache and replace. But your post on cache implementation also intrigued me. Now I'm wondering whether either (or both) of these is key to the performance issues between AMD and Intel.

But thanks for clearing up the single cycle Vs. multicycle.
!