Sign in with
Sign up | Sign in
Your question

Does intel will be able to increase cache size

Last response: in CPUs
Share
August 11, 2002 5:05:36 PM

All increase in cache size increase lantency and a whole bunch of others trouble.
That about presscott does intel will increase cache size.
For the L2 all is almost perfect so there not much to say here.Now the L1 Some may call it short-cut or strip down cache.If you are intel who are you chose about the L1 and AGU FPU decoder xtrance cache output.

The day i meet a goth queen that tell me Intel suck.I turn in a lemming to fill is need in hardware.
August 11, 2002 6:36:03 PM

Rumors are that Prescott will have 1 MB of L2 cache and double (quadruple?) the L1 data cache (and I'm hoping trace cache too). They will be double-pumping the L1 and hopefully trace cache as well so it'll run at twice the core clock. This will hopefully make up for the access latency. Of course, 8 KB of L1 data cache is too small anyway, the lower latency is in no way worth the small size. You can only fit 1 FP data type in there, that can't help SSE2 vectors.

"We are Microsoft, resistance is futile." - Bill Gates, 2015.
August 11, 2002 9:02:06 PM

Technically speaking, a double-pumped Trace Cache means that in each normal CPU clock, it can spit out 6 uOPS instead of 3, which would be very nice, no? I mean it's not like you always can get out the full IPC total each time, but it seems Intel wants to reduce decoding as less as possible and therefore reducing the pipeline length penalities...JA?

--
You are about to witness crazy, mindless, eye-gagging, naive programming, welcome to FOX!
Related resources
August 12, 2002 12:07:18 AM

Well, it really depends on how Intel "double-pumps" its L1 cache (and hopefully trace cache as well). If they use the same method as their double-pumped ALU's (in which the ALU's actually are running at twice the frequency), then it'll help both throughput and latency. If they used some trick such as sending more information per wave, then latency won't improve much. We'll just have to wait and see.
As for the total number of micro-ops issued per clock, ya, 6 would help a lot. Although Intel should consider putting two full fledged decoding units (since Hyperthreading will provide 2 threads of x86 instructions). This would help it significantly.

"We are Microsoft, resistance is futile." - Bill Gates, 2015.
August 12, 2002 12:34:40 AM

But then what would the Trace Cache help if 2 decoders have a good chance of giving the max IPC?

--
You are about to witness crazy, mindless, eye-gagging, naive programming, welcome to FOX!
August 12, 2002 12:39:03 AM

Well, x86 instructions usually decode into about 2 micro-ops. And the trace cache is a cache after all. It's purpose is not to speed up streaming instructions but rather to increase access to instructions which are being repeated over and over again (such as in video processing applications).

"We are Microsoft, resistance is futile." - Bill Gates, 2015.
August 12, 2002 12:47:11 AM

Yes but the thing with it being a cache, in terms, what or how much are we saving from using decoded instructions inside a long pipeline than the L1 cache of a P6 or K7 core?

--
You are about to witness crazy, mindless, eye-gagging, naive programming, welcome to FOX!
August 12, 2002 2:21:46 PM

Well, taking how the P4 only has 1 decoder as opposed to the Athlon's 3 and it's not performing 1/3 as well per clock, I'd say that the decodering isn't a limitation (or at least, a major one). I'd say the trace cache is being used quite a lot. Of course, it depends on code. I still say expanding the trace cache to beyond 12k micro-ops would be very useful as not many realistically coded loops are ever that small.

"We are Microsoft, resistance is futile." - Bill Gates, 2015.
August 12, 2002 2:37:07 PM

dude 8Kb (8 * 8 * 1024 bits) is alot of double-prec(64bit) floating data types.

the Trace Cache now works at 6ups per cycle at half the clock speed of the processor. with avreage x86 program maintaning ~1.5ups/cycle of instruction level parllelizem there intel may or may not want to change that - the preformance boost would not be that somthing.

Quote:

the lower latency is in no way worth the small size.

Intel purposefully designed the tiny L1 so it could fit within the tiny timing budget and run at super high frequency. Increasing the size of the cache while making sure that it would still fit within the same timing budget, without increasing the latency by another cycle would presumably be incredibly difficult if not downright impossible. Someone somewhere might have decided that an 85% hit rate with a latency of 2 cycles might be better than a 92% hit rate with a latency of 3 cycles. and he/she was at a better possetion to make that choise then you are. so you should refain from such statements.



This post is best viewed with common sense enabled<P ID="edit"><FONT SIZE=-1><EM>Edited by IIB on 08/12/02 05:55 PM.</EM></FONT></P>
August 12, 2002 7:23:17 PM

Intel purposefully designed the tiny L1 so it could fit within the tiny timing budget and run at super high frequency. Increasing the size of the cache while making sure that it would still fit within the same timing budget, without increasing the latency by another cycle would presumably be incredibly difficult if not downright impossible. Someone somewhere might have decided that an 85% hit rate with a latency of 2 cycles might be better than a 92% hit rate with a latency of 3 cycles. and he/she was at a better possetion to make that choise then you are. so you should refain from such statements.



Do you have take that from a others forum

The day i meet a goth queen that tell me Intel suck.I turn in a lemming to fill is need in hardware.
August 12, 2002 10:59:29 PM

Quote:
dude 8Kb (8 * 8 * 1024 bits) is alot of double-prec(64bit) floating data types.

Damn, I was thinking bytes, not KB. I guess looking at it that way it can hold a hefty number of data.

Quote:
the Trace Cache now works at 6ups per cycle at half the clock speed of the processor. with avreage x86 program maintaning ~1.5ups/cycle of instruction level parllelizem there intel may or may not want to change that - the preformance boost would not be that somthing.

Not in the current P4 it isn't. The tagged micro-ops are capable of being issued per clock, it's merely 3 micro-ops per clock. Unless of course, it has to go through the micro-code ROM.
As for the how many micro-ops the average x86 program can maintain, I'll have to disagree. Judging from the Athlon, I'd say 3 x86 instructions per clock would be the sweet spot as far as parallelism. Now, the reason that wouldn't matter anyway is because the only time when the micro-ops in the trace cache helps is if there is a cache hit. And after the first time an instruction is used, it is given a trace tag in which to help out-of-order execution. This provides much greater parallelism. I would say that 3 micro-ops per clock is definitely somewhat of a limitation.

Quote:
Intel purposefully designed the tiny L1 so it could fit within the tiny timing budget and run at super high frequency. Increasing the size of the cache while making sure that it would still fit within the same timing budget, without increasing the latency by another cycle would presumably be incredibly difficult if not downright impossible. Someone somewhere might have decided that an 85% hit rate with a latency of 2 cycles might be better than a 92% hit rate with a latency of 3 cycles. and he/she was at a better possetion to make that choise then you are. so you should refain from such statements.

I was speaking of the trace cache mostly. And also, I was under the impression I could express my opinions as I see fit. Unless you can prove to me that indeed expanding the trace cache by and extra 8k micro-ops would produce a latency of 3 cycles (which I don't think it will) and that it would provide worse overall performance, I highly doubt my opinions are somehow invalid.
And btw, it's not just about 85% hit rate vs 92%, it's something a little more dramatic than that. Your average loop in a normal x86 program is pretty big. Certainly a great number of them are beyond 12k micro-ops (or at least, the equivalent of that in x86 instructions). And I'm willing to bet that the ratio of loops (or any other type of repeatable code) that are majorly used that can't fit inside the trace cache to those who can is way more than 1:4. So we're not talking about an 80% hit rate.

"We are Microsoft, resistance is futile." - Bill Gates, 2015.
!