clue69less

Splendid
Mar 2, 2006
3,622
0
22,780
http://www.xbitlabs.com/articles/cpu/display/core2duo-preview.html

OK, I've worked through the whole thing and wrote the following notes then decided to share them so the more knowledgable here can correct any mistakes I've made. I don't see anything there that's really new but maybe some different ways of saying things. I'm no architecture expert but the points that stick out to me the most:

-Driving force for change that brought about Core: Find the best balance between performance and power consumption.

-Processor performance is proportional to clock frequency x the # of instructions per clock cycle. (Core runs up to 4 inst/cycle and execution pipeline is 14 stages long compared to 30 for P4)

-Reducing the number of operations needed to process a quantity of data effectively speeds up the CPU

-Power consumption is proportional to clock frequency x the square of Vcore and by constant dynamic capacity (CDC). CDC is proportional to the # of transistors and their activity during operaion. The shorter exectution pipeline gives a performance per watt advantage. A 15% reduction in clock frequency can cut power by half.

-First Core CPUs will be dual core with 64KB L1 and shared 2 or 4 MB L2 cache. Sharing L2 cache has numerous advantages, including lowering the workload on system memory and the processor bus, which are both bottlenecks. A core controller logic capability has been added that allows L1 cache data to be exchanged through the shared L2, another efficiency boost. EM64T extensions will be supported. Additional decoder and exectution units were added to allow wide dynamic exectution. Six dispatch ports feed four decoders. Branch prediction units were updated and command buffers were made larger. Macrofusion technology has been added to existing microfusion tech. SSE instruction processing has been sped up by 2X by revising the SSE command system and speeding up SIMD units.

-Core has six data prefetch units - two linked to the shared L2, two each to the L1s of each core. I'd appreciate it if someone here would explain memory disambiguation to me. Any idea what the probability for a prediction error is with this kind of unit? The efficiency increases made possible by these processes seem logical but I've always ben a little queasy by comments like: "According to the accumulated stats, the data prefetch units try to load the data into the processor cache even before the corresponding request is made." Sometimes the cart needs to go ahead of the horse, it seems.

The intellegent power capability discussion made me think back to when Cadillac developed an 8 cylinder engine that would shut down two or four cllinders when demand was low. And gasoline-powered engines that shut off at stop lights. This kind of strategy will probably infuse every energy-related technology ad energy costs climb. I like the idea of having an array of thermal diodes to monitor CPU tempreature.

I've read some speculative stuff about K8L but I'd be interested to hear from some folks on this forum about whether (and how) you think K8L will attempt to bridge the gap Intel has made, for example one 128bit command per clock for K8 vs. 3 for Core. What about Intel's supposed decoding logic advantage - do you believe there will be anything really new for AMD in this area with K8L? What I'm struglging with is trying to understand specifically what new Core technologies have made possible the huge performance increases being reported. I realize that the AMD design philosophy is very different with the onboard memory controller, etc., but I have not heard anyting convincing that the AMD arch includes any real handicaps to further development. Has the AMD design run its course or can it grow? Jack has talked about transistor quality as seen in cross-sectional SEM images. This makes me wonder if AMD has reached the limits of their fab processing technology? From my days in analyzing processor microstructure, I know that these issues are very real. My mind drifts into a future where bus speeds are well beyond the current 1066MHz, with higher efficiencys, laptops running faster than any current desktop, zzzz....
 

BaronMatrix

Splendid
Dec 14, 2005
6,655
0
25,790
http://www.xbitlabs.com/articles/cpu/display/core2duo-preview.html

OK, I've worked through the whole thing and wrote the following notes then decided to share them so the more knowledgable here can correct any mistakes I've made. I don't see anything there that's really new but maybe some different ways of saying things. I'm no architecture expert but the points that stick out to me the most:

-Driving force for change that brought about Core: Find the best balance between performance and power consumption.

-Processor performance is proportional to clock frequency x the # of instructions per clock cycle. (Core runs up to 4 inst/cycle and execution pipeline is 14 stages long compared to 30 for P4)

-Reducing the number of operations needed to process a quantity of data effectively speeds up the CPU

-Power consumption is proportional to clock frequency x the square of Vcore and by constant dynamic capacity (CDC). CDC is proportional to the # of transistors and their activity during operaion. The shorter exectution pipeline gives a performance per watt advantage. A 15% reduction in clock frequency can cut power by half.

-First Core CPUs will be dual core with 64KB L1 and shared 2 or 4 MB L2 cache. Sharing L2 cache has numerous advantages, including lowering the workload on system memory and the processor bus, which are both bottlenecks. A core controller logic capability has been added that allows L1 cache data to be exchanged through the shared L2, another efficiency boost. EM64T extensions will be supported. Additional decoder and exectution units were added to allow wide dynamic exectution. Six dispatch ports feed four decoders. Branch prediction units were updated and command buffers were made larger. Macrofusion technology has been added to existing microfusion tech. SSE instruction processing has been sped up by 2X by revising the SSE command system and speeding up SIMD units.

-Core has six data prefetch units - two linked to the shared L2, two each to the L1s of each core. I'd appreciate it if someone here would explain memory disambiguation to me. Any idea what the probability for a prediction error is with this kind of unit? The efficiency increases made possible by these processes seem logical but I've always ben a little queasy by comments like: "According to the accumulated stats, the data prefetch units try to load the data into the processor cache even before the corresponding request is made." Sometimes the cart needs to go ahead of the horse, it seems.

The intellegent power capability discussion made me think back to when Cadillac developed an 8 cylinder engine that would shut down two or four cllinders when demand was low. And gasoline-powered engines that shut off at stop lights. This kind of strategy will probably infuse every energy-related technology ad energy costs climb. I like the idea of having an array of thermal diodes to monitor CPU tempreature.

I've read some speculative stuff about K8L but I'd be interested to hear from some folks on this forum about whether (and how) you think K8L will attempt to bridge the gap Intel has made, for example one 128bit command per clock for K8 vs. 3 for Core. What about Intel's supposed decoding logic advantage - do you believe there will be anything really new for AMD in this area with K8L? What I'm struglging with is trying to understand specifically what new Core technologies have made possible the huge performance increases being reported. I realize that the AMD design philosophy is very different with the onboard memory controller, etc., but I have not heard anyting convincing that the AMD arch includes any real handicaps to further development. Has the AMD design run its course or can it grow? Jack has talked about transistor quality as seen in cross-sectional SEM images. This makes me wonder if AMD has reached the limits of their fab processing technology? From my days in analyzing processor microstructure, I know that these issues are very real. My mind drifts into a future where bus speeds are well beyond the current 1066MHz, with higher efficiencys, laptops running faster than any current desktop, zzzz....


I think I can answer most of your question safter reading the link also.

Firstly, memory disambiguation is a coined term that means exactly disambiguation. Something is ambiguous if it can't be "strongly defined" in the space it resides in. What the technique does is verify that

1. all reordered instructions are suitably marked with a few bits per instruction stream.

2. all memory locations are "indexed" by the prefetch mechanism so that when groups of instructions are passign through the functional units, a load that is not dependent upon a store can be immediately dispatched and a bit marker can be applied to determine at what point a load CAN safely happen before a store. There is only a conflict if the load is dependent upon the "stored" result of a parallel instruction which may complete first.

This technique should do at least the 95% (I think it may be closer to 97%) prediction that the Athlon does - previous Intel cores were averaging 90%. This is aided by shortening the pipe line so that flushing and recalc for a missed prediction costs less cycles for 14 pipes than for 30.


As I said though this looks a LOT like the patent material from one of the schools I read about - I believe it was Princeton. This type of tech can't actually be given a "name" since it's just how the CPU functions.


As far as single cycle SSE128, K8L does have that planned (X2, I believe). Intel would NOT have been able to do this on 90nm. AMD can't do K8L on 90nm. Their 65nm process is supposedly on schedule for Dec. This means that they can experiment with Brisbane in terms of transistor density for each unit.

Adding the "scratch pad" functinoality fromthe patent and speeding up the functional units would allow for extra superscalar execution for the AMD version of "advanced speculation."

AMD has stated they have K8L taped out - they don't have the money to have extra units floating around as Intel does - but they still have to make K8. I think that once Socket F comes out there will be more emphasis on Rev G and K8L.


Either way they have time with 80% of retail - excluding Dell.