The initial block diagram from Dailytech seemed to indicate that there were three ALUs and only 1 FPU. This would have been amazing if that single FPU was responsible for Conroe's gaming and encoding performance that the benchmarks seem to indicate. This is of course considering K8 uses 3 FPUs. However, the fact that each execution unit had it's own port seemed inefficient.
Core has three execution dispatch ports, which feed a total of three 128 bit SSE units, two 128 bit floating point units, and three 64 bit integer units.
The new architecture is much more efficient with all the execution units doubling in terms of width. 64-bit ALU versus 32-bit in Yonah and 128-bit FPU and SSE units versus 64-bit in Yonah. The latter results in a two fold increase in performance as previous 128-bit SSE instructions had to be split in half and executed consecutively. The major improvement is that SSE2 (2x64-bit) now operate 4 times faster under multiplication. Previously, the SSE2 instruction had to be split into it's 64-bit pieces cutting speed in half, and multiplication of 64-bit operations was done at half-speed cutting performance in half yet again. I wonder if Core has a dedicated integer multiplier like the one added to Prescott?
Originally it was reported that Core had 4 symmetric decoders which would have indicated 4 complex ones, but in actually there is only 1 complex and 3 simple. This would have been crippling, but of course Intel cheated starting with Yonah by implementing Micro-ops fusion across all instructions, the major benefit being that all SSE types are no longer limited to the complex decoder. Interestingly, with macro-ops fusion the decoder width can be up to 4+1, namely 1 of the 4 being 2 fused uops.
Prefetch and prediction has of course also been improved and memory disambiguation added. The power saving features are also quite advanced with the FPU capable of shutting half of itself down if it's only processing a 64-bit instruction. SSE4 looks to be decidedly unimpressive though since it was originally designed for Tejas, Prescott's original successor. (Cringe). Intel looks to release an SSE5 customed designed for Core next year as the 45nm transition gives then more room and time to refresh the architecture.
Just to warn you, I'm going off on a tangent here.
Based on the architecture, I'm still inclined to feel that Core would have worked well with Hyperthreading. The fact that Core has a short 14-step pipeline is no reason not to because Power5 uses SMT also and it only has a 16-step pipeline. Itanium is of course a 10-step pipeline and it also uses CMT. The major reason HT theoretically works better with Prescott was because it would have been extremely wasteful to wait for stalled thread so HT would allow another thread to keep running. Since stalls occured more often on Prescott, as it's branch prediction isn't as advanced as the Pentium M or Core, improvements would be more readily apparent.
With the shorter pipeline and better branch prediction of the Core archtecture, the advantage of HT would not be as a latency hiding technique. It has been pointed out the futility of Intel incorporating 4 decoders, because the average code only allows up to 2.5 instructions per cycle. However, this is precisely where the Core's advantage can come in. For example, if there are 2 threads each with 2 instructions per cycle then the 4 decoders can handle the 2 in parallel. This is of course better than Netburst which only had 3 decoders so the options of combining threads was more limited. With macro-ops fusion combined, Core can even handle up to 5 instructions at a time under certain circumstances which is exactly twice the average 2.5 instructions per cycle for true parallelism within a core. Of course, this assumes there are sufficient execution units to accomodate those commands.
Again, the Core architecture is very wide with 3 ALUs, 2 FPUs, and 3 SSE units. Core's major advantage is that it has 3 ports compared to Netburst's two. This of course means that Core can constantly dispatch three instructions per cycle while Netburst on average could only do 2. The only advantage Netburst has is that it can theoretically dispatch 4 instructions if they fit into the 2 double pumped fast ALUs which isn't very common. Core also has the advantage in FP and SSE calculations because the 128-bit width allows any instruction to be completed in one cycle while Netburst's 64-bit width hogs two cycles making it less suited for HT.
One of the concerns about implementing HT is of course in theory it would hog resources from the second thread. This doesn't appear to be as much of a concern for Core. The all the resources in the architecture has been expanded in order to sustain a 4 instruction width and HT ensures that there are actually 4 instructions moving through the processor. A lot of the decrease in performance was due to the Replay function of Netburst which can allow 1 thread to hog valuable resources (especially execution units) as it loops waiting for a cache-miss to be solved. However, Core doesn't appear to have a Replay system as it's shorter pipeline doesn't make it neccessary. In Netburst, performance decrease can also occur due to memory thrashing. This again isn't as likely in for Core because of the larger caches, both L1 and L2. While Presler also had 4MB of L2 cache, the fact that Core's is shared means that replication is eliminated maximizing space. Also Netburst's L2 cache wasn't SMT aware allowing one thread to consume the entire cache if it wanted starving the other thread. The beauty of Core is that it already has a shared cache and dynamically allocates memory for each core. It could easily do the same for each logical core preventing thrashing from occuring. The lowering latency of Core's L2 cache and the higher speed FSB also reduces performance decreases through memory conflicts.
The other major concern of HT is of course heat. When a processor is working it'll obviously generate heat, but the concern is of course when that heat is generated for no increase in performance. The waste of Netburst's HT in terms of heat is again a by product of Replay constantly operating the pipeline even when no useful product is produced as the looping thread is waiting for a cache-miss to be resolved. Again Core doesn't have Replay. The Core architecture and the newer 65nm process means that the power-consumption and heat isn't that drastic at full load anyways and it'd certainly be worth it if it's doing legitimate work.
With the width of the Core architecture, HT would certainly be useful since the likelihood of actually sustaining two threads per core is a lot higher. This would of course mean that the performance of 4 logical cores is a lot closer to that of 4 real cores. Supposedly, Vista is suppose to improve support for HT as well, so the risk of bad scheduling putting two threads on the same core should be reduced. (That's if you trust Microsoft of course, especially considering that battery life issue that's quietly gone away). In any case, even if two similar threads were given to one real core, the architecture has more execution units reducing the likelihood of conflict. This is especially the case with SSE instructions where Core can process 3 while Netburst can only do 1 and that was shared with the FPU. Similarly, FP and SSE moves can now be performed by any of the 3 SSE units, compared to the single FP/SSE unit in Netburst.
Overall, the best time to implement HT would be when the switch to 45nm. The die shrink would give space for the HT related transistors to be added, avoid the power and heat concerns, and with the planned L2 cache increase to 6MB, prevent memory conflict concerns. 45nm will also allow an extra FPU to be added to port 2, balancing out the system and reducing execution unit conflict.
They got a laugh outta me with
"Imagine a program that goes through a list of numbers and adds 1 to each entry; since this is the year 2006, the program was written poorly and the store addresses are unknown."