I don't really see Conroe being bandwidth starved at all. At least not to the point of being uncompetitive. Just looking at Yonah, we see a mobile processor running on a 667MHz FSB that can meet the performance of an X2 at the same clock speed.
"But what about the bigger picture? What does our most recent look at the performance of Intel's Core Duo tell us about future Intel desktop performance? We continue to see that the Core Duo can offer, clock for clock, overall performance identical to that of AMD's Athlon 64 X2 - without the use of an on-die memory controller. The only remaining exception at this point appears to be 3D games, where the Athlon 64 X2 continues to do quite well, most likely due to its on-die memory controller."
http://www.anandtech.com/cpuchipsets/showdoc.aspx?i=2648&p=14
And this is a 667MHz FSB. Conroe will have a 1066MHz FSB with Extreme Editions having a 1333MHz FSB. With so much additional bandwidth available over Yonah, it's doubtful that Conroe or dual processor Woodcrest would be bandwidth starved. Even in a 4-way Woodcrest situation, with 4 processors on 2 1333MHz FSBs each processor would have 667MHz to work with, the same as Yonah. There would be bandwidth restrictions if every processor is going full tilt, but a lot of that would be alleviated by the fact that Woodcrest would have double or more L2 cache than Yonah. Now on 8-way or higher systems I'll admit Woodcrest would definitely be bandwidth restricted although Intel is planning on implementing a 4 FSB design in those situations.
The problem with the FSB design is not so much bandwidth as latency. Even then the relevence in real world situations depends on the circumstance. As mentioned in the Yonah review, FSB latency only hinders the processor in gaming. This actually wsn't so much of an issue for Dothan since its large 2MB with very low 10 cycle L2 cache latency helped buffer the FSB latency. The problem with Yonah was that the new shared cache required new algorithms that introduced latency. The cache was also asleep by default now and required reawakening for use further adding latency. With the cache size being kept the same, meaning a reduction per core, Yonah's performance couldn't help but fall being buffered only by the increased FSB from 533MHz to 667MHz.
Conroe and Woodcrest won't have as many problems. They will at least double the L2 cache bringing the per core amounts in line with Dothan. The latency may also be reduced if Intel removes the power saving features from Yonah. The caches don't need to be asleep by default, but could just be put to sleep when not in use like in Dothan helping to reduce the cache latency. FSB latency will also be reduced when moving to desktop as Yonah and Merom uses a Power Savings optimized FSB. I may be wrong but I thought the mobile FSBs used less lanes, which while keeping average thoroughput the same, reduced burst speeds. Those cut lanes may have just had to due with reduced power requirements though. In any case, with the FSB not needing to sleep most of the time, latency will also be reduced over Yonah. Latency can also be reduced by running synchronous RAM, which is why it was strange that all the reviews I looked at coupled Yonah's 667MHz with 533Mhz RAM. Intel will also be continuing to optimize their northbridge memory controller. The i975 had reduced latencies and improved performance over the i955 so performance gains are possible. Overall latency won't be reduced to match an onboard memory controller, but the gap can be closed.
FSB design could actually be a benefit of Intel's partnership with Apple. Apple's G5s have an FSB even faster than Intel's topping out at 1.35GHz for the 2.7GHz G5. What's interesting is that Apple reaches this speed with only a dual pumped bus. If Intel adds their quad-pumped architecture to it a 2.7GHz FSB will be more than enough to satisfy Intel's needs. Apple's FSB is also bidirectional like HT allowing them to transfer 1.35GHz in each direction at the same time. Co-operation between Apple and Intel will remove the bandwidth constraints of the FSB, and bidirectional support will cut the latency by avoiding the wait for the bus to switch directions. Of course, this would require Intel to swallow their pride and ask Apple for help, but it shouldn't be that hard to do if Apple already did it to ask Intel for help.
And for slvr_phoenix, one of the major focuses of the Merom family architecture was correcting the FPU limitations. All the executions units have been redesigned. The Inquirer believes the new architecture will easily surpass K8 on integer calculations which isn't so hard to believe given the Pentium M's integer performance, and will likely tie on FP calculations. The execution unit count for the Merom family is likely 3 complex (full) ALUs and 2 full FPUs.
A major limitation in Dothan was the fact that only 1 of the 3 decoders could process SSE instructions. This bottleneck was removed from Yonah by allowing all 3 decoders to process SSE instructions. Further improvement was made by extending micro-ops fusion to include SSE. The major bottleneck is actually not FPU numbers but how instructions are processed. Currently on both Dothan and Yonah, the FPU (vector unit) is only 64-bits wide. This means that SSE (4x32 bit) and SSE2 (2x64 bit) instructions can't fit and must be split into 64-bit pieces. This already means performance is cut in half. The problem gets worse when looking at multiplication. While it can be done at full speed for 32-bit SSE instructions, it only runs at half speed for 64-bit SSE2. This means that SSE performance is cut in half and SSE2 performance is as low as a quarter of potential in the 1 FPU that Yonah has.
Merom and family looks to completely alleviate these problems. The FPU will be full speed 128-bit compatible meaning up to 2 times performance increase in SSE and 4 times in SSE2. These units will be fed by 4 full (complex) decoders. The number of ports has also increased from the current 2 although the specific number hasn't been confirmed. The minimum would be 3 universal ports with 2 shared with the FPUs, while 4 universal ports is also likely which would indicate that another ALU has been added. Expanded micro-ops fusion and the addition of macro-ops fusion along with various prefetch techniques will also help ensure the execution units are operating at max efficiency.