I think I figured all this out last week. Here is how I summarized it for a colleague. How close did I come to your view of the way things really are:
Several months ago (November?) - way back in my PC knowledge infancy - I read that AMD was not going to be able to take advantage of DDR because the memory bandwidth was the same as the processor bandwidth. I didn't understand his gist and asked about it as well as I could phrase it at that time. I got an answer like 'those numbers are just peak burst rates -
they don't mean that much - it doesn't matter'. Now I can say this would be a pretty good answer to a question about ATA 100 vs ATA33, but I now see what the guy was saying and it is borne out by all the testing/reviews that show DDR does not deliver on it's promise (2x SDRAM), but Rambus does deliver when code is optimized to take advantage of the P4. It's amazing how much you can learn in a few months - actually just 6-10 articles if you don't count the ones that don't help.
We were touching on the topic today with the question of abandoning the C chip when you go with SDR instead of DDR. It's easier to see if you convert the frequency numbers into time, see below. [I am running with your statement that the interface between the North bridge and the CPU actually runs at 266 MHz - 'something' in the system has to run at 266 to keep up with DDR]
Bottom line, with 133 MHz DDR, you get 2100 Bps and that's the capacity of the FSB. I assume the FSB is isolated from the processor internal to the CPU chip, so there are still processor cycles while the FSB is bogged down. If not, then the CPU has to stop to access memory. So if the memory runs wide open, the PCI & AGP get nothing.
It's kind of analogous to the LAN hub vs switch situation: 200 Mbps at each port of a switch instead of 100/n Mbps at each hub port.
Today's northbridge is a hub. If it was a switch, it could/would run the PCI, AGP & DDR buses concurrently, and then the system would fully benefit from the RAM speed.
Since it isn't/doesn't, then there will be intervals where the process waits for somebody to get his 2-bits in. This is why we only see 10-20% improvements with DDR over SDR (216 vs 236 [10%], in one of Tom's reviews). The processor can't USE the available data rate. In P4 systems with the "400 MHz" bus, even with dual banks of RDRAM, there is FSB bandwidth available to the system - or else the point of tie up is 2x (400 vs 266) farther down the frequency pipe, so performance is better.
What does AMD need to capitalize on DDR:
RAM 8 Bytes @ 266 MHz or 2100 Bytes per microsecond (double data rate)
PCI 4 Bytes @ 66 MHz or 264 Bytes per microsecond (half word width)
AGP 8 Bytes @ 264 MHz or 2112 Bytes per microsecond (quad-pumped for 4xAGP)
total 4476 Bytes per microsecond for 4x AGP
3420 Bytes per microsecond for 2x AGP
2892 Bytes per microsecond for 1x AGP
(More than these, if I left out some significant resource on the FSB.)
If this 'new AMD chip' sticks with 64' width, 4476/8= 560 MHz - close to P4's 400 MHz!!!
So what is P4 actually doing:
400*8 = 3200 MBps. Take away 2100 for RAM and 264 for PCI, then Intel has 836 bytes per microsecond for AGP & everybody else.
This is what the unknown guy I first read was trying to say. AMD will have to have a new processor to take advantage of DDR. It will need a 4000-5000 MBps FSB. [that's the sentence he left out of his article] Is this beast in AMD's hammer line? I haven't the foggiest. But at least, now, I know where we are and why.
Tom wonders why P4 excels at Quake 3 (and SSE- II 'optimized' MPEG4 & nothing else). I'll bet it's because that program has some tight loops, where the game runs from the cache and memory is only accessed for data. That's what the 'streamlined' P4 is designed to do. Intel said that the Flask MPEG4 conversion could be optimized more if it was rewritten with the P4 in mind. I'll bet it could: say limit the processing to loops that fit in the cache and process a block of data. Then load another loop and process the same data again. Do it again ... until that data is complete. Then repeat the process for another block of data. That way you might load actual processing code into the CPU once! per data block instead of once per pixel. Even if there was 50-100% code overhead in breaking the process into such small jobs, there might still be orders of magnitude improvement in the overall, hours long job.
If this last thought is true, imagine Intel's frustration at having such a powerhouse on the market and no field where it can run....
The same thing could be done with the AMD processor as the target, but since Athlon's cache is undecoded instructions and P4's is fully decoded register control bits, P4 would still be faster, even if the loops were the same, since there would be no latency for instruction decode.