Two years ago, Intel pulled off a coup with the introduction of its Conroe architecture, which surfaced as the Core 2 Duo and Core 2 Quad. With this move, the company won back the performance crown after losing a bit of favor in the debacle that was its Pentium 4 "Prescott" design. At that time, Intel announced an ambitious plan to return to evolving its processor architectures at a rapid pace, as they had done in the mid-1990s. The first phase of the plan was the release of a “refresh” of the architecture 12 months after its introduction, to take advantage of progress in fabrication processes. That was done with Penryn. Then a whole new architecture was set to arrive 24 months later, with the code name Nehalem. That new architecture is the subject of this article.
The Conroe architecture offered first-rate performance and very reasonable power consumption, but it was far from perfect. Admittedly, the conditions under which it was developed weren’t ideal. When Intel realized its Pentium 4 was a dead-end, it had to reinvent an architecture in a hurry—something that’s far from easy for a company the size of Intel. The team of engineers in Haifa, Israel that, up until then had had responsibility for mobile architectures, was suddenly responsible for providing a design that’d power the entire new line of Intel processors. It was a challenging task for the team, which now bore Intel’s future on its shoulders. Given those conditions—with the tight schedule they had to stick to and the pressure they were under—the results that the Intel engineers achieved are remarkable. The situation also explains why the team had to make some compromises.
Although it was a serious reworking of the Pentium M, the Conroe architecture still sometimes betrayed its mobile roots. For one thing, the architecture wasn’t really modular. It had to cover the entire Intel range, from notebooks to servers. But in practice, it was practically the same chip in each case; the only place for variation was in the L2 cache memory. The architecture was also clearly designed to be dual-core, and moving to a quad-core version required the same kind of trick that Intel had resorted to for its first dual-core processors—two dies in a single package. The presence of the FSB also hampered the development of configurations using several processors, since it was a bottleneck in terms of memory access. And a final little giveaway: one of the new features introduced with the Conroe architecture—macro-ops fusion—which combines two x86 instructions into a single one, didn’t work in 64-bit mode, the standard operating mode for servers.
These compromises were understandable two years ago, but today Intel can no longer justify them—especially when faced with its rival AMD and the Opteron processor still a compelling play for enterprise environments. With Nehalem, Intel needed to remedy its last weaknesses by designing a modular architecture that could adapt to the differing needs of the all three major markets: mobile, desktop and server.

I regard being late as a quality seal really. No point being first, if your info is only as credible as stuff on inquirer. Better be last, but be sure what you write is correct.
Perhaps, if you count being translated from French.
Nice article, good depth, well written
I don't know french, so no idea if it actually works. But I've tried from english to germany and danish, and viseversa. Also tried from danish to german, and the result is always the same - it's incomplete, and anything that is slighty technical in nature won't be translated properly. In short - want it done right, do it yourself.
You claimed the article on toms was a copy paste from another article. He merely stated that the article here was based on a french version.
I actually read the whole thing.
I just don't get TLP when RAM is cheap and the Nehalem/Vista can address 128gigs. Anyway, things have changed a lot since running Win NT with 16megs RAM and constant memory swapping.
1) How's the loop detection feature know when it is a loop ? The diagrams posted don't show any connection between it and the 'front' of the pipeline, so how can it know that the next operation is the same if it hasn't yet entered the loop?
2) On page 8 there's a diagram with a 4 socket setup showing 2 io hubs. Are they connected to the same pcie bus and whatever else they interface with? or are only 2 of the sockets able to directly access a given resource?
3) With the modular design, would one risk buying a cpu that doesn't work in a motherboard because it is intended for a 2 or 4 socket system? or are they all the same, simply with some qpi's disabled?
4) Am I right assuming that qpi replaces fsb when it has to communicate with an i/o hub only? (as shown in one of the top diagrams on page 8) Or is it used for every one of the 'blue' lines on the lower diagram (10 total in a 4 socket layout). The latter would mean 4 qpi's are barely enough to satisfy bandwidth needs in a server enviroment. I imagine an esx server with 4 processors (32 threads) can easily demand memory from dram pools not linked to the local core the threads are running on, and use 96GB/s (3x32) of the 102GB/s (4x12,8x2) total theoretical bandwidth in addition to some of the local 32GB/s bandwidth from the socket a given core/thread is running on. So if this scenario is correct, is it possible to increase the speed of the qpi (read: oc the link) to increase available bandwidth? And what happends if one would successfully find ddr3-1600 modules that would run within the 1,65v limitation? Wouldn't that mean the qpi was already at its limit? (38,4GB/s per dram pool x 3 sockets not local to the core that runs a thread). I know memory isn't truely the bottleneck in modern computers, but I still find it wierd that they put so much effort into the memory controller if it isn't actually the problem. Simply adding a few qpi links between the sockets and the chipset would've solved the bandwidth issue without limiting usable memory types by choosing a certain cpu. Sure it wouldn't have improved latencies, but honestly, who cares? neither in a gaming pc, netbook or any number of common server configurations is it the memory lantecy that is the bottleneck.
5) How much time should one assume is wasted when a core on conroe flushes the l2 cache? they seem to have solved the issue and as consequence increased cache latency (which should turn into slower overall cache performance). In english : can we expect any gain from this change?
6) Would the immensely increased tlb size improve performance in newer games which precache loads of data? (thinking quicker retrieval of texture data etc)
7) Page 12 mentions unalligned memory access, which I've never heard of before. Appearently compilers already try to avoid this situation, so can we expect the improvement to handling such to be of interest? What's the point of improving a feature to handle a situation that hardly ever arises in the first place?
http://www.xbitlabs.com/articles/cpu/display/replay.html
This was a very good article and is not a copy ... well done Fedy !!
perhaps it will burn out the IMC within the chip since its all done at 45nm, 1.6+v would be deadly, imagine air cooling a 3ghz quad core chip at ~2v? i take it it shares the rail even within the cpu so
depends on how connected that ram is, there might be advantages etc this way, and it also makes you wonder if AMD suffers from this - iv heard of extreme overclockers killing ram channels on AMD's etc
on the other hand who cares about high performance memory - 3 x 1333mhz is going to be better then 2 x 1600+mhz channels etc, along with the fact its an IMC based setup etc and average maximum bandwidths of ~32gb/s vs the current average maximum of ~12.8gb/s etc
as for the memory issue. Who'd want to run 3x1333 if they could run 6x1600 ? any enthusiast will only be satisfied with the best, and 1333 just isn't it. Not even 1600 is. ddr3-1333 is basicly obsolete, and it's not even mainstream yet. It's a disaster really.
A case of perhaps minimising reflected impedence?
Just my theory anyway ... remember ... I am only here for the humour ... not the technology.
AMD4LIFE