
Intel's Developer Forum has always been a great event to get up to speed on processor technology trends and developments. This year, however, IDF was much more than an interesting summit for pros, geeks, analysts and journalists. For starters, the company's CTO Justin Rattner referred to this year's event as the most memorable IDF since its invention in 1997. It is Intel's stage to show the world where its technology is going, and it did use this stage to announce its willingness to shift the balance of processor performance power from Austin back to Santa Clara.
It has been two bumpy years for Intel's processor business. The introduction of the 90-nm Pentium 4 (Prescott) marked a turning point, because all new Pentium 4 releases increased power consumption by a significant amount without a commensurate boost in performance. Intel tried increasing clock speeds with its NetBurst architecture, but it hit the ceiling faster than expected.
At the same time, AMD has offered a more efficient processor architecture, and recently extended its lead further with the introduction of dual-core models. Intel has thus been taking a beating both in performance and in energy efficiency.
Some even began to speculate that Intel was out of the running. However, we should not forget Intel's financial, manufacturing and other resources are 10 times that of AMD's. Hence, it is obvious that such a corporation would not just walk off the battlefield. So, last year, a generation change was announced, comprising Intel's next-generation micro architecture. All new Intel processors in the third quarter of this year are going to be based on one common design, and they are targeted to do big things: Merom (mobile) is to add 20% more performance compared to the Core Duo; Conroe (desktop) is to boost performance by 40% while decreasing power requirements by 40% and Woodcrest (for servers) will purportedly increase performance by as much as 80%, while reducing power by 35%.
It sounds almost too good to be true, which is why Intel focused on disclosing many technical details on the architecture (which is called Core Micro Architecture). We will go through the key details first, and then have a look at more developments in the mobile and platform areas, since IDF, of course, it not only about processors.

Intel's chief technology officer Justin Rattner is the brains behind what he called "Intel 3.0".

Intel finally admits that NetBurst was all but ideal. Justin Rattner said: "We were under tremendous competitive pressure."
If you are aware of recent processor history, Intel's new strategy is a very logical reaction. Pentium 4 and Pentium D processors draw more power than their AMD counterparts the Athlon 64 and Athlon 64 X2. This translates into higher cooling requirements and a higher energy bill. As soon as systems run 24/7, this difference easily adds up to $100 a year in North America or considerably more in countries with higher energy costs. Also think of large corporations with 100s or 1000s of systems: The energy cost difference can be a competitive disadvantage.
A simple reduction of the energy consumption could have been accomplished by taking the Pentium M or Core Duo architecture (Yonah core) and speeding it up in order to match Pentium 4 performance. However, Intel seems to do it right, as it defined a new equation for performance based on Energy per Instruction (EPI):
Performance = Frequency * Instructions per clock cycle
At last year's fall IDF, the announcement was to beat the competitor both in absolute performance and in performance per Watt. Intel changed that statement even more and talked about "satisfaction per Watt," which brings all the current processor features such as 64-bit capability and virtualization technology into the equation. In the end, it all comes down to delivering as much value as possible per clock cycle without involving ridiculous thermals.
And Intel goes a step further: It does not matter how long the processor pipeline actually is, it does not matter whether the memory controller is integrated or not, and it does not matter what clock speed the device actually is running at. All that matters is to deliver maximum performance at minimum power requirements. It does sound good, doesn't it? What we want now is proof.

Consuming less energy per instruction is the primary goal now. Interestingly, the Pentium M and Core Duo processors offer the same, low-level power consumption per instruction as the initial Pentium processor (P54).

One important ingredient is the 65-nm manufacturing process that Intel describes as allowing for 20% faster transistors while requiring 30% less power. This, by the way, is also the projection the firm has for its introduction of 45-nm manufacturing late next year.

This makes pretty clear what happened with the P4 family: Increasing clock speed and voltage by 20% would improve performance only a little, but power draw would increase by three-fourths.

The trend for future multi-core processors is simple: There will be even bigger caches and possibly smaller cores, because this is the only way Intel could possibly make its vision of packing hundreds of cores on a single die come true within the next decade. The next generation of processors, however, will be available in the second half of 2007: Kentsfield in the desktop and Clowertown in the server/workstation space will merge two dual-core Conroe or Woodcrest dies into one package.
This will be described as a multi-chip package and is already being used today with the Pentium D Presler, which is based on two Cedar Mill type Pentium 4 chips. Of course, there are disadvantages, such as L2 cache access: Separated caches create an additional FSB load as soon as one processor needs to access the other's L2 memory. But from a business point of view, this approach definitely makes sense: It will still scale up performance quite a bit, while the implementation is something that can be done based on a 65-nm silicon process. Intel stated that we should not expect monolithic quad-core processors prior to the introduction of 45-nm manufacturing.

The way to quad cores is paved: First there will be multi-chip packages, which will involve the integration of two, dual-core processors into one die.

This is the dual core armada for this year and H1/2007.

We already mentioned the key milestones that Intel set for the development of its next-generation Micro Architecture: a great number of instructions per clock cycle and record-setting energy efficiency (measured in energy per instruction). There are three processor designs that were derived from the same dual-core architecture: Conroe for the desktop, Merom for mobiles and Woodcrest for servers. Everything will be produced with 65-nm process technology. While the three are technically almost identical, there are certain features and characteristics that will be enabled for certain segments only. High clock speeds is something we will only see in the high-end desktop and maybe the server space. For all other applications, clock speed independent efficiency was the primary goal. This will be achieved by increasing the pipeline throughput and bandwidth.
The new micro architecture is now called Core Micro Architecture and is characterized by five key features: Wide Dynamic Execution, Advanced Digital Media Boost, Advanced Smart Cache, Smart Memory Access and Intelligent Power Capability.
Core Micro Architecture is an out-of-order design with which individual instructions are scheduled and staggered in a 14-stage pipeline. In order to increase instruction efficiency, Intel focused on improving the flexible instruction execution. While that sounds easy, it conflicts with the requirements of IA machines to have a clean memory ordering for the sake of adhering to program semantics. One easy example is that store operations need to be completed prior to loading data, because you would want to access the current (latest) dataset.
Executing more instructions at the same time was also achieved within the three ALUs (Arithmetical Logical Unit), which can process SSE instructions in a single cycle (128 bit wide SSE). In addition to that, L2 cache improvements, thanks to the shared design as well as new prefetchers that work on the basis of memory disambiguation (prefetch data that is not going to be modified by other queued instructions), help to feed the pipeline more efficiently.
Critics might want to compare the Core architecture to the Pentium III now. However, Intel obviously built something completely new, because Core features inline decoding, which wasn't the case with P3. Also, there are 3 ALUs, while the P3 had one only (two in NetBurst). Lastly, a trace cache has also been eliminated.

Intel built a radically new processor design from scratch, but took lots of ingredients from what is has learned with the Pentium M (Banias, Dothan), all for the sake of improving instruction level performance while keeping thermals low. "We had to go back and carefully design a balanced machine," Intel mobility Vice President Mooly Eden said.


Merom will be available for Socket 479. Basically, current Napa systems are capable of running Merom processors by updating the BIOS only. Intel prefers to call this a 'Napa Refresh'.

Conroe will be deployed for Socket 775. It will require either the 975X chipset (for gaming) or the upcoming 965 chipset (for the digital home and office). Again, a BIOS update should do it unless you want to use the upcoming Extreme Edition, which is expected to offer a FSB1333 (667 MHz) system speed.

Woodcrest will be known as Xeon and will run on Bensley platforms, which are going to power Dempsey processors at up to 3.73 GHz starting next month.

Wide Dynamic Execution summarizes the improvements that Intel made to execution width (four parallel processes rather than three) and to the efficiency of micro ops processing.
As you can see on the image above, the greater execution width of four (partially even five) is maintained throughout the whole execution path, which represents an internal bandwidth increase. In other words the processor can fetch, dispatch, execute and return four instructions simultaneously.
In addition to that, the Core architecture supports the techniques that the Pentium M applies to reduce the total number of micro-ops: Micro Ops are broken down x86 instructions that the processor understands. Two of these can be fused into another micro op in order to save time (and energy). According to Intel, roughly every 10th instruction can be merged with another one using Micro Ops Fusion.
The idea of fusing micro ops has also been applied to the instruction level (instruction level parallelism) by allowing for two independent instructions (e.g. a compare and a jump) to be merged for decoding and execution. This feature, which is called Macro Ops Fusion, can even be carried into the ALUs: These allow for overall single cycle instruction execution, whether that is a macro op that consists of two instructions or generic instructions.
Both fusion mechanisms together can help to increase the efficiency of each core considerably. Think about it as some sort of instruction or micro ops level.





The ALU typically breaks instructions into two blocks, which results in two micro ops and thus two execution clock cycles. Intel now extended the execution width of the three ALUs and the load/store units to 128 bits, allowing for eight single precision or four double precision blocks to be processed per cycle. The feature is called Advanced Digital Media Boost, because it also applies to SSE instructions. This is called Single Cycle SSE and, for example, allows for merging four 32-bit element vectors into one 128-bit element.
Intel expects this to make a tremendous difference for all types of media processing applications (encoding, transcoding, compressing, etc.) and it even says the Core offers the highest IA computation density for vector processing.


The unified L2 cache probably is the feature that is mentioned first. It allows for a large L2 cache to be shared by two processing cores (2 MB or 4 MB). Caching can be more effective because data is no longer stored twice into different L2 caches any more (no replication). The full L2 cache is highly dynamic and can adapt to each core's load, which means that one single core may allocate 100% of the L2 cache area dynamically if this is required (on a line by line basis).
Sharing data also is more efficient now, because no front side bus load is generated while reading or writing into the cache (which is the case with the Pentium D), and there is no stalling when both cores are trying to access it. A good example that shows the advantages in multi-threaded environments is one core writing data into the cache, while the other may read something else at the same time. Cache misses are reduced, latency goes down, and access by itself also is faster now, because the Front Side Bus definitely was a limiting factor.
Advanced Prefetch

After developing a clearly more efficient processing architecture and a powerful L2 cache, Intel wanted to make sure that these units get used as efficiently as possible. Each Core dual-core processor comes with a total of eight prefetcher units: two data and one instruction prefetcher per core and two prefetchers as part of the shared L2 cache. Intel says they can be fine-tuned for each of the Core processor models (Merom/Conroe/Woodcrest) in order to prefetch data differently, whether it is for mobile-, desktop- or server-class usage models.
A prefetcher gets data into a higher level unit using very speculative algorithms. It is designed to provide data that is very likely to be requested soon, which can reduce latency and increase efficiency. The memory prefetchers constantly have a look at memory access patterns, trying to predict if there is something they could move into the L2 cache - just in case that data could be requested next. At the same time, prefetchers are highly tuned to watch for demand traffic, which can be a sequential data flow. In this case, prefetched caching would not make much sense.

Many times, load instructions have to wait until other instructions are executed, although they have nothing to do with them. The memory disambiguation predictor selects memory loads that are independent from prior write operations in order to execute them ahead of the scheduled point of time (see image). Again, this is a way of making sure that the processor pipeline is provided with data as efficiently as possible, and it masks memory latencies at the same time.



Decreasing the energy level per instruction is one thing, but of course there are certain things that can be done in order to provide better power management and efficiency. Intel combines several measures that start at a manufacturing level: The 65-nm process provides a good basis for efficient ICs. Clock gating and sleep transistors make sure that all units as well as single transistors that are not needed remain shut down. Enhanced SpeedStep still reduces the clock speed when the system is idle or under a low load, but it is also capable of controlling each core separately. Voltage can also be different in different blocks of the processor.


Having gone through the key features of Intel's upcoming Core micro architecture, some people, especially at AMD, may still complain that there will still be the front side bus, and there will still be the memory controller within the Northbridge, a long data path away from the processor, with higher latency, etc. AMD's memory controller, which is part of the Athlon 64 processor, scales beautifully with processor clock speeds and provide enough bandwidth at DDR400 speeds to compete against conventional platforms with DDR2-667 memory.
However, the approach that Intel takes equally makes sense: Why should it integrate the memory controller if it can achieve even better overall performance by applying other techniques? Intel's line of arguments is that a Northbridge memory controller is more flexible: Platform and processor architecture may be updated independently from one another. At the same time, integrated graphics controllers (which are used in the majority of systems sold today) still benefit from memory being "local" in the Northbridge.
As systems scale up to two or more processor sockets, memory coherency between several memory controllers will require much design effort, while the complete memory management is done by one single memory unit in Intel systems. Power saving could be another issue: With integrated controllers, it is impossible to move the CPU to lower power states as long as the memory needs to be active.
At this point, we should also mention Intel's upcoming I/O Acceleration Technology (IOAT), which is designed to move I/O data directly into the memory, while bypassing the processor. This could not be done without creating a system traffic FSB if the memory controller were integrated.
This discussion can go on for a long time, but at the end of the day most users won't care about how their systems actually work, but how efficient they are.

The Core Micro architecture's platform was, of course, a hot topic at IDF. 'Satisfaction per Watt' was one buzz phrase that we often heard at IDF.
In regard to power savings, the platform could centralize power management by, for example, putting the chipset and the memory into a power-saving state, while the graphics chip remains active. Intel will add several digital thermal sensors (DTS) and digital thermometers to the processor cores to facilitate the task.
Virtualization Of I/O Devices: VT-d
Intel's virtualization technology (VT) has just been introduced, but Intel is already talking about the next generation. Virtualized solutions are great for failover solutions, expansion or consolidation. The next generation is called VT-d, which stands for directed I/O.

For the first time, Intel compared one of its upcoming server systems to an AMD-powered machine. Yes, you got it right; it actually mentioned the name of its competitor that has been taking away market shares for some time. We understand that the comparison was meant for the sole purpose of showing the tremendous advances and the dominance that Intel expects to have over AMD here, but we found it a little odd to compare a non-released, future product (Woodcrest) against a product that has been around for a while.
The system in question was HP's ProLiant DL380, which is under development and is going to use Xeon Woodcrest processors at around 3.0 GHz. The comparison system was a SunFire X4200 with Opteron 280s at 2.4 GHz. Intel used the risk assessment software SunGard, which certainly is a great application for showing computational power (not a chance for AMD here), but it does not reflect the memory subsystem very much, involving localized DDR400 memory controllers vs. Intel's centralized quad channel DDR2-533 engine and fully-buffered DIMMs.
FBD setups are known to suffer from latencies while, again, the Core micro architecture will certainly do its part to hide it. And again, we prefer to wait for final systems before judging.

This contest was not about performance only, it was about performance per watt. Here, Intel states an advantage of 1.4x over AMD.

From a technology point of view, the Bensley platform is certainly going to offer more than most competitors - whether you can make use of it or not.

We found the presentation of so-called "Internet Mashups" very interesting. A Mashup is the combination of two different Internet services to achieve another goal. The example Intel used is GoogleEarth, merged with the flight information system of the FAA, so the result would display the actual location of hundreds of flights on GoogleEarth in real time. Intel believes this type of application will drive mobility in the future.
Santa Rosa Platform For 2007

Santa Rosa will be Intel's 2007 mobile platform. In contrast to today's 945GM Napa platform, the 965GM (we suppose this is a very likely name) will add-fourth generation mobile graphics and a new 802.11n wireless (Kedron). ICH8M will support AMT (Active Management Technology), but it is also intended to support Wireless AMT, which means that system maintenance (remote startup, BIOS changes, etc) could even be done over wireless networks. Isn't that scary and amazing at the same time?
Three Serial ATA II ports and 10 USB 2.0 ports will be part of the new mobile Southbridge. Intel also intends to bundle its mobile chipsets with Viiv Media Share software, which facilitates accessing multimedia content on a Viiv PC. Of course, you are supposed to buy one of these in order to make use of this feature. Finally, Intel NAND flash could be found on mobiles in the form of an add-on module called Robson, which we describe below.

Currently, both Microsoft as well as the hard drive makers are working on integrating a limited amount of flash memory into future-generation drives. This flash memory could serve several purposes, including power savings when hard drive data is not required, quicker boot times by reading operating system data out of the quick flash memory or even performance enhancements. Flash enabled hard drives certainly are the future, but they have nothing to do with Intel's latest approach to sell more chips.
Robson will serve as an add-on component to help speed up the process of booting Windows. Currently, Intel is talking about 256 MB, but depending on flash prices, that could easily end up being 1 or 2 GB. While Windows PCs will certainly benefit quite a bit, flash cannot replace conventional hard drives any time soon because of its capacity constraints. However, small form factor solutions could definitely be powered by multi-gigabyte flash storage solutions rather than hard drives.

This is not the future of the PC, but it probably is a new market sector: The Ultra Mobile PC (UMPC) is smaller than a sheet of paper, will weigh less than two pounds and integrate a full PC plus tablet PC features into a form factor that is between handhelds/PDAs and ultra portable notebooks. Since the notebook market grew by 40% last year, there obviously is a demand for more versatile, flexible or individual solutions.
The UMPC is likely to end up being the same thing that Microsoft targets with Origami, which is a stripped down version of Windows. Right now, the UMPCs we took a look at are based on a 915 chipset architecture and Celeron M or Pentium M processors, but it might not be long until we will see Core Duo based solutions.
This whole segment still is pretty new, and it is hard to gauge how the market will react. Personally, I would not want to have such an UMPC, as it is not going to replace my cell phone completely, and it is going to suffer from limitations such as device height and weight (yes, batteries are heavy) for quite some time. But we're eager to see first products and judge then.


It's nice to see Intel sponsoring the BMW F1 team, but isn't the keynote analogy of an efficient platform and a fuel-hungry racing car a bit wrong?
From what we have seen, Intel must have been working quietly for quite some time in order to come up with a processor and platform solution that is meant to regain what has been lost to the competitor. There is a new confidence and a new fire in Intel. Confidence that is based on a manufacturing process that gives the firm a nice nine- to 12-month lead over AMD, and on the fact that the architectural improvements seem to be substantial enough to give them an architecture lead as well.
Yes, there still is the Front Side Bus, and yes, it is going to be a bottleneck again as soon as quad cores are introduced in 2007 (these will merge two dual-core processors into one package first). But in the end, Intel is right with its assessment: Who cares as long as the platform is capable of beating anything else? Performance is still very important, efficiency is a metric that more and more people are finally becoming aware of and value is what it comes down to in the end.
IDF was a great show, but this is what it is: A show. It is time to present proof, and to present some real systems. It is Intel's top priority now to make it to the market at around the projected time frame, which is Q3/2006. Any delays will give AMD the chance to come up with its reworked 65-nm processors (slated for Q3 or Q4), and, of course, these new AMD processors will involve more enhancements than die shrinks.

