Seven years have passed since AMD first launched its K8 “Hammer” microarchitecture, which was updated three years ago by K10. Brand new, the Athlon 64 processors based on K8 kicked ass and took names, flying past Intel’s Pentium 4 processors to become enthusiast favorites.
But the performance landscape changes quickly, and Intel is notorious for feverish comebacks when it’s in second position. The company’s Core microarchitecture shifted favor back toward Intel in 2006, and that is where it has remained for the past four years.
Sure, AMD still sells attractive CPUs. Its Athlon II lineup consistently headlines our monthly Best Gaming CPUs For The Money column, thanks to respectable performance and entry-level pricing. The dual-core Phenom II X2 555 Black Edition is unbeatable under $100. And AMD’s Thuban-based offerings actually make hexa-core computing viable under $200.
Clock-for-clock, though, nobody will deny that AMD’s portfolio trails Intel’s. And Intel, as always, has a sizable manufacturing technology advantage with its newest chips centering on a 32 nm process. Meanwhile, AMD is forced to engineer its six-core CPUs into a 130 W TDP using 45 nm lithography.
Heavy Machinery Is On The Way
AMD hopes that its K10 design won’t have to stave off Intel’s Westmere microarchitecture for long, though. Last year, at one of its Financial Analyst Days, AMD gave a first look at the “modules” that will come to define how its next-generation processors will be put together. Though detail was sparse, company representatives made it clear that this is the most significant redesign seen since K8.
We already know that there will be two x86 cores based on this new architecture, each facilitating competitive functionality in a handful of different markets. Bulldozer is intended for deployment in everything from mainstream clients (including desktops and notebooks) to servers. Bobcat is supposed to be the more flexible design, enabling the low power and small dies needed in netbooks and cloud-optimized clients.
Bear in mind that, as with any generational leap, there are a number of new internally-referenced names to keep straight. AMD is really only discussing Bulldozer and Bobcat at Hot Chips 22 (the IEEE-sponsored symposium on performance processors). However, it’s probably worth going into a little more depth on where you’ll see these CPU designs crop up, if only to prevent code-name confusion. If you're a little lost on the nomenclature, use the last page of this piece as a reference to AMD's plans for 2011.
In reality, much of what AMD is discussing at Hot Chips is already known, lending to a much-regurgitated slide deck covering the Bulldozer and Bobcat architectures.
Much of the company’s emphasis is on Bulldozer and its approach to threading. AMD draws a clear distinction between conventional simultaneous multi-threading (productized by Intel as Hyper-Threading) and chip-level multi-processing, employed by the six-core Thuban design, for example, where one core operates on one thread.
CMP is pretty straightforward. You replicate physical cores to scale out performance in threaded software, basically. It’s a brute-force approach that yields the best performance, but becomes very expensive for a manufacturer bumping up against the limits of its process technology, especially if execution resources are left idle. This is the exact reason we often recommend quick quad-core processors over slower six-core CPUs for gaming. Unless your workload is properly optimized for parallelism, CMP results in over-provisioning, and the higher clock rates of less-complex dual- and quad-core designs yield better performance.
Intel combats this with Hyper-Threading, which allows each physical core to work on two threads. Over-provisioning is assumed, meaning you rely on under-utilization to extract additional performance from each core. This is a relatively inexpensive technology. But it’s also quite limited in the benefits it offers. Some workloads don’t see any speed-up from Hyper-Threading. Others barely crack double-digit performance gains.
AMD is trying to define a third approach to threading it calls Two Strong Threads. Whereas Hyper-Threading only duplicates architectural states, the Bulldozer design shares the front-end (fetch/decode) and back-end of the core (through a shared L2 cache), but duplicates integer schedulers and execution pipelines, offering dedicated hardware to each of two threads.
The pair of threads share a floating point scheduler with two 128-bit fused multiply-accumulate-capable units. Consequently, it’s clear that AMD’s emphasis here is integer performance, which makes sense given the company’s Fusion initiative and impending plan to have GPU resources handle floating-point work. Just bear in mind that the first Bulldozer-powered processors won’t be APUs. Despite the fact that it is sharing FP resources here, AMD remains confident in its balance between dedicated and shared components.
None of that is new, though. AMD talked about it all back in November of last year.
Ahead of the Hot Chips presentation, we had the opportunity to refresh what we knew about Bulldozer with Dina McKinney, corporate vice president of design engineering at AMD. According to Dina, the company’s Two Strong Thread approach achieves somewhere in the neighborhood of 80% of the performance you’d see from simply replicating cores. At the same time, sharing some resources helps cut back on power use and die space.
This development, along with a shift to 32 nm SOI manufacturing, is leading AMD to estimate a 33% increase in core count and a 50% increase in throughput (suggesting significant IPC gains) in the same power envelope as Magny-Cours-based Opteron processors. The projections here are based on simulated comparisons between today’s 12-core Opteron 6100-series chips and upcoming 16-core Bulldozer-based models, currently code-named Interlagos.
Duplicated execution resources lead AMD to call this a dual-core implementation.
Now, one of the concerns I’ve seen brought up regarding AMD’s taxonomy is that a Bulldozer module looks like an SMT-enabled single-core processor. Only, instead of duplicating registers to store the architectural state, AMD gives each thread its own instruction window and dedicated pipelines. In talking back and forth with AMD’s John Fruehe, it’s clear that the company thinks that the duplication of integer schedulers and corresponding pipelines (disregarding the other shared components) makes each Bulldozer module a dual-core design, distinguishing it from SMT as it’s associated with Hyper-Threading. That gets a little marketing-heavy for me, but I can certainly respect that we’re looking at an architecture that’ll do much more for performance than Hyper-Threading in parallelized workloads.
I was also curious how Bulldozer modules are expected to interact with Windows 7. Intel and Microsoft put a deliberate effort into optimizing for Hyper-Threading. The operating system’s scheduler knows the difference between a physical core and a Hyper-Threaded core. If it has two threads to schedule, Windows 7 and Server 2008 R2 use two physical cores. The alternative—scheduling two threads to the same physical, Hyper-Threaded core—would naturally sacrifice performance. Because Bulldozer modules are still sharing resources, it’d stand to reason that a four-module Zambezi CPU would be best served by similarly handling two threads using different modules. Though AMD wasn’t able to address how it’ll handle this interaction, it assures me that it’s working with OS vendors on optimizations that’ll be ready for Bulldozer’s release.
Zambezi, based on Bulldozer, might just look like this.
I also asked John about the front-end’s instruction/cycle capabilities and the shared L2’s capacity configuration, but neither of those details is available yet. What he could tell me was that the 128-bit FP units are symmetrical, and that, on any cycle, either integer core can dispatch a 256-bit AVX instruction (assuming software compiled to support AVX). Or, both integer cores can dispatch a single 128-bit instruction at the same time.
In addition, John clarified how each integer unit’s pipelines are oriented. Whereas K10 enables three pipelines shared between ALUs and AGUs (effectively 1.5 of each), Bulldozer increases this number to four pipelines—two dedicated AGU and two dedicated ALU. The L1 cache configuration is a bit different, too. Whereas K10 offered 64 KB of L1 instruction and 64 KB of L1 data cache per core, Bulldozer enables 16 KB of L1 data cache per core and 64 KB of 2-way L1 instruction cache per module. It remains to be seen how the smaller L1 affects performance.
AMD had a bit more to add on its Bobcat design, unquestionably created with the Fusion initiative in mind. The focus here is on Bobcat as a technology, which AMD plans to use to create SoCs targeting specific markets—the first of which should be its Ontario APU, featuring on-die graphics processing, fixed-function video playback acceleration, a DDR3 memory controller, and the dedicated bus linking everything together.
AMD’s estimate here is retention of 90% of today’s mainstream performance (I’d certainly consider something in Athlon II territory reasonable) in less than half of the silicon area. That’s a figure we’ve seen AMD use in past discussions of Bobcat. But perhaps less known was how the company planned to achieve this.
Details being discussed today include a dual-issue x86 decoder and out-of-order execution, perhaps enabling a performance advantage compared to Intel’s Atom CPUs. Bobcat will support SSE, SSE2, and SSE3, along with virtualization acceleration.
Beyond its performance implications, though, AMD repeats over and over that this is a sub-1 W-capable core. Possible though that might be (at standby), remember that Ontario will incorporate a pair of these cores. Additionally, Bobcat is part of a SoC. So, it might be a little more realistic to expect power numbers between 10 and 20 W.
AMD is juggling a ton of jargon, from far-reaching initiatives to very specific logic designs. The following list should help clarify some of what the company is doing. We’ll start with the broadest concepts and narrow it down to the hardware you’ll see turned into actual products.
Initiatives:
Fusion: AMD is using the word Fusion to describe an approach to processor design and software development, in its words: “…delivering powerful CPU and GPU capabilities for HD, 3D and data-intensive workloads in a single-die processor called an APU (accelerated processing unit). APUs combine high-performance serial and parallel processing cores with other special-purpose hardware accelerators, enabling breakthroughs in visual computing, security, performance-per-watt and device form factor.”
In short, an APU designed according to AMD’s Fusion initiative will include a CPU and a GPU on a single piece of silicon. The improvements an APU are expected to deliver include: enhanced mainstream gaming performance and accelerated video transcoding, to name a couple of specific examples.
Microarchitectures:
Bulldozer: One of two new x86 architectures, Bulldozer will be used in performance desktops and servers. Bulldozer-based modules will serve as the basis for AMD’s next generation of processors. The company has already confirmed that it’ll maintain socket compatibility with existing Magny-Cours-based Opteron processors. Thus, you can expect to see Bulldozer-based CPUs dropping into existing server boards and, likely, Socket AM3 desktop platforms as well. AMD’s target power use for Bulldozer-based chips is between 10 and 100 W.
Bobcat: The second of two new x86 architectures, Bobcat is aimed at the low-power, ultrathin notebook and netbook spaces. Expect Bobcat-based cores to go up against Intel Atom and Via Nano. AMD has aspirations of hitting a sub-1 W power ceiling, though there will likely be models exceeding that figure. Bobcat is designed to be synthesizable, meaning AMD can build it into complementary logic blocks more easily than a processor laid out by hand. In other words, expect to see Bobcat CPUs rolled into AMD’s Fusion initiative.
Platforms:
Sabine: Mainstream mobile platform based on the Llano APU, which will see a quad-core Stars-based CPU and DirectX 11-class graphics processor tied together on the same piece of silicon, manufactured using 32 nm lithography. Sabine is expected to arrive in 2011.
Brazos: Ultra low-power mobile platform based on the Ontario APU, which will see a dual-core Bobcat-based CPU and DirectX 11-class graphics processor tied together on the same piece of silicon. Brazos is expected to arrive in 2011, and will allow AMD to drive netbooks, along with form factors the company’s hardware hasn’t yet appeared in (possibly tablets).
Scorpius: Enthusiast desktop platform based on AMD’s Zambezi processor and discrete graphics (AMD, of course, specifies an ATI GPU). The platform requires a quad-core CPU or higher, DDR3 memory, and a revised Socket AM3 interface. Availability is expected in 2011.
Lynx: Mainstream desktop platform based on AMD’s Llano APU. It’ll feature up to four CPU cores, a single graphics core (integrated onto the APU, naturally), and DDR3 memory. Availability is expected in 2011.
Components:
Llano: This is going to be AMD’s first APU, combining a quad-core Stars-based CPU and DirectX 11-class GPU on a single piece of silicon. It’ll be manufactured using a 32 nm SOI process, support DDR3 memory, and include core-level power gating. Because there are brand new capabilities in play here, it should surprise no one that Llano will drop into a new socket interface. Availability is expected in 2011.
Ontario: While the Llano APU absorbs much of AMD’s risk in shifting to 32 nm manufacturing (since it employs a familiar CPU microarchitecture and more mature manufacturing process), Ontario will be the first APU to employ AMD’s Bobcat CPU microarchitecture. Ontario is manufactured at 40 nm, armed with DirectX 11-class graphics, and expected in 2011.
Zambezi: Per AMD, Zambezi will be the first desktop processor based on the company’s Bulldozer architecture. Featuring as many as eight cores, Zambezi-based offerings will incorporate as many as four processor “modules.” AMD plans to use 32 nm manufacturing, and early reports suggest Socket AM3 compatibility (along with DDR3 memory support). Zambezi is not an APU, but rather is meant to be paired with discrete graphics.
Interlagos/Valencia: Respective code-names for AMD’s upcoming 16-core and eight-core Opteron processors, respectively, both based on the Bulldozer microarchitecture. Interlagos will drop into the existing G34 interface, while Valencia is C32-compatible. Both families will be manufactured using 32 nm SOI lithography, will support DDR3 (including load-reduced DIMMs and 1.25 V memory modules), and are expected in 2011.





