The Bulldozer Platform: Using FX-8150 To Test
Within the line-up of Zambezi chip, three models employ two Bulldozer modules totaling four cores, one includes three modules, adding up to six cores, and three are fully-featured four-module SKUs boasting eight cores. We're using the flagship FX-8150 for testing here. It's currently selling for about $270 bucks and, again, drops into the Socket AM3+ interface.
As you no doubt already know, AMD counts cores differently from Intel (or even from its previous architectures). In contrast to chip-level multiprocessing, where each core is complete and distinct, AMD uses the phrase chip-level multi-threading, and draws a distinction between modules and cores. The emphasis here is on efficiency in a multi-core design, positing that the days of single cores operating on their own are over. Rather than cramming in as many cores as possible onto a piece of silicon, brute-forcing the performance story, AMD tries to achieve an optimal balance by duplicating key resources, theoretically adding complexity where it'll be best utilized and avoiding waste in less sensitive parts of the chip.
Again, from our launch coverage of FX-8150:
...the Bulldozer module doesn’t incorporate two complete cores. Instead, it shares certain parts of what we’d expect to find as dedicated resources in a typical execution core, including instruction fetch and decode stages, floating-point units, and the L2 cache.
According to Mike Butler, chief architect behind Bulldozer, this is justifiable because traditional cores operating in a power-constrained environment don’t make optimal use of thermal headroom. That completely makes sense; when you’re trying to pack as many cores into a server as possible, you want to bias in favor of the resources most likely to be used most often, and avoid chewing up die space/power with components that can be shared without negatively impacting performance too severely.
...but simultaneously optimizing for performance and power necessitates sharing of certain resources.
The decision to share only bites you in the butt when both threads need the same resources, at which point performance drops relative to chip-level multiprocessing. But AMD is optimistic: last August, when it started releasing architectural details at the Hot Chips conference, it estimated that a Bulldozer module could average 80% of two complete cores, while only affecting die space minimally. As a result, in heavily-threaded environments, a Bulldozer-based processor should deliver significant efficiency improvements.
This also means AMD has to redefine what actually constitutes a core. To best accommodate its Bulldozer module, the company is saying that anything with its own integer execution pipelines qualifies as a core (no surprise there, right?), if only because most processor workloads emphasize integer math. I don’t personally have any problem with that definition. But if sharing resources negatively impacts per-cycle performance, then AMD necessarily has to lean on higher clocks or a greater emphasis on threading in order to compensate.
Learning To Share
Of course, AMD’s architects were careful in deciding which parts of the core could be shared, keeping power and efficiency in mind. As an example, following a branch misprediction, the front-end of a conventional core has to be flushed, wasting both bandwidth and power. Sharing that hardware between two cores helps improve the utilization of those resources. AMD also looked for areas where it could “afford” to share without hurting the timing of critical paths, hence the shared floating-point scheduler, which wasn’t considered to be as latency-sensitive as the integer units.
To the operating system, the resulting module appears as a pair of cores, similar to how a Hyper-Threaded core would appear. AMD is naturally eager to dispel the idea that Bulldozer will behave anything like Hyper-Threading (or SMT), claiming that its design facilitates better scalability than two threads sharing one physical core. Again, that makes sense—a Bulldozer module really can’t be characterized as a single core because many of its resources are, in fact, duplicated.
|Model||Base Clock||Turbo-Core Clock||Max. Turbo Clock||TDP||Cores||Total L2 Cache||L3 Cache||North Bridge Freq.|
|FX-8150||3.6 GHz||3.9 GHz||4.2 GHz||125 W||8||8 MB||8 MB||2.2 GHz|
|FX-8120||3.1 GHz||3.4 GHz||4.0 GHz||125 / 95 W||8||8 MB||8 MB||2.2 GHz|
|FX-8100||2.8 GHz||3.1 GHz||3.7 GHz||95 W||8||8 MB||8 MB||2.0 GHz|
|FX-6100||3.3 GHz||3.6 GHz||3.9 GHz||95 W||6||6 MB||8 MB||2.0 GHz|
|FX-4170||4.2 GHz||-||4.3 GHz||125 W||4||4 MB||8 MB||2.2 GHz|
|FX-B4150||3.8 GHz||3.9 GHz||4.0 GHz||95 W||4||4 MB||8 MB||2.2 GHz|
|FX-4100||3.6 GHz||3.7 GHz||3.8 GHz||95 W||4||4 MB||8 MB||2.0 GHz|