The Idea Behind AMD’s Bulldozer
If you were to sum up the Bulldozer concept with just one word, it’d have to be scalability. AMD put the bulk of its effort into designing a building block that’s small enough to be duplicated over and over in silicon, and yet capable enough to handle both integer- and floating-point-based workloads as deftly as possible. Indeed, the company confirms this was a from-scratch project started several years ago after considering some of the target markets its next-gen architecture would end up addressing: everything from mainstream clients to the top of the server range.
Even as far back as Bulldozer’s design phase, AMD recognized that the days of single-core processors were at an end. And indeed, even today’s entry-level desktops center on dual-core CPUs, at least. It’s hardly a coincidence, then, that each “module” in a Bulldozer-based SoC is able to address two threads concurrently.
Now, we all know that there are a number of ways to tackle multiple threads. Chip-level multiprocessing sits at one end of the spectrum, leveraging the brute force of multiple execution cores on the same piece of silicon. Replicating those resources yields the highest performance potential in well-threaded applications. However, it’s also the most “expensive” in terms of eating up a limited transistor budget.
Simultaneous multi-threading (SMT) sits at the other end, duplicating the resources needed to issue instructions from multiple threads to one physical core at minimal hardware cost. When a single thread fails to fully utilize a core’s resources, SMT fills in the gaps, squeezing every bit of potential from it. That’s what Intel’s Hyper-Threading technology does. The thing is, although Windows sees two logical processors for every physical core, the performance advantages are much more modest in real-world apps.
And that’s why AMD gives Intel a hard time about its Hyper-Threaded chips, making a loud distinction between physical and logical cores. We’ve seen in our own benchmarks where a cheap quad-core Phenom II outperforms a dual-core, Hyper-Threaded Core i3 in a test like WinRAR or 7-Zip by virtue of its more resource-laden architecture.
What Is A Core, Anyway?
But now AMD is going to have to hop off of its high horse because the Bulldozer module doesn’t incorporate two complete cores. Instead, it shares certain parts of what we’d expect to find as dedicated resources in a typical execution core, including instruction fetch and decode stages, floating-point units, and the L2 cache.
According to Mike Butler, chief architect behind Bulldozer, this is justifiable because traditional cores operating in a power-constrained environment don’t make optimal use of thermal headroom. That completely makes sense; when you’re trying to pack as many cores into a server as possible, you want to bias in favor of the resources most likely to be used most often, and avoid chewing up die space/power with components that can be shared without negatively impacting performance too severely.
The decision to share only bites you in the butt when both threads need the same resources, at which point performance drops relative to chip-level multiprocessing. But AMD is optimistic: last August, when it started releasing architectural details at the Hot Chips conference, it estimated that a Bulldozer module could average 80% of two complete cores, while only affecting die space minimally. As a result, in heavily-threaded environments, a Bulldozer-based processor should deliver significant efficiency improvements.
This also means AMD has to redefine what actually constitutes a core. To best accommodate its Bulldozer module, the company is saying that anything with its own integer execution pipelines qualifies as a core (no surprise there, right?), if only because most processor workloads emphasize integer math. I don’t personally have any problem with that definition. But if sharing resources negatively impacts per-cycle performance, then AMD necessarily has to lean on higher clocks or a greater emphasis on threading in order to compensate. Remember that for later.
Learning To Share
Of course, AMD’s architects were careful in deciding which parts of the core could be shared, keeping power and efficiency in mind. As an example, following a branch misprediction, the front-end of a conventional core has to be flushed, wasting both bandwidth and power. Sharing that hardware between two cores helps improve the utilization of those resources. AMD also looked for areas where it could “afford” to share without hurting the timing of critical paths, hence the shared floating-point scheduler, which wasn’t considered to be as latency-sensitive as the integer units.
To the operating system, the resulting module appears as a pair of cores, similar to how a Hyper-Threaded core would appear. AMD is naturally eager to dispel the idea that Bulldozer will behave anything like Hyper-Threading (or SMT), claiming that its design facilitates better scalability than two threads sharing one physical core. Again, that makes sense—a Bulldozer module really can’t be characterized as a single core because many of its resources are, in fact, duplicated.
But this does force us to address the relationship between AMD’s hardware and the software that’ll invariably run on it. In Intel Core i5 And Core i7: Intel’s Mainstream Magnum Opus, I brought up specific optimizations in Windows 7 that were the product of collaboration between Intel and Microsoft—notably, core parking. Windows 7 intelligently schedules to physical cores before utilizing logical (Hyper-Threaded) cores.
In theory, AMD could benefit from the same thing. If Windows were able to utilize an FX-8150’s four modules first, and then backfill each module’s second core, it’d maximize performance with up to four threads running concurrently. This isn’t the case, though. According to Arun Kishan, software design engineer at Microsoft, each module is currently detected as two cores that are scheduled equally. So, in a dual-threaded application, you might see one active module and three idle modules—great for optimizing power, but theoretically less ideal from a performance standpoint. This also plays havoc with AMD’s claim that, when only one thread is active, it has full access to shared resources. Adding just one additional thread could tie up those shared resources, even as multiple other modules sit idle.
Microsoft is looking to change that behavior moving forward, though. Arun says that the dual-core modules have performance characteristics more similar to SMT than physical cores, so the company is looking to detect and treat them the same as Hyper-Threading in the future. The implications there would be significant. Performance would unquestionably improve, while AMD’s efforts to spin down idle modules would be made less effective.
That’s pretty granular stuff though. And it’s the performance today that matters most. So, let’s keep this party rocking…