The CPU Side: An All-New Piledriver Core
An APU is an amalgamation of x86 cores and graphics resources. So, let’s start by exploring the component of the die traditionally referred to as the CPU.
When Llano was introduced a year ago, we already knew that its Stars architecture was on its last legs. AMD’s plans for the future clearly centered on Bulldozer, a design that wouldn’t make it into a desktop-oriented product until last October.
Well, the situation is reversed for Trinity’s introduction. This time, AMD’s most modern processor architecture is being shown off in an APU—a mobile APU at that. Dubbed Piledriver, we’re faced with the update to Bulldozer that won’t find its way onto the desktop until later in 2012.
What are the main differences between the Husky cores in AMD’s Llano architecture and the Piledriver-based cores in Trinity? Whereas a quad-core APU built on the Llano design employs four distinct execution cores, quad-core Trinity chips feature two Bulldozer modules. Each module boasts two integer cores. However, they share some of the resources that you’d typically find duplicated on more traditional multi-core implementations, such as the fetch and decode stages, floating point units, and L2 cache. Again, you can read more about the Bulldozer architecture in AMD Bulldozer Review: FX-8150 Gets Tested.
The most obvious difference between AMD’s desktop FX processors and the CPU component of Trinity is cache. While each of the APU’s modules still shares 2 MB lf L2, Trinity lacks the 8 MB shared L3, leaving this module architecture with 4 MB of L2 and no L3, matching Llano’s on-die memory.
AMD engineers made it clear that one of their main design goals for Piledriver was to improve IPC compared to Bulldozer. We knew this as far back as AMD’s original Bulldozer briefing, so it’s not a surprise. With FX, we saw that the architecture gave up significant per-clock performance compared to its predecessor, and that clearly needed to be addressed. The engineering team didn’t use just one magic bullet in its quest, but rather a variety of strategies that result in improved performance per clock.
Here are the main improvements implemented in the Piledriver core:
First, the branch predictor was significantly re-vamped and split into a two-level structure. Keeping the instruction pipeline flowing is a critical job when performance is the target, and while AMD didn’t disclose anything more specific, it did make it clear that branch prediction plays a significant role.
In addition, engineers increased the size of the instruction window to allow a larger group of instructions to be processed; this improves performance, and helps process operating system-level code more efficiently. In addition, more ISA instructions were added, including a fused multiply-add (FMA3) and a floating point 16-bit convert (F16C). The Bulldozer architecture already supported FMA4, so the inclusion of FMA4 enables support for a capability that Intel will introduce in its next-gen architecture as well. According to AMD, instruction executable times were improved, resulting in faster floating-point and integer divide results in addition to calls and returns, changes that are critical to get in and out of subroutines quickly. Page translation has also been improved and optimized.
The memory subsystem is another key component of performance, and we saw early on that high cache latencies were one of Bulldozer’s key weaknesses. AMD engineers claim to have invested a lot of effort to improve Piledriver’s L2 cache and hardware prefetcher, purportedly reducing latencies when memory is read. Stream prediction is purportedly improved significantly since the previous generation of APUs.
The Load/Store unit has also been targeted as a place where latency can be reduced, so store-to-load reordering has been improved with follow-up reads to better anticipate compiler requests and reduce load latency. The L1 translation lookaside buffer (TLB) has been doubled to 64-entries to avoid associated latency increases if possible, as a larger TLB provides a more efficient structure. Finally, both the integer and floating-point schedulers have been improved to better utilize all of the hardware units that Piledriver has to offer.
With improvements in clock rate (something we’ll talk about a little later), AMD claims its Trinity-based A10-5800K offers a 26% improvement over the Llano-based A8-3850 on the desktop, and that its A10-4600M shows a 29% improvement over the A8-3500M in notebooks.
Those are pretty aggressive improvements, and we’ll be keeping them in mind as we run through our tests. But first, let’s take a look at the graphics segment of Trinity.