Jaguar: A Low-Power x86 Core
We've already introduced you to a number of AMD's APU designs, which combine general-purpose and graphics processing resources onto a single die. First it was Llano in the mobile space with The AMD A8-3500M APU Review: Llano Is Unleashed. Then it was Trinity on the desktop in AMD Trinity On The Desktop: A10, A8, And A6 Get Benchmarked! But both of those APU designs followed AMD's more performance-oriented roadmap with the Stars- and Piledriver-derived CPU architectures.
For an example of the company's low-power efforts, we have to go all the way back to January of 2011 for ASRock's E350M1: AMD's Brazos Platform Hits The Desktop First. The Brazos platform came armed with a Zacate APU. Within Zacate, AMD integrated two 1.6 GHz Bobcat-based x86 cores and its Cedar (Radeon HD 5450ish) GPU.
The Jaguar architecture we're looking at today is an iterative improvement over Bobcat. In approaching Jaguar, AMD says it had three design goals. First, improve IPC. Bobcat was (in)famously slow, barely outperforming Intel's 2008-era Atom 330. Second, bring the ISA's functionality up to more modern standards, introducing instruction sets like SSE4.1/4.2 and AVX. Third, augment portability for the future, making Jaguar easier to take to new process technologies and fab partners.
As end users, that last point isn't our problem. The modern list of features is nice, but once you know what Jaguar supports, it's easy to anticipate the gains in specific, optimized workloads. AMD's efforts to improve IPC are much more interesting, though.
Let's start with the basics. Jaguar (as it shows up in the SoCs we're talking about today) is available in dual- and quad-core configurations. Bobcat-based SoCs were limited to dual-core arrangements. The quad-core variants based on Jaguar require active cooling, while the dual-core chips should run cool enough for passive cooling.
The CPU core is manufactured using 28 nm technology, and AMD's chief technology officer, Joe Macri, points out that the x86 design team leveraged some of the software tools used to build GPUs, squeezing more resources into smaller area than more custom previous-gen cores. As a result, each Jaguar core occupies 3.1 square millimeters of die space. That's notably smaller than the 4.9 square millimeters each Bobcat core monopolized.
Now, where does Jaguar improve over Bobcat? In the front-end, Jaguar's instruction cache offers similar throughput, though it delivers this bandwidth at a lower power cost thanks to a selective read process that only activates one-fourth of the banks. A 4x32B loop buffer is also added; when the execution pipelines can use information stored there, the instruction cache can stay powered-down, yielding the double benefit of lower latency.
In addition, the instruction buffer is about 30% larger than it was on Bobcat, circumventing some of the hit you might take after a cache miss.
Finally, the execution pipeline grows by one decode stage. As we saw so painfully when Intel introduced Pentium 4, longer pipelines are actually detrimental to IPC. However, breaking the pipeline up does help improve scalability. The assumption is that AMD is countering the IPC hit with higher clock rates.
The integer pipeline is augmented with a divider unit pulled over from Llano's Stars architecture and modified for Jaguar. Support for a number of familiar complex operation (cops) instructions is included, in addition to hardware CRC units to help the CPU's x86 code execution efficiency. Schedulers and re-order buffers are anywhere from 30 to 70% larger, improving the parallelism of code executed out-of-order.
The L2 cache and its interface with the execution cores is completely redesigned. It is now shared, 2 MB-large (broken up into 512 KB banks), and 16-way associative, no longer 512 KB dedicated to each core. AMD says this is a nod to efficiency, as software can take advantage of a little or a lot, depending on a thread's needs.
Bobcat's L2 cache ran at half of the CPU's clock rate. Jaguar's interface runs at full processor frequency. Pre-fetching is improved; AMD's algorithm pays better attention to data patterns, assisting the predictor in making better choices. Sixteen additional L2 snoop entries act as a probe filter to avoid look-ups whenever possible, again, saving power and improving latencies. According to AMD, its shared L2 is one of the greatest contributors to IPC improvements in Jaguar compared to Bobcat.
The load/store unit between the the execution pipeline and L2 cache, and the data cache, are improved to help make AMD's L2 enhancements more tangible. Jaguar combines loads, utilizing a much bigger buffer to avoid store data shuffling and perform load bypasses at lower latencies.
The sum of AMD's changes to Jaguar add up to a 22% single-threaded IPC increase over Bobcat, the company says. That's a per-clock improvement, so optimizations for clock rate should push that number upwards as this architecture hits higher frequencies. Naturally, we'll be putting those claims to the test in just a few pages...