The Piledriver Architecture: Improving On Bulldozer
The very foundation of AMD’s current x86 architecture was covered in great depth back when I reviewed the FX-8150 (AMD Bulldozer Review: FX-8150 Gets Tested). All of those tenets carry over to the company’s Piledriver update. However, we know that AMD’s engineers learned a number of lessons as they took the original Bulldozer concept from theories and diagrams to actual silicon. We also know that process technology evolved over the last year, even if the company continues to use a 32 nm node for manufacturing its Vishera-based CPUs. It should come as no surprise, then, that today’s reformulation is the result of several tweaks flagged for improvement a long time ago.
Front-End Improvements
In the days that followed AMD’s Bulldozer introduction, branch prediction was suggested as one of the architecture’s possible weaknesses. The module concept involves certain shared resources feeding two execution threads, and the architects attempted to minimize bottlenecks in the front-end by implementing one prediction queue per thread behind a 512-entry L1 and 5000-entry L2 branch target buffer. For Piledriver, the company claims the accuracy of its predictor is better.
Piledriver adds support for a couple of ISA extensions that we first covered in our mobile Trinity-based APU coverage. The fused multiply-add was introduced a year ago in Bulldozer. That specific version was called FMA4, though, and allowed an instruction to have four operands. But Intel only plans to support a simpler three-operand FMA3 instruction set in its upcoming Haswell architecture, so AMD preempts that addition with Piledriver. The other extension, F16C, enables support for converting up to four half-precision to floating-point values at a time. Intel’s Ivy Bridge architecture already includes this, so its implementation on Piledriver simply catches AMD up. Not that Bulldozer was suffering without FMA3/F16C; compiler support was only added in Visual Studio 2012.
Inside The Integer Cluster
The two integer clusters in each compute module feature an out-of-order load/store unit capable of two 128-bit loads/cycle or one 128-bit store/cycle. AMD discovered that there were certain cases where Bulldozer wouldn’t catch store data already in a register file. By rectifying this, instructions are fed into the integer clusters more quickly.
Within each integer core, we’re still dealing with two execution units and two address generation units (referred to simply as AGens). Those AGens are more capable this time around in that they’re able to perform MOV instructions. When AGen activity is light, the architecture will shift MOVs over to those pipes.
One of the most notable changes is a larger translation lookaside buffer for the L1 data cache, which grows from 32 entries to 64. Because the L2 TLB has fairly high 20-cycle latency, improving the hit rate in L1 can yield significant performance gains in workloads that touch large data structures. This is particularly important in the server space, but AMD’s architects say they noticed certain games demonstrating sensitivity to this too, which isn’t something they had expected.
L2 Cache Optimizations
Hardware prefetching into the L2 is improved as well. Minimum latency doesn’t change, which is why cache latency doesn’t look any better in our Sandra 2013 benchmark. However, as the prefetcher and L2 are used more effectively, average latency (much more difficult to measure with a diagnostic) should be expected to drop, AMD claims. The same Sandra 2013 module also reflects very little change in L3 latency, and Vishera’s architects confirm that no changes were made to the L3 cache shared by all modules on an FX package.
Putting It All Together: Five Architectures At 4 GHz
What effect do all of those adjustments have on Piledriver's per-cycle performance? We ran five different architectures at 4 GHz to compare their relative results.
In iTunes, which we know to be single-threaded, the FX-8350 demonstrates significant gains over the Bulldozer-based FX-8150. But a Phenom II X6 1100T operating at the same frequency is still faster. And that's before we look at the Sandy and Ivy Bridge architectures, which jump way out in front of anything from AMD.
Notice that the Core i7 is listed as a quad-core CPU capable of addressing four threads. I disabled Hyper-Threading in this test to isolate core performance. Had it been turned on, Intel's client flagship would have likely finished in first place.
Nevertheless, we're most interested in the gain realized by shifting from FX-8150 to FX-8350, and it is significant. Again, though, Thuban's six cores manage to outmaneuver Vishera's quad-module configuration. AMD is using a clock rate advantage to keep this latest architecture in front of its older design. Thuban really doesn't want to run at such high frequencies, even as it's able to get more done per cycle.