Apple’s A8 SoC: A More Powerful Cyclone
With the launch of its iPhone 5s, Apple introduced the A7 SoC sporting the first ever 64-bit ARMv8 CPU core: Cyclone. Apple’s adoption of a 64-bit, desktop-like architecture caught the entire industry by surprise. Only now, a year later, are we beginning to see some ARM Cortex-A53 and Cortex-A57 based devices come to market. These, however, are stock ARM designs.
ARM architecture licensee Qualcomm has yet to announce its custom ARMv8 based CPU, which probably won’t arrive until the latter half of 2015. To bridge the gap in its roadmap, Qualcomm will release the Snapdragon 810 SoC with Cortex-A53 x4 plus Cortex-A57 x4 cores in early 2015. This is in stark contrast to the current generation’s timeline, where Qualcomm’s custom Krait cores, loosely based on Cortex-A15, appeared inside the 28nm Snapdragon S4 SoC roughly half a year before the first Cortex A15-based SoC, Samsung's Exynos 5 Dual, started shipping in October of 2012.
Nvidia seems to have managed to push its own Denver core, capable of executing the ARMv8 instruction set, to market earlier than expected (compared to its own roadmap). But Denver is still arriving more than a year later than Apple’s 64-bit A7. Denver, however, looks to be even wider (seven-way superscalar pipeline versus the six-way A7/A8) with an efficient in-order design and software-based instruction translation and out-of-order execution. It will certainly be interesting to see how well it performs relative to the native ARMv8 cores. Nvidia isn’t the first company to try this approach. Transmeta’s Crusoe family of processors did something similar, emulating the x86 instruction set. While a fairly power efficient design, it could never achieve the same level of performance as native x86 CPUs.
The dual-core Cyclone CPU in the A7 has more in common with Intel’s desktop CPUs than its ARM-based brethren. It’s capable of decoding, issuing, executing and retiring up to six instructions per cycle, twice the IPC of the -A15 and Krait—sometimes even greater due to restrictions on executing certain integer and floating-point instructions in parallel. Each core has four integer ALUs, three floating-point/NEON ALUs and two load/store units.
To go along with its wide, out-of-order pipeline, the A7 has a large out-of-order window size, or reorder buffer, holding up to 192 micro-ops versus 128 for the A15 and only 40 for Krait. Also on board is a generous cache structure, with 64KB/64KB L1, 1MB L2 and 4MB L3.
The A8 SoC: A New Process And More Transistors
The A7 SoC was built on Samsung’s 28nm HKMG process with a die area of 102 mm2. Apple stated that the A7 contains “over 1 billion” transistors. The A8’s transistor count grows to 2 billion, significantly more than the Snapdragon 805’s estimated ~700 million transistors and more than the 1.6 billion transistors in Intel’s quad-core Haswell with GT2 graphics. Despite an increase in the number of transistors, Apple managed to reduce the die size by 13 percent to 89mm2 by switching to TSMC’s latest 20nm process (based on analysis performed by Chipworks).
The second-generation HKMG 20nm node “can provide 30 percent higher speed, 1.9 times the density or 25 percent less power than [TSMC’s] 28nm technology,” according to TSMC. A chip designer can’t get all three improvements at the same time of course, with the speed and power improvements being mutually exclusive. The A8’s modest increase in CPU clock frequency shows Apple chose the power-saving path.
In the post Inside the iPhone 6 and iPhone 6 Plus, Chipworks estimates that the die size of the A7 Cyclone CPU is 17.1mm2, while the die size of the A8 CPU is 12.2mm2—a 29% area reduction. Based solely on the process change from 28 to 20 nm, the maximum theoretical reduction in die area should be ~51%. This number doesn’t account for differences in process between Samsung and TSMC, or that SRAM transistors scale differently than logic transistors, but provides an upper limit on die shrinkage. Even after using more realistic numbers, it looks like some additional logic has been added to the A8 CPU.
Apple bumps the CPU clock rate from 1.3 to 1.4GHz for the A8, an increase of roughly 8%. With the company's promise of a 25% performance increase over Cyclone, there are clearly some tweaks and extra transistors hiding within A8. The most obvious change is that the CPU and GPU have swapped sides. Focusing on the CPU, we see it’s still dual-core, but the L1 and L2 caches moved farther apart and the logic circuits appear mirrored left-to-right. A simple visual analysis doesn’t reveal anything about performance or where Apple put the extra transistors, but it does show us that the A8 CPU is not a mere shrink of A7.
Based on the die comparison and what various software benchmarks report, the A8’s cache hierarchy appears unchanged. There’s still 64KB/64KB of L1 instruction/data for each core, 1MB of L2 cache and a shared 4MB L3 cache. It now appears that the L2 cache is split with 512KB per core.
Thanks to the team at iFixit, we know that the A8 uses LPDDR3-1600 DRAM in a package-on-package (PoP) configuration, which is unchanged from the A7. While theoretical max memory bandwidth remains unchanged, the A8’s memory performance is consistently faster in Geekbench 3.
|Geekbench 3 Pro Memory Bandwidth (Single-Core)|
|STREAM Copy (GB/s)||STREAM Scale (GB/s)||STREAM Add (GB/s)||STREAM Triad (GB/s)|
|iPhone 6 Plus (A8)||9.61||5.81||6.20||6.16|
|iPhone 6 (A8)||9.95||6.00||6.32||6.33|
|iPhone 5s (A7)||8.32||5.21||5.69||5.71|
|A8 Advantage(based on 6 Plus)||15.6%||11.5%||8.9%||7.9%|
STREAM's Copy metric simply copies the contents of one large array to another and is the most indicative of memory bus performance. In this test, the A8 sees a greater than 15% improvement in memory throughput, indicative of further memory controller optimizations for handling sequential data. The other three tests perform some type of arithmetic: STREAM Scale reads floating-point numbers from an array and multiplies each by a constant; STREAM Add reads floating-point numbers from two arrays, adds them together and then writes them to a third array; and STREAM Triad reads floating-point numbers from two arrays, multiplies one number by a constant, adds this to the other number and writes the result to a third array [a(i) = b(i) + q*c(i)]. The performance results on these three tests show the floating-point pipeline mirroring the ~8% increase in clock frequency with maybe a little bonus from the optimized memory controller. Based on these results it appears the floating-point pipeline is unchanged from A7.
Looking at all the individual single-core integer and floating-point tests from Geekbench 3 shows a similar pattern. All tests suggest that A8 is at least moderately faster than the A7, with most tests showing gains just beyond the ~8% increase in clock rate. Cryptography routines show the smallest gains, while tests that stream sequential data show slightly better results. There are a few outliers, but overall it looks like the A8 retains the same basic architecture as the A7: a six-wide design with four integer ALUs, three FP/NEON ALUs and two load/store units.
Apple’s Cyclone architecture was a huge advancement for mobile CPUs, both in terms of performance and features. Its emphasis on IPC, out-of-order execution and large caches drew more similarities to desktop processors rather than existing mobile CPUs. The higher IPC allowed Apple to extract the performance it desired while keeping frequency, and thus, power consumption under control.
The A8 CPU refines Cyclone’s radical architecture overhaul. It sees performance gains from an optimized memory controller, with further gains likely coming from improved hardware prefetch, reduced instruction latency and lower cache/memory latency. It remains the fastest mobile CPU, a title it may keep for another year depending on how Nvidia’s Denver and ARM’s -A57 cores perform.