Processor Performance: Now Dual-Core Flavored
Apple uses what’s referred to as a system-on-a-chip (SoC) in its mobile devices like the iPad and iPhone. In this particular implementation the SoC includes the processor core (or cores), graphics processing, and RAM in a package-on-package. Because those components sit next to each other in the same package, data transfers are achieved more efficiently. Moreover, less PCB space is consumed, since more functionality lives in one on-board component.
The influence of an SoC isn't all positive, however. A heavily integrated IC still has specific physical and thermal constraints, so the SoC's comprising subsystems aren't as potent as they might be if they were discrete.
Intel's Sandy Bridge architecture is a good example. The company simultaneously improved platform performance, while trimming power versus its previous-generation design. However, keeping processing, memory control, cache, and graphics in the same 95 W thermal window required concessions. The HD Graphics engine is perhaps the clearest indicator that Intel was working with a very specific transistor budget. Though the company's engineers created an engine deemed "good enough" for many desktop workloads, discrete graphics cards like AMD's Radeon HD 6970 and Nvidia's GeForce GTX 580 demonstrate how much more flexibility there is without the considerations afforded to more integrated solutions.
|Apple A4 (iPad)||Apple A5 (iPad 2)|
|Processor||1 GHz ARM Cortex-A8 (single-core)||1 GHz ARM Cortex-A9 (dual-core)|
|Memory||256 MB LP-DDR (single-channel?)||512 MB LP-DDR2 (dual-channel)|
|Graphics||PowerVR SGX535 (single-core)||PowerVR SGX545MP2 (dual-core)|
|L1 Cache(Instruction/Data)||32 KB / 32 KB||32 KB / 32 KB|
|L2 Cache||640 KB||1 MB|
The iPad 2 features Apple's newest SoC, the A5, which is completely different from the A4 in its iPad. Let's start with what changes in the CPU.
|ARM Cortex-A8||ARM Cortex-A9|
|Package Size||198.8 mm2||238.8 mm2|
|Execution Pipeline Depth||13-stages||8-stages|
|Processing Power||2.0 DMIPS/MHz/Core||2.5 DMIPS/MHz/Core|
Instead of the iPad’s ARM Cortex-A8, the iPad 2 uses a dual-core ARM Cortex-A9 with a total of 1 MB L2 cache. At the architectural level, the major difference is out-of-order execution. This is regarded to be a higher-performance approach than in-order execution, which executes instructions based on the order they appear. An out-of-order design addresses instructions based on the availability of of input data, thereby preventing the pipeline from spinning idly as data is retrieved.
If you want to draw a summertime analogy, consider the process of preparing a glass of ice water. You could choose to put the ice in the cup before you get the water, or you might fill the cup with water before getting the ice. The quickest task depends on where you are in relation to the refrigerator and the faucet. Out-of-order execution pipelines operate similarly.
The problem is that out-of-order execution requires extra die space in order to rearrange all those operations, which means that you're using more transistors and increasing energy consumption. That's one reason why Intel's small, power-efficient Atom architecture employs in-order execution. The benefit, however, is improved performance, as fewer CPU cycles are wasted. The fact that Apple moved to out-of-order execution is indicative of its emphasis on augmenting the iPad 2's performance.
According to analysis done by Chipworks, Apple also couples its dual-core ARM Corex-A9 with 512 MB of LP-DDR2 (low-power DDR2). The original iPad only used 256 MB of LP-DDR. So, not only do we have two times more memory, we have it delivered through a more modern memory technology (DDR versus DDR2).
Geekbench is a synthetic benchmark similar to SiSoftware's Sandra, and it's one of the few available benchmarks available for iOS. The best part about Geekbench, however, is that it's offered on multiple platforms. That means we can use it to make apples to apples comparisons against low-power x86-based devices like netbooks.
|Geekbench v.2Score in Points, Higher is Better||Apple iPad||Apple iPad 2||Dell Mini 1012(Atom N450)|
Single-threaded floating point and integer performance is much stronger on the iPad 2 than its predecessor. On average, performance nearly doubles.
The Cortex-A9 demonstrates a large lead in single-threaded scenarios due to its updated execution pipeline. However, threaded floating point performance sees an even larger boost, as the architecture's advantages are multiplied by the increased parallelism enabled by a second core. Though, I should point out that this doesn’t necessarily translate into better real-world performance. Most Apps have a greater tendency to rely on integer performance. That's the case whether you're talking about iTunes on the desktop or on the iPad.
From an architectural standpoint, we've come a long way since the original iPad debuted. But tablets fall very short of netbook-class performance. Intel's old Atom N450 still manages to outclass even Apple's latest hardware.
|Geekbench v2 (detailed results)||Apple iPad||Apple iPad 2||Dell Mini 1012|
|Blowfish (single-threaded scalar)||13.6 MB/s||13.2 MB/s||26.2 MB/s|
|Blowfish (multi-threaded scalar)||14.3 MB/s||26.0 MB/s||41.5 MB/s|
|Text Compress (single-threaded scalar)||1.25 MB/s||1.49 MB/s||2.49 MB/s|
|Text Compress (multi-threaded scalar)||1.20 MB/s||2.79 MB/s||3.60 MB/s|
|Text Decompress (single-threaded scalar)||1.13 MB/s||2.07 MB/s||3.22 MB/s|
|Text Decompress (multi-threaded scalar)||1.09 MB/s||3.24 MB/s||4.86 MB/s|
|Image Compress (single-threaded scalar)||3.26 Mpixels/s||3.77 Mpixels/s||6.00 Mpixels/s|
|Image Compress (multi-threaded scalar)||3.38 Mpixels/s||7.42 Mpixels/s||8.81 Mpixels/s|
|Image Decompress (single-threaded scalar)||6.12 Mpixels/s||6.66 Mpixels/s||9.98 Mpixels/s|
|Image Decompress (multi-threaded scalar)||6.04 Mpixels/s||12.8 Mpixels/s||15.0 Mpixels/s|
|Lua (single-threaded scalar)||173.5 Knodes/s||272.6 Knodes/s||340.4 Knodes/s|
|Lua (multi-threaded scalar)||172.9 Knodes/s||535.0 Knodes/s||488.4 Knodes/s|
|Floating Point Section|
|Mandelbot (single-threaded scalar)||79.9 MFLOPS||278.8 MFLOPS||339.6 MFLOPS|
|Mandelbot (multi-threaded scalar)||79.4 MFLOPS||549.0 MFLOPS||613.2 MFLOPS|
|Dot Product (single-threaded scalar)||247.5 MFLOPS||221.3 MFLOPS||204.9 MFLOPS|
|Dot Product (multi-threaded scalar)||246.2 MFLOPS||435.5 MFLOPS||361.5 MFLOPS|
|LU Decompression (single-threaded scalar)||50.5 MFLOPS||207.3 MFLOPS||309.7 MFLOPS|
|LU Decompression (multi-threaded scalar)||54.7 MFLOPS||403.4 MFLOPS||534.0 MFLOPS|
|Primality Test (single-threaded scalar)||71.4 MFLOPS||176.6 MFLOPS||126.7 MFLOPS|
|Primality Test (multi-threaded scalar)||69.2 MFLOPS||316.8 MFLOPS||194.5 MFLOPS|
|Sharpen Image (single-threaded scalar)||1.51 Mpixels/s||1.68 Mpixels/s||482.1 Kpixels/s|
|Sharpen Image (multi-threaded scalar)||1.52 Mpixels/s||3.32 Mpixels/s||858.9 Kpixels/s|
|Blur Image (single-threaded scalar)||762.2 Kpixels/s||664.4 Kpixels/s||535.6 Kpixels/s|
|Blur Image (multi-threaded scalar)||762.0 Kpixels/s||1.31 Mpixels/s||941.5 Kpixels/s|
The write sequential and sfdlib write memory tests in Geekbench confirm better RAM performance, but it's difficult to separate how much of this is due to memory technology and how much is attributable to the processor. At the end of the day, it really doesn't matter; what does is that throughput goes up.
Intel's Atom N450 still manages to remain top dog, despite it's 64-bit single-channel interface. The Atom only falls behind in the sfdlib allocate and write tests. However, the N450's 1.97 GB/s score in read sequential is about 6x higher than what we see in the iPad 2.
|Geekbench v2 (detailed results)||Apple iPad||Apple iPad 2||Dell Mini 1012|
|Read Sequential (single-threaded scalar)||306 MB/s||342.2 MB/s||1.97 GB/s|
|Write Sequential (single-threaded scalar)||849.1 MB/s||1.02 GB/s||1.32 GB/s|
|Sfdlib Allocate (single-threaded scalar)||1.99 Mallocs/s||1.83 Mallocs/s||1.25 Mallocs/s|
|Sfdlib Write (single-threaded scalar)||1.28 GB/s||2.57 GB/s||1.34 GB/s|
|Sfdlib Copy (single-threaded scalar)||830.4 MB/s||474.8 MB/s||1.03 GB/s|
|Stream Copy (single-threaded scalar)||465.5 MB/s||449.9 MB/s||1.18 GB/s|
|Stream Scale (single-threaded scalar)||320.5 MB/s||372.5 MB/s||1.08 GB/s|
|Stream Add (single-threaded scalar)||655.9 MB/s||606.3 MB/s||1.41 GB/s|
|Stream Triad (single-threaded scalar)||427.4 MB/s||426.6 MB/s||1.11 GB/s|