Nitty Gritty: CPU Core Performance, Per Clock
Performance Per Clock Cycle
Up until now, we've compared the performance of different SoCs in different devices. But we know even just from Qualcomm's APQ8064 spec sheet that the Krait CPU cores can be made to run from 1.5 to 1.7 GHz. And we've seen Tegra 3 running from 1.2 to 1.6 GHz.
So, the conclusions we draw about the devices in our lab can't automatically be applied to other tablets or smartphones, particularly if their SoCs operate at higher or lower frequencies. That's precisely why Sandra's Core Performance Per Clock index is valuable: it lets us drill down one level more from performance-per-core to performance-per-core at a constant clock rate.
Core Performance At A Given Clock Rate | |||||
---|---|---|---|---|---|
Row 0 - Cell 0 | OMAP 4430 | Tegra 3 (T30L) | S3 (APQ8060) | S4 Plus (MSM8960) | S4 Pro (APQ8064) |
CPU | Two Cortex-A9 Cores @ 1 GHz | Four Cortex-A9 Cores @ 1.3 GHz | Two Scorpion Cores @ 1.2 GHz | Two Krait Cores @ 1.5 GHz | Four Krait Cores @ 1.5 GHz |
Native Arithmetic(MOPS/MHz) | 0.23 | 0.21 | 0.15 | 0.20 | 0.20 |
Native Multi-media(kPix/s/MHz) | 1.15 | 1.14 | 1.37 | 1.69 | 1.60 |
Java Arithmetic(MOPS/MHz) | 0.045 | 0.043 | 0.035 | 0.057 | 0.051 |
Memory (MB/s/MHz) | .301 | 0.19 | 0.53 | 1.10 | 0.75 |
Qualcomm's Krait processor architecture certainly does well, but it relies largely on its 1.5 GHz clock rate (at least in our mobile development platform) to exert its advantage over the OMAP 4430. Per cycle, TI's SoC actually has an advantage.
Of course, that's not to detract from what Qualcomm is achieving with its APQ8064. The company designed this SoC to run at 1.5 GHz at least. TI's chip operates between 1 and 1.2 GHz. So, even if it does achieve slightly better arithmetic performance per cycle, it's specific implementation simply cannot catch the more modern Krait-based design.
Core Performance At A Given Clock Rate: Arithmetic | |||||
---|---|---|---|---|---|
Row 0 - Cell 0 | OMAP 4430 | Tegra 3 (T30L) | S3 (APQ8060) | S4 Plus (MSM8960) | S4 Pro (APQ8064) |
CPU | Two Cortex-A9 Cores @ 1 GHz | Four Cortex-A9 Cores @ 1.3 GHz | Two Scorpion Cores @ 1.2 GHz | Two Krait Cores @ 1.5 GHz | Four Krait Cores @ 1.5 GHz |
Dhrystone (MIPS/MHz) | 2.34 | 2.21 | 1.92 | 2.55 | 2.64 |
Whetstone Double (FLOPS/MHz) | 0.023 | 0.021 | 0.012 | 0.15 | 0.015 |
Whetstone Float (FLOPS/MHz) | 0.031 | 0.029 | 0.016 | 0.16 | 0.022 |
Whetstone Float/Double(FLOPS/MHz) | 0.026 | 0.025 | 0.011 | 0.15 | 0.018 |
Breaking out the Arithmetic sub-test, we can get inside the OMAP 4430's advantage, which was reflected in the first table. Although Qualcomm's APQ8064 achieves superior integer performance per cycle, its showing in the floating-point-based Whetstone metric is consistently worse than TI's.
Again, though, these results are completely synthetic. The OMAP 4430 and APQ8064 will never be made to compete at the same clock rate. We're simply interested in where each architecture derives its strengths.
Core Performance At A Given Clock Rate: Multi-media | |||||
---|---|---|---|---|---|
Row 0 - Cell 0 | OMAP 4430 | Tegra 3 (T30L) | S3 (APQ8060) | S4 Plus (MSM8960) | S4 Pro (APQ8064) |
CPU | Two Cortex-A9 Cores @ 1 GHz | Four Cortex-A9 Cores @ 1.3 GHz | Two Scorpion Cores @ 1.2 GHz | Two Krait Cores @ 1.5 GHz | Four Krait Cores @ 1.5 GHz |
Multi-media Integer [NEON] (kPix/s/MHz) | 1.15 | 1.14 | 1.23 | 1.34 | 1.38 |
Multi-media Float [NEON] (kPix/s/MHz) | 1.16 | 1.09 | 1.53 | 2.13 | 1.81 |
Multi-media Double [FPU] (kPix/s/MHz) | 0.56 | 0.54 | 0.40 | 0.33 | 0.42 |
Multi-media Float/Double (kPix/s/MHz) | 0.80 | 0.77 | 0.77 | 0.83 | 0.87 |
When we perform the same exercise with Sandra's Multi-media module, we see where the Krait architecture earns its advantage over Scorpion, first, and the OMAP's Cortex-A9 cores, second.
Particularly when it's able to exploit ARM's NEON 64- and 128-bit instruction set, Krait dominates handily. Only when Sandra drops back to measuring performance using the Vector Floating Point mode does Qualcomm's latest cede its lead. Not that you should be worried; NEON is far more powerful, making it a more likely instruction set to see in real-world apps.
Multi-Core Efficiency
Many years ago, Intel and AMD stopped emphasizing fast single-core desktop processors and started designing CPUs with multiple cores per package. Software developers had to learn how to exploit those duplicated resources in order to extract some benefit from them.
The same thing is happening in the mobile space as multi-core SoCs facilitate parallelism in power-optimized architectures. As on the desktop, though, the performance of a dual- or quad-core chip doesn't scale linearly. Synthetic measurements make it possible for us to get a best-case scaling number, but the real-world is far less exact.
This is partly a result of how cores work together. Threaded apps involve data sharing between cores. If this isn't done efficiently, performance drops. Lots of bandwidth and low latency are important. TI's OMAP 4430 is able to move the most data per second between its cores, while Nvidia's Tegra 3 follows closely behind, instead standing out for its minimal latency.