Sign in with
Sign up | Sign in

Nitty Gritty: CPU Core Performance, Per Clock

Snapdragon S4 Pro: Krait And Adreno 320, Benchmarked
By

Performance Per Clock Cycle

Up until now, we've compared the performance of different SoCs in different devices. But we know even just from Qualcomm's APQ8064 spec sheet that the Krait CPU cores can be made to run from 1.5 to 1.7 GHz. And we've seen Tegra 3 running from 1.2 to 1.6 GHz.

So, the conclusions we draw about the devices in our lab can't automatically be applied to other tablets or smartphones, particularly if their SoCs operate at higher or lower frequencies. That's precisely why Sandra's Core Performance Per Clock index is valuable: it lets us drill down one level more from performance-per-core to performance-per-core at a constant clock rate.

Core Performance At A Given Clock Rate

OMAP 4430
Tegra 3 (T30L) S3 (APQ8060)
S4 Plus (MSM8960)
S4 Pro (APQ8064)
CPU
Two Cortex-A9 Cores @ 1 GHzFour Cortex-A9 Cores @ 1.3 GHzTwo Scorpion Cores @ 1.2 GHzTwo Krait Cores @ 1.5 GHz
Four Krait Cores @ 1.5 GHz
Native Arithmetic
(MOPS/MHz)
0.230.21
0.15
0.20
0.20
Native Multi-media
(kPix/s/MHz)
1.151.14
1.37
1.69
1.60
Java Arithmetic
(MOPS/MHz)
0.045
0.043
0.035
0.057
0.051
Memory
(MB/s/MHz)
.301
0.19
0.53
1.10
0.75


Qualcomm's Krait processor architecture certainly does well, but it relies largely on its 1.5 GHz clock rate (at least in our mobile development platform) to exert its advantage over the OMAP 4430. Per cycle, TI's SoC actually has an advantage.

Of course, that's not to detract from what Qualcomm is achieving with its APQ8064. The company designed this SoC to run at 1.5 GHz at least. TI's chip operates between 1 and 1.2 GHz. So, even if it does achieve slightly better arithmetic performance per cycle, it's specific implementation simply cannot catch the more modern Krait-based design.

Core Performance At A Given Clock Rate: Arithmetic

OMAP 4430Tegra 3 (T30L) S3 (APQ8060)S4 Plus (MSM8960)
S4 Pro (APQ8064)
CPU
Two Cortex-A9 Cores @ 1 GHzFour Cortex-A9 Cores @ 1.3 GHzTwo Scorpion Cores @ 1.2 GHzTwo Krait Cores @ 1.5 GHzFour Krait Cores @ 1.5 GHz
Dhrystone (MIPS/MHz)
2.34
2.211.92
2.55
2.64
Whetstone Double
(FLOPS/MHz)
0.0230.021
0.012
0.15
0.015
Whetstone Float
(FLOPS/MHz)
0.0310.029
0.016
0.16
0.022
Whetstone Float/Double
(FLOPS/MHz)
0.026
0.0250.011
0.15
0.018


Breaking out the Arithmetic sub-test, we can get inside the OMAP 4430's advantage, which was reflected in the first table. Although Qualcomm's APQ8064 achieves superior integer performance per cycle, its showing in the floating-point-based Whetstone metric is consistently worse than TI's. 

Again, though, these results are completely synthetic. The OMAP 4430 and APQ8064 will never be made to compete at the same clock rate. We're simply interested in where each architecture derives its strengths.

Core Performance At A Given Clock Rate: Multi-media

OMAP 4430Tegra 3 (T30L) S3 (APQ8060)S4 Plus (MSM8960)
S4 Pro (APQ8064)
CPU
Two Cortex-A9 Cores @ 1 GHzFour Cortex-A9 Cores @ 1.3 GHzTwo Scorpion Cores @ 1.2 GHzTwo Krait Cores @ 1.5 GHzFour Krait Cores @ 1.5 GHz
Multi-media Integer [NEON]
 (kPix/s/MHz)
1.15
1.141.231.34
1.38
Multi-media Float [NEON]
 (kPix/s/MHz)
1.161.09
1.53
2.13
1.81
Multi-media Double [FPU]
 (kPix/s/MHz)
0.560.54
0.40
0.33
0.42
Multi-media Float/Double
 (kPix/s/MHz)
0.80
0.770.77
0.83
0.87


When we perform the same exercise with Sandra's Multi-media module, we see where the Krait architecture earns its advantage over Scorpion, first, and the OMAP's Cortex-A9 cores, second.

Particularly when it's able to exploit ARM's NEON 64- and 128-bit instruction set, Krait dominates handily. Only when Sandra drops back to measuring performance using the Vector Floating Point mode does Qualcomm's latest cede its lead. Not that you should be worried; NEON is far more powerful, making it a more likely instruction set to see in real-world apps.

Multi-Core Efficiency


Many years ago, Intel and AMD stopped emphasizing fast single-core desktop processors and started designing CPUs with multiple cores per package. Software developers had to learn how to exploit those duplicated resources in order to extract some benefit from them.

The same thing is happening in the mobile space as multi-core SoCs facilitate parallelism in power-optimized architectures. As on the desktop, though, the performance of a dual- or quad-core chip doesn't scale linearly. Synthetic measurements make it possible for us to get a best-case scaling number, but the real-world is far less exact.

This is partly a result of how cores work together. Threaded apps involve data sharing between cores. If this isn't done efficiently, performance drops. Lots of bandwidth and low latency are important. TI's OMAP 4430 is able to move the most data per second between its cores, while Nvidia's Tegra 3 follows closely behind, instead standing out for its minimal latency.

React To This Article