Sign in with
Sign up | Sign in
Snapdragon S4 Pro: Krait And Adreno 320, Benchmarked
By ,
1. Qualcomm's Fourth-Generation Snapdragon Family Gets A Flagship

Since the beginning of computing, true enthusiasts have wanted to know more about the underlying hardware in their machines. From the old Tandy 1000 with Intel's 8088 processor to modern tablets with ARM-based architectures, components under the hood matter, even if we only use them to enjoy the device itself.

Back in the '80s, standardization made it possible to start swapping parts in and out of an IBM-compatible PC. Although most subsystems were soldered in, we still remember adding serial mouse cards, modems, and "high-end" 20 MB hard drives.

Over time, we were delighted to get access to upgradeable processors (even on-board L2 cache modules), standardized memory formats, faster graphics interfaces, and a broad range of peripherals. It was truly a golden age for power users who knew exactly what they wanted to to spend and where to spend it for the best experience.

What's The Future of Mobile Gaming?

Check out our interview with Four Android And iOS Game Developers for an insider's look of how the software world will evolve, and why hardware plays such an important role.

That model lives on today in enthusiast desktops. However, as we push forward into an era of mobility, compact tablet and smartphone form factors prevent the flexibility typical of a gaming machine at home. When you buy a mobile device, your choices narrow to the capacity of non-volatile memory available for storing music, movies, and pictures.

We've moved backwards in a sense. In the interest of being portable, we're willing to give up the ability to play games at the same quality we're used to. And the smaller you go, the worse performance gets. Because these diminutive platforms are all highly integrated, there's not a darned thing an enthusiast can typically do to make his or her hardware run faster.

Here's the thing, though. Developers aren't just writing software for iOS or Android anymore. They're actually optimizing and targeting specific platforms now. Nvidia has, perhaps, seen the most success engaging with the software community and getting games enabled on its Tegra 3 SoC that simply won't run as well on other Android-based gadgets. That means you have to do your homework now more than ever. 

Despite the truly amazing things we've seen ISVs do for Tegra 3, Nvidia's share of the smartphone market was comparatively small in 2011 (it ranked sixth, according to Strategy Analytics). The top player was Qualcomm with its family of Snapdragon SoCs. Naturally, any move the company makes is going to have a profound impact on the mobile market moving forward.

S4 Pro MDPS4 Pro MDP

Qualcomm recently invited us to a benchmark workshop where we were offered an opportunity to go hands-on with its S4 Pro, available in two- and four-core configurations.

Shortly thereafter, we acquired an S4 Pro Mobile Development Platform of our own, allowing us to perform controlled comparisons against other tablets in our lab based on competing architectures. This one's data-heavy, so buckle up!

2. Qualcomm's Snapdragon S4 Line-Up: Krait CPUs And Adreno Graphics

Qualcomm's product portfolio is both deep and wide. Its mobile SoCs in the Snapdragon family stretch back to 2008, when the S1 platform was first made available. Now, in 2012, we're looking at the S4 series, indicating the company's fourth generation. 

You'll find four product families under Qualcomm's S4 umbrella, each consisting of individual chips organized in such a way as to address specific workloads.

S4 Prime, for example, is being positioned as a solution for smart TVs and set-top boxes. The MPQ8064 SoC is the only component under the S4 Prime moniker, boasting a quad-core Krait architecture with Adreno 320 graphics.

The focus of today's story, S4 Pro, includes a couple of different components: MSM8960T and APQ8064, the former featuring a dual-core Krait-based processor and the latter equipped with four cores. Both are 28 nm components with the same high-end Adreno 320 graphics engine. Whereas the MSM8960T part features an integrated cellular radio, the APQ8064 does not.

S4 Plus and Play, intended for smartphones and tablets, are composed of an additional 14 SoCs with and without built-in modems. 

In Qualcomm's hierarchy, S4 Pro is the highest-end platform you'll see used in mobile devices, and so it makes sense that the company built its mobile development platforms using an APQ8064, and that's what we have in the lab today.

Although it takes the second spot in the S4 line-up, the Pro segment is certainly still a performance-oriented part. As mentioned, the APQ8064 features a quad-core Krait-based processor operating between 1.5 and 1.7 GHz. Qualcomm couldn't get us access to a block diagram of the APQ8064, so imagine the shot of the MSM8960 above with a much smaller modem subsystem (no cellular radio, just Wi-Fi and Bluetooth), and an additional pair of cores.

Each core has 16 KB of L1 data and 16 KB of L1 instruction cache, and each pair of cores shares a 1 MB L2 cache. Qualcomm's Krait-based cores succeed the Scorpion-based design that we first covered in Third-Generation Snapdragon: The Dual-Core Scorpion. In the table below, we drill down into more granular specifics of the Krait and Scorpion architectures, comparing them to ARM's Cortex-A9 and Cortex-A15 core designs.

Architecture Comparison
Cortex-A9
Cortex-A15
Scorpion
Krait
Pipeline Depth
Eight-Stages15/17-24-Stage
(Integer/FPU)
10-Stage11-Stage
Out-of-Order Execution
Yes
Yes
PartialYes
Fab Node
45/30/32 nm32/28 nm
65/45 nm28 nm
Core Configurations
Single, Dual, Quad
Dual, Quad
Single, Dual
Dual, Quad
Cache
L1: 32 KB + 32 KB
L2: 1 MB
L1: 32 KB + 32 KB
L2: 4 MB max
L1: 32 KB + 32 KB
L2: 256 kB (per core)
L1: 16 KB + 16 KB
L2: 1 MB (per dual-core)
DMIPS/MHz
2.5
3.5
2.5
3.3


Unlike many of its competitors, Qualcomm is unique in that it employs custom processor design based on ARM IP, investing considerable time and money developing its own cores. For example, its Scorpion design employs the same ARMv7-A architecture used by the Cortex-A8 and -A9 cores. However, Qualcomm's specific implementation breaks the instruction pipeline down into a different number of stages, utilizes non-speculative out-of-order execution, and offers 128-bit SIMD functionality. Featuring a lot of in-house work, Scorpion is easily differentiated from the standard Cortex-A9, which helps explain certain benchmark victories.

Krait improves performance tangibly through increased complexity (due in no small part, we imagine, to a smaller 28 nm process node). Each core can now decode up to three instructions per clock cycle (up from two), similar to the Cortex-A15 design. Its integer pipeline is now 11 stages long, though, which is one stage longer than Scorpion's, but not as long as the -A15's 15-stage implementation. In practice, the longer pipeline should translate into a clock rate advantage 

Qualcomm also enables Krait with the ability to run each core's clock rate asymmetrically. This helps facilitate power savings in applications where all of the SoC's compute resources aren't needed. Useful though it may be, this isn't a new feature. The Scorpion core featured it as well, and Nvidia's Tegra 3 leans on the same principle for its fifth companion core.

3. Performance From Scorpion To Krait: What A Difference One Generation Makes

GeekBench v2

Beyond the speeds and feeds, how do the performance of Qualcomm's Scorpion and Krait cores differ? GeekBench can help us out with a general assessment.

GeekBench Scores
SoC
Integer
Floating-Point
Memory
Nvidia Tegra 3 (T30L)
(Four Cortex-A9 Cores @ 1.3 GHz)
1298
2288
1222
TI OMAP 4430
(Two Cortex-A9 Cores @ 1 GHz)
7501298853
Apple A5/A5X
(Two Cortex-A9 Cores @ 1 GHz)
691
921
830
Qualcomm S3 (APQ8060)
(Two Scorpion Cores @ 1.2 GHz)
594
708946
Qualcomm S4 Plus (MSM8960)
(Two Krait Cores @ 1.5 GHz)
964
2251
1666
Qualcomm S4 Pro (APQ8064)
(Four Krait Cores @ 1.5 GHz)
1400
3292
1276


According to our numbers, Krait nearly triples the performance of its predecessor, with the biggest gain seen in floating-point performance. Also interesting is the comparison between Krait and Nvidia's Tegra 3, which drives tablets like the Nexus 7 and Transformer Pad.

Strong floating-point performance is a notable boon for game developers, and we hope that Qualcomm's strength in this discipline serves to further the work being done in mobile gaming. Google's own Android documentation recommends judicious use of floating-point math, since it's about 2x slower than integer math on Android-based devices. And yet, we see TI's OMAP 4430 outmaneuver Apple's A5/A5X in GeekBench's floating-point metric, despite the fact that they both employ dual Cortex-A9 cores at 1 GHz. So, what's that say about performance under iOS?

Although software developers are still tied to programming for multiple hardware platforms, some faster and some slower, it's entirely possible that, in some situations, Krait-based devices will offer the best performance currently available.

SiSoft Sandra, Android Edition

Sandra is one of those diagnostic tools that lets us dig a little deeper on the desktop, isolating specific performance characteristics in a granular way. SiSoftware eventually plans to release an Android-specific version of the software, but the company granted us exclusive access to an early beta copy for our story today.

SiSoftware Sandra Aggregate Performance

OMAP 4430
Tegra 3 (T30L)
S3
(APQ8060)
S4 Plus
(MSM8960)
S4 Pro
(APQ8064)
CPU
Two Cortex-A9 Cores @ 1 GHz
Four Cortex-A9 Cores @ 1.3 GHz
Two Scorpion Cores @ 1.2 GHz
Two Krait Cores @ 1.5 GHz
Four Krait Cores @ 1.5 GHz
Native Arithmetic (MOPS)463
1133
365593
1194
Native Multi-media
(kPix/s)
2301
5912
3297
5067
9642
Java Arithmetic (MOPS)
90
225
86
171
278
Memory
(MB/s)
603
968
1265
3308
4104


Naturally, the quad-core architectures are at an inherent advantage in any threaded workload. So, we also run aggregate performance-per-core tests to zero in on the capabilities of each computational building block.

Again, the S4 Pro platform's Krait processor takes a commanding lead over the Cortex-A9- and Scorpion-based competition.

Aggregate Performance-Per-Core

OMAP 4430Tegra 3 (T30L) S3 (APQ8060)S4 Plus (MSM8960)
S4 Pro (APQ8064)
CPU
Two Cortex-A9 Cores @ 1 GHzFour Cortex-A9 Cores @ 1.3 GHzTwo Scorpion Cores @ 1.2 GHzTwo Krait Cores @ 1.5 GHz
Four Krait Cores @ 1.5 GHz
Native Arithmetic
(MOPS/Thread)
231.5
283.2
182.5296.5
298.5
Native Multi-media
(kPix/s/Thread)
1150.5
1478.0
1648.5
2533.5
2410.5
Java Arithmetic
(MOPS/Thread)
45.0
56.2
43.0
85.5
69.5
Memory
(MB/s/Thread)
301.5
242.0
632.5
1654.0
1026.0
4. Nitty Gritty: CPU Core Performance, Per Clock

Performance Per Clock Cycle

Up until now, we've compared the performance of different SoCs in different devices. But we know even just from Qualcomm's APQ8064 spec sheet that the Krait CPU cores can be made to run from 1.5 to 1.7 GHz. And we've seen Tegra 3 running from 1.2 to 1.6 GHz.

So, the conclusions we draw about the devices in our lab can't automatically be applied to other tablets or smartphones, particularly if their SoCs operate at higher or lower frequencies. That's precisely why Sandra's Core Performance Per Clock index is valuable: it lets us drill down one level more from performance-per-core to performance-per-core at a constant clock rate.

Core Performance At A Given Clock Rate

OMAP 4430
Tegra 3 (T30L) S3 (APQ8060)
S4 Plus (MSM8960)
S4 Pro (APQ8064)
CPU
Two Cortex-A9 Cores @ 1 GHzFour Cortex-A9 Cores @ 1.3 GHzTwo Scorpion Cores @ 1.2 GHzTwo Krait Cores @ 1.5 GHz
Four Krait Cores @ 1.5 GHz
Native Arithmetic
(MOPS/MHz)
0.230.21
0.15
0.20
0.20
Native Multi-media
(kPix/s/MHz)
1.151.14
1.37
1.69
1.60
Java Arithmetic
(MOPS/MHz)
0.045
0.043
0.035
0.057
0.051
Memory
(MB/s/MHz)
.301
0.19
0.53
1.10
0.75


Qualcomm's Krait processor architecture certainly does well, but it relies largely on its 1.5 GHz clock rate (at least in our mobile development platform) to exert its advantage over the OMAP 4430. Per cycle, TI's SoC actually has an advantage.

Of course, that's not to detract from what Qualcomm is achieving with its APQ8064. The company designed this SoC to run at 1.5 GHz at least. TI's chip operates between 1 and 1.2 GHz. So, even if it does achieve slightly better arithmetic performance per cycle, it's specific implementation simply cannot catch the more modern Krait-based design.

Core Performance At A Given Clock Rate: Arithmetic

OMAP 4430Tegra 3 (T30L) S3 (APQ8060)S4 Plus (MSM8960)
S4 Pro (APQ8064)
CPU
Two Cortex-A9 Cores @ 1 GHzFour Cortex-A9 Cores @ 1.3 GHzTwo Scorpion Cores @ 1.2 GHzTwo Krait Cores @ 1.5 GHzFour Krait Cores @ 1.5 GHz
Dhrystone (MIPS/MHz)
2.34
2.211.92
2.55
2.64
Whetstone Double
(FLOPS/MHz)
0.0230.021
0.012
0.15
0.015
Whetstone Float
(FLOPS/MHz)
0.0310.029
0.016
0.16
0.022
Whetstone Float/Double
(FLOPS/MHz)
0.026
0.0250.011
0.15
0.018


Breaking out the Arithmetic sub-test, we can get inside the OMAP 4430's advantage, which was reflected in the first table. Although Qualcomm's APQ8064 achieves superior integer performance per cycle, its showing in the floating-point-based Whetstone metric is consistently worse than TI's. 

Again, though, these results are completely synthetic. The OMAP 4430 and APQ8064 will never be made to compete at the same clock rate. We're simply interested in where each architecture derives its strengths.

Core Performance At A Given Clock Rate: Multi-media

OMAP 4430Tegra 3 (T30L) S3 (APQ8060)S4 Plus (MSM8960)
S4 Pro (APQ8064)
CPU
Two Cortex-A9 Cores @ 1 GHzFour Cortex-A9 Cores @ 1.3 GHzTwo Scorpion Cores @ 1.2 GHzTwo Krait Cores @ 1.5 GHzFour Krait Cores @ 1.5 GHz
Multi-media Integer [NEON]
 (kPix/s/MHz)
1.15
1.141.231.34
1.38
Multi-media Float [NEON]
 (kPix/s/MHz)
1.161.09
1.53
2.13
1.81
Multi-media Double [FPU]
 (kPix/s/MHz)
0.560.54
0.40
0.33
0.42
Multi-media Float/Double
 (kPix/s/MHz)
0.80
0.770.77
0.83
0.87


When we perform the same exercise with Sandra's Multi-media module, we see where the Krait architecture earns its advantage over Scorpion, first, and the OMAP's Cortex-A9 cores, second.

Particularly when it's able to exploit ARM's NEON 64- and 128-bit instruction set, Krait dominates handily. Only when Sandra drops back to measuring performance using the Vector Floating Point mode does Qualcomm's latest cede its lead. Not that you should be worried; NEON is far more powerful, making it a more likely instruction set to see in real-world apps.

Multi-Core Efficiency


Many years ago, Intel and AMD stopped emphasizing fast single-core desktop processors and started designing CPUs with multiple cores per package. Software developers had to learn how to exploit those duplicated resources in order to extract some benefit from them.

The same thing is happening in the mobile space as multi-core SoCs facilitate parallelism in power-optimized architectures. As on the desktop, though, the performance of a dual- or quad-core chip doesn't scale linearly. Synthetic measurements make it possible for us to get a best-case scaling number, but the real-world is far less exact.

This is partly a result of how cores work together. Threaded apps involve data sharing between cores. If this isn't done efficiently, performance drops. Lots of bandwidth and low latency are important. TI's OMAP 4430 is able to move the most data per second between its cores, while Nvidia's Tegra 3 follows closely behind, instead standing out for its minimal latency.

5. Graphics Performance: Adreno 320 Under GLBenchmark 2.1 And 2.5

Fortunately, it's easier for us to evaluate graphics performance. Let's start by running off-screen tests using GLBenchmark 2.1.

GLBenchmark 2.1

Our benchmark results tell it all. The SGX543MP4 in Apple's iPad 3 is the performance king at 1280x720. Qualcomm's Adreno 320 nearly matches Imagination Technologies' work in the Egypt test, but falls short by roughly 30% in the Pro benchmark.

The effective fill rate of the PowerVR GPUs is good, and that's a big advantage for Imagination Technologies, since mobile game developers tell us that fill rate is a primary bottleneck in their work. 

Although Adreno 320 is routed by the SGX543MP, Qualcomm still deserves recognition for improving its graphics architecture. The Adreno 320 serves up three to four times as many frames per second as Adreno 220, and its fill rate is more than six times higher. 

GLBenchmark 2.5, OpenGL ES 2.0 benchmark for 1080p

Over time, today's most prolific resolutions will evolve into higher-def screens. We've already seen Apple's third-gen iPad 3 enable 2048x1536, necessitating more powerful graphics hardware in the process. Clearly, it'll become increasingly important to benchmark mobile graphics architectures under more demanding circumstances. Today, the best we can do is Kishonti's latest GLBenchmark 2.5, which continues to primarily test OpenGL ES 2.0 features, but now specifically targets 1920x1080 and higher-quality textures.

Under the duress of what GLBenchmark 2.5 applies to each graphics subsystem, the results change. Qualcomm's Adreno 320 emerges victorious over Imagination's SGX543MP4. The margin is really pretty small, and the Pro test that favored PowerVR graphics previously is no longer part of the benchmark. So, the swap isn't quite decisive. But it's at least interesting to see Adreno maintain more of its frame rate in this more taxing workload.

Unfortunately, Qualcomm can't achieve the fill rate of Imagination's tile-based deferred renderer. The Adreno 320 does, however, improve greatly on last-generation's Adreno 220 implementation.

6. S4 Pro Puts Qualcomm Back In The Fight

Dell XPS 10: Qualcomm's S4 + Windows RTDell XPS 10: Qualcomm's S4 + Windows RT

Qualcomm's Snapdragon S4 Pro platform is backed by big improvements in CPU and graphics performance, arming the company with the numbers it needs to go up against the current heavy-hitters employing Cortex-A9-based SoCs.

As the technology world works, however, more competition is headed Qualcomm's way. Nvidia's next-generation Tegra and TI's OMAP 5430 are both expected soon. Both will employ Cortex-A15 cores and, we are certain, substantially better graphics performance.

Qualcomm faces a few challenges moving forward. The first is a lack of games able to showcase the strengths of its Adreno architecture. In comparison, Nvidia's storied history with ISVs has already proven quite fruitful for the company's mobile efforts. Today, there are several Tegra-optimized games (like RipTide and Shadowgun). Further, titles like Sonic the Hedgehog 4: Episode II enjoy an early release on Tegra platforms compared to the rest of the Android ecosystem.

It'd be easy for PC enthusiasts to dismiss the value of titles written specifically for smartphones and tablets. However, they're very important to hardware vendors trying to demonstrate what a graphics processor can do. For example, Apple uses Epic's Infinity Blade series to evangelize what its iPads can do. As we all know, manufacturers need software to justify ever more powerful hardware.

Qualcomm's second obstacle is that S4 Pro-based devices may show up later to the party than the company would have liked. The APQ8064 in our mobile development platform is a very attractive SoC sporting the new Krait CPU architecture, Adreno 320 graphics, and Wi-Fi connectivity. But without an integrated cellular modem, the chip has to be complemented by another piece of silicon in smartphone designs. This almost certainly has ramifications for power consumption.

Snapdragon S4 Thermal Comparison and Butter Benchmark

Its MSM8960T is where we'll see copious integration translate to a better balance between performance and power (for several reasons). Phones based on the SoC still haven't landed, though. And it remains to be seen how much of a head-start Qualcomm might enjoy with its Snapdragon S4 Pro platform compared to upcoming solutions from Nvidia and TI. 

More near-term, we have no problem waiting for retail S4 Pro-based tablet or smartphone hardware, which is faster than anything else we've used in those segments. Based on the roadmaps we've seen, it'll be 2013 before Qualcomm is challenged. We only hope the company uses that time to court ISVs able to properly utilize its promising Adreno 320 graphics engine.