Page 2:Testing Methodology
Page 3:Benchmark Suite
Page 4:Display Performance
Page 5:Audio Performance
Page 6:Camera Performance And Photo Quality
Page 7:CPU And System Performance
Page 8:Web Browsing Benchmarks
Page 9:GPU And Gaming Performance
Page 10:Battery Life
Page 11:Thermal Throttling
Page 12:Final Thoughts
Today, we outline the strict testing procedures used to obtain accurate data and discuss each test that we perform on smartphones and tablets.
Advances in transportation, networking, and wireless technologies have allowed our society to become increasingly mobile. And whether we’re moving to another room in our home, down the street to a coffee shop, or to a city halfway around the world, our computers, which we’ve become so dependent upon, need to come with us. Laptops are still the go-to mobile device for getting real work done; however, they’re too big and bulky to comfortably use while laying in bed, and you certainly can’t put one in your pocket and carry it everywhere you go. For these reasons, tablets and smartphones are stealing screen time from traditional PCs and laptops as we increasingly use them to browse the Web, read books, watch TV and movies, play games, socialize, find a place to eat, and yes, even get some work done. Both hardware and software must evolve as we continue to find new uses for these amazing devices.
Since we carry these devices with us everywhere we go, they are even more personal than personal computers. So it’s no surprise that mobile devices vary so dramatically in size, design, performance, and price. With so many options available, and so much marketing hype, it can be difficult to choose the device that’s right for you.
Whether you’re a curious enthusiast or trying to make an informed buying decision, understanding how a device performs is both interesting and crucial for a positive user experience. This is why an in-depth performance evaluation is part of our comprehensive product reviews.
Hey, this is not easy!
Testing mobile devices presents many challenges, however. For starters, they contain a lot of hardware, from internal components such as the SoC, memory, and storage to external components such as the display, cameras, and speakers, all of which have some impact on user experience. Software plays a vital role too, and how all of these pieces work together affects battery life.
The biggest challenge when testing complete, working systems is collecting data in an uncontrolled environment. Mobile devices by their very nature are connected devices, constantly sending and receiving data from the cloud. Apps running in the background wake periodically to collect data, send notifications, or any number of things. Security software installed by the OEM or carrier may perform a routine scan. The operating system is constantly busy too. All of this activity leads to a system with a high degree of entropy.
Collecting accurate, repeatable, and fair results from such a noisy environment requires a strict testing methodology. Here at Tom’s Hardware, we’ve used our knowledge and experience to develop a procedure that minimizes background tasks and strives to create a level playing field for all devices—as well as we can anyway, since there are some variables beyond our control.
Before discussing the details, however, we should answer a more basic question: Where do our review units come from? In some cases, we purchase the products ourselves, but most of the units we review are retail products provided by OEMs.
While Tom’s Hardware attracts readers from all over the world, the main site is based in the United States (there are other sites in the Tom’s family focusing on different regions). Therefore, the devices we test are models intended for sale in the North American market. The increasing importance of markets outside of the US, however, means many OEMs are launching products in these other regions first. Because of the media’s unhealthy obsession with being the first to post, many tech news sites are now reviewing international or even pre-production units running different software and often exhibiting different performance characteristics than the North American retail models. Many of these sites do not even disclose this information. We feel that this potentially misleading practice is not in our reader’s best interest, however. If we do test an international or pre-production unit outside of a full review, we will always disclose this within the article.
So, after acquiring our North American retail units, what’s next? The first thing we do is install operating system and software updates. Next, we perform a factory reset to make sure we’re starting from a clean slate. The final step in preparing a device for testing involves diving into the settings menu. We’ll spare you our full list of configuration settings (we go through every possible menu), which are meant to minimize background tasks and keep comparisons as fair as possible, and just show you the most important ones in the table below.
|Wireless & Data|
|Airplane Mode||on||This is an uncontrollable variable because signal strength (and thus power draw) varies by location, carrier, time of day, etc. The cellular radio is powered down to keep device testing fair.|
|Location Services||off||Reduces background activity|
|Data Collection||off||Options for improving customer experience and sending diagnostic data, usage statistics, etc. to Google, OEMs, or cellular providers are disabled to reduce background activity.|
|Auto or Adaptive Brightness||off|
|Display Brightness||The screens are calibrated to 200 nits, keeping results comparable between devices.|
|Special display settings||off|
|Screen Mode||Set to basic, native, standard, sRGB, or device default. When testing the display, each mode is tested separately.|
|Battery saving modes||off||Device’s implement different techniques. Turning these off shows battery life without having to sacrifice performance.|
|Turn on automatically||never|
|Google, iCloud, Facebook, Twitter, etc.||inactive||All cloud-based accounts (or any account that accesses the internet in the background) are deleted after initial setup. This reduces background activity that interferes with measurements. The only exception is the Microsoft account for Windows Phone which cannot be removed.|
In order to keep testing fair and results comparable, we strive to make each device’s configuration as similar as possible. Due to differences in operating systems and features, however, there will always be some small differences. In these situations, we leave the settings at their default values.
Devices may also contain pre-installed software from cellular providers and OEMs, introducing more variability. When present, we do not remove or disable this software—it’s usually not possible and the average user will likely leave it running anyway. To help mitigate this issue, we try to get either unlocked devices or devices from carriers with the least amount of “bloatware.”
A consistent testing procedure is just as important to the data collection process as our device configuration profile. Since power level and temperature both effect a device’s performance, tests are performed in a controlled manner that accounts for these factors. Below are the main points of our testing procedure:
- The ambient room temperature is kept between 70 °F (21 °C) – 80 °F (26.5 °C). We do not actively cool the devices during testing. While this would further reduce the possibility of thermal throttling affecting the results, it’s not a realistic condition. After all, none of us carry around bags of ice, fans, or thermoelectric coolers in our pockets to cool our devices.
- Smartphones lie flat on a wood table (screen facing up) during testing, with tests conducted in portrait mode unless forced to run landscape by the app. Tablets are propped up in a holder in landscape mode, so that the entire backside of the device is exposed to air. This is to better simulate real-world usage.
- Devices are allowed to sit for a specified length of time after they are turned on to allow initial network syncing and background tasks to complete before starting a test.
- Devices are not touched or moved while tests are running.
- Devices are allowed to cool between test runs so that subsequent tests are not affected by heat buildup.
- All tests are performed while running on battery power. The battery charge level is not allowed to drop below a specific value while performing any performance measurement other than battery life.
Benchmarks are run at least two times and the results are averaged to get a final score. The minimum and maximum values from each benchmark run must not vary from the computed average value by more than 5%. If the variance threshold is exceeded, all of the benchmark scores for that run are discarded and the benchmark is run again. This ensures that the occasional outlier caused by random background tasks, cosmic rays, spooky quantum effects, interference from technologically advanced aliens, disturbances in the space-time continuum, or other unexplainable phenomena do not skew the final results.
Our testing procedure also includes several methods for detecting benchmark cheats.
The mixture of synthetic and real-world tests we run are meant to give a comprehensive overview of a device’s performance. Synthetic tests—which are generally composed of several small, specialized blocks of code for performing operations such as cryptography, file compression, matrix operations, and alpha blending—are good at isolating the performance of different parts of a component’s architecture, including integer and floating point math, specialized instructions, pixel shaders, and rasterization. With this information, we can make comparisons between individual hardware components like SoCs (CPU A is faster than CPU B) or NAND flash. Because of their highly focused nature, however, it can be difficult to relate these results to the overall user experience, limiting us to generic statements about a device being faster because it has a faster CPU in certain benchmarks. Furthermore, synthetic tests are generally designed to push hardware to its limits—useful for determining maximum performance and for spotting weak points in a design—but do not represent real-world workloads.
For this reason, we also try to include benchmarks that test macro-level activities you do every day such as web browsing, composing text, editing photos, or watching a video. While these benchmarks are a better indicator of overall user experience, they are much more difficult to develop, leaving testers few options for mobile platforms.
To truly understand the performance of a device, we need to test it at the component level and at the system level, we need to know its maximum performance and its performance in real-world scenarios, and we also need to spot deficiencies (thermal throttling) and anomalies (unsupported features). No single benchmark can do all of these things. There’s not even a single benchmark that can adequately test any one of these things (creating a good benchmark is extremely difficult and there are always compromises). This is why we run a whole suite of benchmarks, many of which have overlapping functionality.
By now it should be apparent that the benchmarks we use are not randomly selected. In addition to fulfilling the requirements above, our benchmark suite comes from experienced developers who are willing to openly discuss how their benchmarks work. We work closely with most of these developers so that we may gain a better understanding of the tests themselves and to provide them with feedback for improving their tests. The table below lists the benchmarks we currently use to test mobile devices.
|CPU And System Performance||AndEBench Pro 2015||2.1.2472||EEMBC|
|Basemark OS II Full||2.0||Basemark Ltd|
|Geekbench 3||3.3.1||Primate Labs|
|MobileXPRT 2013||188.8.131.52||Principled Technologies|
|GPU And Gaming Performance||3DMark: Ice Storm Unlimited||1.2||Futuremark|
|Basemark X||1.1||Basemark Ltd|
|GFXBench 3 Corporate||3.0.28||Kishonti|
|GFXBench 3.1 Corporate||3.1.0||Kishonti|
|Basemark ES 3.1||1.0.2||Basemark Ltd|
|Battery Life And Thermal Throttling||Basemark OS II Full||2.0||Basemark Ltd|
|GFXBench 3 Corporate||3.0.28||Kishonti|
|CPU And System Performance||Basemark OS II Full||2.0||Basemark Ltd|
|Geekbench 3||3.3.4||Primate Labs|
|GPU And Gaming Performance||3DMark: Ice Storm Unlimited||1.2||Futuremark|
|Basemark X||1.1||Basemark Ltd|
|GFXBench 3 Corporate||3.0.32||Kishonti|
|GFXBench 3.1 Corporate||3.1.0||Kishonti|
|Basemark ES 3.1||1.0.2||Basemark Ltd|
|Battery Life And Thermal Throttling||Basemark OS II Full||2.0||Basemark Ltd|
|GFXBench 3 Corporate||3.0.32||Kishonti|
Microsoft Windows Phone
|CPU And System Performance||Basemark OS II Full||2.0||Basemark Ltd|
|GPU And Gaming Performance||Basemark X||1.1||Basemark Ltd|
|GFXBench 3 DirectX||3.0.4||Kishonti|
|Battery Life And Thermal Throttling||Basemark OS II Full||2.0||Basemark Ltd|
|GFXBench 3 DirectX||3.0.4||Kishonti|
A mobile device’s display serves as its primary interface—the front is basically all screen—and, over the course of a day, we can spend many hours staring at it in lighting conditions varying from a pitch-black room to bright sunlight outdoors. Because display quality so profoundly impacts our mobile experience, we thoroughly scrutinize its performance.
For display measurements, we use a SpectraCal C6 colorimeter, which measures color and luminance accurately and works equally well with LCD or OLED screens. The C6 is calibrated by SpectraCal with a Konica Minolta CS-2000 and comes with a NIST (National Institute of Standards and Technology) certificate of accuracy. Even though sealed designs like the C6 are more stable than designs whose color filters are exposed to air, the accuracy of all colorimeters can drift over time. Therefore, our C6 hardware gets calibrated as required.
In conjunction with the C6 colorimeter, we use SpectraCal's CalMAN Ultimate for Business v5 software, which allows us to create custom workflows and provides us with a wealth of measurement options. The CalMAN software also makes the nice looking graphs you see in our reviews.
The C6 is placed at the center of a thoroughly cleaned screen during testing. The display testing patterns are controlled manually or by the CalMAN software working together with SpectraCal’s MobileForge app.
Display Brightness: Maximum & Minimum
A display’s brightness or luminance—measured in candelas per square meter (cd/m2) or nits—affects its visibility in different lighting conditions. A higher maximum brightness makes the screen easier to see in bright conditions such as a brightly lit room or in direct sunlight outdoors. For a dimly lit room or at night, a lower minimum brightness is preferable to keep your screen from blinding you or bothering others around you.
To measure the maximum brightness, the brightness slider is adjusted to its upper bound and a 100% white pattern is displayed onscreen. The minimum brightness is measured the same way, but with the brightness slider set to its lower bound.
The values we report for maximum and minimum brightness represent what’s achievable with the stock brightness slider. Sometimes, however, higher or lower values can be “unlocked” by using a third-party app to control the screen brightness—the mapping of slider positions to luminance values is set by the OEM and may be capped below the hardware’s true limits. These unlocked values may be reported in addition to the stock values if there’s a significant difference between the two.
Full Brightness chart including values for APL=50.
We report two different brightness levels for OLED displays: APL=50% and APL=100% (APL stands for Average Picture Level). This is because the brightness of an OLED screen changes depending on what content is actually being displayed (here’s a more in-depth discussion on APL). The APL values we use provide a good upper and lower bound for real-world content.
A word of caution: Maximum brightness values for OLED screens that do not specify the APL used for the measurement are essentially useless. The spec sheets for many OEMs, and even some mobile review websites, list values much higher than we report by using very low APL levels—sometimes less than 10%—which will never be seen in real-world use.
Display Brightness: Calibrated
After measuring the maximum and minimum brightness, the screen is calibrated to 200 nits ±1% for the remainder of the display tests and all of the battery tests.
Since some OEMs manipulate luminance in subtle ways—usually decreasing it by a small amount after a few seconds to minutes after an adjustment is made—we continue to monitor screen brightness during the remaining tests to ensure it stays at 200 nits.
Black Level & Contrast Ratio
Black level represents the luminance of a full-black (0% white) pattern, which we measure after setting 100% white to 200 nits as mentioned above. Because OLED displays are self-emitting, they are capable of displaying a true black by turning pixels off. This is not the case for LCD displays which use a separate, always-on backlight. Even when an LCD pixel is off, it allows some of the backlight to leak through.
A display’s contrast ratio represents the luminance ratio between a full-white (100% white) and a full-black (0% white) pattern; similar to dynamic range, the higher the value the better. Written another way: contrast = (100% white luminance) / (0% white luminance). Clearly, having a lower black level is desirable, not just for better looking blacks, but for maximizing the screen’s contrast ratio. Because OLED displays have black levels at or very near zero, they essentially have infinite contrast ratios.
The human visual system perceives light in a non-linear manner according to a gamma or power function, with better sensitivity to changes in dark tones than light ones. This quality improves our dynamic range and helps keep us from being blinded by bright sunlight outdoors (CMOS sensors in digital cameras perceive light in a more linear fashion, which is one reason they suffer from poor dynamic range).
If the luminance values in digital images were encoded linearly, matching the linear brightness levels displayed by the screen (non-CRT), then too many bits would be wasted encoding highlights we cannot perceive and too few used for encoding shadows, either leading to a loss in quality or larger files. By encoding luminance with a non-linear gamma function, however, bits are optimized for how we perceive light, leading to higher visual quality and lower file sizes.
The gamma value is actually the exponent used in the power-law expression to generate a specific gamma curve. A gamma of 2.2 is the ideal target value. A screen with a gamma less than 2.2 appears brighter or washed out with fewer shadows, while a gamma larger than 2.2 displays a darker image with a loss of shadow detail and fewer highlights. The sequence of images above show how a screen’s gamma curve affects the displayed image.
Gamma is measured in 10% increments from the display’s black level (0% white) to 200 nits (100% white). In our reviews, we show the average gamma for the whole luminance range in a chart for comparison with other devices. We also include a graph that shows the gamma value at each measurement point relative to the ideal value of 2.2, shown as a yellow line. The ideal display would match the flat yellow line at every point.
Color temperature, also known as white balance or white point, describes the color of light emitted by an object and is measured in Kelvin. For example, an incandescent bulb has a color temperature around 2700 K and a fluorescent lamp is around 5000 K. The ideal correlated color temperature for a computer display, defined by the D65 standard, is 6500 K, which is the color temperature of sunlight on an overcast day.
To put this another way, if you take a white object and view it outside on an overcast day it will take on a specific color. If you look at the same white object in a room lit by incandescent bulbs with tungsten filaments, the object will appear more yellow or orange. In fluorescent light, the object would take on a green tint. Since light with a color temperature less than our standard 6500 K has more red than blue content, we describe it as being “warm.” Conversely, color temperatures above 6500 K skew towards blue, so we say the light is “cool.”
Similar to gamma, we measure color temperature in 10% grayscale increments from the display’s black level (0% white) to 200 nits (100% white). The average correlated color temperature (CCT) is compared to other devices in a chart, while the CCT at each measured grayscale value is shown in a bar graph. The yellow line represents the ideal value of 6500 K. Values close to 0% white are not accurate due to limitations of the testing hardware.
Grayscale RGB Balance
Every luminance or grayscale value between black (0% white) and 100% white is a mixture of the three primary colors red, green, and blue (abbreviated RGB), which also happen to be the colors of an LCD or OLED display’s sub-pixels.
The RGB balance graph shows the individual levels of the red, green, and blue primary colors for grayscale (luminance) values ranging from black (0% white) to 100% white (200 nits) in 10% increments. The vertical axis is percent relative to a target value. Below the graph, the average value for each primary color is displayed for reference.
This graph does not give us any new information; it’s the same data from the color temperature and grayscale accuracy graphs, but plotted in a different way. We include this graph because RGB balance is fundamental to display performance and helps us explain most of the other charts.
Ideally, a screen would display red, green, and blue equally at 100% for each grayscale value. In the example graph above, we see that there’s too much blue (blue is greater than 100%) and not enough red (red is less than 100%) for grayscale values between 10% and 80%. This deviation results in grayscale error and produces a cool color temperature above 6500 K. At 100%, green is the dominant color, which gives a visible green tint to a 100% white screen.
Grayscale error is the mismatch between a target value and the measured value. There are several formulas for measuring error, but we’re using the most recent formula known as CIE ΔE2000 (luminance compensated).
The grayscale error is calculated for the same data set used for the other grayscale tests, and the average value is shown in a chart for easy comparison to other devices. The bar graph shows the amount of error at each measured luminance level. Error values below one are imperceptible. The green line represents an error level of three, anything below which is considered good. Anything between the green and yellow line at an error level of five is clearly visible, but most people would consider the error acceptable. Error values above five signify unacceptable results.
The grayscale error, which is influenced by the RGB balance and gamma, influences more than just the color and luminance of white; it also effects the hue and saturation of other colors too. If a screen performs poorly in grayscale measurements, then it will also perform poorly when displaying colors.
Current screen technology is not capable of displaying all of the colors we can perceive. Instead, a screen has a color gamut or subset of colors that it can generate. This also applies to other output devices such as projectors and printers; however, if you edit a video or design a poster using a screen whose color gamut differs from the projector’s or printer’s color gamut, the final output may not look as you intended. Using standardized color spaces ensures that different devices are capable of displaying the same set of colors.
There are many different color spaces, but the only one that matters for mobile devices is the sRGB standard. Most content intended for mobile or computer screens is created, and thus intended to be displayed in, the sRGB color space. Even if content is created in a different color space, Adobe RGB for example, it will be mapped into the sRGB color space because mobile apps and operating systems are not color space aware—they do not understand anything other than sRGB.
A display’s color gamut is displayed with a chromaticity diagram like the one shown above. We use the CIE 1976 u’v’ diagram because it more accurately represents our eye’s sensitivity to different colors: We are less sensitive to differences in green tones than blue tones. The goal is to create a perceptually uniform color space where equal distances anywhere on the diagram correspond to equally perceived color differences.
In this diagram, the larger, horseshoe shaped color region represents the total range of colors humans can perceive. The brighter, triangular region is the sRGB color space. Ideally, the tested display’s color gamut, defined by the white triangle, would match the boundaries of the sRGB color space exactly.
Color gamut is tested by measuring the three primary colors (red, green, and blue), which form the vertices of the gamut triangle; the three secondary colors (cyan, magenta, and yellow), which lie along the edges of the triangle; and the white point, which is inside the triangle where the primary colors combine, at a stimulus level of 75%. The small white squares represent the target values, while the dots are the measured values.
Color Saturation Sweep
Color saturation for the three primary and three secondary colors is measured in 20% increments, with 0% being white, at a stimulus level of 75%.
The results are shown in the CIE 1976 u’v’ chromaticity diagram, with the white squares depicting the target values and the dots the measured values. We’re looking for two things in this test: shifts in hue and color compression. In the example above, we can see that magenta tones are shifted towards blue, giving them a purple hue instead. There’s also evidence of color compression, signified by unequal spacing between the saturation levels for a particular color. Some OEMs will increase color saturation at the higher end of the scale to make colors look more “vibrant.” The side effect, however, is “color banding,” where smooth saturation gradients are replaced by discrete steps and regions of nearly constant color.
The color accuracy of a display is tested using a total of 30 colors: the 24 GretagMacbeth ColorChecker colors plus the three primary and three secondary colors. The measured values, represented by dots, are plotted on the CIE 1976 u’v’ chromaticity diagram relative to the target values, represented by white squares. The blackbody curve is shown for reference.
Color error, the mismatch between a target value and the measured value, is also calculated for this same data set using the CIE ΔE2000 formula. The average color error is shown in a chart for easy comparison to other devices. The bar graph shows the error values for each tested color. Error values below one are imperceptible. The green line represents an error level of three, anything below which is considered good. Anything between the green and yellow line at an error level of five is clearly visible, but most people would consider the error acceptable. Error values above five signify unacceptable results.
We also include a color swatch to better visualize the difference between the target color (bottom half) and the displayed color (top half). In our discussion about RGB balance above, we noted how this particular screen had a surplus of green and a deficit of blue at 100% white. This shows up as a visible green tint for White in the color swatch. Note that your display’s accuracy will influence the actual color and color difference you see in the swatch. If your display is not calibrated properly or is of low quality, what you see will not match the results of our analysis.
Our audio testing is limited to a subjective listening test with comparisons to a reference device. We explore the audio quality of both the external speaker(s) and headphone output. For the external speaker(s), we listen to the output in several common orientations, including lying on a desk screen-side up and held in both portrait and landscape. If the speaker(s) are not front-facing, we also listen with the speaker facing directly towards us.
The listening tests start with the equalizer turned off or at neutral settings. All other sound enhancement or special effects features are also turned off. If deficiencies in sound quality are detected, we try making adjustments to the equalizer or other settings to examine their affect on quality.
Camera Performance And Photo Quality
Photo quality is composed of many facets, including exposure, white balance, sharpness, and noise to name just a few. There are also many variables that affect these qualities, including hardware (lens system, CMOS sensor), software (noise reduction algorithms), and environmental conditions (light level, light color, scene contrast). Clearly, photo quality is a complex topic, making it difficult to measure.
Our current method for evaluating quality involves taking a series of pictures and subjectively comparing them to photos taken with other competing cameras. The pictures are taken in a variety of lighting conditions, capturing scenes our users might encounter or that present a challenge to the cameras. To help ensure that the conditions for each photo are similar, the photos for each camera are taken from the same spot and all at the same time (within a two to three minute window). The stock camera app and default settings are used for all images except where noted in the review.
CPU And System Performance
The overall user experience is greatly affected by system-level performance. How responsive is the user interface? Do Web pages scroll smoothly? How quickly do apps open? How long does it take to search for an email in the inbox? All of these things are influenced by single- and multi-threaded CPU performance, memory and storage speed, and GPU rendering. Software too plays an important role; Dynamic voltage and frequency scaling (DVFS) includes SoC specific drivers, with parameters configured by the OEM, that adjust various hardware frequencies to find a balance between performance and battery life.
Using a combination of synthetic and real-world workloads, along with data gathered by our system monitoring utility, we quantify the performance of a device at both a hardware and system level, relating these values to the overall user experience.
Basemark OS II
Basemark OS II by Basemark Ltd. is one of the few benchmarks for comparing hardware performance across platforms—it runs on Android, iOS, and Windows Phone. The individual tests are written in C++, ensuring the same code is run on each platform, and cover four main categories: System, Memory, Graphics, and Web.
The System test scores CPU and memory performance and is composed of several subtests for evaluating both single- and multi-threaded integer and floating-point calculations. The Math subtest includes lossless data compression (for testing integer operations) and a physics simulator using eight independent threads (giving a slight advantage to multi-core CPUs). The XML Parsing subtest parses three different XML files (one file per thread) into a DOM (Document Object Model) Tree, a common operation for modern websites. Single-core CPU performance is assessed by applying a Gaussian blur filter to a 2048x2048 pixel, 32-bit image (16MB). There’s also a multi-core version of this test that uses 64 threads.
The Memory test was originally designed to measure the transfer rate of the internal NAND storage by reading and writing files with a fixed size, files varying from 65KB to 16MB, and files in a fragmented memory scenario. However, on devices with sufficient RAM (the cutoff seems to be around 1GB for Android) the operating system uses a portion of RAM to buffer storage reads/writes to improve performance. Since the files used in the Memory test are small (a necessary requirement for the benchmark to run on low-end hardware when it was created), they fit completely within the RAM buffer, turning this test into a true memory test that has little to do with storage performance.
The Graphics test uses a proprietary engine supporting OpenGL ES 2.0 on Android and iOS and DirectX11 feature level 9_3 for Windows Phone. It mixes 2D objects (simulating UI elements) and 3D objects with per-pixel lighting, bump mapping, environment mapping, multi-pass refraction, and alpha blending. The scene also includes 100 particles rendered with a single draw call to test GPU vertex operations. For scoring purposes, 500 frames are rendered offscreen with a resolution of 1920x1080.
Finally, the Web score stresses the CPU by performing 3D transformations and object resizing with CSS, and also includes an HTML5 Canvas particle physics test.
The Basemark OS II results are presented in a stacked bar chart. The scores for the individual tests are shown in white. The overall score, which is the geometric average of the individual test scores, is shown in black at the left end of each bar.
AndEBench Pro 2015
This Android only benchmark is developed by the Embedded Microprocessor Benchmark Consortium (EEMBC), with tests targeting CPU, GPU, memory, and storage performance.
The CoreMark-Pro test suite, which measures single- (one thread) and multi-core (two or more threads) CPU performance, is composed of several different algorithms designed to exercise both integer and floating-point operations. The “Base” version is compiled with the Google Native Development Kit (NDK) default compiler, while the “Peak” version uses a CPU-specific compiler meant to deliver optimal performance. The “Peak” version, which is only supported by Intel, is included for reference and is not included within the overall device score.
The Platform test performs a number of operations using Android platform SDK API calls meant to mimic real-world application scenarios, testing CPU, memory, and storage performance. It first decrypts and unpacks a file, calculating its hash to verify integrity. Inside the file is a database of contact names, a to-do list in XML format, and some pictures. Following the to-do list instructions, it performs searches within the contact database and manipulates the images. When the tasks are complete, the files are compressed, encrypted, and saved back to local storage.
There are two different memory (RAM) tests. The Memory Bandwidth test, based on the STREAM benchmark, measures memory throughput by performing a series of read and write operations in both single- and multi-threaded modes. The geometric mean is then calculated and reported in MB/s. The Memory Latency test initializes a 64MB block of memory as a linked list, whose elements are arranged in random order. Measuring the time to traverse each node in the list, which requires fetching the next node pointer from memory for each step, provides the memory subsystem’s random access latency in number of memory elements accessed per second.
The Storage test performs a series of sequential and random reads and writes to the internal NAND storage in file sizes ranging from 512 bytes to 256KB using a single CPU thread. For each block size, access pattern, and access mode, multiple tests are repeated and the median throughput is used in the result calculation. The overall storage performance is the geometric mean of these median values and is reported in KB/s.
The 3D test measures GPU performance by running a scripted game sequence, running on a custom engine developed by Ravn Studio, offscreen with a resolution of 1920x1080. The test also runs onscreen purely as a demonstration and is not included in the final score.
The overall device score is calculated using a weighted geometric mean of all of the subtest scores with the following category weights: 25% CoreMark-PRO (base), 25% Platform, 25% 3D, 8.33% Memory Bandwidth, 8.33% Memory Latency, and 8.33% Storage. The weighted geometric mean is then scaled by 14.81661 (a factor used to calibrate the reference device to give a score of 5000) to arrive at the final score.
Geekbench 3 Pro
Primate Labs' Geekbench is a synthetic benchmark that runs on both Android and iOS as well as Windows, OS X, and Linux on the desktop. It isolates single- and multi-core CPU performance by running a series of integer workloads, including encryption, compression, route finding, and edge detection algorithms, and floating-point workloads, including matrix multiplication, fast Fourier transforms, image filters, ray tracing, and physics simulation algorithms, for each case.
There are also single- and multi-core memory tests based on the STREAM benchmark. STREAM copy measures a device’s ability to work with large data sets in memory by copying a large list of floating-point numbers. STREAM scale works the same way as STREAM copy, but multiplies each value by a constant during the copy. STREAM add takes two lists of floating-point numbers and adds them value-by-value, storing the result back to memory in a third list. Finally, STREAM triad is a combination of the add and scale workloads. It reads two lists of floating-point numbers, multiplying one of the numbers by a constant and adding the result to the other number. The end result gets written back to memory in a third list.
Geekbench calculates three different scores. Each individual workload gets a performance score relative to a baseline score, which happens to be 2,500 points set by a Mac mini (Mid 2011) with an Intel Core i5-2520M @ 2.50GHz processor. A section score (e.g., single-core integer) is the geometric mean of all the workload scores for that section. Finally, the Geekbench score is the weighted arithmetic mean of the three section scores.
While synthetic tests like Geekbench are useful for measuring peak performance at the hardware level, it’s difficult to extrapolate these results to the system level. This is what makes the PCMark benchmark by Futuremark a valuable tool, because it measures system-level performance by running real-world workloads. Not only does it measure a device’s performance for several common scenarios, including web browsing, video playback, writing text, and photo editing, it also gives us insight into how its DVFS system is configured, which has a substantial impact on overall user experience.
The Web Browsing test uses a native Android WebView object to load and interact with a custom page meant to mimic a social networking site. Because the content is stored as local files on the device, the test is not influenced by network speed, latency, or bandwidth. The test begins by loading the initial page containing text and images; it then scrolls the page up and down, zooms into a picture, loads a set of data for the search feature, executes searches from the search box, selects an image thumbnail to add to the page, and then re-renders the page after inserting the new image. The final Web Browsing score is based on the geometric mean of the time to render and re-render the page, the time to load the search data, and the time to execute the search. It does not include the time spent scrolling and zooming.
The Video Playback test loads and plays four different videos in succession, and then performs a series of seek operations in the last video. The 1080p @ 30fps videos are encoded in H.264 with a baseline profile and a variable bit rate target of 4.5 Mb/s (maximum of 5.4 Mb/s). The test runs in landscape orientation and uses the native Android MediaPlayer API. Scoring is based on the weighted arithmetic averages for video load times, video seek times, and average playback frame rates. Frame rate has the largest effect on the score, heavily penalizing a device for dropping frames, followed by video seek time, and finally video load time.
The Writing test simulates editing multiple documents using the Android EditText view in portrait mode. The test starts by uncompressing two ZIP files (2.5MB and 3.5MB) and displaying the two text documents with images in each. All of document one is then copied and pasted into document two, creating a file with ~200,000 characters and four images. This combined document gets compressed into a 6MB ZIP file and saved to internal storage. It then performs a series of cut and paste operations, types several sentences, inserts an image, and then compresses and saves the final document in a 7.4MB ZIP file. The final score is based on a geometric mean of the times to complete these tasks.
The Photo Editing task loads and displays a source image (2048×2048 in size), applies a photo effect, displays the result, and for certain images, saves it back to internal storage in JPEG format with 90% quality. This process is repeated using 24 different photo filters and 13 unique source images, with a total of six intermediate file save operations. A total of four different APIs are used to modify the source images as shown in the table below.
|APIs used by the PCMark Photo Editing test for image processing|
|android.media.effect||Performs all image processing on the GPU||15|
|android.support.v8.renderscript||Uses the RenderScript Intrinsics functions that support multi-core CPU or GPU processing||3|
|android-jhlabs||Runs Java-based filters on the CPU||4|
|android.graphics||Handles drawing to the screen directly||2|
The Photo Editing score includes a geometric mean of three parameters: the geometric mean of the time to apply the RenderScript effects, the geometric mean of the time to apply the Java and android.graphics effects, and the time to apply the android.media.effect effects. This final parameter also includes the time to perform file loads and saves, image compression/decompression, and an additional face detection test.
The scores we report come from the PCMark battery test, which loops all four tests and provides a larger sample size. The scores are simply the geometric mean of the scores achieved on each individual pass. There’s also an overall Work performance score, which is the geometric mean of the four category scores.
Like PCMark, TabletMark 2014 by BAPCo is a system-level benchmark with workloads meant to mimic common usage scenarios. Unlike PCMark which only runs on Android, TabletMark 2014 is cross-platform, supporting Windows, Windows RT, Android, and iOS. The only caveat is that it requires a 7-inch or larger screen, limiting it to tablets only.
This benchmark uses platform APIs wherever possible to measure performance for two different scenarios. It also computes an overall performance score which is the geometric mean of the two scenario scores.
The Photo and Video editing scenario does just what it says. It applies various filters and image adjustments to eight 14MP images. It also tests floating-point and integer math with an HDR filter written in C. There’s also two video transcoding projects: a 40MB movie composed of four smaller clips and a 13MB movie composed of two smaller clips with a B&W filter applied. The output video is encoded with H.264 at 1080p and 8Mb/s bit rate.
Principled Technologies' MobileXPRT 2013 for Android consists of 10 real-world test scenarios split into two categories of testing: Performance and User Experience. The Performance suite contains five tests: Apply Photo Effects, Create Photo Collages, Create Slideshow, Encrypt Personal Content, and Detect Faces to Organize Photos. Performance results are measured in seconds. The User Experience suite also has five tests: List Scroll, Grid Scroll, Gallery Scroll, Browser Scroll, and Zoom and Pinch. These results are measured in frames per second. The category scores are generated by taking a geometric mean of the ratio between a calibrated machine (Motorola's Droid Razr M) and the test device for each subtest.
How a device achieves a score in this benchmark is more important than the score itself—most devices hit the 60fps vsync limit in the User Experience tests and the Performance tests are largely redundant. Its real-world scenarios provide insight into a device’s DVFS system, which directly effects the perceived user experience.
Web Browsing Benchmarks
While it’s desirable to measure a device’s performance for a common task like Web browsing, collecting good data is difficult because of the browser software layer. In many cases, the choice of Web browser has a larger effect on performance than hardware or even CPU scaling frequencies. Since the browser is so important, which one should we use?
There are several options available for Android, but the two obvious choices are Google’s Chrome browser and the stock Android browser. As the most commonly used browsers, their performance is the most representative of what a user would experience using a particular device. Unfortunately, neither is well suited for benchmarking. Chrome’s frequent updates makes comparing scores from device-to-device or even week-to-week difficult. The stock browser is even worse, because every OEM makes its own modifications (including some benchmark cheating), making device-to-device comparisons impossible. To avoid these issues, we use a static version of the Chromium-based Opera browser. The advantage to this approach is consistency and the ability to compare hardware performance. The downside is that using an out-of-date, less-popular browser produces scores that are not representative of what a user would actually see.
Due to platform restrictions, Safari is the only choice for iOS-based devices, while Internet Explorer is the only game in town on Windows RT. This makes it impossible to compare hardware performance across different platforms. This also is not fair to Android devices since iOS and Windows devices get to use newer, and likely higher performing, browsers.
When running browser benchmarks, no additional browser tabs or pages are loaded. After running each benchmark, we close the page and force close the browser to make sure caches are cleared between runs.
Because some of the Octane test scores show a wider variance than other benchmarks, we run it three times and average the results (throwing away any outlier runs) instead of the usual two runs.
GPU And Gaming Performance
Fueled by dramatic increases in mobile GPU performance and increasing familiarity with touch-based controls, developers both big and small are creating a rich gaming ecosystem for our phones and tablets. But just like on the desktop, better looking graphics and higher resolution screens require faster GPUs and more memory bandwidth. The synthetic and real-world game engine tests in this section probe the various aspects of GPU performance to identify weak points that might ruin the fun.
3DMark: Ice Storm Unlimited
This test by Futuremark includes two different graphics tests and a CPU-based physics simulation. It’s a cross-platform benchmark that targets DirectX 11 feature level 9 on Windows and Windows RT and OpenGL ES 2.0 on Android and iOS. The graphics tests use low-quality textures and a GPU memory budget of 128MB. All tests render offscreen at 1280x720 to avoid vsync limitations and display resolution scaling. These features allow hardware performance comparisons between devices and even across platforms.
The two different graphics tests stress vertex and pixel processing separately. Graphics test 1 focuses on vertex processing while excluding pixel related tasks like post-processing and particle effects. Graphics test 2 uses fewer polygons without shadows to minimize vertex operations, while boosting overall pixel count by including particle effects. This second test measures a system’s ability to read textures, write to render targets, and add post-processing effects such as bloom, streaks, and motion blur. The table below summarizes the differences in geometry and pixel count between the two graphics tests.
|Graphics test 1||530,000||180,000||1.9 million|
|Graphics test 2||79,000||26,000||7.8 million|
The Physics test uses the Bullet Open Source Physics Library to perform game-based physics simulations on the CPU. It uses one thread per available CPU core to run four simulated worlds, each containing two soft and two rigid bodies colliding. Each frame of the soft-body vertex data is sent to the GPU. Because the soft-body objects use a data structure that requires random memory access patterns, SoCs whose memory controller is optimized for serial rather than random memory access perform poorly.
The performance of each graphics test is measured in frames per second, and the graphics score is the harmonic mean of these results times a constant. The Physics test score is just the raw frames per second performance times a constant. The overall score is a weighted harmonic mean of the graphics and physics scores.
This benchmark by Basemark Ltd. is built on top of a real-world game engine, Unity 4.2.2, and runs on Android, iOS, and Windows Phone. It includes two different tests—Dunes and Hangar—which stress the GPU with lighting effects, particles, dynamic shadows, shadow mapping, and other post-processing effects found in modern games. With as many as 900,000 triangles per frame, these tests also strain the GPU’s vertex processing capabilities. The tests target DirectX 11 feature level 9_3 on Windows and OpenGL ES 2.0 on Android and iOS.
Both tests are run offscreen at 1920x1080 (for direct comparison of hardware across devices) and onscreen at the device’s native resolution using default settings (antialiasing disabled). The same set of tests are run at both medium- and high-quality settings. Each test reports the average frames per second after rendering the entire scene. The final score is an equal combination of the offscreen Dunes and Hangar tests, normalized to the performance of a Samsung Galaxy S4.
GFXBench by Kishonti is a full graphics benchmarking suite, including two high-level, 3D gaming tests (Manhattan, T-Rex) for measuring real-world gaming performance and five low-level tests (Alpha Blending, ALU, Fill, Driver Overhead, Render Quality) for measuring hardware-level performance. It’s also cross-platform, supporting Windows 8 and Mac OS X on the desktop with OpenGL; Android and iOS with OpenGL ES; and Windows 8, Windows RT, and Windows Phone with DirectX 11 feature level 9/10.
All of the tests are run offscreen at 1920x1080, to facilitate direct comparisons between devices/hardware, and onscreen at the device’s native resolution, to see how the device handles the actual number of pixels supported by its screen. The tests are run using default settings and are broken into three run groups (Manhattan, T-Rex, and low-level tests) with a cooling period in between to mitigate thermal throttling, which can occur if all the tests are run back-to-back.
Manhattan: This OpenGL ES 3.0 based game simulation includes several modern effects, including diffuse and specular lighting with more than 60 lights, cube map reflection and emission, triplanar mapping, and instanced mesh rendering, along with post-processing effects like bloom and depth of field. The geometry pass employs multiple render targets and uses a combination of traditional forward and more modern deferred rendering processes in separate passes. Its graphics pipeline awards architectures proficient in pixel shading.
T-Rex: This demanding OpenGL ES 2.0 based game simulation is as much a stress test as it is a performance test, pushing the GPU hard and generating a lot of heat. While not as dependent on pixel shading as Manhattan, this test still uses a number of visual effects such as motion blur, parallax mapping, planar reflections, specular highlights, and soft shadows. Its more balanced rendering pipeline also uses complex geometry, high-res textures, and particles.
Alpha Blending: This synthetic test measures a GPU’s alpha-blended overdraw capability by layering 50 semi-transparent rectangles and measuring the frame rate. Rectangles are added or removed until the rendered scene runs steadily between 20 and 25 FPS. Performance is reported in MB/s, which represents the total number of different sized layers blended together, an important metric for hardware-accelerated UIs and games that include translucent surfaces.
This test is highly dependent on GPU memory bandwidth, since it uses high-resolution, uncompressed textures and requires reading/writing to the frame buffer (memory) during alpha blending. It also stresses the back-end of the rendering pipeline (ROPs) rasterizing the frame. Because all of the onscreen objects are transparent, GPUs see no benefit from early z-buffer optimizations.
ALU: This test measures pixel shader compute performance, an important metric for visually-rich modern games, by rendering a scene with rippling water and lighting effects like reflection and refraction. Performance is measured in frames per second. The onscreen results are vsync limited (60fps) for most GPUs, but the offscreen test is still useful for comparing the ALU performance of different devices.
Fill: The fill test measures texturing performance (in millions of texels per second) by rendering four layers of compressed textures. This test combines aspects of both the alpha blending and ALU tests, since it depends on both pixel processing performance and frame buffer bandwidth.
From left to right: Alpha Blending, ALU, Fill
Driver Overhead: This test measures the graphics driver’s CPU overhead by rendering a large number of simple objects one-by-one, continuously changing the state of each object. Issuing separate draw calls for every object stresses both hardware (CPU) and software (graphics API and driver efficiency). While the GPU does render each scene, its impact on the overall score (given in frames per second) is minimal.
Render Quality: This test compares a single rendered frame from T-Rex to a reference image, computing the peak signal-to-noise ratio (PSNR) based on mean square error (MSE). The test value, measured in milliBels, reflects the visual fidelity of the rendered image. The primary purpose of this test is to make sure that the GPU driver is not “optimizing” (i.e. cheating) for performance by sacrificing quality.
The defining feature of a mobile device is its ability to function untethered, with no physical connection required for data or power, making battery life a critical performance metric. Knowing which device will last the longest on a full charge is complicated, however, since it’s affected by many different factors, some of which can be gleaned from a spec sheet and some only through testing.
The battery’s storage capacity plays the biggest role, since this limits the total amount of energy available to the system. With only minimal gains in energy density, the only reasonable way to increase capacity is by increasing size. A bigger battery, however, means a bigger and heavier device, so compromises must be made.
The other half of the battery life story is power consumption, which is influenced by hardware, software, and, ultimately, how you actually use your device. From a hardware perspective, there are a number of different components that drain power, including wireless radios, cameras, speakers, and sensors, but the two biggest culprits are the screen and SoC. Screen size and resolution (more pixels generally use more power), panel technology (AMOLED uses less power than LCD to display black), and panel self refresh (local RAM caches the frame buffer so the GPU and memory bus are not required for static images) all influence display power. The display’s brightness level also effects battery life, which is why we calibrate all screens to 200 nits, removing this variable from our results. The SoC’s power consumption is influenced by process technology, number and type of transistors, power gating, and max core frequencies. Dynamic voltage and frequency scaling (DVFS), a system of software drivers that adjust core and bus frequencies, also has a significant impact on battery life.
Designing battery life tests that account for all of a system’s hardware and software influences is difficult enough without considering different usage scenarios. Do you use your phone to play 3D games or to just occasionally check email? Is the screen powered on for hours at a time or just a few minutes several times a day? Do you get a flood of notifications that keep turning the screen on or have apps constantly running in the background? Since everyone uses their devices differently, we cannot tell you how long your battery will last. Instead, we run some worst-case tests to put a lower bound on battery life, and another test modeled after more real-world usage.
PCMark measures system-level performance and battery life by running real-world workloads. The battery life test starts with a full 100% charge and loops the Work performance benchmark (see description in the CPU and System Performance section) until the battery charge reaches 20%. The reported battery life estimates a 95% duty cycle (from 100% to 5% charge remaining) by extrapolating the measured battery life from the test. In addition to showing the battery life in minutes, the overall work performance score (the same value reported in the CPU and System Performance section) is shown again for reference.
As a system-level test, the power consumption of the CPU, GPU, RAM, and screen all factor into the final battery life number. By running realistic workloads, the DVFS functions just as it would when running common apps, providing a more accurate representation of battery life.
This benchmark is similar to PCMark in that it measures system-level performance and battery life by running real-world workloads. Its 7-inch or larger screen requirement limits it to tablets only, however.
The battery life test uses three different scenarios, including Web and Email and Photo and Video Sharing, both of which are explained in the CPU and System Performance section. The other test is Video Playback, which loops three one-minute, 1080p H.264 video clips (~60 MB each) three times for nine minutes total playback time.
The device starts with a full 100% charge and loops the following 50 minute script until the battery dies:
- Web and Email workload
- Idle for 3 minutes (screen on)
- Photo and Video Sharing workload
- Idle for 3 minutes (screen on)
- Video Playback workload for 9 minutes
- Idle to 50 minute mark (~8 minutes)
Basemark OS II: Battery
Basemark OS II by Basemark Ltd. includes a battery rundown test in addition to the performance tests discussed earlier in the CPU and System Performance section. The largely synthetic battery test runs a multi-core workload similar to the CPU performance test, and provides a worst-case battery life primarily based on CPU, memory, and display power consumption.
The test calculates ratios and standard deviations for battery percentage consumed per minute. The final score is based on the arithmetic average of these values plus a bonus score based on CPU usage.
GFXBench 3.0: Battery Life
The GFXBench battery life test loops the T-Rex game simulation benchmark (detailed in the GPU and Gaming Performance section) continuously for 60 minutes, starting from a full 100% charge. This provides a worst-case battery life based primarily on GPU, memory, and display power consumption, and is indicative of what you might see while playing an intense 3D game.
Test results are displayed in two different charts: the extrapolated battery life in minutes and the average performance during the test in frames per second. It’s important to see both charts, because looking at only the battery life chart can be misleading; thermal throttling will cause the GPU to run at a lower frequency, leading to better battery life but lower performance.
Running at a higher clock frequency requires a higher voltage which generates more heat. This heat moves from the core to the SoC package and, eventually, finds its way to the device’s chassis, where it gets dissipated to the surrounding environment. If the core(s) produce heat faster than it can be dissipated, or the external chassis reaches a temperature making it uncomfortable to hold, then the system reduces clock frequency to satisfy thermal constraints. This is what we mean by thermal throttling. Because it reduces performance and can negatively affect the user experience, it’s something our performance testing needs to account for.
GFXBench 3.0: Battery Life
We use the GFXBench 3.0 battery life test to examine GPU thermal throttling, because the T-Rex workload generates a lot of heat. At the end of the test, GFXBench creates a nice diagram that plots battery charge (green line) and performance (blue line) versus time. Since performance is directly linked to the GPU core frequency, this diagram characterizes the device’s GPU thermal throttling over time.
During this test, we use an infrared camera from Seek Thermal to measure the surface temperature of the device. The camera sensor is a vanadium oxide microbolometer type with a 12μm pixel pitch. It has a resolution of 206x156 for a total of 32k thermal pixels, which can detect long-wave infrared of 7.2-13 microns. The camera is rated to be accurate to within 2% at 100 °F, and uses a “black body” shutter that passes in front of the lens periodically for self calibration.
The maximum skin temperature of the rear cover is recorded during the test and compared to other devices in a chart. The thermal image is also provided, because it shows how a device’s structural design helps or hinders heat dissipation.
Performance testing complete systems—and mobile devices in particular—is challenging, but it’s one aspect of product reviews we take very seriously. From selecting benchmarks and testing hardware/software to designing our own tests to crafting a methodology for gathering accurate data, a lot of thought and research goes into our testing process to ensure you get useful information you can trust.
Building this trust requires transparency, which is the primary purpose of this article. It also aims to inform; the information in this article provides a deeper understanding when reading our reviews and will, hopefully, help you see through less rigorous “reviews” and data presented by other media outlets and marketing firms, allowing you to make better buying decisions.
We’re never satisfied, however. There’s still improvements to make and holes to fill, so how we test mobile devices will continue to evolve. We want you to be a part of this process too. Is there something we’re missing that you would like to see? Do you have an idea for improving our methodology? Let us know.