Tegra 2: Nvidia Goes Mobile
As we’ve mentioned in the past, mobile devices like smartphones and tablets use what’s known as a system-on-chip (SoC). This integrates the processor, GPU, RAM, along with several other subsystems onto single device. Since all of those components sit next to each other on the same chip, there is greater efficiency in data transfers, while reducing the amount of space consumed on the PCB.
Header Cell - Column 0 | Apple A4 (iPad) | Apple A5 (iPad 2) | Tegra 2 (Xoom) |
---|---|---|---|
Processor | 1 GHz ARM Cortex-A8 (single-core) | 1 GHz ARM Cortex-A9 (dual-core) | 1 GHz ARM Cortex-A9 (dual-core) |
Memory | 256 MB 333 MHz LP-DDR (single-channel) | 512 MB 800 MHz LP-DDR2 (dual-channel) | 1 GB 667 MHz LP-DDR2 (single-channel) |
Graphics | PowerVR SGX535 | PowerVR SGX545MP2 | ULP GeForce |
L1 Cache(Instruction/Data) | 32 KB / 32 KB | 32 KB / 32 KB | 32 KB / 32 KB |
L2 Cache | 640 KB | 1 MB | 1 MB |
Tegra is Nvidia’s SoC brand, and it symbolizes the company’s effort to tap into the mobile market beyond its desktop-derived GeForce graphics processors. A lot of engineering is tied up in this initiative, and what we see today in tablets like the Xoom represents the company's second incarnation of Tegra.
You may be asking "What happened to the first Tegra?" Flatly, it was far less impressive, even when it hit the market in 2009. Compared to Apple’s A4, it was a much more conservative design. Nvidia choose the older ARM11 processor, which probably explains the lack of design wins. Microsoft’s Zune HD was the only major product that employed the original Tegra.
Tegra 2 is an entirely different beast. It’s based on the Cortex-A9, which is a generation ahead of the older ARM11. This is the same CPU seen in Apple’s A5 (iPad 2). Read Apple's iPad 2 Review: Tom's Goes Down The Tablet Rabbit Hole for a full discussion of Cortex-A9 performance.
The ultra-low power GeForce isn't just a physically smaller GPU than the A5’s SGX 543MP2. Unlike Nvidia's desktop GPUs, Tegra 2 is based on an architecture that pre-dates its unified design. So, you’re looking at four pixel shader cores and four vertex shader cores. This means Tegra 2 operates most efficiently when it's presented with an even mix of vertex and shader code. We expect Nvidia to address that constraint in Tegra 3 (code named Kal-El).
GPU (System-on-Chip) | PowerVR SGX 535(Apple A4) | PowerVR SGX 543(Apple A5) | ULP GeForce (Tegra 2) |
---|---|---|---|
SIMD | USSE | USSE2 | Core |
Pipelines | 2 (unified) | 4 (unified) | 8 (4 pixel / 4 vertex) |
TMUs | 2 | 2 | 2 |
Bus Width (bit) | 64 | 64 | 32 |
Triangle rate @ 200 MHz | 14 MTriangles/s | 35 MTriangles/s | ? |
The ULP GeForce has a maximum operating frequency of 300 MHz, but device vendors can tweak this setting to save on power. Nvidia provides less information on the Tegra 2 than it does for its desktop GPUs, so it’s best to move on to benchmarks. As in our iPad 2 review, we're turning to GLBenchmark 2.0.
In terms of frames rendered in a set period of time, the Xoom offers more performance than the original iPad, but it still falls short of the iPad 2. Conservatively, Google's first Honeycomb-based tablet renders 50% fewer frames according to the Pro test, and up to 3.7x less according to the Egypt test.
GPU (System-on-Chip) | PowerVR SGX 535(Apple A4) | PowerVR SGX 543(Apple A5) | ULP GeForce (Tegra 2) |
---|---|---|---|
SIMD | USSE | USSE2 | Core |
Channels | Single | Dual | Single |
Memory Bandwidth | 2.6 GB/s | 17.0 GB/s | 2.6 GB/s |
You can't use fill or triangle rates to draw a direct comparison of how well Tegra 2 utilizes its memory bandwidth, even though it's a quick-and-dirty way of sizing up other mobile GPUs.
According to Intel, the SGX 535 (GMA 500) requires 4.2 GB/s of memory bandwidth to reach a 14 Mtriangles/s triangle rate, but that's not the result that we get in GLBenchmark's triangle test. If you do the math, you'll find that the iPad's A4 uses 333 MHz LP-DDR, which offers up to 2.6 GB/s of throughput. This matches the memory bandwidth ratio (2.6/4.2 = 63%) to the triangle rate (8.6/14 = 61%).
In comparison, the iPad 2 uses 800 MHz LP-DDR2 in a dual-channel configuration. This adds up to about 17.0 GB/s of memory bandwidth. GLBenchmark suggests that this isn't enough though, because a single-core SGX 543 should reach 35 Mtriangles/sec. And yet, we only achieve about 30 Mtriangles/sec with our dual-core SGX 543. Adding another core doesn't exactly double performance because it's not a linear scale. However, given our previous experience with desktop GPUs, we suspect that another 30-40% could be squeezed out of the iPad 2's GPU if Apple used higher-performance memory.
We can make this assertion because there is a direct relationship between memory bandwidth and triangle rates in the A4's and A5's PowerVR GPUs, due to their tile-based deferred rendering architecture. Those GPUs operate differently than what we're used to seeing on the desktop. Tegra 2, however, is an entirely different beast. It employs a more traditional z-buffered rendering architecture, like desktop GPUs. That's why it's pointless to compare triangle and fill rates. It's more important to look at the Egypt and Pro benchmarks.
Interestingly, Tegra 2 only employs a single-channel 32-bit LP-DDR2 memory controller. This could be a bottleneck restricting performance, but there is no benchmark we can use to determine that for sure. Then again, we do know that the version of Tegra 2 in the Xoom is somewhat restricted. Motorola wanted to emphasize better battery life, so it capped the Tegra 2's memory clock at 600 MHz. This effectively limits bandwidth to 2.4 GB/s. Nvidia specs the Tegra 2 for up to 667 MHz operation, which means there could be other tablets that offer better performance through a higher data rate.
GLBenchmark 2.0 | Apple iPad | Apple iPad 2 | Motorola Xoom |
---|---|---|---|
Egypt frames (frames) | 575 | 5075 | 1371 |
Egypt with FSAA (frames) | 436 | 5057 | - |
Pro (frames) | 880 | 2897 | 1347 |
Pro with FSAA (frames) | 672 | 2851 | - |
Egypt with FSAA Fixed Time (sec) | 825.6 | 65.0 | - |
Pro with FSAA Fixed Time (sec) | 123.3 | 22.6 | - |
Swap Buffer Test (frames) | 600 | 599 | 603 |
Fill Test (texture fetch) ktexel/s | 17 0980 | 91 8551 | 12 9897 |
Trigonometric Test (vertex weighted) kvertex/s | 1039 | 3326 | 2632 |
Trigonometric Test (fragment weighted) kfragment/s | 1191 | 3512 | 4452 |
Trigonometric test (balanced) kshader/s | 1259 | 3158 | 2543 |
Exponential Test (vertex weighted) kvertex/s | 3130 | 3535 | 2628 |
Exponential Test (fragment weighted) kfragment/s | 3774 | 11 165 | 3003 |
Exponential Test (balanced) kshader/s | 2043 | 11 735 | 1656 |
Common Test (vertex weighted) kvertex/s | 1524 | 3727 | 1973 |
Common Test (fragment weighted) kfragment/s | 1634 | 3699 | 4451 |
Common Test (balanced) kshader/s | 1065 | 4114 | 2530 |
Geometric Test (Vertex Weighted) kvertex/s | 1949 | 3776 | 1316 |
Geometric Test (Fragment Weighted) kfragment/s | 2081 | 6388 | 2888 |
Geometric Test (Balanced) kshader/s | 1281 | 6181 | 1628 |
For Loop Test (Vertex Weighted) kvertex/s | 1671 | 3860 | 1315 |
For Loop Test (Fragment Weighted) kfragment/s | 1842 | 6237 | 7271 |
For Loop Test (balanced) kshader/s | 1275 | 3718 | 3583 |
Branching Test (vertex weighted) kvertex/s | 3906 | 3778 | 2633 |
Branching Test (fragment weighted) kfragment/s | 6045 | 22 557 | 3211 |
Branching Test (balanced) kshader/s | 2106 | 11 193 | 1493 |
Array Test (uniform array access) kvertex/s | 2918 | 3658 | 3946 |
Triangle Test (white) ktriangle/s | 9548 | 29 957 | 12 595 |
Triangle Test (textured, vertex lit) ktriangle/s | 7058 | 21 129 | 10 520 |