Nvidia Announces Maxwell-Powered Tegra X1 SoC At CES

Nvidia announced the Tegra K1 SoC a year ago at CES 2014 and brought a desktop caliber GPU architecture to mobile (albeit slimmed down to 192 CUDA cores), along with newfound attention to mobile gaming, and Android as a gaming platform in particular. When it first launched, Tegra K1 easily surpassed the best tablet SoCs available from ARM, Imagination Technologies, and Qualcomm. One year later, the Tegra K1 still commands a lead in most GPU benchmarks, although its competitors have closed the performance gap and have even matched the K1 in specific tests. Today at CES 2015, Nvidia raises the mobile SoC performance bar once again with Tegra X1. Originally codenamed Erista, the X1 includes new CPU, GPU, and ISP components.

Built on TSMC's 20 nm process, Tegra X1's 64-bit CPU utilizes four ARM A53 cores and four ARM A57 cores. The higher performing A57 cores each have a 48KB L1 instruction cache and a 32KB L1 data cache, with the four core cluster sharing a common 2 MB L2 cache. The more power efficient A53 cores each contain their own 32KB L1 instruction cache and 32KB L1 data cache, with the cluster sharing a smaller, common 512KB L2 cache. Nvidia uses a cluster migration scheme for thread management, where either the four A57 cores are active or the A53 cores are active, but never both at the same time. Thus, the operating system scheduler can only see the four cores inside the active cluster. Within a given cluster, however, individual CPU cores can be throttled or shutdown entirely depending on the given workload.

Generally, cluster migration is the least efficient method for thread management, since it's less granular than either the CPU migration scheme (where each cluster contains both a fast and slow core) or a heterogeneous scheme (where all cores are active and available to the operating system scheduler). However, Nvidia claims to have twice the power efficiency at the same performance level as Samsung's Exynos 5433, which also uses a big.LITTLE configuration of A53/A57 cores but uses heterogeneous multi-processing. How Nvidia accomplishes this feat of efficiency is unknown, especially since both CPUs are built on similar, but not identical, 20 nm processes.

CPU Power [mW] vs. SPECint Score (Tegra X1 values based on development platform. Exynos 5433 values based on Galaxy Note 4.) source: Nvidia

It's curious that Nvidia moves back to stock ARM cores rather than using its own 64-bit Denver CPU. Nvidia's motive is simply market timing. Getting the ARM cores onto a 20 nm process was more attainable than porting the new Denver architecture.

Media capabilities also see an improvement with X1. It uses two Image Signal Processors (ISP) to process a total of 1.3 Gpixels/s from up to six camera inputs. It also supports image sensors up to 100 MP, can manage up to 4096 focus points, and has up to 600 MPixels/s of JPEG encode/decode throughput. As for video, Tegra X1 encodes 4K video at 30 fps in either H.264, H.265, or VP8 formats in hardware, and can decode 4K H.265 (with 10-bit color depth) and VP9 video at 60 fps, also in hardware. It also supports the HDMI 2.0 interface for external displays.

The Maxwell GPU in Tegra X1 looks similar to GM204 used in the GTX 980 graphics cards. In order to squeeze GM204 into a mobile TDP however, the number of Graphics Processing Clusters (GPC) is reduced from four in GM204 to one GPC in X1. Additionally, whereas each GPC in GM204 contains four Streaming Multiprocessors (SM), each GPC in X1 has only two SM blocks. This gives Tegra X1 256 total CUDA cores, up from 192 in Tegra K1.

Swipe to scroll horizontally

GPU	Tegra K1	Tegra X1
Architecture	Kepler	Maxwell
Manufacturing Process	28 nm	20 nm
SMs	1	2
CUDA Cores	192	256
GFLOPs (FP32) Peak	365	512
GFLOPs (FP16) Peak	365	1024
Texture Units	8	16
Texel Fill-Rate	7.6 Gtexels/s	16 Gtexels/s
GPU Clock	~ 950 MHz	~ 1000 MHz
Memory Clock	930 MHz (LPDDR3)	1600 MHz (LPDDR4)
Memory Bandwidth	14.9 GB/s	25.6 GB/s
ROPs	4	16
L2 Cache Size	128KB	256KB
Z-cull	256 pixels/clock	256 pixels/clock
Raster	4 pixels/clock	16 pixels/clock
Texture	8 bilinear filters/clock	16 bilinear filters/clock
ZROP	64 samples/clock	128 samples/clock

So what does all this mean for performance? For starters, Tegra X1 is the first mobile SoC to exceed one TeraFLOPS of peak FP16 operations and over 500 GFLOPS at FP32. Compared to ASCI Red, the first supercomputer to break one TeraFLOP on the LINPACK benchmark and the world's fastest supercomputer until the year 2000, this is quite an accomplishment, especially considering it used almost 1600 square feet of space, 9,298 Intel Pentium Pro processors running at 200 MHz, and required 850 kW of power. Of course one simple benchmark doesn't tell the whole story, but clearly we've made some progress over the past decade and a half.

Nvidia also invited us to a benchmarking session where we saw Tegra X1 development boards running Android Lollipop and several graphics and system benchmarks. Note that the Tegra X1 performance numbers in the charts below were not obtained with our specific versions of these benchmarks and we couldn't verify clock speeds. The SoC itself had a small heatsink (no fan), which Nvidia said represented the typical thermal dissipation capability of a tablet chassis.

3DMark Ice Storm Unlimited

Swipe to scroll horizontally

Header Cell - Column 0	Nvidia Tegra X1 (Dev Board)	Nvidia Tegra K1 (Shield Tablet)	Apple A8x (iPad Air 2)	Adreno 420 (Samsung Galaxy Note 4)	Benchmark Units
Overall Score	43860	30545	21708	19684	score
Graphics	58448	35588	31525	20298	score
Physics	23410	20418	10388	17802	score
Graphics Test 1	285.4	212.0	147.9	102.9	fps
Graphics Test 2	229.0	121.8	127.8	77.5	fps
Physics Test	74.3	64.8	33.0	56.5	fps

The Maxwell based Tegra X1 easily outperforms the latest GPUs from Imagination and Qualcomm in this benchmark. It also scores almost 44 percent better than its predecessor, the Tegra K1.

GFXBench 3.0

Swipe to scroll horizontally

Header Cell - Column 0	Nvidia Tegra X1 (Dev Board)	Nvidia Tegra K1 (Shield Tablet)	Apple A8x (iPad Air 2)	Adreno 420 (Samsung Galaxy Note 4)	Benchmark Units
Manhattan Offscreen	65.8	30.8	32.6	18.0	fps
T-Rex Offscreen	124.2	70.0	70.4	19.0	fps
ALU Offscreen	455.2	273.0	184.3	151.5	fps
Alpha Blending Offscreen	21888	4249	17229	11882	MB/s
Fill Offscreen	12197	5830	7606	7582	MTexels/s
Driver Overhead Offscreen	63.0	52.0	105.9	27.0	fps

While the A8x in the iPad Air 2 matches the performance of the Tegra K1 in several sub-tests, it's easily outclassed by the new Tegra X1. We see a greater than 2x improvement over the K1 in Manhattan and about a 1.8x increase in T-Rex. The Alpha Blending and Fill rates see a dramatic improvement too, due partially to the significant increase in memory bandwidth.

In addition to better performance, X1 also sees a significant improvement in energy efficiency. To prove this point, Nvidia monitored the power rail current from both the X1 development board and an iPad Air 2. Since the X1 performs better, it was underclocked to match the GPU performance of the A8x.

In this example, the average current draw for the X1 is 44 percent lower than the A8x. We'll withhold judgement until we can run some additional battery drain benchmarks of our own; however, this result is encouraging.

Based on these early results, it appears that Nvidia has once again raised the bar for performance and power efficiency. The X1 is currently in production, and while Nvidia wasn't prepared to disclose any upcoming products using its new SoC, we certainly expect to see the first device ship in the first half of the year, if not Q1. We also inquired about the prospect of seeing X1 in smartphones in addition to tablets. Again, Nvidia was noncommittal but seemed to hint that it's possible. With the tablet version running in the 4-5W range, it's conceivable that dialing back clockspeed and sacrificing some performance, which appears to be in ample supply, could get the TDP low enough to work in a smartphone. It looks like it's going to be another exciting year for mobile.

TOPICS

10 Comments Comment from the forums

Cash091

The K1 is still a beast!! I really hope they can bring something to the smartphone world. I would love the ability to stream PC games to my smartphone. Would need a 5.5+ inch screen though, IMO.
Reply
hEt lEyd

i think with those info's (1024 GFLOPS, +60 in Manhattan) it will be more powerfull than Intel Broadwell-U HD6000 and HD6100. what's goin on here?!
Reply
anxiousinfusion

i think with those info's (1024 GFLOPS, +60 in Manhattan) it will be more powerfull than Intel Broadwell-U HD6000 and HD6100. what's goin on here?!

Depending on the test, which Broadwell-U may be measured in, Tegra X1 may only be rated 512 peak.

Reply
alextheblue

Hey look, unreleased hardware is faster than current hardware on the market!

I look forward to an independent review complete with benchmarks... when it's in a shipping tablet. :)
Reply
photonboy

Cash091,
Streaming still needs to be done locally to get a good experience (i.e. local AC network) not using your PC as a remote server.

Also, not sure streaming to a PHONE is a good idea since controller support wouldn't really be there. That's why you need an NVidia Shield.
Reply
alextheblue

Cash091,
Streaming still needs to be done locally to get a good experience (i.e. local AC network) not using your PC as a remote server.
Yes.

Also, not sure streaming to a PHONE is a good idea since controller support wouldn't really be there. That's why you need an NVidia Shield.
No. Not necessarily. Plenty of tablets and phones have support for controllers. Heck if you get a Windows tablet you've got even more options for controllers (and other input devices). The only time a Shield might be strictly better is if you're not streaming games, but rather running them locally on the tablet itself.
Reply
TripleHeinz

I'd wish Nintendo to build their next handheld with a SOC like the X1...keeping the ARM A9 & A11 for backwards compatibility with previous gen consoles' games.
Anybody can dream on the internet these days.
Reply
007agentHP

4k phone anyone?
Reply
Abram730

i think with those info's (1024 GFLOPS, +60 in Manhattan) it will be more powerfull than Intel Broadwell-U HD6000 and HD6100. what's goin on here?!

Depending on the test, which Broadwell-U may be measured in, Tegra X1 may only be rated 512 peak.

Yep, the demo needs 1 TFLOP, but they run color compressed to FP16, so that's where the 1 TFLOP comes from, but they are measured based on FP32.. so 512 GFLOP

Their compression is great though. The demo looked great, so there can be multiplatform games for mobile.

Cash091,
Streaming still needs to be done locally to get a good experience (i.e. local AC network) not using your PC as a remote server.

Also, not sure streaming to a PHONE is a good idea since controller support wouldn't really be there. That's why you need an NVidia Shield.

I stream from 3000 miles away all the time without latency issues. You clearly haven't used grid or Nvidia's game streaming. Web browsing in front of my computer is where the latency is. I still haven't found a PC that can web browse without lagging out.
Reply
g00ey

Devices with this chip ought to surpass the PS3/XB360 in terms of performance, even when accounting for additional OS overhead. It would be interesting to see PS3/XB360 games be ported to such devices...
Reply

Show more comments