There will clearly be limitations to what Tegra K1 runs smoothly. Yes, it leverages the Kepler architecture. But the specific implementation is necessarily distilled in order to fit within a constrained power budget. In essence, we’re looking at one Streaming Multiprocessor built into a single Graphics Processing Cluster. The SMX contains 192 CUDA cores. Instead of 16 texture units, which is what you find on the desktop, Nvidia pares Tegra K1’s SMX down to eight. And whereas each ROP outputs eight pixels per clock in, say, GK104, Tegra K1 drops to four.
Although Nvidia didn't give us a specific clock rate for its graphics complex, one of its slides mentions a 365 GFLOPS peak shader performance figure. With 192 shaders, that'd put frequency right around 950 MHz.
Some of the other changes needed to enable Kepler on Tegra K1 are more difficult to illustrate. In short, if you look at the GPU block diagram, everything in grey, representing the fabric by which components of the engine communicate with each other, was replaced to optimize for efficiency. Although Nvidia constructed its next-generation Maxwell architecture with mobile in mind, you will continue to see the company utilize distinctly different fabrics to build its mobile and scaled-up GPUs, balancing performance and power consumption.
Learn About Nvidia's SMX
If you want to know more about Nvidia’s nomenclature and how its SMX appears in a discrete graphics architecture, check out GeForce GTX 680 2 GB Review: Kepler Sends Tahiti On Vacation.
Nvidia is quick to point out that it didn’t handicap certain other features of the architecture. For example, even though tessellation is exposed through DirectX 11 and OpenGL, the same second-gen PolyMorph engine found in desktop Kepler-based GPUs is still part of the SMX. This isn’t the first we’ve heard of DirectX 11-compliant tessellation enabled in hardware—Qualcomm’s Snapdragon 805 with Adreno 420 graphics is also equipped with hull, domain, and geometry shader support, as are Vivante’s licensable Vega cores. Nvidia is confident that its implementation is best, but there’s simply no way to test the company’s claims right now. We suspect, however, that industry-wide adoption of features like tessellation and geometry shading will make developers more likely to utilize those capabilities in next-gen games.
GPU-accelerated path rendering is another technology that Nvidia experimented with on its big GPUs first (back in 2001, in fact), and is now trying to advocate in the mobile world. Briefly, path rendering is involved in resolution-independent 2D graphics—content like PostScript, PDFs, TrueType fonts, Flash, Silverlight, HTML5 Canvas, along with the Direct2D and OpenVG APIs. It’s historically a CPU-oriented task. The artifacts of this are painfully obvious in the mobile space, though. When I pinch to zoom on a Web page using my first-gen iPad, I can let go and count several seconds as the A4 SoC re-rasterizes the scene. During this time, the text remains blurry. My iPad Mini’s A5 handles the task better; fonts sharpen almost instantly after letting go. But so long as my fingers remain pinched, the blur persists. Now, Nvidia’s saying that accelerated path rendering gets rid of that, simultaneously conferring certain power-oriented benefits since the CPU isn’t touching the scene.
Perhaps sensitive to Qualcomm’s disclosure that Snapdragon 805 sports a 128-bit memory interface supporting LPDDR3-1600 memory (128-bit divided by eight, multiplied by 1600 MT/s, equals 25.6 GB/s), Nvidia is eager to assure that the 17 GB/s enabled by its 64-bit bus populated with 2133 MT/s memory is still ample. Of course, raw bandwidth is an important specification. However, Nvidia carries over architectural features from Kepler that benefit Tegra beyond its spec sheet. A 128 KB L2 cache is one example, naturally alleviating demands on the DRAM in situations where references to already-used data result in a high hit rate. And because the cache is unified, whatever on-chip unit is experiencing locality can use it. A number of rejection and compression technologies also minimize memory traffic, including on-chip Z-cull, early Z and Z compression, texture compression (including DXT, ETC, and ASTC), and color compression.
Some of those capabilities even extend beyond 3D workloads into layered user interfaces, where bandwidth savings pave the way for higher-res outputs (and perhaps explain why most of Tegra 4-based devices we’ve seen today employ lower resolutions). New to Tegra K1 is delta-encoded compression, which uses comparisons between blocks of pixels to reduce the footprint of color data. Nvidia is also able to save bandwidth on UI layers with a lot of transparency—the GPU recognizes clear areas and skips that work completely. We’ll naturally get a better sense of how Tegra’s memory subsystem affects performance with hardware to test. For now, Nvidia insists elegant technology is just as effective as brute force.
Tegra K1 additionally inherits the Kepler architecture’s support for heterogeneous computing. Up until now, the latest PowerVR, Mali, and Adreno graphics designs all facilitated some combination of OpenCL and/or Renderscript, isolating Nvidia’s aging mobile architecture as the least flexible. That changes as Nvidia enables CUDA, OpenCL, Renderscript, Rootbeer (for Java), and a number of other compute-oriented languages on its newest SoC.
How Do You Scale Kepler Down Under 2 W?
At first glance, the math doesn’t add up. The GK104 GPU in Nvidia’s GeForce GTX 680 contains eight SMX blocks and is rated for roughly 200 W. Sure, there are four ROP partitions, a 256-bit memory bus, and twice as many texture units per SMX. Still, you’re looking a factor of 10 difference, at least, difference between Kepler as it appears in Tegra and a theoretical single-SMX discrete GPU. How is that rectified?
Nvidia’s Jonah Alben used GeForce GT 740M—a 19 W part with two SMXes—to illustrate. Memory I/O and PCI Express 3.0 are responsible for roughly 3 W of the GPU’s power budget. Leakage accounts for about 6 W. Because GK107 is a dual-SMX design, divide by two for the power of a single block. Giving us 5 W. From there, consider that Nvidia is able to turn up the voltage and clock rates of its discrete GPUs to fill an allowable power envelope. Through voltage scaling, it’s possible to dial back to 2 W or so, which is where Tegra K1’s GPU lands.
Maximizing the design’s efficiency is naturally quite a bit more complex than that description conveys. Multi-level clock gating ensures that, throughout the GPU, logic not needed at any given time is turned off. There are also two levels of power gating to cut current in the chip or at the regulator. Inside the SoC, Nvidia’s engineers had to optimize interconnects and data paths, as mentioned, trading performance for power where it made the most sense.
Nvidia presented its own benchmark results from the upcoming GFXBench 3.0, graphing framerate against power consumption in the 1080p Manhattan off-screen test. It chose Apple’s iPhone 5s and Sony’s Xperia Z Ultra as comparison points, targeting the A7 and Snapdragon 800 SoCs. At a constant performance level, the company claims a 1.5x performance per watt advantage over both with the application processor and DRAM power summed.