Just for today, let’s play pretend. We’ll pretend that a technology’s specs directly translate into market success. We’ll pretend that a company’s level of enthusiasm correlates to the public’s. We’ll pretend that Tegra 4 brought good news for Nvidia.
Although Tegra 4’s hardware looked reasonably good on paper, although the execs we talked to a year ago were ultra-amped about their SoC’s prospects, and although we were expecting tablet and smartphone design wins aplenty, the list of relevant devices remains painfully short.
Was it the missing API support? Nvidia’s delivery schedule? Qualcomm’s broad product line that lets its partners pick through a number of different application processors with or without LTE? Power? Or the lack of a well-established modem? The answer likely involves a combination of factors. But again, we have to push on as if the past is no determinant of the future.
Stylized depiction of the Tegra K1 die
Why? Because Nvidia is taking the wraps off of its Tegra K1 SoC—a mobile platform it considers so fundamentally improved from anything prior that, at the last second, it changed courses and shelved the expected Tegra 5 branding. Company executives are ratcheting up the excitement beyond where it was at for Tegra 4. They’re saying design wins are a certainty. And all of the communication thus far ignores cellular connectivity, keeping the focus on Tegra K1’s processing, graphics, and imaging.
So, we check history at the door, set aside the past, and dig into this generation’s specifications. After all, Tegra K1 represents the first time that Nvidia’s GeForce and mobile architectures converge—and GeForce is where the company put itself on the map. Vendors are already taking notice, too. In the hours before Nvidia's Sunday night reveal, our team on the ground at CES was already meeting up with Lenovo, which disclosed the ThinkVision 28, an Ultra HD display running Android with "next-gen" Tegra inside. We think this is Nvidia's first design win (and driving a 3840x2160 panel, no less). The device is a little obscure, but it's more important that we're seeing partners willing to take Tegra K1 and run right out of the gate.
Meet The Tegra K1 SoC
Before we dive deeper into the Tegra K1 SoC’s subsystems, here’s an overview that looks a lot like what Nvidia presented when it launched Tegra 4.

Last night, Nvidia announced two versions of Tegra K1. The first version we knew was coming. That's what you see above. It's a 4+1 architecture consisting of four “big” Cortex-A15 cores and one Cortex-A15 “battery saver” core with 2 MB of L2 cache. The second version was a surprise. It's pin-compatible, but instead sports two 64-bit cores based on the company's Project Denver, first discussed back in 2011.
Nvidia wouldn't answer any of our questions about the Denver-based model, except to say that it introduces 64-bit support, is a much wider seven-way superscalar design (versus Cortex-A15's three-way), runs at up to 2.5 GHz, features 128 KB of L1 instruction and 64 KB of L1 data cache, and of course is a custom design based on the ARMv8 architecture.

The GPU is significantly overhauled, no longer composed of separate programmable vertex and pixel shaders, but rather built using the same Kepler architecture prevalent across Nvidia’s GeForce family. Tegra K1 sports 192 CUDA cores, and of course, because they’re so fundamentally different, you can’t compare them to Tegra 4’s 24 vertex and 48 pixel shaders. Implementing Kepler in the Tegra K1 does enable OpenGL ES 3.0 support though, which Tegra 4 was notably missing. DirectX 11-, OpenGL 4.4-, OpenCL 1.1-, and CUDA-based apps will also run on Tegra K1.
Imaging was clearly a focus of the Tegra 4 launch, although I’d argue that we still haven’t seen the real potential of Nvidia’s ISP. Nevertheless, the company makes significant improvements to this subsystem for Tegra K1, boosting peak throughput to 1.2 gigapixels/s (up from a claimed 350 megapixels/s) and supporting sensors as large as 100 megapixels.
Tegra 4’s fixed-function video encode and decode pipelines were limited to 2160p at 24 FPS; Tegra K1 pushes that to 2160p support at 30. Whereas Qualcomm’s Snapdragon 805 is expected to support HEVC decoding in hardware, Nvidia makes no mention of acceleration in Tegra K1, unfortunately.
The SoC’s display controller is improved. Nvidia says Tegra K1 supports the same 4x2 DSI, but also includes eDP, LVDS, and HDMI 1.4b support to drive up to 4K panels and 4K external devices. Tegra 4 required an external DSI-to-eDP bridge chip for DisplayPort connectivity.
Tegra K1 is manufactured using TSMC’s 28 nm HPM process, similar to Qualcomm’s Snapdragon 800, which already powers a number of shipping devices. In comparison, Tegra 4 utilized 28 nm HPL, a high-performance, low-leakage technology not as ideal for mobile SoCs.
Nvidia and Samsung both utilize cores designed by ARM, while Qualcomm and Apple build their own cores using ARM’s instruction set. Tegra 4 was Nvidia’s first effort based on Cortex-A15, and it ran at clock rates up to 1.9 GHz. The company sticks with Cortex-A15 in its 32-bit Tegra K1, but makes some improvements that it claims facilitate up to 40% more performance at the same power level or allow the SoC to use 45% of the power at a specific performance point in SPECint2000.
Nvidia's four Cortex-A15 cores share a 16-way set associative L2 cache
Those are some fairly substantial claims given the same processor generation. Yet, Nvidia says that they’re the product of three factors. First, its engineers have all of that time building Tegra 4 under their belts. Although this is Cortex-A15, there are purportedly some layout-related optimizations unique to Nvidia’s implementation. The shift from TSMC’s 28 nm HPL to HPM process brings dynamic power down a bit as well. Finally, ARM is on its fourth revision of Cortex-A15. Tegra K1 employs r3p3 (the third revision), whereas Tegra 4 was based on r2. According to ARM’s technical documentation, most of the changes between the two SoCs are related to regional clock gating and a couple of other configurable power-saving options. Estimates put the performance/watt improvements between 5 and 10%. As a result, Nvidia reaches clock rates as high as 2.3 GHz with the quad-core Tegra K1.
The flexibility to choose between higher performance or lower consumption means that Nvidia can allocate its power budget more freely than the generation prior, scaling back on the CPU in favor of its graphics engine, for example. In graphics-bound applications, this is precisely what you’d want to see. Doubly so given Tegra K1’s broader API support, which at least make it technically feasible for developers to port higher-end titles down to Android-based tablets.

Notably less was said about the dual-core Denver-based model, except that its wide superscalar execution pipeline should facilitate notably better threaded performance and fast single-core speed at clock rates of up to 2.5 GHz.
Alright, so, we have revised Cortex-A15 cores, Nvidia's own Denver cores coming sometime later, and a more appropriate variation of the 28 nm manufacturing process improving the performance-per-watt of Tegra K1 compared to its predecessor. That story could turn out to be very good, but we suspect it isn't strong enough on its own to coax partners away from entrenched incumbents.
Nvidia is instead making a move a lot of enthusiasts expected in the Tegra 4 timeframe: it’s converging Tegra’s graphics technology with the same design pervasive through the HPC, workstation, desktop, and mobile markets. Gone are the programmable vertex and pixel shaders, which Nvidia previously told me were the right decision at the time for Tegra 4, replaced by the company’s Kepler architecture.
What Nvidia’s reps couldn’t tell me back when “Wayne” launched was that they built Kepler in such a way to facilitate scalability from the 250 W discrete GPUs we see in workstations down to the sub-2 W version that shows up in Tegra K1. Kepler was the missing piece Nvidia needed to phase out its older mobile architecture.
Moving forward, every GPU architecture Nvidia develops will be mobile-first. That was a decision management made during Kepler’s design and then applied to Maxwell from the start. It doesn’t mean Maxwell will show up in Tegra first (in fact, the first Maxwell-powered discrete GPUs are expected in the next few weeks). But the architecture was approached with its mobile configuration and power characteristics in mind, scaling up from there. Sounds like a gamble for a company so reliant on the success of its big GPUs, right? Nvidia says that the principles applied to getting mobile right will be what help it maximize the efficiency of its discrete products moving forward—and we’ll have the hardware to put those claims to the test once GeForce GTX 750 materializes.
Here’s the other side of that coin from a group of guys who do nearly all of their gaming on PCs: do smartphones and tablets really need similar capabilities as desktops? We sit down to a mouse and keyboard for the story-driven content or all-night multi-player marathons, and kick back on the couch for more casual console gaming. Given small screens and interface limitations, is there a real reason to make powerful graphics the front-and-center priority? According to Nvidia’s data, yes. A majority of revenue earned through Google Play comes from games. A majority of time spent on tablets is spent gaming. And a majority of game developers are targeting mobile devices (slightly more than PCs, even). Although we joke around that tablet gaming happens in the bathroom, leaving little time for an immersive experience, perhaps ample graphics horsepower and the right API support are what push this segment to the next level.
Naturally, Nvidia needs to count on gaming continuing as the dominant tablet workload. Graphics is its wheelhouse, after all. With the company’s old mobile architecture set aside, it can push that agenda forward using the same software tools developers leverage in the desktop and professional spaces. Whether or not ISVs move massively successful games like GTA over from previous-gen consoles to Android is a business decision. But Nvidia makes this possible in hardware and through its various development environments on the software side. The individual tools aren’t particularly relevant to enthusiasts. What they mean, however, is that titles written for DirectX 11, OpenGL, any version of OpenGL ES, CUDA, and OpenCL 1.1 can be brought over using Windows, Linux, and OS X.
It’s hard not to notice that all of this is a sharp contrast to last year’s message, which was that Tegra 4 lacked OpenGL, OpenGL ES 3.0, CUDA, and DirectX 11 support, but that it was optimized for available apps, making its feature set perfectly suitable. We much prefer Tegra’s prospects going into 2014, with the impetus on Android and Windows RT developers to exploit Nvidia’s hardware. If this was a story about any other hardware vendor, it might seem far-fetched to expect titles tuned for a specific platform. But Nvidia tends to play the relationship game well; there is already plenty of content with Tegra-oriented optimizations.
Screen shot from Nvidia's UE 4 demo
As if to drive its point home, Nvidia showed off a demonstration of the Unreal Engine 4’s feature set (which includes several line items that transcend OpenGL ES 3.0) running on a Tegra K1 reference tablet in the Tegra Note 7’s chassis. It took the engine, ported it to Android, implemented an OpenGL 4.4 renderer, and now UE4-based content runs on Tegra K1. Also on display were Serious Sam 3 and Trine 2—both 2011-era titles that looked great on Nvidia’s samples.
There will clearly be limitations to what Tegra K1 runs smoothly. Yes, it leverages the Kepler architecture. But the specific implementation is necessarily distilled in order to fit within a constrained power budget. In essence, we’re looking at one Streaming Multiprocessor built into a single Graphics Processing Cluster. The SMX contains 192 CUDA cores. Instead of 16 texture units, which is what you find on the desktop, Nvidia pares Tegra K1’s SMX down to eight. And whereas each ROP outputs eight pixels per clock in, say, GK104, Tegra K1 drops to four.
Although Nvidia didn't give us a specific clock rate for its graphics complex, one of its slides mentions a 365 GFLOPS peak shader performance figure. With 192 shaders, that'd put frequency right around 950 MHz.
Some of the other changes needed to enable Kepler on Tegra K1 are more difficult to illustrate. In short, if you look at the GPU block diagram, everything in grey, representing the fabric by which components of the engine communicate with each other, was replaced to optimize for efficiency. Although Nvidia constructed its next-generation Maxwell architecture with mobile in mind, you will continue to see the company utilize distinctly different fabrics to build its mobile and scaled-up GPUs, balancing performance and power consumption.
Learn About Nvidia's SMX
If you want to know more about Nvidia’s nomenclature and how its SMX appears in a discrete graphics architecture, check out GeForce GTX 680 2 GB Review: Kepler Sends Tahiti On Vacation.
Nvidia is quick to point out that it didn’t handicap certain other features of the architecture. For example, even though tessellation is exposed through DirectX 11 and OpenGL, the same second-gen PolyMorph engine found in desktop Kepler-based GPUs is still part of the SMX. This isn’t the first we’ve heard of DirectX 11-compliant tessellation enabled in hardware—Qualcomm’s Snapdragon 805 with Adreno 420 graphics is also equipped with hull, domain, and geometry shader support, as are Vivante’s licensable Vega cores. Nvidia is confident that its implementation is best, but there’s simply no way to test the company’s claims right now. We suspect, however, that industry-wide adoption of features like tessellation and geometry shading will make developers more likely to utilize those capabilities in next-gen games.
GPU-accelerated path rendering is another technology that Nvidia experimented with on its big GPUs first (back in 2001, in fact), and is now trying to advocate in the mobile world. Briefly, path rendering is involved in resolution-independent 2D graphics—content like PostScript, PDFs, TrueType fonts, Flash, Silverlight, HTML5 Canvas, along with the Direct2D and OpenVG APIs. It’s historically a CPU-oriented task. The artifacts of this are painfully obvious in the mobile space, though. When I pinch to zoom on a Web page using my first-gen iPad, I can let go and count several seconds as the A4 SoC re-rasterizes the scene. During this time, the text remains blurry. My iPad Mini’s A5 handles the task better; fonts sharpen almost instantly after letting go. But so long as my fingers remain pinched, the blur persists. Now, Nvidia’s saying that accelerated path rendering gets rid of that, simultaneously conferring certain power-oriented benefits since the CPU isn’t touching the scene.
On the right, you see the blur from pinching and zooming; fonts on the left are re-rasterized
Perhaps sensitive to Qualcomm’s disclosure that Snapdragon 805 sports a 128-bit memory interface supporting LPDDR3-1600 memory (128-bit divided by eight, multiplied by 1600 MT/s, equals 25.6 GB/s), Nvidia is eager to assure that the 17 GB/s enabled by its 64-bit bus populated with 2133 MT/s memory is still ample. Of course, raw bandwidth is an important specification. However, Nvidia carries over architectural features from Kepler that benefit Tegra beyond its spec sheet. A 128 KB L2 cache is one example, naturally alleviating demands on the DRAM in situations where references to already-used data result in a high hit rate. And because the cache is unified, whatever on-chip unit is experiencing locality can use it. A number of rejection and compression technologies also minimize memory traffic, including on-chip Z-cull, early Z and Z compression, texture compression (including DXT, ETC, and ASTC), and color compression.
Some of those capabilities even extend beyond 3D workloads into layered user interfaces, where bandwidth savings pave the way for higher-res outputs (and perhaps explain why most of Tegra 4-based devices we’ve seen today employ lower resolutions). New to Tegra K1 is delta-encoded compression, which uses comparisons between blocks of pixels to reduce the footprint of color data. Nvidia is also able to save bandwidth on UI layers with a lot of transparency—the GPU recognizes clear areas and skips that work completely. We’ll naturally get a better sense of how Tegra’s memory subsystem affects performance with hardware to test. For now, Nvidia insists elegant technology is just as effective as brute force.
Tegra K1 additionally inherits the Kepler architecture’s support for heterogeneous computing. Up until now, the latest PowerVR, Mali, and Adreno graphics designs all facilitated some combination of OpenCL and/or Renderscript, isolating Nvidia’s aging mobile architecture as the least flexible. That changes as Nvidia enables CUDA, OpenCL, Renderscript, Rootbeer (for Java), and a number of other compute-oriented languages on its newest SoC.
How Do You Scale Kepler Down Under 2 W?
At first glance, the math doesn’t add up. The GK104 GPU in Nvidia’s GeForce GTX 680 contains eight SMX blocks and is rated for roughly 200 W. Sure, there are four ROP partitions, a 256-bit memory bus, and twice as many texture units per SMX. Still, you’re looking a factor of 10 difference, at least, difference between Kepler as it appears in Tegra and a theoretical single-SMX discrete GPU. How is that rectified?
Nvidia’s Jonah Alben used GeForce GT 740M—a 19 W part with two SMXes—to illustrate. Memory I/O and PCI Express 3.0 are responsible for roughly 3 W of the GPU’s power budget. Leakage accounts for about 6 W. Because GK107 is a dual-SMX design, divide by two for the power of a single block. Giving us 5 W. From there, consider that Nvidia is able to turn up the voltage and clock rates of its discrete GPUs to fill an allowable power envelope. Through voltage scaling, it’s possible to dial back to 2 W or so, which is where Tegra K1’s GPU lands.
Maximizing the design’s efficiency is naturally quite a bit more complex than that description conveys. Multi-level clock gating ensures that, throughout the GPU, logic not needed at any given time is turned off. There are also two levels of power gating to cut current in the chip or at the regulator. Inside the SoC, Nvidia’s engineers had to optimize interconnects and data paths, as mentioned, trading performance for power where it made the most sense.
Nvidia presented its own benchmark results from the upcoming GFXBench 3.0, graphing framerate against power consumption in the 1080p Manhattan off-screen test. It chose Apple’s iPhone 5s and Sony’s Xperia Z Ultra as comparison points, targeting the A7 and Snapdragon 800 SoCs. At a constant performance level, the company claims a 1.5x performance per watt advantage over both with the application processor and DRAM power summed.
Nvidia was clearly bullish on imaging when it introduced Tegra 4. Given the company’s strength in graphics, it made sense that it’d extend expertise to photography and video as well. Disappointingly, we waited almost an entire year before the first manifestation of the company’s Computational Photography Engine surfaced, enabling always-on HDR and video stabilization. Then again, given the dearth of devices sporting Tegra 4, a modest software ecosystem isn’t altogether surprising. Should Tegra K1 enjoy more rapid pick-up, we’d hope to see hardware manufacturers doing more with the SoC’s imaging capabilities.
The potential balloons with Tegra K1. Last generation, Tegra 4’s ISP was rated for up to 350 megapixels/s of throughput. This time around, dual ISPs offer 600 megapixels each (two 20 MP streams at 30 Hz). Having a pair of ISPs make it possible to gather images from one source as information is processed from another, either from two cameras or from a camera and memory. A crossbar fabric connects both ISPs to memory, where they’re able to communicate with the CPU and GPU.
There’s a lot of future-looking enablement going on—support for up to 4096 focus points, 14-bit input, 100 MP sensors, interoperability with general-purpose compute, and quality-enhancing capabilities to minimize noise, correct bad pixels, and downscale are all at least possible. Some of the features are in place simply to improve quality, while others may pave the way for new imaging applications.

For instance, allowing more than 4000 focus points seems like overkill. Nvidia counters that this is useful for tracking multiple objects, and indeed we’ve seen demos of moving content where the camera detects and tracks the subject, maintaining focus on it, despite the rest of the scene changing. In the same vein, leveraging the graphics engine’s newfound compute capability, effects can be applied to the ISP’s contents in real-time. Deep images, where each pixel can store any number of samples per channel (instead of just one), become viable. So does creating panoramic shots by “painting” with a camera sensor.
Powerful ISPs backed by a capable GPU open the door to new forms of artistic expression, and Nvidia’s team clearly has the vision to drive innovation forward in that space. As mentioned, though, demonstrations of what computational photography can achieve have been somewhat slow to emerge in production. By taking the reins and introducing its own Tegra Note 7, Nvidia put itself in a position to roll out new software features. The first OTA update appeared mere weeks ago. Hopefully, that cadence speeds up with Tegra K1—we don’t want to wait another year before the company’s most recent demos become tangible.
When Nvidia introduced Chimera last year, it presented a slide very similar to this one, except that there was only one ISP and the GPU wasn’t Kepler-based.
On paper, Tegra K1 fixes the issues that most clearly put Tegra 4 at a competitive disadvantage. Some of the questions we still have can’t be answered until we get our hands on a derived device to test. Others won’t be resolved until game developers either bring premium content over from the console space or create newer titles using advanced APIs.
We had hoped to benchmark one of Nvidia’s reference platforms in time for the Tegra K1 announcement. They’re still so rare, though, that real performance data will need to come later. Production of the SoC purportedly started in December, and company representatives claim devices based on Tegra K1 will ship in the first half of 2014. However, specific announcements aren’t Nvidia’s to make, so it can’t comment on the form factors we’ll see or the regions they’ll be available in (nor did it mention any product of its own based on K1). But our own research in the early hours before CES suggests that at least one Tegra K1-based product is already on display. We consider this to be promising news.

What the company did say was that Tegra K1 is definitely a tablet play, and will also be available to premium superphones (think big screens and loaded with new technology). Tegra 4i, the more smartphone-oriented SoC with Nvidia’s i500 LTE modem built-in, is purportedly happening still, and we’re told there will be more information, again, in the first half of 2014.
But today’s discussion clearly isn’t about Tegra’s recent track record or Tegra K1’s ultimate destiny. Rather, knowing how amped our audience gets about speeds and feeds, Nvidia wanted to share more information about the SoC's inner workings. Together with Intel, Nvidia is one of the most forthcoming vendors in the mobile segment, revealing far more about its hardware than Apple or Qualcomm. Small gaps in the spec sheet (like a final GPU clock rate) remain; however, given what we already know about the Kepler architecture, this is the Tegra we were hoping for in 2013.
| Approx. Comparison To Both Last-Gen Consoles | |||
|---|---|---|---|
| Tegra K1 | PlayStation 3 | Xbox 360 | |
| Peak Shader (GFLOPS) | 365 | 192 | 240 |
| Texturing (GTex/s) | 7.6 | 12 | 8 |
| Memory Bandwidth (GB/s) | 17 | 28.8 | 22.4 |
| Feature Set (DX) | DX 11.2 | DX 9 | DX 9 |
| CPU Performance (SPECint, Per-core) | 1403 | 1200 | 1200 |
One of the slides Nvidia presented in its briefing compared Tegra K1 to the Xbox 360 and PS3. Although the console specifications aren’t 100% on-target, the math suggests that a single SMX running at what we presume to be about 950 MHz offers substantially more shader horsepower and almost as much peak texture fillrate. At least in theory, Tegra K1 could be on par with those previous-generation systems that continue to occupy shelf space today. Could you imagine the gaming performance of your old Xbox in a tablet form factor, perhaps with a Bluetooth-connected controller to solve the I/O issue?
We’d only be missing the games—and that’s not an altogether bad position for Nvidia to be in, given the developer relationships it maintains on the PC side. The company appears to have its hardware ducks in a row. Let’s see if it can redefine the mobile gaming experience beyond Tegra-optimized titles with a few extra effects in them.
My challenge to Nvidia: do whatever it takes to bring games to tablets that enthusiasts want to play, instead of that superficial content we only bother with because we’re bored somewhere else. Show us that your mobile hardware has the same goodness as Kepler on the desktop and that those same developers will follow your lead. When Android gaming is as compelling as it is on the PC, there will be a long line of Tom’s Hardware readers ready to buy new tablets.





