Nvidia Shares More Information About Its Innovative 64-bit ARMv8-Based Denver Core

Nvidia recently wrote a blog post about its upcoming ARMv8-based 64-bit Denver CPU core, which the company claims will be the “first 64-bit processor for Android”. The first Denver-based device is likely to be the rumored Nexus 9 tablet, but depending on launch scheduling vs. other Cortex A53-based mid-range launch products, it may not be the first 64-bit Android chip that hits the market.

Denver was announced early this year at CES along with the “Tegra K1” product name, which Nvidia initially said would arrive with a quad-core Cortex A15 R3 CPU. This should be pin-compatible with the Denver chip so it should be very easy for OEMs to switch between the two. At the time, Nvidia didn’t release details about its upcoming 64-bit processor, but it did show an image that suggests the Denver core is about twice the size of a Cortex A15 core. It also purported that the new silicon would have very high single-threaded and multi-threaded performance.

Now that Nvidia has released more information about Denver’s architecture, we can get an idea where this added performance comes from. Unlike the Cortex A15’s three-way superscalar and the Apple A7’s six-way superscalar architecture, Nvidia’s Denver is a seven-way superscalar part. This means it can execute up to 7 micro-ops per clock cycle. Its 128KB L1/64KB L2 cache is deeper than the competition, too, compared to Cortex A15’s 32KB L1 instruction/32KB L1 data cache, and Apple’s A7 has 64KB L1 instruction/64KB L1 data cache per core.

The most innovative thing about Denver - and this CPU is quite different from other mobile processors out there in many ways - is the instruction pipeline. Instead of going with a “deeply out of order pipeline” like ARM chose with its Cortex A57, Nvidia went with a more efficient fully in-order hardware design. The difference is that in-order designs must execute instructions in the same order they occur in the application, while out-of-order processors can execute instructions as soon as they are available to the CPU.

An out-of-order pipeline greatly reduces the delay between processing instructions, but the problem is that it significantly increases the power consumption and physical die size of the CPU. This is why ARM has delayed going out-of-order for as long as possible and why the Cortex A53 is still an in-order design.

So why did Nvidia choose an inherently-slower in-order pipeline? The company claims to have found a way to create an efficient in-order hardware design by leveraging out-of-order techniques in software, a technique the company calls “Dynamic Code Optimization”. If what it says is true, this slight increase in software overhead is less than the performance gains that are achieved with Denver’s in-order pipeline.

“As part of the Dynamic Code Optimization process, Denver looks across a window of hundreds of instructions and unrolls loops, renames registers, removes unused instructions, and reorders the code in various ways for optimal speed. This effectively doubles the performance of the base-level hardware through the conversion of ARM code to highly optimized microcode routines and increases the execution energy efficiency.”

Nvidia’s benchmarks show that Denver is roughly twice as fast as Cortex A15 R3, Krait 400 and Silvermont/Bay Trail. It even beats Intel’s mainstream Haswell Celeron CPU in the majority of tests. Of course we have to leave any final judgements until we have hardware in our own hands for testing, but these numbers do suggest that Nvidia’s use of the phrase “doubles the performance” might not be far from the mark.

The company has been working on the Denver design for more than 5 years, and its investment into Android may very well pay off. With the transition of its desktop GPU architecture to mobile (Kepler), Nvidia achieved a significant lead over its competition in many respects. This innovative new processor design may earn the company a stronger presence in the mobile market

Follow us @tomshardware, on Facebook and on Google+.

Lucian Armasu
Lucian Armasu is a Contributing Writer for Tom's Hardware US. He covers software news and the issues surrounding privacy and security.
  • whyso
    Just like Tegra 3 was comparable to core 2 a few years back. I need proof.
    Reply
  • jasonelmore
    this is cpu not gpu, argument invalid.
    Reply
  • CRITICALThinker
    I wonder if this will make it into the shield tablet, and how it will change performance.
    Reply
  • whyso
    Nvidia compared T3 CPU to core 2 T7200, not GPU.
    Reply
  • somebodyspecial
    This will be more impressive then K1 on the cpu side, but the real magic happens with 20nm Maxwell strapped to it so you get 64bit cpu that blows away the A15 K1, and then the huge gpu boost also from the first NV chip aimed directly at mobile applications. Don't get me wrong K1 was a huge leap and vaulted them to the top of the gpu for a while (with a great a15r3 also) but a bigger hit is looming with the die shrink maxwell version. I can't wait to see what xmas gaming looks like next year on android with all the great 20nm chips coming from everyone (qcom, samsung, nvida, arm, MediaTek).

    The lowest common denominator gets FAR better next year at 20nm (from the entire ARM side), and we need that to get devs to aim higher with unity5 and unreal 4 type games for mobile. I like where we're heading ;)
    Reply
  • somebodyspecial
    Just like Tegra 3 was comparable to core 2 a few years back. I need proof.

    Check anandtech's benchmarks with K1 vs. Surface Pro 1/2/3, and again vs. the Asus T100.
    http://anandtech.com/show/8296/the-nvidia-shield-tablet-review/5
    Hit back one page to page 4 for the cpu side. Z3740 Baytrail in that Asus Transformer T100. Smoked in GPU period, and does well against Surface1, not bad against 2/3 either.

    http://anandtech.com/show/8296/the-nvidia-shield-tablet-review/4
    Page4 for the cpu side. K1 beating the T100 in Sunspider, Kracken, Octane v2 handily in all. Only tie is webXPRT. Otherwise, Baytrail got their clock cleaned right? Big time. Don't forget WebXPRT is Intel's favorite benchmark...LOL. Not sure that's real world then, knowing that when in every other benchmark on cpu they got smoked. It's suspicious at best right?
    Reply
  • fykusfire
    God it would be awesome if AMD's Seattle was going up against Nvidia's Denver. The Super Bowl of ARM 64-bit chips.
    Reply
  • photonboy
    jasonelmore,
    It is a valid question as their was a version of the K1 I believe that had two Denver cores on it.

    A mobile chip has both a CPU and GPU on it (plus other stuff) and it's often called an "SoC" or System on a Chip.

    Again though, if a program can actually use the Denver cores as well as the other CPU components the Shield Tablet would run better. I doubt it will happen for a while due to the cost, and likely heat constraints. Mobile chips are a very, very competitive area so you need to justify the extra Denver cores.
    Reply
  • juanrga
    Seeing a 7W ARM Denver SoC matching (or even beating) a 15W Haswell CPU is amazing! Now anxiously waiting to see a 70W ARM Boulder SoC matching 140W Haswell Xeon CPUs
    Reply
  • somebodyspecial
    13949937 said:
    Seeing a 7W ARM Denver SoC matching (or even beating) a 15W Haswell CPU is amazing! Now anxiously waiting to see a 70W ARM Boulder SoC matching 140W Haswell Xeon CPUs

    I'd like to see them up the Nov Denver version to 4ghz, put TWO in a pc (so 4 cpu cores, as denver is DUAL), slap a fan/heatsink on it, put an NV GPU inside and see what we get. If they can already do 2.5Ghz in a tablet (which was already benchmarked ages ago, I'm talking denver here), what happens with a massive heatsink/fan like a PC? So no OS or Intel premium, but with nearly the same power :) A 50-80w version, with a 780TI in it etc (pic your card). Something I can run a triboot of Linux, Android and SteamOS, assuming this gets ported to arm shortly by NV or Valve or both working together. They could sell them cheaper with not discrete card (just the two socs) and then have a PCIE in there so you can pick a card at any time to explode your gaming ;) I think I'd double the SMX's though (so you'd have two chips with TWO SMX on each) to catch the apus out now (iris/A10's etc) to really take them on.

    That would have lots of versatility in software, great gamer, and certainly steal a lot of sales from Wintel camp. Surely a great HTPC, great for email, web etc same crap low end stuff can do now for 85% of us etc.

    As for Boudler, it's been giving a rest for a while. They aren't concentrating on server right now at this point. They didn't say they wouldn't again (or that Denver couldn't fill SOME of those roles), I just think they don't see the ecosystem to support the endeavor yet so are passing on this first round of arm stuff. It's easier to take on a desktop where there isn't so much testing/certification that has to happen (costly) and the base work is all done for that transition into the PC like box. Just amp it up with more mhz and watts and slap a HSF on it.
    Reply