Nvidia recently wrote a blog post about its upcoming ARMv8-based 64-bit Denver CPU core, which the company claims will be the “first 64-bit processor for Android”. The first Denver-based device is likely to be the rumored Nexus 9 tablet, but depending on launch scheduling vs. other Cortex A53-based mid-range launch products, it may not be the first 64-bit Android chip that hits the market.
Denver was announced early this year at CES along with the “Tegra K1” product name, which Nvidia initially said would arrive with a quad-core Cortex A15 R3 CPU. This should be pin-compatible with the Denver chip so it should be very easy for OEMs to switch between the two. At the time, Nvidia didn’t release details about its upcoming 64-bit processor, but it did show an image that suggests the Denver core is about twice the size of a Cortex A15 core. It also purported that the new silicon would have very high single-threaded and multi-threaded performance.
Now that Nvidia has released more information about Denver’s architecture, we can get an idea where this added performance comes from. Unlike the Cortex A15’s three-way superscalar and the Apple A7’s six-way superscalar architecture, Nvidia’s Denver is a seven-way superscalar part. This means it can execute up to 7 micro-ops per clock cycle. Its 128KB L1/64KB L2 cache is deeper than the competition, too, compared to Cortex A15’s 32KB L1 instruction/32KB L1 data cache, and Apple’s A7 has 64KB L1 instruction/64KB L1 data cache per core.
The most innovative thing about Denver - and this CPU is quite different from other mobile processors out there in many ways - is the instruction pipeline. Instead of going with a “deeply out of order pipeline” like ARM chose with its Cortex A57, Nvidia went with a more efficient fully in-order hardware design. The difference is that in-order designs must execute instructions in the same order they occur in the application, while out-of-order processors can execute instructions as soon as they are available to the CPU.
An out-of-order pipeline greatly reduces the delay between processing instructions, but the problem is that it significantly increases the power consumption and physical die size of the CPU. This is why ARM has delayed going out-of-order for as long as possible and why the Cortex A53 is still an in-order design.
So why did Nvidia choose an inherently-slower in-order pipeline? The company claims to have found a way to create an efficient in-order hardware design by leveraging out-of-order techniques in software, a technique the company calls “Dynamic Code Optimization”. If what it says is true, this slight increase in software overhead is less than the performance gains that are achieved with Denver’s in-order pipeline.
“As part of the Dynamic Code Optimization process, Denver looks across a window of hundreds of instructions and unrolls loops, renames registers, removes unused instructions, and reorders the code in various ways for optimal speed. This effectively doubles the performance of the base-level hardware through the conversion of ARM code to highly optimized microcode routines and increases the execution energy efficiency.”
Nvidia’s benchmarks show that Denver is roughly twice as fast as Cortex A15 R3, Krait 400 and Silvermont/Bay Trail. It even beats Intel’s mainstream Haswell Celeron CPU in the majority of tests. Of course we have to leave any final judgements until we have hardware in our own hands for testing, but these numbers do suggest that Nvidia’s use of the phrase “doubles the performance” might not be far from the mark.
The company has been working on the Denver design for more than 5 years, and its investment into Android may very well pay off. With the transition of its desktop GPU architecture to mobile (Kepler), Nvidia achieved a significant lead over its competition in many respects. This innovative new processor design may earn the company a stronger presence in the mobile market