Deep Dive into Arm Cortex-A76 Performance Improvements

Editor's Note: This article is sponsored content from our advertiser Arm and was not reported or published by the Tom's Hardware staff

The new Arm Cortex-A76 CPU based on DynamIQ technology is focused on delivering the richest user experience on any kind of mobile computing device, from smartphones to large screen devices such as laptops and DTVs, while remaining power efficient. It aims to improve the traditional mobile experience through faster responsiveness and constant connectivity. The performance improvements and longer- lasting battery life of multiple working days allow mobile devices to sustain current, emerging and new power-hungry use cases, such as more immersive AR/VR and mobile gaming, and machine learning (ML) improvements at the edge. All these use cases are enabled by design modifications and additions to the CPU.

Significant performance improvements

Despite building on the performance and efficiency improvements of the DynamIQ cluster microarchitecture, the design of the new Cortex-A76 CPU essentially started from scratch, which enabled Arm to focus entirely on the user experience. As a result, it achieves greater performance, better power efficiency and significant ML improvements compared to the Cortex-A75 CPU. There are improvements in all directions across integer, floating 8-bit fixed point (which includes ML) and bandwidth. These can be scaled into different use cases across multiple devices.

The Cortex-A76 CPU achieves 35 percent more performance, with this noted improvement being based on a 750mW/core power envelop per core. This provides better responsiveness and improved user experience for critical applications such as AAA gaming, web browsing and application start. The 40 percent better power efficiency is based on less power being required to deliver a high-demanding sustained performance, with this leading to 50 percent energy savings over 2018 devices. This means that the Cortex-A76 CPU can operate within the constraints of the mobile device without compromising the overall experience, which translates into a better UX. The energy savings means that the battery life is extended, which sustains use cases such as AR/VR and mobile gaming. The four times ML improvements, focusing on the inference algorithm, of the Cortex-A76 CPU mean that devices can manage multiple ML workloads requiring different performance and efficiency points.

The performance curve

Looking at the performance curve, the Cortex-A76 CPU accelerates performance across any sustained workload and multiple performance points. This is all achieved without compromising battery life. It has 25 percent more integer IPC, 35 percent higher ASIMD/FP performance and 90 percent higher memory bandwidth than the Cortex-A75 CPU. There is also a boost to the mobile experience, with 28 percent more Geekbench performance and 35 percent more JavaScript performance.

Key performance features of the Arm Cortex-A76 CPU compared to the Cortex-A75 CPU

Regardless of Moore’s law, the performance trajectory is set to continue, as noted in our recent CPU roadmap disclosure. The follow-up to Cortex-A76 CPU is codenamed ‘Deimos’ and was delivered to our partners this year. Optimized for the latest 7nm nodes, ‘Deimos’ is expected to deliver a 15+ percent increase in compute performance. In 2019, the CPU codenamed ‘Hercules’ will be available to Arm partners. ‘Hercules’, also based on DynamIQ technology, will be optimized for both the latest 5nm and 7nm nodes. ‘Hercules’ continues the trajectory of increased compute performance, while also improving power and area efficiency by 10 percent. This is in addition to the efficiency gains achievable from the 5nm process node.

The Arm Client compute CPU roadmap

New microarchitecture capabilities

A big part of these performance improvements are the new microarchitecture capabilities of the Cortex-A76 CPU, which are designed for premium performance, performance efficiency and power efficiency. The Cortex-A76 microarchitecture also removes bottlenecks throughout the design to break through theoretical limits. In addition, partners can get the best performance out of the microarchitecture through full SoC optimization. This means:

  1. Aggressive implementation, with the Cortex-A76 CPU showing scalability above 3GHz.
  2. Increasing LITTLE core performance, with this sustaining more background tasks, better power efficiency and overall better performance.
  3. Implementing large caches, with this supporting up to 4MB already for single-thread and multi-core performance. The multi-level branch-target caches minimize exposed latency and reduce power.
  4. Implementing the memory system, with this enabling latency at 2.5 percent every 10ns and multiple memory channels using higher bandwidth memory. In fact, the front-end of Cortex A-76 CPU is built to hide latency at the high bandwidth.

Inside the IP

Looking at the Cortex-A76 CPU in more detail, you can understand how these performance and efficiency improvements are possible.

The front-end of the IP has been built to hide latency at a high bandwidth on devices. It has a decoupled predict/fetch, multi-level brand-target caches and hybrid indirect predictor which has unparalleled prediction capability for polymorphic branches.

The decode, rename, commit elements of the Cortex-A76 CPU have a 4-instruction/ cycle and power-optimized decode. This has instruction transformation for low power. These elements also contain high-density decode/rename including an expansion to 8-wide dispatch. The dispatch to out-of-order core and commit unit are optimized for low-latency OS and hypervisor activity. It contains 128-160 entry area and power-optimized instruction window, alongside a hybrid commit unit optimized for area and power. This is further optimized for low-latency OS and hypervisor activity.

The execution core has µops (a micro-operation) dispatched to 120-entry issue queue capacity, which contains eight independent issue queues power-optimized for attached execution pipelines. The integer pipelines include time simple ALU, one brand and one multi-cycle integer and simple ALU. The dual 128-bit ASIMD/ FP execution pipelines are twice the bandwidth of prior CPUs. And the state-of-the-art latency-optimized VX datapaths contain 2-cycle FADD, 3-cycle FMUL, 4-cycle FMAC and radix-64 FDIV.

The L1 data cache of the Cortex-A76 CPU is decoupled to address generation and cache-lookup pipelines for optimal bandwidth. The cache is optimized for extreme memory level parallelism capability, which means 68 in-flight LDs, 72 in-flight STs and 20 outstanding non-prefetch misses. It also contains a sophisticated four generation prefetcher with the design philosophy of the perfect cache-hit operation.

The full cache hierarchy is optimized for latency and bandwidth to ensure the best of both worlds. This adapts to both system latency and bandwidth characteristics. The increased bandwidth at low latency provides more than twice the performance compared to the Cortex-A75 CPU.

The performance curve that goes beyond Moore’s law

Arm is continuing to accelerate performance gains through our annual design cadence, while also improving efficiency and the silicon footprint. This has enabled our ecosystem of silicon, software, and OEM partners to bring smartphones to the market every year with new innovations and features.

The Cortex-A76 CPU is aiming for nothing less than changing the face of intelligent mobile computing. Our recent roadmap disclosure shows that the Cortex-A76 CPU represents the continuation of the trajectory that will increase performance at a staggering pace up until 2020, going beyond the performance trajectory for Moore’s law. All of this means that the end user will be able to get more out of their mobile computing experience, whether on a smartphone or using a PC on the go.