Architecture Overview
Instruction Fetch
The A72 sees improvements to performance too, starting with a much improved branch prediction algorithm. ARM’s performance modeling group is continuously updating the simulated workloads fed to the processor, and these affect the design of the branch predictor as well as the rest of the CPU. For example, based on these workloads, ARM found that instruction offsets between branches are often close together in memory. This allowed it to make certain optimizations to its dynamic predictor, like enabling the Branch Target Buffer (BTB) to hold anywhere from 2000 large branches to 4000 small branches.
Because real-world code tends to include many branch instructions, branch prediction and speculative execution can greatly improve performance—something that’s usually not tested by synthetic benchmarks. Better branch prediction usually costs more power, however, at least in the front-end. This is at least partially offset by fewer mispredictions, saving power on the back-end by avoiding pipeline flushes and wasted clock cycles. Another way ARM is saving power is by shutting off the branch predictor in situations where it’s unnecessary. There are many blocks, for instance, where instructions between branches are greater than the 16-byte instruction window the predictor uses, so it makes sense to shut it down because the predictor will obviously not hit a branch in that section of code.
Decode / Rename
The rest of the front-end is still in-order with a 3-way decoder just like the A57. However, unlike the A57’s decoder, which decodes instructions into micro-ops, the A72 decodes instructions into macro-ops that can contain multiple micro-ops. It also supports AArch64 instruction-fusion within the decoder. These macro-ops get “late-cracked” into multiple micro-ops at the dispatch level, which is now capable of issuing up to five micro-ops, up from three on the A57. ARM quotes a 1.08 micro-ops per instruction ratio on average when executing common code. Improving the A72’s front-end bandwidth helps keep the new lower-latency execution units fed and also reduces the number of cases where the front-end bottlenecks performance.
Dispatch / Retire
As explained above, the A72 is now able to dispatch five micro-ops into the issue queues that feed the execution units. The issue queues still hold a total of 66 micro-ops like they did on the A57—eight entries for each pipeline except the branch execution unit queue that holds ten entries. While queue depth is the same, the A72 does have an improved issue-queue load-balancing algorithm that eliminates some additional cases of mismatched utilization of the FP/Advanced SIMD units.
The A72’s dispatch unit also sees significant area and power reductions by reorganizing the architectural and speculative register files. ARM also reduced the number of register read ports through port sharing. This is important because the read ports are very sensitive to timing and require larger gates to enable higher core frequencies, which causes a second-order effect to area and power by pushing other components further away. In total, the dispatch unit sees a 10% reduction in area with no adverse affect on performance.
Execution Units
Moving to the out-of-order back-end, the A72 can still issue eight micro-ops per cycle like the A57, but execution latency is significantly reduced because of the next-generation FP/Advanced SIMD units. Floating-point instructions see up to a 40% latency reduction over the A57 (5-cycle to 3-cycle FMUL), and the new Radix-16 FP divider doubles bandwidth. The changes to the integer units are less extensive, but they also make the change to a Radix-16 integer divider (doubling bandwidth over A57) and a 1-cycle CRC unit.
Reducing pipeline length improves performance directly by reducing latency and indirectly by easing pressure on the out-of-order window. So even though the instruction reorder buffer remains at 128 entries like the A57, it now has more opportunity to extract instruction-level parallelism (ILP).
Expanded zero-cycle forwarding (all integer and some floating-point instructions) further reduces execution latency. This technique allows dependent operations—where the output of one instruction is the input to the next—to directly follow each other in the pipeline without a one or more cycle wait period between them. This is an important optimization for cryptography.
Load / Store
The load/store unit also sees improvements to both performance and power. One of the big changes from A57 is a move away from separate L1 and L2 prefetchers to a more sophisticated combined prefetcher that fetches from both the L1 and L2 caches. This improves bandwidth while also reducing power. There are additional power optimizations to the L1-hit pipeline and forwarding network too.
The L2 cache sees significant optimizations for higher bandwidth workloads. Memory streaming tasks see the biggest benefit, but ARM says general floating-point and integer workloads also see an increase in performance. A new cache replacement policy increases hit rates in the L2, which again improves performance and reduces power overall.
Another optimization increases parallelism in the table-walker hardware for the memory management unit (MMU), responsible for translating virtual memory addresses to physical addresses among other things. Together with a lower-latency L2 TLB, performance improves for programs that spread data across several data pages such as Web browsers.