At today's Hot Chips Symposium, Mark Papermaster, Senior Vice President and CTO, AMD, talks about the upcoming "Steamroller" Microarchitecture
We are getting our first look at the "Steamroller”, which is the core for the "Kaveri" APU, among others. AMD is expecting to see a 15 percent improvement in performance per watt over the "Piledriver" core. The improvements are seen through design-level improvements rather than process-level improvements.
The design-level improvements are driven by the microarchitecture's design to feed the cores faster, improve single-core execution and a push on performance per watt.
To "Feed the Cores Faster", AMD has increased the instruction cache size, enhanced instruction prefetch, and has a more efficient dispatch. In addition, Steamroller has a dedicated decode for each integer pipe. These design improvements have resulted in a 30 percent reduction in i-cache misses, an increase of 25 percent on max-width dispatcher per thread and a 20 percent reduction in mispredicted branches, which results in a 30 percent increase in overall ops delivered per cycle.
Steamroller improves single-core execution by tuning up the integer execution bandwidth and decrease average load latency. The integer execution bandwidth is tuned up with the improvement seen with "Feed the Cores Faster", along with more register resources (same latency) and intelligent scheduling. Average load latency is decreased by not just minimizing latency but with faster handling of data cache misses and accelerated store-to-load forwarding. These design improvements have resulted in a 5 to 10 percent increase in scheduling efficiency, along with major improvements in store handling.
As with any design improvement, companies are trying to get more performance out the process with equal or less power requirements. We are seeing this not only with processors but with graphics cards as well. AMD improves Steamroller's performance per watt with its power optimization, floating point rebalance and dynamic resizing of L2 cache. The dynamic resizing of L2 cache has allowed the shared L2 cache to work in an adaptive mode based on workload. The floating point rebalance allows it to streamline execution hardware and adjust to application trends, which add to the efficiency of the design. In addition, its power optimization offers lower average dynamic power and is optimized for loop behaviors.
Look for more details from AMD during Hot Chips Symposium on its Surround Computing and High Density (Thin) Libraries.