AMD has teased us with Carrizo, its sixth-generation APU, for months, dribbling out bits of detail, culminating in a recent series of worldwide briefings meant to showcase the engineering prowess behind this Bulldozer-based processing platform, and finally launch the product. It will be available in notebooks from OEMs starting this month.
The APU features Excavator, a new core built on the same 28nm process node, yet it packs in a dizzying array of execution efficiency and features HVEC decoding, a bigger L1 cache and the first true HSA 1.0 support. AMD said that Carrizo is aimed squarely at the sweet spot of the mobile market: the $400 to $700 laptop. And that's where Carrizo's story truly begins.
Today's notebooks are used for all sorts of tasks that had previously been reserved for high powered desktop systems. Video editing, photo editing, gaming, high quality video playback and heavy multitasking are now normal affairs for portable computers.
Citing internal research, AMD claimed that every second, two notebooks are sold in the $400 to $700 price range, which works out to two out of every five notebooks sold. But these computers typically lack performance, especially for playing games or watching high quality video smoothly, without stuttering or significant battery drain. This is the main problem that AMD hopes Carrizo will solve.
Carrizo is set to replace Kaveri in the mobile market, and it will usher in a host of new forward-looking features that AMD believes will not only set it apart from the current competition, but will ensure relevance and usability for the coming years.
AMD claimed Carrizo has 17x the compute performance, twice the graphics capability and double the battery life of the Kaveri architecture. All this, while maintaining the process at 28nm, and focusing on a power target of 15 W for light and portable notebooks.
Company executives indicated that this is the result of a corporate culture that shifts towards power efficiency.
Energy efficiency has been steadily improving year over year for the last six years, with AMD achieving 10x increased efficiency between 2009 and 2014, but in 2014 the company vowed to increase energy efficiency an additional 25x by the year 2020. To do that, engineers have to find ways to outpace Moore's Law.
AMD claimed Carrizo already represents a big leap towards that goal. Amazingly, it said company engineers managed to squeeze out 240 percent more performance per watt.
"It's important to set aggressive aspirational goals. It motivates the engineering team; it helps galvanize thinking," said Sam Naffziger, AMD Corporate Fellow. "It's one of the things that stimulates innovation and ways to get there when it seems too big a stretch."
Coming Together To Innovate Better
AMD acquired ATI Technologies back in 2006. Since then the company has used its technologies and people to come up with some very forward thinking ideas, but AMD's engineering teams are spread out and synergy takes time. AMD indicated that Carrizo's most promising innovation is the result of such collaborations of knowledge.
Specifically, the CPU team observed the way the GPU team used high density library technologies to maximize the available space with the most capability possible, surmising that this technology could benefit the CPU side. Without any official direction, experiments began.
With the lack of a new process node, the engineering team had to figure out how to reduce the space used on the die to make room for other features, such as improved graphics and multimedia decoding. Implementing the knowledge learned from the GPU team reduced the size of the Floating Point Scheduler by 38 percent, the FMAC by 35 percent and i-Cache Control by 35 percent. In total, even with all the extra features, AMD reduced the APU area by 23 percent in Carrizo.
Smaller, Cooler, Less Power, More Performance
AMD knew the processor couldn't be slower than its predecessor. Power saving and area reduction don't make a compelling product, but the skinnier metal in the High Density Library cells was expected to be slower than previous designs; running at the lower 15 W power threshold, this actually became a positive.
As the core got smaller with the introduction of the High Density design, the power draw was reduced. Naturally, with less power draw, the chip runs cooler. Cooler silicon in turn leaks less power. Due to these increases in efficiency, Excavator cores can run faster than Steamroller, while consuming less watts. Ten thousand lines of Firmware have been added to track what the CPU is doing, designed to handle when and if the core will receive more power, as it can be scaled up and down to fit specific needs.
Creative Thinking Goes A Long Way
Turning up the clock speed doesn't improve the number of instructions run per clock. IPC for Carrizo has increased by 4 to 15 percent nonetheless, thanks to doubling of the L1 Cache, better branch prediction, support for a new instruction set, and separating the memory interface controller from the cores.
One of the main bottlenecks that held back performance in the previous generation APU was the limited 16kb L1 cache. Increasing the cache here means higher L1 hit rate, which not only increases performance (less latency waiting for the slower memory), it again also saves power.
L1 cache has traditionally been very small to keep the latency at the lowest point possible; L2 cache is larger, but much slower. In the new design, AMD had to shrink L2 cache for the area and power savings, and therefore L1 Cache necessitated an increase to keep up. AMD's engineers managed to devise a way to double the capacity to 32kb, while maintaining the same latency as previous designs and cutting raw power draw in half. To pull this off, AMD had to make a number of different changes in the fundamental design of the cache memory.
L1 cache memory has multiple address segments known as ways. The segments in traditional cache are accessed all at the same time, rather than focused on the one it needs, which uses unnecessary power. Kaveri's design used four ways; Carrizo has seen an increase to eight ways. In order to reduce power use, AMD introduced a way predictor that can predict which segment will be needed and powers that particular way up, leaving the others dormant.
AMD designed Excavator with thermal aware floor plans to keep the heat generating components separated. This allowed for higher power levels in the memory controller, which AMD claims resulted in better memory performance.
Higher Quality Video, Longer Playback
When it comes to power draw, it's very easy to fall back on standby numbers and idle time, but those stats don't mean anything. People buying notebooks expect to use them for the content that they desire, including streaming the latest League of Legends match on Twitch.
Traditionally, this has been a major drain on battery life because high quality video content -- particularly files encoded in the latest compression formats -- demand the use of a GPU to decode. The problem with this is the GPU is very much underutilized in this scenario, though powering it on is necessary for reasonably fast playback.
The video playback path used for Kaveri is rather wasteful when it comes to energy efficiency. Video files go through the UVD (unified video decoder), into memory, into the GPU, back into memory, and then out to the display. This double use of memory significantly hindered performance.
To fix this problem, AMD designed a special purpose pipeline into the display engine that does all the scaling work. A special purpose logic underlay engine was also added to take care of most of the video playback tasks that were formerly handled by the graphics engine. This reduces power draw by nearly half a watt, which translates to almost half an hour of video playback time, according to AMD.
4K Video Playback Nets More Power Savings
The unified video decoder is another key piece of the puzzle. The UVD 6 engine has four times its predecessor's bandwidth, which is required to support 4K video at 60fps. AMD claimed that this small dedicated decoder chip is the first of its kind to support HEVC/H.265 decode. Without this chip, the traditional GCN cores aren't capable of decoding this high quality format in real time.
4K was a key aspect of the design. AMD wanted to make sure that the new APUs were ready for the future. At the same time, the company also realized that the average 1080p content was being rendered much faster than needed. Applying the energy efficiency mantra to this problem actually created some very impressive power savings yet again, for example, by switching off the power to the decoder between frames running at 60fps.
Power Gating was added to the UVD, allowing the decoder to handle each frame in 25 percent of the frame time, and then go to sleep until the next frame. Software is used to synchronize the CPU so that when it goes to sleep, the DRAM can finish the job.
All of these changes resulted in twice the battery life while playing video content, according to AMD.
What About Gaming?
AMD has been very bullish on targeting the eSports market. Carrizo engineers had a similar target in mind, and were able to squeeze out enough power to run DOTA2 on max settings at 1080p above 30fps on the top-end FX-8800p. League of Legends runs at nearly 50fps on the mid-tier A10-8700p, and CS:GO should hit nearly 40fps on the lower-price A8-8600p. For a laptop without discrete graphics, this level of performance is unheard of.
Note: These are AMD's numbers. We'll have to wait for products to test this out ourselves.
To manage this, AMD needed even more innovation. Memory bandwidth for an APU is always in short supply, as it relies on DRAM instead of the GDDR that discrete GPUs are paired with. To overcome this issue, AMD included a module that handles color compression. Doing this allows data to be compressed in a lossless format that the GCN cores are able to read and write. Memory performance is improved yet again by 5 to 7 percent using color compression, according to AMD.
Carrizo will support dual graphics when paired with specific AMD GPUs. In current games this can offer a significant boost in performance, especially once Windows 10 is released and DirectX 12 is put to use. DX12 will also leverage the power of any GPU, and combine it with the GCN cores of an APU, making a compelling argument for Carrizo to be the base of gaming-dedicated notebooks. Once we have hardware in the lab to put this to the test, you can be sure we'll let you know how it really stacks up against AMD's claims.
The new APU will support other advanced APIs as well. AMDs own Mantel API is a given. Vulcan is another option once the spec is finalized. AMD FreeSync technology is also supported by the new iGPU. Keeping the refresh rate and frame rate synchronized is yet another feature that helps reduce power draw.
Creating Performance With Intelligence, Not Brawn
On a path for energy efficiency, AMD didn't stop there. Carrizo has more voltage islands than previous iterations to further reduce power draw and minimize leakage. As I mentioned earlier, silicon tends to leak more when it runs warmer, so for this reason limiting heat generation is an important goal.
A voltage island is pretty much what it sounds like; a segregated area with its own power source. The reason this makes a difference with heat generation is simple; if graphics, multimedia and the Northbridge share the same voltage source, and one of them demands more power, all three are fed more power. This causes the other components to increase heat generation, with no real benefit to performance.
With everything on its own island, voltage is only increased for the part that is actually requesting more. In a mobile platform with a 15 W power constraint, this solution is especially potent. Now more voltage can be pumped into the graphics engine than before, as there is no unnecessary drain on the limited power source. Doing so allowed Carrizo to have 33 percent more graphics compute units (CUs) than Kaveri did.
With more CUs to work with, next came the task of trying to extract more performance from each of them. There are two different devices that can be used to build a CU: one is optimized for higher frequencies, and the other is optimized for low leakage. Engineers have to seek for balance here.
The efforts put forth in optimizing the graphics resulted in a very respectable 10 percent higher frequency, while reducing leakage by a whopping 18 percent, according to AMD.
AMD usually aims for a 30 to 40 percent increase in performance with each generation. In this case, the company said that it has increased performance as much as 65 percent over Kaveri, at 20 W lower. AMD pulled these numbers from tests running 3DMark 11 with performance settings.
In This Day And Age, Security Is Paramount
Carrizo has a dedicated security subsystem, which includes a dedicated ARM Cortex -- A5 integrated into the APU die. This microcontroller has its own dedicated ROM and SROM that are isolated from the rest of the system. It has access to main system memory if needed, but nothing else can access the dedicated memory chips.
A cryptographic co-processor is also included, and it's used to handle RSA up to 16,384-bits, sha-1 through sha-512 and a host of other cryptographic methods. Using the dedicated processor allows for encryption to be handled while using very little power. This chip handles SecureBoot, hard drive encryption, and TrustZone applications. When security isn't handled by the CPU, it becomes much harder to exploit.
AMD emphasized the importance of consistency across its product line in regard to security and highlighted that with the release of Carrizo, all of the APUs AMD builds now have the same level of cryptographic protection. The company sees a future where secure applets will be created that execute directly on the security chip. This is only possible with commonality across the product line.
Heterogeneous System Architecture 1.0
AMD has been working towards full HSA 1.0 specification support for some time now. Kaveri gave us an early look at what the technology can do. The GPU and CPU cores were able to access all available memory, up to a limit of 32 GB, but memory allocated to the CPU cores was segregated from memory allocated to the GCN cores. If a workflow demanded both GCN and CPU cores, then memory had to be mirrored, rather than properly shared.
The full HSA specification allows memory access equally across all 12 compute cores. Now, large workloads being sent to the GCN cores aren't limited to allocated graphics memory. All system memory is available to work with, and the CPU cores can see the same workload at the same time.
The ability to create and dispatch work equally between CPU and GPU cores is one of the main advantages of HSA. These two processing units handle very different workloads and are very much mismatched. One core is optimized for scaler, and the other is optimized for vector. Both cores have to be able to pass work back and forth, but with rigid connections between the two, they tend to fight each other.
This is where a driver would normally get involved, but a driver acting as task manager in this case would slow the work down. AMD employs heterogeneous queuing (hQ) here, and described these queues as rubber bands connecting the data paths between cores. This provides the flexibility for the workloads to mismatch, yet neither side will run out of work.
Popular programming languages Java, C++ AMP, and Python are all accelerated by Carrizo, and with their widespread use, it should make harnessing it and future HSA processors simple for developers that use these platforms today.
Simultaneously Handles Compute And 3D Graphics
The graphics units in an APU are expected to do two very different tasks: 3D graphics and compute. To do that, there needs to be an efficient way for the GCN unit to switch between the two tasks on the fly.
Context switching is the ability to push tasks aside into a save state to make room for a different task. This is not a new idea; your CPU has been doing this forever. Take a look at the processes running in the background of Windows. Imagine requiring a core for each process running. You'd need a CPU with thousands of cores in that scenario. Context switching is the reason you don't. It enables the ability to pass work between processors quickly.
Context switching on a CPU is an accepted practice, and AMD wants the same to be true for APUs. To do that, it had to overcome a few problems. Most notably, 3D graphics carry a lot of data, and switching state means pushing all of it to the side. Compute, on the other hand, doesn't need that much space, and therein lays the solution AMD pursued.
3D state gets pushed aside, but the 3D pipeline is much larger than compute workloads. Most of the data can remain stored in the pipeline, while the space needed is switched for compute data. Compute workloads can context switch independently from the 3D data until it’s time to bring 3D back. All of this is done in the background with no interruptions.
So What Does This All Mean?
AMD's new processor holds plenty of promise on paper. The level of productivity that can be achieved with an HSA-enabled platform has yet to be demonstrated, but it's easy to see it could have a dramatic positive affect.
We have actually seen first-hand what Carrizo can do, and feel like this will be a game changer for the mobile market. It lays the foundation for a future of very efficient, very powerful HSA enabled processors.