This year’s CES was my most insane to date. I showed up in Vegas two days earlier, stayed a day later, and managed to fit close to 50 different meetings into a schedule that started early in the morning and didn’t end until late at night. But by the end, I had a solid grasp on the technologies we’ll be seeing in 2014. Some of them are decidedly evolutionary. Others, like Oculus’ Crystal Cove prototype, will fundamentally change PC gaming for the better.
For AMD’s part, it spent CES talking about Kaveri—a design that, on paper, should be interesting stuff for enthusiasts. There are the Steamroller-based x86 cores, giving us a new processor architecture to talk about. This is also the first time AMD’s vaunted Graphics Core Next design finds its way into an APU. The company did a ton of work enabling Heterogeneous System Architecture features for better interplay between computing resources and software developers. And it’s using a new 28 nm manufacturing process from GlobalFoundries.
But although this week's introduction focused on the top-end 95 W A10-7850K, the real emphasis of Kaveri is down in the lower-power segments. Company representatives say engineers designed for the 35 to 45 W range, scaling as high as 95 and as low as 15 W. AMD wants to see APUs in desktops, notebooks, embedded environments, and servers. So, it took the middle road in order to better optimize for those targets. AMD also had to make some compromises on the manufacturing side, better balancing transistor density to enable a 512-shader Radeon graphics core, while ultimately sacrificing CPU speed.

Of course, at the end of the day, once we’ve carefully carved through the architecture and AMD’s vision for Kaveri, what matters most is how this APU family compares to what came before and Intel’s best effort in the same space.
Building A Better Computing Device
Integration is a word that gets thrown around a lot, and often with negative connotation. Ew, integrated graphics, right? But integration is an important part of making complex technologies more affordable. In many cases, it’s very, very good for performance. And there's typically a positive correlation with power, too. By now we all know that AMD’s APUs combine multiple subsystems to allow the fast movement of data between programmable and fixed-function logic, maximizing flexibility and, ideally, making it possible to run demanding workloads on affordable hardware.
That Kaveri includes multiple x86 cores, graphics processing, memory control, cache, hardware-based accelerators, and PCI Express connectivity on a single piece of silicon is no surprise; its predecessor offered a similarly-thorough list of capabilities as well. But if you think of Kaveri as a puzzle, AMD took each piece and tweaked it in such a way that the finished product would reflect the latest technologies, more advanced manufacturing, and another step toward the company’s vision of utilizing the most appropriate resources for any workload.
One component of this approach involved re-thinking lithography. In partnership with GlobalFoundries, AMD is shifting from 32 nm SOI to a 28 nm bulk silicon process. Now, there are associated advantages and disadvantages. Previously, AMD was building its APUs using technology optimized for CPUs. That allowed chips like the A10-6800K to hit clock rates as high as 4.4 GHz through Turbo Core. But tuning for low density, low resistance, and ultimately higher frequencies negatively affected the number of transistors that AMD could fit on a die, limiting the complexity of its GPU. In a world where x86 cores are considered “fast enough” in workloads that wait for user input, the decision was made to slide the scale toward density. AMD calls this APU-optimized, but the bottom line is that it’s using slower, higher-resistance transistors in order to facilitate better area utilization.
The consequence is lower-frequency x86 cores, which you’ll see reflected in a comparison of Kaveri and Richland. AMD says it compensates with a transition from the Piledriver architecture to Steamroller. A focus on improving IPC—or the amount of work each core does per cycle—purportedly yields up to 20% gains, leaving Kaveri net-positive in most x86 workloads.
On the other hand, the APU sports a more potent graphics subsystem, wielding up to 512 shaders based on the GCN architecture. Richland topped out at 384 of the previous-gen VLIW4 ALUs. This clear re-distribution of transistor wealth in favor of the GPU better-addresses the performance-sensitive workloads AMD is targeting (gaming, multimedia, and content creation), while maintaining a status quo in more general-purpose tasks.
All told, Kaveri is a 2.41 billion-transistor SoC crammed into 245 square millimeters. Richland was nearly the same size (246 mm²), but comprising just 1.3 billion transistors. Do you like that? We’re now dismissing billion-plus-transistor processors as pedestrian. This all goes to show the impact of AMD’s shift to 28 nm bulk silicon, optimized for a more GPU-focused die, though.
The Kaveri Family, As It Exists Today
Two models (A10-7850K and A10-7700K) are expected to ship immediately, with a third (A8-7600) surfacing in the first quarter of 2014. The flagship is priced at $173. So, you get a lot of additional goodness, but pay an additional 22% compared to A10-6800K. Even the -7700K is pricier than last-gen's fastest offering at $152. Ahead of its official debut, the -7600 is expected to sell for $119.
| A10-7850K | A10-7700K | A8-7600 | |
|---|---|---|---|
| Graphics Level | Radeon R7 | Radeon R7 | Radeon R7 |
| TDP | 95 W | 95 W | 65/45 W |
| CPU Cores | 4 | 4 | 4 |
| CPU Base Clock Rate | 3.7 GHz | 3.4 GHz | 3.3 / 3.1 GHz |
| Max. Turbo Core Clock Rate | 4 GHz | 3.8 GHz | 3.8 / 3.3 GHz |
| GPU Shaders | 512 | 384 | 384 |
| GPU Clock Rate | 720 MHz | 720 MHz | 720 MHz |
| "Compute Cores" | 12 | 10 | 10 |
| Price | $173 | $152 | $119 |
Both of the just-launched Kaveri-based APUs are 95 W parts (ironically, the thermal ceiling AMD appears least concerned with).
A10-7850K sports two Steamroller modules and 512 shaders. The processor’s base clock rate is 3.7 GHz, though it can reach up to 4 GHz in lightly threaded apps. Meanwhile, the R7 graphics engine operates at 720 MHz.
In fact, all three Kaveri models sport GPUs at 720 MHz. The biggest difference between A10-7850K and the other two SKUs is shader count. A10-7700K and A8-7600 both come with 384. The -7700 operates at a 3.4 GHz base clock that ramps up as high as 3.8 GHz under the right thermal conditions.
The A8-7600 is unique in that it offers a TDP that can be manually configured to 65 or 45 W. A higher thermal ceiling allows for a 3.3 GHz base clock and 3.8 GHz peak, while the 45 W setting keeps the APU cycling between 3.1 and 3.3 GHz.

Kaveri-based APUs drop into a new interface called Socket FM2+. We’ve already seen compatible motherboards employing AMD’s A88X, A78, A75, and A55 Fusion Controller Hubs; it’s really up to each board vendor to hit the right price points with Socket FM2+. You can use Socket FM2-based APUs on FM2+-equipped boards, but not vice versa. A block diagram of the Kaveri die also reveals a PCI Express 3.0 controller (presumably with 16 lanes of connectivity, given the motherboards we have in the lab so far), support for up to four display outputs, and the same XDMA engines found in AMD’s Hawaii GPU for CrossFire (in this case, enabling Dual Graphics functionality). We’ll go into multi-GPU rendering in greater depth later in today’s story.
Send In The Marketers
Here’s the part of the discussion where our more technical readers may let out a collective groan. Last generation, AMD referred to its x86 and graphics shaders independently. A10-6800K had four cores (actually, two Piledriver modules with four distinct integer clusters) and 384 shaders.
This time around, the company takes the fundamental graphics building block—the Compute Unit—which is replicated over and over to give us GCN-based GPUs like Hawaii with up to 2816 shaders, and dubs it a Compute Core. By definition, a Compute Core is HSA-enabled, programmable, and capable of running at least one process in its own context and virtual memory space, independent of other cores.

Of course, this gives AMD the ability to sum its CPU and GPU resources, yielding Kaveri-based APUs with eight and 12 compute cores, all with access to the same unified coherent memory. That’s compelling nomenclature when your competition is selling dual- and quad-core mainstream processors. Fortunately, the company’s legal department insists on a specific breakdown of CPU and GPU resources any time core count is used to describe a Kaveri-based APU.
AMD (validly) posits that it wants the technical community to think of up to 12 threads running concurrently, which is why it talks about Kaveri as a 12-core device. The APU does, in fact, address parallelism in a new and interesting way. We simply want to see the company use its messaging for good. At a time when AMD talks about CPUs in terms of their highest Turbo Core frequencies and rates high-end GPUs for clock rates they simply cannot sustain, the mainstream customers most likely to buy an APU aren’t going to understand the advanced implications of its nomenclature.
A New x86 Architecture: The First Steamroller CPU

At least most folks seem comfortable with the definition of AMD’s module-based approach to x86 cores, right? Kaveri represents the first outing for the Steamroller architecture, succeeding the Piledriver design at the heart of AMD’s Richland APUs. Although some of those previous-gen parts sported one module (or two cores), the just-introduced Kaveri models include two. AMD calls this a four-core configuration, though we know each module exposes two integer clusters and a shared floating-point unit.
Back when AMD introduced the Bulldozer architecture, we immediately took note of the big step back in per-cycle performance. Piledriver helped a little, but IPC remained painfully low compared to Intel’s Sandy Bridge, Ivy Bridge, and Haswell architectures. Steamroller was designed to help make up some of the difference, and engineers claim instruction throughput is up as much as 20%. Unfortunately, manufacturing decisions temper that gain.

The changes made to Steamroller predominantly improve efficiencies at the front-end of the pipe to minimize stalls and, according to AMD, get single-threaded performance back up to more competitive levels. The L1 instruction cache, previously 64 KB and two-way set associative, is now 96 KB and three-way set associative, reducing misses by 30%. AMD’s engineers similarly went after mispredicted branches by increasing the L2 branch target buffer from 5000 to 10,000 entries and augmenting the branch predictor itself. Instruction scheduling is made 5 to 10% more efficient through a jump to 48 entries (from 40). And company reps say that both integer clusters can access the microcode ROM simultaneously now, where they couldn’t before. Steamroller can issue two stores at once; the Piledriver architecture would only do one. Finally, the load/store units in each integer cluster feature ~20%-larger queues, further benefiting efficiency.
To test AMD’s claims, I dialed in a Core i5-4670K, A10-6800K, and A10-7850K to exactly 4 GHz, then ran our single-threaded iTunes and LAME benchmarks.

In iTunes, Steamroller gets exactly zero benefit. The Haswell-based Core i5 is naturally quite a bit faster. LAME actually reflects a tiny gain, but again, Intel’s architecture enjoys a commanding lead.
Frustrated at the lack of single-core speed-up, I decided to add our threaded 3ds Max 2013 render project. Only then, after spinning up both Steamroller modules, does the architecture demonstrate significantly better results. At 4 GHz, the A10-7850K is 22% faster than the A10-6800K. Some of that is eroded in practice by the Richland-based APU’s higher shipping clocks. However, it does appear that improvements made to Steamroller show up selectively, depending on the workload.
I’m a bit of geek in that I get excited about testing the differences between subsequent processor designs. But Steamroller really only serves as an enabler for AMD’s Graphics Core Next architecture in Kaveri, improving IPC enough so that the more dense APU doesn’t sacrifice too much general-purpose performance as the graphics subsystem grows. In fact, AMD says Kaveri’s GPU accounts for 47% of the die.

The engine is composed of up to eight GPU “cores”, formerly referred to as Compute Units, and made up of four Vector Units with 16 shaders each. In total, that’s 64 shaders per core and 512 shaders in an eight-core implementation. Don’t let the numbers or evolving terminology confuse you though. Architecturally, this is the same technology found in AMD’s Hawaii GPU, which I covered in Radeon R9 290X Review: AMD's Back In Ultra-High-End Gaming, including precision improvements to the native LOG/EXP operations and MQSAD optimizations for speeding up motion estimation algorithms, mentioned back when Hawaii launched. Of course, the big addition is coherent shared unified memory. That coherency makes it easier to pass data between the GPU and CPU cores—again, the degree of “equalness” between dissimilar on-die resources is the exact reason why AMD is using the term Compute Core to begin with.

There’s a lot of strategic trimming that goes into optimizing Kaveri’s GPU compared to AMD’s discrete solutions. The Hawaii GPU has four geometry processors able to rasterize as many primitives per clock cycle. Tahiti features two. Kaveri gets one. And while 16 render back-ends give Hawaii massive pixel fillrate, Kaveri is pared down to two ROP partitions, capable of eight pixels per clock. Given the bandwidth limitations of an integrated solution attached to DDR3 memory, those design decisions make perfect sense.

Not every piece of the Kaveri GPU is a subset of Hawaii. AMD exposes all eight of the discrete processor’s Asynchronous Compute Engines, which independently schedule tasks to the CUs (incidentally, Sony’s PlayStation 4 also boasts eight ACEs). They all share access to a global data share and a 512 KB L2 cache. But they can otherwise operate on their own for efficient multi-tasking. Back when I was digging into Hawaii, the shift from two ACEs in Tahiti to four in Kabini/Temash and then eight didn’t seem imminently necessary. Now that we’re seeing the design exposed on Kaveri, however, its importance to AMD’s HSA is clearer.
Fixed-Function Accelerators: More Specialized Hardware
I already mentioned that Kaveri lacks fixed-function support for H.265 decoding. However, the old faithful Unified Video Decoder is in there, accelerating playback of H.264, VC-1, MPEG-2, MVC, and MPEG-4. In essence, the “new” UVD 4 in Kaveri is similar to the older UVD 3 block, except for improved error resiliency during AVC decoding.
AMD also claims to have improved its Video Codec Engine, adding I, P, and B frame support to the common H.264 YUV420 video format and I frames to the simpler YUV444 format. To be sure, we’re happy to see AMD adding to the VCE block’s functionality. However, our most recent look at the VCE’s performance put AMD behind Nvidia’s NVEnc solution and significantly slower than Intel’s Quick Sync. So, while Kaveri’s second-gen VCE might represent a functional step forward, we want to see more attention paid to its position relative to competitive encoders.

As with the Hawaii and Bonaire GPUs powering Radeon R9 290X, 290, and R7 260X, Kaveri includes TrueAudio support. That means there are, presumably, three Tensilica HiFi2 EP Audio DSP cores built into the APU’s die able to offload sound processing. I say Kaveri supports this technology because it needs to be exploited in software before you realize any benefit, and thus far there aren’t any applications we can use to illustrate TrueAudio’s impact. At least in practice, it’s intended to facilitate more complex effects without a corresponding drain on host resources. But every attempt we’ve seen to demonstrate TrueAudio hasn’t translated particularly well to a conference room setting.
Let’s get back to that concept of integration. Done right, integration should allow for compounding efficiencies. Subsystems brought closer together can communicate more quickly and save power.

Back when AMD introduced its Llano APU, the company put four Stars cores, a northbridge, two 64-bit memory channels, PCIe control, and a graphics engine on one die. It provided a 128-bit Fusion Control Link to the GPU for access to coherent memory space, simultaneously giving the CPU access to the GPU’s frame buffer. Separately, another bus gave the graphics engine higher-bandwidth access to memory.

The advent of Trinity (and then Richland) saw AMD push integration even further. It unified the CPU and graphics northbridges, doubling the data path bandwidth of its Radeon Memory Bus in the process. Perhaps even more significantly, it added an I/O memory management unit, attached through the Fusion Control Link, which gave the GPU access to virtual address space. The road to HSA was slowly being paved.

Kaveri incorporates a second bus through the IOMMU for coherency. It also exposes functionality called system-level atomics for synchronizing work across different cores. Together, those features complete the puzzle and enable a trio of HSA features.
A heterogeneous unified memory architecture, to begin, gives the CPU and GPU subsystems visibility into the entire memory space, up to 32 GB. Additionally, both the CPU and GPU are treated equally by AMD’s heterogeneous queuing model. Work can be dispatched from one to the other and vice versa. As a result, the APU’s on-die resources can tag-team more compute-intensive workloads.
Right off the bat, AMD is identifying a handful of tasks that’ll benefit from greater compute potential in the mobile and desktop spaces. Media playback is the first. You’ve already seen us demonstrate how demanding H.265 encoding can be. AMD is going to offload encode/decode onto the GPU, since it wasn’t able to build a fixed-function accelerator for playback in time, and doesn’t expect to even try encoding that way. Unfortunately, the requisite software is still being worked on, so we can’t compare CPU- to GPU-based HEVC playback today. In the same vein, video and image editing already do lean on GPU resources (we have our own Photoshop, Premiere Pro, and After Effects tests that are technically OpenCL-optimized). This will naturally continue with Kaveri. Of course, gaming is that killer app always able to push the latest and greatest; developers are already using compute in a variety of ways. For example, DICE uses a compute shader for tile-based deferred rendering in Battlefield 4.
How Do I Use This HSA You Speak Of?
AMD makes a big deal about its effort to design hardware that just works within the scope of how developers write code today, rather than forcing them to change direction yet again. Leveraging HSA shouldn't have the long adoption curve of multi-core CPUs, which were difficult to fully utilize, or GPGPU computing, which was only possible through low-level APIs for quite a while. Instead, the company’s HSA features map to OpenCL 2.0, ratified late last year.
The bad news is that the applications already installed on your PC aren’t optimized for Kaveri’s full feature set (though AMD does claim legacy OpenCL benefits from HSA thanks to run-time improvements). That’ll require ISVs to gradually introduce updated software. But a growing swath of developers is becoming increasingly proficient with OpenCL, and we’ve already incorporated a number of workloads into our benchmark suite able to leverage the API. Though testing won’t reflect HSA-oriented design today, we’re already working with a couple of big names to fold in relevant workloads.
Bottom line: we waited years for the first mainstream OpenCL-optimized applications, and now we have many well-known multimedia, content creation, productivity, and gaming titles benefiting from heterogeneous computing. We only expect to be on-hold for months before software starts showing up written for OpenCL 2.0. When that happens, AMD’s HSA features should augment performance and power consumption in different ways.
| Test Hardware | |
|---|---|
| Processors | AMD A10-7850K (Kaveri) 3.7 GHz, Four Cores, Socket FM2+, 4 MB Shared L2, 512 Shaders, 720 MHz GPU Clock Rate, Power-savings enabled |
| AMD A8-7600 (Kaveri) 3.3 GHz, Four Cores, Socket FM2+, 4 MB Shared L2, 384 Shaders, 720 MHz GPU Clock Rate, Power-savings enabled | |
| AMD A10-6800K (Richland) 4.1 GHz, Four Cores, Socket FM2, 4 MB Shared L2, 384 Shaders, 844 MHz GPU Clock Rate, Power-savings enabled | |
| AMD A8-6500T (Richland) 2.1 GHz, Four Cores, Socket FM2, 4 MB Shared L2, 256 Shaders, 720 MHz GPU Clock Rate, Power-savings enabled | |
| Intel Core i5-4670K (Haswell) 3.4 GHz, Four Cores, LGA 1150, 6 MB Shared L3, HD Graphics 4600, 1.2 GHz Max. GPU Clock Rate, Power-savings enabled | |
| Intel Core i3-4330 (Haswell) 3.5 GHz, Two Cores, LGA 1150, 4 MB Shared L3, Hyper-Threading enabled, HD Graphics 4600, 1.15 GHz Max. GPU Clock Rate, Power-savings enabled | |
| Motherboards | ASRock FM2A88X-ITX+ (Socket FM2+) AMD A88X Fusion Controller Hub, BIOS 1.90 |
| MSI Z87I Gaming AC (LGA 1150) Intel Z87 Platform Controller Hub, BIOS 1.0 | |
| Memory | AMD Radeon Memory (2 x 8 GB) DDR3-2133 10-11-11-30, AG316G2130U2K |
| G.Skill Ripjaws X (2 x 8 GB) DDR3-2133 9-11-10-28, F3-17000CL9Q-16GBXM | |
| Hard Drive | Samsung 840 Pro 256 GB SATA 6 Gb/s |
| Power Supply | Corsair AX860i 80 PLUS Platinum-Rated |
| System Software And Drivers | |
| Operating System | Windows 8 Professional 64-bit |
| DirectX | DirectX 11 |
| Graphics Driver | AMD Catalyst 13.30 RC2 |
| Intel 15.33.8.64.3345 | |
| Benchmark Configuration | |
|---|---|
| Gaming | |
| BioShock Infinite | Medium Quality Settings, 1920x1080, Built-in Benchmark Sequence, 75-Second playback, Fraps |
| Grid 2 | Medium Quality Preset, 2x MSAA, V-sync off, 1920x1080, Built-In Benchmark, Fraps |
| The Elder Scrolls V: Skyrim | Medium Quality Preset, FXAA Disabled, 1920x1080, Custom Run-Through, 25-Second playback, Fraps |
| World of Warcraft: Mists of Pandaria | Good Quality Preset, DirectX 11, 1920x1080, Flight Point Recording, Fraps |
| Adobe Creative Suite | |
| Adobe After Effects CC | Version 12.0.0.404 x64: Create Video which includes three Streams, 210 Frames, Render Multiple Frames Simultaneosly |
| Adobe Photoshop CC | Version 14.0 x64: Filter 15.7 MB TIF Image: Radial Blur, Shape Blur, Median, Polar Coordinates |
| Adobe Premeire Pro CC | Version 7.0.0, 6.61 GB MXF Project to H.264 to H.264 Blu-ray, Output 1920x1080, Maximum Quality |
| Audio/Video Encoding | |
| iTunes | Version 11.0.4.4 x64: Audio CD (Terminator II SE), 53 minutes, default AAC format |
| LAME MP3 | Version 3.98.3: Audio CD "Terminator II SE", 53 min, convert WAV to MP3 audio format, Command: -b 160 --nores (160 Kb/s) |
| HandBrake CLI | Version: 0.9.9: Video from Canon EOS 7D (1920x1080, 25 FPS) 1 Minutes 22 Seconds Audio: PCM-S16, 48,000 Hz, Two-Channel, to Video: AVC1 Audio: AAC (High Profile) |
| TotalCode Studio 2.5 | Version: 2.5.0.10677: MPEG-2 to H.264, MainConcept H.264/AVC Codec, 28 sec HDTV 1920x1080 (MPEG-2), Audio: MPEG-2 (44.1 kHz, 2 Channel, 16-Bit, 224 Kb/s), Codec: H.264 Pro, Mode: PAL 50i (25 FPS), Profile: H.264 BD HDMV |
| Productivity | |
| ABBYY FineReader | Version 11.0.102.583: Read PDF save to Doc, Source: Political Economy (J. Broadhurst 1842) 111 Pages |
| Adobe Acrobat XI | Version 11.0.0: Print PDF from 115 Page PowerPoint, 128-bit RC4 Encryption |
| Autodesk 3ds Max 2012 and 2013 | Version 14.0 x64: Space Flyby Mentalray, 248 Frames, 1440x1080 |
| Blender | Version: 2.68a, Cycles Engine, Syntax blender -b thg.blend -f 1, 1920x1080, 8x Anti-Aliasing, Render THG.blend frame 1 |
| Visual Studio 2010 | Version 10.0, Compile Google Chrome, Scripted |
| Cinebench | Cinebench R15.0 CPU Component |
| File Compression | |
| WinZip | Version 18.0 Pro: THG-Workload (1.3 GB) to ZIP, command line switches "-a -ez -p -r" |
| WinRAR | Version 5.0: THG-Workload (1.3 GB) to RAR, command line switches "winrar a -r -m3" |
| 7-Zip | Version 9.30 Alpha: THG-Workload (1.3 GB) to .7z, command line switches "a -t7z -r -m0=LZMA2 -mx=5" |
| Synthetic Benchmarks and Settings | |
| 3DMark | Version: 1.2.250, Cloud Gate |
| PCMark 8 | Version: 2.0, Home (OpenCL-Accelerated), Creative (OpenCL-Accelerated), Work |

On-die graphics engines occupy more die area than ever; Kaveri’s Radeon R7-class GPU gets 47% of the SoC. And yet it seems that 1920x1080 remains just out of reach.
Yes, A10-7850K is 11% faster than A10-6800K. But at less than 30 FPS using the Medium detail setting, your choices are either to dial back graphics quality even more or play at a lower resolution.
I imagine AMD is more excited about the A8-7600’s performance. Configured as a 65 W device, it’s a hair quicker than the A10-6800K, and even with a 45 W TDP it comes within 1 FPS of the 100 W Richland-based flagship. AMD is shooting for a roughly $120 price tag—about $20 cheaper than what the -6800K is selling for right now.
It’s unfortunate that even the -7850K’s average frame rates leave us wanting more. In a world where current-gen gaming consoles yield good frame rates at impressive detail levels, it’s not enough to say “turn BioShock’s settings as low as they can go” or “just step back to 720p”. Getting the gaming performance we’d recommend still requires purchasing discrete graphics in this case.

Grid 2 is typically more platform-bound than BioShock, benefiting greatly from fast system memory. In this title, we can use the game’s Medium detail preset at 1920x1080 and race around fairly smoothly.
AMD will want to flaunt this chart. Not only do we get playable performance from three different Kaveri-based configurations, but the 45 W A8-7600 decimates the Richland-based A8-6500T and both of Intel’s HD Graphics 4600-powered CPUs.

We tend to think of Skyrim as a graphics pushover. But at 1920x1080, it keeps everything we’re testing under an average of 40 FPS.
First things first—Intel’s HD Graphics 4600 is wholly incapable on its own. The company really needs to enable its GT3/GT3e configuration on the desktop at a reasonable price.
From there, we see the 45 W A8-7600 smoke the 45 W -6500T, hammering home AMD’s emphasis on GPU performance at lower power ceilings. The same is true to a lesser degree as the 65 W -7600 configuration outperforms the 100 W A10-6800K in our Skyrim benchmark. Finally, the -7850K is around 19% quicker than the fastest Richland-based APU—still pretty darned good.

World of Warcraft also gets derided as mainstream fodder, though it drags down our integrated graphics engines as well (even at the mid-range Good quality preset).
Intel’s more capable x86 cores allow the Core i5 and i3 to make up some ground against AMD’s APUs, but not enough for playable performance at 1920x1080.
The A8-7600 manually set to 45 W does get us pretty close though, simultaneously embarrassing the 45 W A8-6500T. Stepping up through the APU line-up yields small gains, culminating in the 95 W -7850K’s victory over A10-6800K by just 7%--likely due to the loss of clock rate on the host processing side. Even still, Kaveri is quite clearly a win performance-wise. Is it worth an extra $30+, plus a new motherboard? Probably not. Could we see some really cool mobile gaming platforms at more affordable price points than anything Intel has? It’d certainly seem so.
If you're not already familiar with AMD's Dual Graphics technology, you can think of it as a form of CrossFire that pairs an APU and discrete GPU to extract additional performance. The company started touting Dual Graphics back in the Llano days, so we published an in-depth examination of its inner workings back in August of last year (AMD Dual Graphics Analysis: Better Benchmarks; Same Experience?).
Back then, we observed that Fraps suggested that the add-in GPU was adding a lot of extra speed. However, our video-based FCAT testing showed a lot of those new frames were getting dropped and chopped off, ultimately yielding an experience no better than integrated graphics operating on its own. To demonstrate this phenomenon to our readers, we captured lossless video from the graphics card and uploaded it to YouTube with instructions on how to watch at 60 Hz. If you want more information, and haven't already seen the story we wrote, you owe it to yourself to check it out.
It took a few months, but AMD says it addressed the problem in its beta Catalyst 13.35 driver. We didn't have much time to prepare for this piece, but we did manage to test with Tomb Raider, BioShock Infinite, and The Elder Scrolls V: Skyrim. Previously, all three games fared well in Fraps and fell on their faces during our FCAT analysis.


We begin with Tomb Raider. These results were recorded using FCAT, so dropped and runt frames are excluded; the results can be believed because we're pulling the output straight from the DVI port.
There's an almost-100% boost with Dual Graphics enabled. But before we get too excited, let's look at frame time variance:


Latencies appear low, and lower is better. But to make sure our quantitative data corresponds to what we see, let's revisit the video like we did in our previous Dual Graphics analysis.
Our video jives with the data we collected; Dual Graphics appears to work much better in Tomb Raider than it did on AMD's previous-gen platform.
Next up is BioShock Infinite, which has also been problematic in the past.



Frame rates clearly jump in BioShock thanks to Dual Graphics, though we do observe more of that nasty frame time variance.
The last title we had time for was The Elder Scrolls V: Skyrim.




FCAT tells us that Skyrim's average frame rate is up with Dual Graphics enabled. However, the game's minimum performance level doesn't increase. A look at frame rate over time reveals a couple of valleys, one of which corresponds to a spike in the frame time variance chart. Worse, the Dual Graphics comparison video doesn't look any smoother than the A8-7600 or Radeon R7 240 on their own. If anything, it even looks choppier at times.
Skyrim is known for its frame time variance issues, and we've seen other dual-GPU configurations behave strangely in the game. Although we plan to write a more thorough follow-up to our first excursion with Dual Graphics, we're at least glad that frame pacing appears enabled in the company's driver. As for the value analysis, we have to save that for next time.

Sorted by 3DMark score, Intel’s Core i5-4670K takes first place. But it appears that’s only the result of a winning Physics sub-test result. Intel’s four physical cores outperforming AMD’s two Steamroller modules really should come as no surprise. More significant is the Graphics number, which puts four AMD APU configurations ahead of Intel's HD Graphics 4600 engine.

The latest version of PCMark is quite a bit different from previous Futuremark benchmarks. It’s broken down into three separate suites, including Home, Creative, and Work. Each has a collection of workloads (for example, the Home test emphasizes Web browsing, writing, casual gaming, photo editing, and video chatting). Moreover, the Home and Creative benchmarks can be run with or without OpenCL acceleration. Oddly, our A8-6500T couldn’t get through either without crashing, so its lower numbers had to be run using the Conventional setting.
Likely as a result of its strong graphics engine, the A10-7850K secures wins in the Home, Creative, and Work tests (though we wouldn’t expect a CPU-only Work run to favor AMD).
The A8-7600 in its 65 W configuration fares pretty well against AMD’s 100 W A10-6800K. Given a slightly lower price tag, that’d likely become an attractive option for comparable performance in a lower-power machine.
We’ll refrain from drawing sweeping conclusions about AMD’s showing against Intel in a synthetic benchmark. However, the Kaveri-based APUs easily slip past Intel’s $140 Core i3-4330. In most cases, they also do really well versus the pricier Core i5.


Based on our earlier exploration of per-cycle performance, we know that AMD’s Steamroller architecture yields a nice speed-up in 3ds Max. And even though A10-6800K enjoys higher frequencies, the A10-7850K and A8-7600 (65 W) still turn in slightly better results.
The top Richland-based APU trades blows with Intel’s Core i3-4330, incidentally at the same price. AMD’s A8-7600 isn’t available yet, but once it is, we expect it’ll impress at around $120. We’d have a much harder time arguing in favor of the A10-7850K for $50 more when it’s only marginally faster.

Both 3ds Max and Blender might be considered heavy lifting for low-cost PCs. But that doesn’t stop Intel’s dual-core Core i3 from performing surprisingly well in our render project, besting AMD’s dual-module Kaveri APUs. The A10-7850K narrowly beats AMD’s A10-6800K, while the A8-7600 lands just behind Richland’s quickest incarnation.
Benchmarks make it pretty clear that it’s worth stepping up to Core i5 (or better) if you’re into more demanding content creation apps…at least for now. After all, this would seem to be the type of workload AMD is talking about when it extols the virtues of HSA.

A pure CPU test, Maxon’s Cinebench strictly gives us the difference between host processing performance. Steamroller, operating at its manufacturing process-limited peak and base clock rates, turns out to be slower than A10-6800K.
No surprise—the Haswell-based chips are faster than Piledriver and Steamroller in a single-threaded benchmark. It’s less expected to see Core i3 in front of the APUs, though.

The same thing happens in Sony’s Vegas Pro 12 as Intel’s CPUs take first and second place, while A10-6800K slides past the Kaveri-based APUs. This is an OpenCL-accelerated workload, so it’s strange that AMD’s latest doesn’t turn in stronger numbers. But when we turn off OpenCL and run the same benchmark, completion time nearly doubles. So hardware acceleration is definitely helping, just not as much as we would have thought given AMD's more modern graphics architecture.

As we start testing the applications in Adobe’s Creative Cloud suite, a pattern emerges suggesting that two Hyper-Threaded Haswell cores can keep up with Kaveri’s Steamroller modules.
Our Premiere Pro workload encodes a project to .mp4, and doesn’t benefit from OpenCL acceleration. Therefore, we’re left to the mercy of Steamroller, which nudges A10-7850K in front of last generation’s -6800K. Core i3-4330 is just a few seconds faster than AMD’s new flagship though.

The After Effects rendering project we run is also CPU-limited, pegging each of our processors at 100% utilization. But AMD pulls out a win. It’s not clear how the A10-7850K manages to trump Intel’s Core i5-4670K, but it does, if only by one second. Really, only the dual-core Core i3 and 45 W A8-6500T suffer in this test.

You probably already know that we script two different Photoshop tests, each with its own set of filters. The CPU-oriented metric uses threaded routines to tax as many cores as we can expose, while the OpenCL-accelerated benchmark taps graphics resources to speed up an entirely separate workload.
Sorting according to the CPU results, we see AMD’s A10-7850K just ahead of the -6800K. Yes, Intel’s Core i5-4670K is quite a bit quicker, but you’ll pay an extra $70 or so for the privilege of owning it. A much closer contender is the $140 Core i3, which uses two Hyper-Threaded cores to pull within 12 seconds of AMD’s -7850K.
But the Intel chip does quite a bit better in our OpenCL-accelerated metric. In fact, we also see the Richland-based A10-6800K finish up ahead of any Kaveri-based APU.

ABBYY’s OCR software is very well-threaded, so Intel’s quad-core -4670K tears through it with aplomb. Improvements made to the Steamroller architecture kick into gear and facilitate a second-place finish for the A10-7850K. The next three processors land quite close, while the A8-6500T is way in the back.
Thus far, a majority of our tests have shown the A10-7850K to be fairly comparable to the -6800K. Meanwhile, the A8-7600 at its 65 W setting is remarkably strong against the higher-power parts. Curious as to how the lower-power part would fare at 45 W, I ran it through FineReader and came back with a result of 188 seconds. That’s 61% of the time it takes AMD’s A8-6500T—another 45 W part you should be able to find for about $112.

Our Google Chrome compile job in Visual Studio is quite demanding. Even the lowly Core i3 smokes through it ahead of AMD’s dual-module Kaveri-based APUs, though.
Knowing that this test fully taxes each SoC’s CPU complex, we would have guessed that the Steamroller-powered Kaveri would beat Richland. And we’d be right. The difference is subtle, but A10-7850K is a bit faster than -6800K, which is in turn slightly quicker than the 65 W A8-7600.

Printing a PowerPoint file to PDF happens in one thread, and the results of our benchmark are right in line with what previous metrics tell us to expect. Haswell dominates, Steamroller and Piledriver achieve parity with efficiency and clock rate balancing each other out. Meanwhile, the 65 W A8-7600 gives up a tiny bit of performance in the interest of lower power.

Our WinZip 18 benchmark has three components: a pure CPU test, a maximum compression (EZ) run, and an OpenCL-accelerated pass.
The fact that A10-6800K finishes the CPU-specific metric (in red) ahead of -7850K suggests that we aren’t getting maximum utilization. In any case, the Core i3 is both cheaper and faster than AMD’s fastest Kaveri-based APU.
Creating a more intense load with the EZ setting doesn’t change the finishing order—it only makes each run take longer.
Switching on OpenCL does help, though. In this case, the A10-6800K jumps in ahead of Intel’s Core i3. The A10-7850K roughly ties the Intel CPU (albeit at a $30+-higher price point). It also bears mention that the Core i3 is a 54 W part, while AMD is hitting 95 and 65 W thermal ceilings with the two Kaveri APUs in our chart.

The latest version of WinRAR is fairly effective at utilizing parallel computing resources; however, the Kaveri APUs trail a bit behind Richland. The Haswell-based Intel processors are significantly faster.
Shifting down to 45 W on the A8-7600 yields a completion time of 111 seconds. As enthusiasts, it’s most natural for us to look at the highest-end parts for inspiration. But in the case of Kaveri, AMD says it focused its attention in the 35 to 45 W band. It shows, too.

7-Zip is typically considered one of the best-threaded compression workloads in our benchmark suite, so it’s a bummer to see Kaveri receiving little validation at the 95 W level. Fortunately, the 65 W A8-7600 gives up fairly little performance for a big reduction in peak thermal dissipation. It’ll be interesting to see how this translates over to our efficiency calculations across the entire benchmark suite.

The benchmarks on this page employ workloads that we imagine are high on AMD’s list of tasks to accelerate through OpenCL, and ultimately to optimize for its HSA features. In fact, there’s already a beta version of HandBrake with OpenCL-based optimizations that offload cropping and down-scaling to the GPU.
At least in TotalCode Studio, however, encoding happens on x86 cores. This application leverage’s Rovi’s popular MainConcept codecs, which run well on Intel’s Core i5-4670K. The dual-core Core i3 and dual-module A10 and A8 APUs all turn in very similar results. Only the 45 W A8-6500T is completely blown away.
Switch over to the -7600 with a 45 W ceiling and you can take that 144-second finish time down to 98 seconds. We've already pointed this out several times, but AMD says it optimized Kaveri for that 45 W ceiling. In this case, those improvements cut 31% from the test's completion time.

Although we’re not using the OpenCL-accelerated beta of HandBrake, the stable version in our suite does explicitly leverage Kaveri’s support for FMA3/4, LZCNT, and BMI1. Then again, so does Richland’s Piledriver architecture.
Either way, A10-7850K manages a win against -6800K (for that matter, A8-7600 does too).
Dialed down to 45 W, the -7600 finishes in 213 seconds. Compared to the other 45 W part in our chart, AMD’s A8-6500T, that’s a phenomenal improvement. It’s just particularly sexy in a desktop environment.

Our LAME audio conversion test is single-threaded. It’ll allow each of these CPUs to spin up to their maximum Turbo Boost or Core frequency (unlike the per-cycle comparison we ran earlier, which sought to compare architectural efficiency at a fixed 4 GHz).
Intel’s Haswell design maintains its advantage. Richland, as it appears on the A10-6800K, hits higher clock rates and therefore is faster than Kaveri.
Again, curious as to how the 45 W version of A8-7600 would size up with its peak clock rate constrained to 3.3 GHz, I adjusted down the configurable TDP in ASRock’s firmware. The outcome was a finish time of 139 seconds—an impressive improvement over the A8-6500T standing in as our 45 W Richland-based APU.

The same story applies to iTunes, which is also single-threaded.
I’d argue that this stuff is more fun than the actual performance benchmarks. We get to see how quickly each processor cruises through our suite, we measure instantaneous power consumption along the way, and then we're able to calculate what that data means to overall efficiency in the tests we’re running.

The line chart is admittedly messy, but it shows you that there’s actually something being logged as our benchmarks run. Each platform has 30 minutes of idle time inserted at the end of the suite, where we record power on the Windows desktop, just to ensure we’re not exclusively reporting results under load.
The outlier is AMD’s 45 W A8-6500T, which maintains remarkably low power draw through our workloads, but takes a long time to finish up. It’s a bold strategy, Cotton. Let’s see if it pays off in the efficiency chart.
Hidden under all of those lines, the A10-7850K (red) and A8-7600 (black) track very closely throughout the scripted sequence, even though one is a 95 W part and the other has a 65 W TDP. I went back and re-ran the -7600 using Disabled, 65 W, and 45 W options in ASRock’s Configurable TDP firmware setting, verifying that this feature works as it should.

Averaging power consumption confirms that the Kaveri-based parts end up really, really close to each other. Given the A10-6800K’s average use almost 20 W higher, the -7850K doesn’t appear to exploit all of its power budget to max out performance.
Both the Core i5-4670K and Core i3-4330 average lower power consumption through the Tom’s Hardware benchmark suite. Now we want to know how long each processor takes to finish the job.

Those same Intel CPUs end up being the first- and second-fastest through our benchmarks.
We can’t script the game tests, so 3D performance isn’t a component in these charts. You’re seeing the result of content creation, media encoding, productivity, and compression workloads. Any time you fold in gaming and compare AMD's Radeon R7 engine to HD Graphics 4600, the Kaveri design wins consistently. Whether or not an APU can deliver the frame rate, detail setting, and resolution you want to use is another matter entirely. Still, context is important here because Steamroller just doesn’t do much for the high end of Kaveri as it exists today. There’s a lot more to like at lower power levels when you compare inside of AMD’s portfolio. But once you add low-power Core i3s to the mix, again, Intel comes out on top.

Multiply power consumption across the duration of our test suite and this is what you get. The flagship A10-7850K is notably more efficient than the Richland-based A10-6800K, partly because it’s a little faster, but mostly as a result of lower power use. That doesn’t stop the $140 Core i3 from burning close to 50% of the A10’s energy, though.
AMD deserves praise for championing heterogeneous computing, and doing it in a way that opens up the benefits to multiple market segments, hardware vendors, developers, and ultimately a larger number of end users. Even as Intel increases the footprint of the graphics engine on its own processors and shores up driver support for compute, AMD gets most of the credit for pushing forward with OpenCL evangelization.

The real purpose of Kaveri is encapsulated in this diagram. Leveraging the right resources for any given workload can have a profound impact on performance and power—it just takes developers optimizing for the potential available in today’s GPUs.
Unfortunately, much of our benchmark suite (just like today’s software landscape) isn’t put together with those resources in mind. We started folding in OpenCL-capable tests a long time ago, and we currently have a number of metrics able to expose the benefits of GPU acceleration. Even those tasks didn’t shine brightly on Kaveri today, though. Despite the lack of innovation we’ve seen from Intel on the PC desktop, its Haswell architecture cuts through most tasks faster and at lower power than what AMD is showing.
Of course, the notable exception is in the gaming space. Intel does have an ace in its Iris Pro graphics, but charges far too much for them and is very limited in exposing the GT3/GT3e implementation of its GPU. That leaves the door open for AMD to kick HD Graphics 4600 around, which it does. Even the 45 W version of A8-7600 has no trouble trouncing Intel's 84 W Core i5-4670K. The difference between them is made more significant when one pumps out playable frame rates and the other cannot.
AMD was pretty explicit that it designed Kaveri to fit in thermal envelopes from 15 to 95 W, but that it optimized for the middle of that range. Today’s numbers confirm the company’s assessment.

A10-7850K, the 95 W flagship, is a capable little gaming processor. It should be able to handle most games at 1920x1080 using decent detail settings. Thanks to a 512-shader GCN-based graphics engine, it’s faster than A10-6800K. But when you switch over to the other desktop-oriented workloads in our test suite, it merely trades blows with Richland’s top-end model. But expect to pay an extra $30+ for it (AMD says the -7850K will sell for $173, though it hasn’t shown up for sale on Newegg yet). And you’ll need a Socket FM2+-equipped motherboard, too.
Personally, I’d go for a $75 Athlon X4 740 and a Radeon HD 7750 for close to the same amount of money…at least until HSA-optimized software gives us a reason to favor Kaveri-based APUs.
There’s a lot more to get excited about at the 65 and 45 W power levels. AMD’s bias to the GPU is evident in games where the A8-7600, constrained to a 45 W TDP, destroys the 84 W Core i5 with HD Graphics 4600 and its own 45 W A8-6500T. Kaveri doesn’t have to lean on games, either. Its application performance is quite a bit better than the similarly-priced Richland-based APU we used for comparison. The caveat is that Intel does the low-power, high-performance song and dance more convincingly.
With its emphasis shifting away from big desktop CPUs, and even the 100 W APUs it used to herald at the top of its stack, AMD appears most focused on 65 W and lower. That’s probably not what performance-hungry enthusiasts want to hear, but with significant real-world gains in graphics and general-purpose computing, Kaveri looks to be stronger mainstream desktop and mobile contender than Richland.
The story of this APU isn’t opened and closed today. AMD is championing the Heterogeneous System Architecture, and Kaveri represents the pinnacle of its work so far. Although we have a better idea of what HSA might enable, there is still a lot of work to be done on the software side. And with OpenCL 2.0 recently ratified, we’ll only start scratching the surface in 2014.
And so we step away from Kaveri marveling at the SoC’s potential and eager for developers to start pushing out optimized apps, but lacking a real reason to make a purchase today. Fortunately, this platform doesn’t have to be programmed to specifically. AMD just needs ISVs to adopt OpenCL 2.0. From there, the merits of HSA should become more apparent. We look forward to watching the saga unfold.

