AMD is known for provoking its fans into frothy feeding frenzies ahead of important launches. After slowly teasing out details of its Radeon R9 Fury X flagship graphics card, it’s time for us to determine if the hype was warranted. Can a more complex GPU, groundbreaking memory technology and closed-loop liquid cooler generate enough performance to usurp Nvidia’s decidedly efficient Maxwell architecture in the GeForce GTX 980 Ti?
AMD’s last piece of ultra-high-end hardware surfaced more than a year ago. The Radeon R9 295X2 was a crowning achievement for the company. It showed that two Hawaii GPUs fit on one graphics card and, unlike the Radeon HD 7990 and 6990 before it, could be cooled relatively quietly.
The secret to AMD’s success was a closed-loop liquid cooler. A large radiator and 120mm fan effectively exhausted waste heat right out the back of your chassis. The combination didn’t make much noise, and yet it enabled enough thermal capacity for AMD to overclock its big processors beyond the reference Radeon R9 290X.
Best of all, it surfaced for $1500—half the price Nvidia was asking for its ill-fated GeForce GTX Titan Z, a card that ate up three expansion slots and still needed significant detuning from its Titan pedigree to behave under air cooling.
We certainly admired what AMD accomplished in its Radeon R9 295X2. But over time, and in the face of increasingly faster single-GPU boards from Nvidia, the 295X2 became a reminder of the company’s tendency toward brute force, rather than efficiency, to compete. Meanwhile, prices on the dual-GPU board dipped as low as $600—a steal for anyone willing to cope with its physical size and the sometimes-frustrating state of CrossFire support.
A Blast From The Past
The Radeon R9 Fury X is born of the same DNA, with Graphics Core Next in its heart and liquid drawing thermal energy away from the massive Fiji GPU. It’s a single-processor board, though, so it doesn’t need to reside on an especially long PCB. Further, Fury X is AMD’s first graphics card featuring HBM, putting 4GB of stacked dies on a silicon interposer, right next to Fiji, further condensing the necessary dimensions.
What we’ve been promised, of course, is that the new GPU’s bigger pool of processing resources plus an unprecedented amount of memory bandwidth come together on a graphics card able to beat Nvidia’s GeForce GTX 980 Ti (and at a similar $650 price point, too).
AMD isn’t leaving this card’s fate to chance. The company’s marketing machine is invoking a brand from the past—one that predates Tom’s Hardware, even. Rage was what ATI called its first 3D accelerators in 1995, back before PCI Express or even AGP. And yes, I owned an original 3D Rage graphics card. The Rage Pro, Rage 128 Pro and Rage Fury MAXX also found their way into my various PCs. This is AMD conjuring up some of the mojo that made ATI a company it wanted to acquire for more than $5 billion dollars back in 2006.
Fiji Takes Shape
Is the Radeon R9 Fury X worthy of such a designation? It wields promising specifications, that’s for sure. We covered most of the vitals in a preview last week. But to recap, the card centers on AMD’s new Fiji GPU.
Both AMD and Nvidia knew that 28nm manufacturing would be a long-term affair. However, AMD acknowledges it was counting on process technology to evolve more quickly. The same almost certainly applies to Nvidia. As reality set in, the two companies adapted and took different paths in designing their newest GPUs. Whereas GM200 measures 601mm², Fiji is almost as large at 596mm². AMD crams a claimed 8.9 billion transistors into that space, and then mounts the chip on a 1011mm² silicon interposer, flanking it with four stacks of High Bandwidth Memory.
A quick glance at Fiji’s block diagram is suggestive of the Hawaii design launched back in 2013, if only because they’re both organized into four Shader Engines, each with its own geometry processor and rasterizer, plus four render back-ends capable of 16 pixels per clock cycle. AMD doesn’t touch any of that. But the company does replicate more of the Compute Units in each Shader Engine, deploying 16 rather than 11. With 64 shaders per CU, you end up with 1024 per Shader Engine and an aggregate of 4096 shaders across the entire GPU. Furthermore, AMD maintains four texture filter units per CU, yielding a total of 256 in Fiji versus Hawaii’s 176.
Clearly, Fiji’s theoretical shading, compute and texture filtering performance are way up. Without corresponding improvements to the geometry engines or ROP count, however, aren’t we staring down the barrel of another big bottleneck? It’s going to depend on the workload. Just remember that back when it introduced Hawaii, AMD made a concerted effort to augment geometry throughput with the four-way Shader Engine layout and boost pixel fill rate. Representatives went so far as to posit that memory bandwidth was the GPU’s limiter, despite a wide 512-bit bus. Today, the company says its analysis suggests standard eight-bit-per-channel raster ops rarely bottleneck performance. Sixteen-bit-per-channel ops can be more of a challenge; however, the combination of HBM and color compression allow Fiji to realize GCN’s ability to support full-rate 16bpc raster operations where previous GPUs were, in fact, bottlenecked. Does AMD wish it could have built a bigger engine? That seems to have been the plan. Facing a limit to the interposer’s size, though, AMD pulled right up to ceiling for its GPU, yielding Fiji.
What you don’t see in the processor’s block diagram are the incremental improvements made to AMD’s Graphics Core Next architecture, some of which actually help alleviate the bottlenecks we’d worry about. Hawaii employed a second iteration of GCN, which was subsequently updated for Radeon R9 285’s Tonga GPU. Fiji inherits the benefits of what AMD calls its third-gen GCN design. One such advantage is updated geometry processors that improve tessellation performance. Lossless color compression for frame buffer reads and writes, new 16-bit integer/floating-point instructions and a doubling of L2 cache to 2MB are on the list too. Less relevant to Fiji’s 3D pipeline but no less welcome are a higher-quality display scalar and an updated video decode engine supporting accelerated HEVC playback.
On the compute side, Fiji incorporates improved task scheduling and some new data parallel processing instructions to go along with the eight Asynchronous Compute Engines carried over from Hawaii. Given this GPU's 4096 shaders and 1050MHz maximum core frequency, AMD can claim an 8.6 TFLOP single-precision compute rate. However, it limits FP64 to 1/16th of that, yielding a DP ceiling 537.6 MFLOPs (less than Hawaii). After the handicapping of GM200, consider this another nod to the purpose-built nature of high-end gaming GPUs.
The HBM Hookup
Where AMD makes up ground is its implementation of High Bandwidth Memory, which propels peak throughput from 320 GB/s on R9 290X to Fury X’s 512 GB/s. The low-level details are already fairly well-known, but HBM achieves its big bandwidth numbers by stacking DRAM vertically. Each die has a pair of 128-bit channels, so four create an aggregate 1024-bit path.
This first generation of HBM runs at a relatively conservative 500MHz and transfers two bits per clock. GDDR5, in comparison, is currently up to 1750MHz at four bits per clock (call it quad-pumped, to borrow a term from the old Pentium 4 front-side bus days). That’s the difference between 1 Gb/s and 7 Gb/s. Ouch. But factor in the bus width and you have 128 GB/s per stack of HBM versus 28 GB/s from a 32-bit GDDR5 package. A card like GeForce GTX 980 Ti employs six 64-bit memory controllers. Multiply that all out and you get its 336 GB/s specification. Meanwhile, Radeon R9 Fury X employs four stacks of HBM, which is where we come up with 512 GB/s.
It’s not often you see such an instrumental specification jump by 60%, or sit more than 50% higher than the competition. There’s no doubt that HBM plays a big role in Fury X’s performance story, or that it would have been even more influential had Fiji been a bigger chip. But here’s where we’re thrown a bit of a curveball. You’re going to see in the performance results that Radeon R9 290X and GeForce GTX 980 are closely matched today. We know that this is AMD’s first outing with Fiji and HBM, and it’s logical to assume the company’s driver team will extract some more performance from the combination. However, AMD has specific release targets planned when it expects significant speed-ups. We certainly can’t predicate a conclusion on guesses as to where Fury X will land. Still, it’s interesting that AMD sees unrealized potential.
There is some uncertainty about Fury X’s long-term prospects given its 4GB of HBM. Indeed, it’s easy to get spooked in the face of 6GB 980 Tis and 12GB Titan Xes. None of our benchmarks at 4K suggest that Radeon R9 Fury X will prove problematic with 4GB, though. We were able to set up a fairly contrived combination of settings in Grand Theft Auto V that blew past 4GB of memory use and dropped frame rates to single digits. But the game was barely playable by then anyway. AMD does find itself in a somewhat strange position, what with arming Radeon R9 390X and 390 with 8GB and all. Still, we just don’t think the flagship’s halved capacity is much of a handicap. At the resolutions and settings required to exceed 4GB, a single Fiji already finds itself out of its element. Moreover, AMD says there’s a lot more it could be doing to manage memory that wasn’t happening before. Now that the issue is more complicated than simply throwing down twice as much GDDR5, the company is motivated to take better care of the capacity available. This is receiving engineering attention now, naturally.