Vega Architecture & HBM2
Although we've detailed Vega's architecture before, it's worthwhile to provide it here again as a refresher. Vega represents a new GPU generation for AMD, with a reported 200+ changes and improvements separating it from the GCCN implementation that came before.
HBM2: A Scalable Memory Architecture
Both AMD and Nvidia are working on ways to reduce host processor overhead, maximize throughput to feed the GPU, and circumvent existing bottlenecks—particularly those that surface in the face of large datasets. Getting more capacity closer to the GPU in a fairly cost-effective manner seemed to be the Radeon Pro SSG’s purpose. And Vega appears to take this mission a step further with a more flexible memory hierarchy.
Vega makes use of HBM2, of course, which AMD officially introduced more than six months ago. At the time, we also discovered that the company calls this pool of on-package memory—previously the frame buffer—a high-bandwidth cache. HBM2 equals high-bandwidth cache in AMD parlance. Got it?
According to Joe Macri, corporate fellow and product CTO, the vision for HBM was to have it be the highest-performance memory closest to the GPU. However, he also wanted system memory and storage available to the graphics processor, as well. In the context of this broader memory hierarchy, sure, it’s logical to envision HBM2 as a high-bandwidth cache relative to slower technologies. But for the sake of disambiguation, we’re going to continue calling HBM2 what it is.
After all, HBM2 represents a significant step forward already. An up-to-8x capacity increase per vertical stack, compared to first-gen HBM, addresses questions enthusiasts raised about Radeon R9 Fury X’s longevity. Further, a doubling of bandwidth per pin significantly increases potential throughput.
That’s the change we expect to have the largest impact on gamers as far as Vega's memory subsystem goes. However, AMD also gives the high-bandwidth cache controller (no longer just the memory controller) access to a massive 512TB virtual address space for large datasets.
When asked about how the Vega architecture's broader memory hierarchy might be utilized, AMD suggested that Vega can move memory pages in fine-grained fashion using multiple, programmable techniques. It can receive a request to bring in data and then retrieve it through a DMA transfer while the GPU switches to another thread and continues work without stalling. The controller can go get data on demand but also bring it back in predictively. Information in the HBM can be replicated in system memory like an inclusive cache, or the HBCC can maintain just one copy to save space. All of this is managed in hardware, so it should be quick and low-overhead.
As it pertains to Radeon RX Vega 64, AMD exposes an option in its driver called HBCC Memory Segment to allocate system memory to Vega's cache controller. The corresponding slider determines how much memory gets set aside. Per AMD, once the HBCC is operating, it'll monitor the utilization of bits in local GPU memory and, if needed, move unused information to the slower system memory space, effectively increasing the capacity available to the GPU. Given Vega 64's 8GB of HBM2, this option is fairly forward-looking; there aren't many games that need more. However, AMD has shown off content creation workloads that truly need access to additional memory. Of course, the one recommendation is to have plenty of system RAM available before using the HBCC. If you have 16GB installed, you don't want to lose 4GB or 8GB to the HBCC.
AMD suggested that Unigine Heaven might reflect some influence from enabling the HBCC memory segment, so we ran the benchmark at 4K using 8x anti-aliasing and Ultra quality. With the HBCC disabled, we measured 25.7 FPS. Allocating an extra 4GB of DDR4 at 3200 MT/s nudged that result up to 26.9 FPS.
New Programmable Geometry Pipeline
The Hawaii GPU (Radeon R9 290X) incorporated some notable improvements over Tahiti (Radeon HD 7970), one of which was a more robust front end with four geometry engines instead of two. The more recent Fiji GPU (Radeon R9 Fury X) maintained that same four-way Shader Engine configuration. However, because it also rolled in goodness from AMD’s third-gen GCN architecture, there were some gains in tessellation throughput, as well. More recently, the Ellesmere GPU (Radeon RX 480/580) implemented a handful of techniques for getting more from a four-engine arrangement, including a filtering algorithm/primitive discard accelerator.
AMD promised us last year that Vega’s peak geometry throughput would be 11 polygons per clock, up from the preceding generations' four, yielding up to a 2.75x boost. That specification came from adding a new primitive shader stage to the geometry pipeline. Instead of using the fixed-function hardware, this primitive shader uses the shader array for its work.
More recently, AMD published a Vega architecture whitepaper that revised the geometry pipeline's peak rate to more than 17 primitives per clock.
AMD describes this as having similar access as a compute shader for processing geometry in that it’s lightweight and programmable, with the ability to discard primitives at a high rate. The primitive shader’s functionality includes a lot of what the DirectX vertex, hull, domain, and geometry shader stages can do but is more flexible about the context it carries and the order in which work is completed.
The front-end also benefits from an improved workgroup distributor, responsible for load balancing across programmable hardware. AMD says this comes from its collaboration with efficiency-sensitive console developers, and that effort will now benefit PC gamers, as well.
Vega’s Next-Generation Compute Unit (NCU)
Using its many Pascal-based GPUs, Nvidia is surgical about segmentation. The largest and most expensive GP100 processor offers a peak FP32 rate of 10.6 TFLOPS (if you use the peak GPU Boost frequency). A 1:2 ratio of FP64 cores yields a double-precision rate of 5.3 TFLOPS, and support for half-precision compute/storage enables up to 21.2 TFLOPS. The more consumer-oriented GP102 and GP104 processors naturally offer full-performance FP32 but deliberately handicap FP64 and FP16 rates so you can’t get away with using cheaper cards for scientific or training datasets.
Conversely, AMD looks like it’s trying to give more to everyone. The Compute Unit building block, with 64 IEEE 754-2008-compliant shaders, persists, only now it’s being called an NCU, or Next-Generation Compute Unit, reflecting support for new data types. Of course, with 64 shaders and a peak of two floating-point operations/cycle, you end up with a maximum of 128 32-bit ops per clock. Using packed FP16 math, that number turns into 256 16-bit ops per clock. AMD even claimed it can do up to 512 eight-bit ops per clock. Of course, leveraging this functionality requires developer support, so it’s not going to manifest as a clear benefit at launch.
Double-precision is a different animal—AMD doesn’t seem to have a problem admitting it sets FP64 rates based on target market, and we confirmed that Vega 10’s FP64 rate is 1:16 of its single-precision specification. This is another gaming-specific architecture; it’s not going to live in the HPC space at all.
The impetus for Vega 10's flexibility may have very well come from the console world. After all, we know Sony’s PlayStation 4 Pro can use half-precision to achieve up to 8.4 TFLOPS—twice its performance using 32-bit operations. Or perhaps it started with AMD’s aspirations in the machine learning space, resulting in products like the upcoming Radeon Instinct MI25 that aim to chip away at Nvidia’s market share. Either way, consoles, datacenters, and PC gamers alike stand to benefit.
AMD claimed the NCUs are optimized for higher clock rates, which isn’t particularly surprising, but it also implemented larger instruction buffers to keep the compute units busy.
Next-Generation Pixel Engine: Waiting For A Miracle
Let’s next take a look at AMD’s so-called Draw Stream Binning Rasterizer, which is supposed to be a supplement to the traditional ROP, and as such, should help improve performance while simultaneously lowering power consumption.
At a high level, an on-chip bin cache allows the rasterizer to fetch data only once for overlapping primitives, and then shade pixels only once by culling pixels not visible in the final scene.
AMD fundamentally changes its cache hierarchy by making the render back-ends clients of the L2.
In architectures before Vega, AMD had non-coherent pixel and texture memory access, meaning there was no shared point for each pipeline stage to synchronize. In the example of texture baking, where a scene is rendered to a texture for reuse later and then accessed again through the shader array, data has to be pulled all the way back through off-die memory. Now, the architecture has coherent access, which AMD said particularly boosts performance in applications that use deferred shading.
Spoiler alert: AMD's launch driver doesn't unleash the massive performance improvements we were hoping for after reviewing the Frontier Edition board. We'd remind you, though, that Fiji and Hawaii, just like good wine (especially good red wine), took some time to achieve their full potential.
MORE: Best Graphics Cards
MORE: All Graphics Content