NVAPI: Measuring Graphics Memory Bandwidth Utilization
Experimenting With Nvidia's NVAPI
Nvidia, through its GeForce driver, exposes a programming interface ("NVAPI") that, among other things, allows for collecting performance measurements. For the technically inclined, here is the relevant section in the nvapi.h header file:
FUNCTION NAME: NvAPI_GPU_GetDynamicPstatesInfoExDESCRIPTION: This API retrieves the NV_GPU_DYNAMIC_PSTATES_INFO_EX structure for the specified physical GPU. Each domain's info is indexed in the array. For example:- pDynamicPstatesInfo->utilization[NVAPI_GPU_UTILIZATION_DOMAIN_GPU] holds the info for the GPU domain. There are currently four domains for which GPU utilization and dynamic P-state thresholds can be retrieved: graphic engine (GPU), frame buffer (FB), video engine (VID), and bus interface (BUS).
Beyond this header commentary, the API's specific functionality isn't documented. The information below is our best interpretation of its workings, though it relies on a lot of conjecture.
- The graphics engine ("GPU") metric is expected to be your bottleneck in most games. If you don't see this at or close to 100%, something else (like your CPU or memory subsystem) is limiting performance.
- The frame buffer ("FB") metric is interesting, if it works as intended. From the name, you'd expect it to measure graphics memory utilization (the percentage of memory used). That is not what this is, though. It appears, rather, to be the memory controller's utilization in percent. If that's correct, it would measure actual bandwidth being used by the controller, which is not otherwise available as a measurement any other way.
- We're not as interested in the video engine ("VID"); it's not generally used in gaming, and registers a flat 0% typically. You'd only see the dial move if you're encoding video through ShadowPlay or streaming to a Shield.
- The bus interface ("BUS") metric refers to utilization of the PCIe controller, again, as a percentage. The corresponding measurement, which you can trace in EVGA PrecisionX and MSI Afterburner, is called "GPU BUS Usage".
We asked Nvidia to shed some light on the inner workings of NVAPI. Its response confirmed that the FB metric measures graphics memory bandwidth usage, but Nvidia dismissed the BUS metric as "considered to be unreliable and thus not used internally".
We asked AMD if it had any API or function that allowed for similar measurements. After internal verification, company representatives confirmed that they did not. As much as we would like to, we are unable to conduct similar tests on AMD hardware.
Moving On To The Actual Tests...
Myth: Graphics memory bandwidth utilization and PCIe bus utilization are impossible to measure directly.
The amount of data moved between graphics memory to the graphics processor and back is massive. That's why graphics cards need such complex memory controllers capable of pushing tons of bandwidth. In the case of AMD's Radeon R9 290X, you're looking at up to 320GB/s. Nvidia's GeForce GTX 780 Ti is rated for up to 336GB/s. Maximum PCIe throughput isn’t as impressive (15.75GB/s through a 16-lane third-gen link), though it isn’t in as much demand. But how much of that is utilized at any given point in time? Is this a bottleneck? Until now, it has been hard to answer those questions. But the Kepler architecture and NVAPI make it possible to address them with more precision, we hope.
We began our exploration looking at the BUS metric on a GeForce GTX 690. While Nvidia says it’s unreliable, we still wondered what we could glean from our test results. As we took readings, however, we faced another complication: the card's two GK104 GPUs are not linked directly to the main PCIe bus, but are rather switched through a PLX PEX 8747. So, no matter what setting the motherboard uses, the GPUs always operate at PCI Express 3.0 signaling rates, except when they're power-saving. That's why GPU-Z shows them operating at PCIe 3.0, even on platforms limited to PCIe 2.0. The PEX 8747 switch is what drops to previous-gen rates on the host side.
With each GPU's bus controller operating at PCIe 3.0 on a 16-lane link, utilization at 100% should be 15.75GB/s. That information alone doesn't help us much, though. It's impossible to say how much traffic is directed at the host and how much goes to the other GPU. And unfortunately, the PLX switch doesn't give us access to more granular data. For now, we're left with a worst-case scenario: that each GPU is receiving all of its traffic from the host, and none is multicast.
With that explanation, check out the graph above. It is a 200-second run of BioShock: Infinite at a custom quality setting, using a grueling 3840x2160 to average 80 FPS. PCIe bus utilization is in the ~15% range for each of the two GK104s. Even in the academic worst-case situation described above, we're using about ~30% of a 16-lane PCIe 3.0 link, or ~60% of 16 second-gen lanes. And that's a brutal scenario, where 663.6 million pixels per second are being rendered (3840x2160 * 80 FPS) across two GPUs in SLI on a single board. For comparison, 1920x1080 * 60 FPS is 124.4 million pixels per second.
Given the shortcomings of my current hardware selection, I won't go any further with the GeForce GTX 690. Instead, let's proceed to more meaningful tests conducted on a GeForce GTX 750 Ti and the GeForce GTX 650 Ti it replaces. The seven charts that follow may seem like data overload. But they really do need to be considered together. In them, you'll see runs of the Metro: Last Light benchmark using the two aforementioned cards operating on the same platform.
Starting from the first graph, in order, we have: GPU core utilization (%), graphics memory utilization (MB), GPU core temperature (Celsius), frames per second, frame time (milliseconds), "FB" utilization (%) and "BUS" utilization (%). We're omitting core and memory clock rates; they were fixed at their respective GPU Boost values on both cards.
GPU utilization shows that the graphics processor on both cards is the primary system bottleneck. It's close to 100% most of the time. This is as we'd expect, and easy to explain. The GPU core is the component you'd expect a game to tax most heavily.
Graphics memory utilization appears stable under 1GB, which is what the GeForce GTX 650 Ti includes. There's nothing special to report, overall.
GPUs predictably warm up with use. Both of these test subjects remain quite cool compared to enthusiast-class cards, though.
Frame rates are typically what gamers look at, and we can see the GeForce GTX 750 Ti is faster than the 650 Ti. We're not necessarily interested in average FPS in this experiment, but will need relative frame rate as a measure of throughput for an adjustment we'll make later.
The spikes in frame times increase frame time variance. Those spikes coincide with higher (and not lower) frame rates, and with lower (not higher) GPU utilization. This could be the smoking gun of a platform bottleneck (although in this case, not a major one, since GPU utilization remains in the ~95% range; if the platform bottleneck was severe, utilization would drop much more). Still, while that's interesting, we aren't where we want to be with our experiment.
This is one of the new metrics enabled by the NVAPI queries: "FB" or "Framebuffer" Utilization. It's expressed as a percentage, and is actually named misleadingly. What it really measures is the percentage of time each GPU's memory controller is busy and, by proxy, its bandwidth utilization.
Now, we didn't pick these two cards at random. We chose them because they come equipped with the same 128-bit interface at 1350MHz, delivering up to 86.4GB/s. At an equal throughput (FPS), their bandwidth utilization should be directly comparable. Their frame rates aren't the same, though. The GeForce GTX 750 Ti achieves higher performance, shown by the frame rate chart. Thus, we normalize the metric for the GM107-based board using the GeForce GTX 650 Ti's performance. That is the third blue line you see in the chart.
The results are impressive. A Maxwell-based GPU appears to deliver 25% more FPS than a Kepler GPU in the same price range, while at the same time reducing its memory bandwidth utilization by 33%. Stated differently, on a per-frame basis, the GeForce GTX 750 Ti needs half of the memory bandwidth in the Metro: Last Light benchmark.
This fact may have profound implications for the GeForce GTX 980 and 970. Memory interface area can be sacrificed for additional SMM area, bringing power to bear where it's most needed, or to shrink die size, reducing cost.
Our final graph shows, for the first time, directly (versus indirectly via FPS measurements), how extraordinarily small PCIe bandwidth requirements are for a modern graphics card (assuming the readings are meaningful and not "unreliable" as Nvidia says). PCI Express 2.0 gave us 8GB/s of bi-directional throughput from 16-lane links. Both cards fail to hit 10% of that, requiring a sustained figure under 0.8GB/s. That is a mere 5% of what you have available to a single card on a Haswell-based platform.
Graphics Memory Bandwidth Conclusions
Our tests confirms two things:
- Graphics memory bandwidth is not a bottleneck (at least in Metro with a GeForce GTX 750 Ti/650 Ti).
- Graphics memory utilization improvements linked to the Maxwell architecture are beyond impressive - graphics memory bandwidth requirements are essentially halved by Maxwell. This latter finding was confirmed by Nvidia.
The huge reduction in memory bandwidth needs has broad architectural implications, we believe. We're expecting to see (comparatively) smaller graphics memory controllers and the recovered die area put to good use in the form of additional SMMs, further increasing efficiency and performance gains of Maxwell over Kepler.
We'd love to make similar comparisons on AMD cards in the future, should the company enable common functionality in its drivers.