The Myths Of Graphics Card Performance: Debunked, Part 2

NVAPI: Measuring Graphics Memory Bandwidth Utilization

Experimenting With Nvidia's NVAPI

Nvidia, through its GeForce driver, exposes a programming interface ("NVAPI") that, among other things, allows for collecting performance measurements. For the technically inclined, here is the relevant section in the nvapi.h header file:

FUNCTION NAME: NvAPI_GPU_GetDynamicPstatesInfoEx

DESCRIPTION: This API retrieves the NV_GPU_DYNAMIC_PSTATES_INFO_EX structure for the specified physical GPU. Each domain's info is indexed in the array. For example:

- pDynamicPstatesInfo->utilization[NVAPI_GPU_UTILIZATION_DOMAIN_GPU] holds the info for the GPU domain. There are currently four domains for which GPU utilization and dynamic P-state thresholds can be retrieved: graphic engine (GPU), frame buffer (FB), video engine (VID), and bus interface (BUS).

Beyond this header commentary, the API's specific functionality isn't documented. The information below is our best interpretation of its workings, though it relies on a lot of conjecture.

  • The graphics engine ("GPU") metric is expected to be your bottleneck in most games. If you don't see this at or close to 100%, something else (like your CPU or memory subsystem) is limiting performance.
  • The frame buffer ("FB") metric is interesting, if it works as intended. From the name, you'd expect it to measure graphics memory utilization (the percentage of memory used). That is not what this is, though. It appears, rather, to be the memory controller's utilization in percent. If that's correct, it would measure actual bandwidth being used by the controller, which is not otherwise available as a measurement any other way.
  • We're not as interested in the video engine ("VID"); it's not generally used in gaming, and registers a flat 0% typically. You'd only see the dial move if you're encoding video through ShadowPlay or streaming to a Shield.
  • The bus interface ("BUS") metric refers to utilization of the PCIe controller, again, as a percentage. The corresponding measurement, which you can trace in EVGA PrecisionX and MSI Afterburner, is called "GPU BUS Usage".

We asked Nvidia to shed some light on the inner workings of NVAPI. Its response confirmed that the FB metric measures graphics memory bandwidth usage, but Nvidia dismissed the BUS metric as "considered to be unreliable and thus not used internally".

We asked AMD if it had any API or function that allowed for similar measurements. After internal verification, company representatives confirmed that they did not. As much as we would like to, we are unable to conduct similar tests on AMD hardware.

Moving On To The Actual Tests...

Myth: Graphics memory bandwidth utilization and PCIe bus utilization are impossible to measure directly.

The amount of data moved between graphics memory to the graphics processor and back is massive. That's why graphics cards need such complex memory controllers capable of pushing tons of bandwidth. In the case of AMD's Radeon R9 290X, you're looking at up to 320GB/s. Nvidia's GeForce GTX 780 Ti is rated for up to 336GB/s. Maximum PCIe throughput isn’t as impressive (15.75GB/s through a 16-lane third-gen link), though it isn’t in as much demand. But how much of that is utilized at any given point in time? Is this a bottleneck? Until now, it has been hard to answer those questions. But the Kepler architecture and NVAPI make it possible to address them with more precision, we hope.

We began our exploration looking at the BUS metric on a GeForce GTX 690. While Nvidia says it’s unreliable, we still wondered what we could glean from our test results. As we took readings, however, we faced another complication: the card's two GK104 GPUs are not linked directly to the main PCIe bus, but are rather switched through a PLX PEX 8747. So, no matter what setting the motherboard uses, the GPUs always operate at PCI Express 3.0 signaling rates, except when they're power-saving. That's why GPU-Z shows them operating at PCIe 3.0, even on platforms limited to PCIe 2.0. The PEX 8747 switch is what drops to previous-gen rates on the host side.

With each GPU's bus controller operating at PCIe 3.0 on a 16-lane link, utilization at 100% should be 15.75GB/s. That information alone doesn't help us much, though. It's impossible to say how much traffic is directed at the host and how much goes to the other GPU. And unfortunately, the PLX switch doesn't give us access to more granular data. For now, we're left with a worst-case scenario: that each GPU is receiving all of its traffic from the host, and none is multicast.

With that explanation, check out the graph above. It is a 200-second run of BioShock: Infinite at a custom quality setting, using a grueling 3840x2160 to average 80 FPS. PCIe bus utilization is in the ~15% range for each of the two GK104s. Even in the academic worst-case situation described above, we're using about ~30% of a 16-lane PCIe 3.0 link, or ~60% of 16 second-gen lanes. And that's a brutal scenario, where 663.6 million pixels per second are being rendered (3840x2160 * 80 FPS) across two GPUs in SLI on a single board. For comparison, 1920x1080 * 60 FPS is 124.4 million pixels per second.

Given the shortcomings of my current hardware selection, I won't go any further with the GeForce GTX 690. Instead, let's proceed to more meaningful tests conducted on a GeForce GTX 750 Ti and the GeForce GTX 650 Ti it replaces. The seven charts that follow may seem like data overload. But they really do need to be considered together. In them, you'll see runs of the Metro: Last Light benchmark using the two aforementioned cards operating on the same platform.

Starting from the first graph, in order, we have: GPU core utilization (%), graphics memory utilization (MB), GPU core temperature (Celsius), frames per second, frame time (milliseconds), "FB" utilization (%) and "BUS" utilization (%). We're omitting core and memory clock rates; they were fixed at their respective GPU Boost values on both cards.

GPU utilization shows that the graphics processor on both cards is the primary system bottleneck. It's close to 100% most of the time. This is as we'd expect, and easy to explain. The GPU core is the component you'd expect a game to tax most heavily.

Graphics memory utilization appears stable under 1GB, which is what the GeForce GTX 650 Ti includes. There's nothing special to report, overall.

GPUs predictably warm up with use. Both of these test subjects remain quite cool compared to enthusiast-class cards, though.

Frame rates are typically what gamers look at, and we can see the GeForce GTX 750 Ti is faster than the 650 Ti. We're not necessarily interested in average FPS in this experiment, but will need relative frame rate as a measure of throughput for an adjustment we'll make later.

The spikes in frame times increase frame time variance. Those spikes coincide with higher (and not lower) frame rates, and with lower (not higher) GPU utilization. This could be the smoking gun of a platform bottleneck (although in this case, not a major one, since GPU utilization remains in the ~95% range; if the platform bottleneck was severe, utilization would drop much more). Still, while that's interesting, we aren't where we want to be with our experiment.

This is one of the new metrics enabled by the NVAPI queries: "FB" or "Framebuffer" Utilization. It's expressed as a percentage, and is actually named misleadingly. What it really measures is the percentage of time each GPU's memory controller is busy and, by proxy, its bandwidth utilization.

Now, we didn't pick these two cards at random. We chose them because they come equipped with the same 128-bit interface at 1350MHz, delivering up to 86.4GB/s. At an equal throughput (FPS), their bandwidth utilization should be directly comparable. Their frame rates aren't the same, though. The GeForce GTX 750 Ti achieves higher performance, shown by the frame rate chart. Thus, we normalize the metric for the GM107-based board using the GeForce GTX 650 Ti's performance. That is the third blue line you see in the chart.

The results are impressive. A Maxwell-based GPU appears to deliver 25% more FPS than a Kepler GPU in the same price range, while at the same time reducing its memory bandwidth utilization by 33%. Stated differently, on a per-frame basis, the GeForce GTX 750 Ti needs half of the memory bandwidth in the Metro: Last Light benchmark.

This fact may have profound implications for the GeForce GTX 980 and 970. Memory interface area can be sacrificed for additional SMM area, bringing power to bear where it's most needed, or to shrink die size, reducing cost.

Our final graph shows, for the first time, directly (versus indirectly via FPS measurements), how extraordinarily small PCIe bandwidth requirements are for a modern graphics card (assuming the readings are meaningful and not "unreliable" as Nvidia says). PCI Express 2.0 gave us 8GB/s of bi-directional throughput from 16-lane links. Both cards fail to hit 10% of that, requiring a sustained figure under 0.8GB/s. That is a mere 5% of what you have available to a single card on a Haswell-based platform.

Graphics Memory Bandwidth Conclusions

Our tests confirms two things:

  • Graphics memory bandwidth is not a bottleneck (at least in Metro with a GeForce GTX 750 Ti/650 Ti).
  • Graphics memory utilization improvements linked to the Maxwell architecture are beyond impressive - graphics memory bandwidth requirements are essentially halved by Maxwell. This latter finding was confirmed by Nvidia.

The huge reduction in memory bandwidth needs has broad architectural implications, we believe. We're expecting to see (comparatively) smaller graphics memory controllers and the recovered die area put to good use in the form of additional SMMs, further increasing efficiency and performance gains of Maxwell over Kepler.

We'd love to make similar comparisons on AMD cards in the future, should the company enable common functionality in its drivers.

This thread is closed for comments
79 comments
    Your comment
  • iam2thecrowe
    i've always had a beef with gpu ram utillization and how its measured and what driver tricks go on in the background. For example my old gtx660's never went above 1.5gb usage, searching forums suggests a driver trick as the last 512mb is half the speed due to it's weird memory layout. Upon getting my 7970 with identical settings memory usage loading from the same save game shot up to near 2gb. I found the 7970 to be smoother in the games with high vram usage compared to the dual 660's despite frame rates being a little lower measured by fraps. I would love one day to see an article "the be all and end all of gpu memory" covering everything.

    Another thing, i'd like to see a similar pcie bandwidth test across a variety of games and some including physx. I dont think unigine would throw much across the bus unless the card is running out of vram where it has to swap to system memory, where i think the higher bus speeds/memory speed would be an advantage.
  • blackmagnum
    Suggestion for Myths Part 3: Nvidia offers superior graphics drivers, while AMD (ATI) gives better image quality.
  • photonboy
    Implying that an i7-4770K is little better than an i7-950 is just dead wrong for quite a number of games.

    There are plenty of real-world gaming benchmarks that prove this so I'm surprised you made such a glaring mistake. Using a synthetic benchmark is not a good idea either.

    Frankly, I found the article was very technically heavy were not necessary like the PCIe section and glossed over other things very quickly. I know a lot about computers so maybe I'm not the guy to ask but it felt to me like a non-PC guy wouldn't get the simplified and straightforward information he wanted.
  • eldragon0
    If you're going to label your article "graphics performance myths" Please don't limit your article to just gaming, It's a well made and researched article, but as Photonboy touched, the 4770k vs 950 are about as similar as night and day. Try using that comparison for graphical development or design, and you'll get laughed off the site. I'd be willing to say it's rendering capabilities are actual multiples faster at those clock speeds.
  • SteelCity1981
    photonboy this article isn't for non pc people, because non pc people wouldn't care about detailed stuff like this.
  • renz496
    749236 said:
    Suggestion for Myths Part 3: Nvidia offers superior graphics drivers


    even if toms's hardware really did their own test it doesn't really useful either because their test setup won't represent million of different pc configuration out there. you can see one set of driver working just fine with one setup and totally broken in another setup even with the same gpu being use. even if TH represent their finding you will most likely to see people to challenge the result if it did not reflect his experience. in the end the thread just turn into flame war mess.

    749236 said:
    Suggestion for Myths Part 3: while AMD (ATI) gives better image quality.


    this has been discussed a lot in other tech forum site. but the general consensus is there is not much difference between the two actually. i only heard about AMD cards the in game colors can be a bit more saturated than nvidia which some people take that as 'better image quality'.
  • ubercake
    Just something of note... You don't necessarily need Ivy Bridge-E to get PCIe 3.0 bandwidth. Sandy Bridge-E people with certain motherboards can run PCIe 3.0 with Nvidia cards (just like you can with AMD cards). I've been running the Nvidia X79 patch and getting PCIe gen 3 on my P9X79 Pro with a 3930K and GTX 980.
  • dovah-chan
    There is one AM3+ board with PCI-E 3.0. That would be the Sabertooth Rev. 2.
  • ubercake
    Another article on Tom's Hardware by which the 'ASUS ROG Swift PG...' link listed for an unbelievable price takes you to the PB278Q page.

    A little misleading.
  • WyomingKnott
    "Smaller cards fit into longer slots (for instance, a x8 or x1 card into a x16 slot), but larger cards do not fit into shorter slots (a x16 card into a x1 slot, for example). "

    Not sure that this is correct - aren't some slots made open-ended, so that you can use all of the slot's lanes but not all of the card's lanes?
  • chaospower
    Quote:
    Implying that an i7-4770K is little better than an i7-950 is just dead wrong for quite a number of games. There are plenty of real-world gaming benchmarks that prove this so I'm surprised you made such a glaring mistake. Using a synthetic benchmark is not a good idea either. Frankly, I found the article was very technically heavy were not necessary like the PCIe section and glossed over other things very quickly. I know a lot about computers so maybe I'm not the guy to ask but it felt to me like a non-PC guy wouldn't get the simplified and straightforward information he wanted.


    You're wrong. The old I7s had much slower clocks, thats why the performance wasn't as good as the newer ones, and many becnhmarks would confirm that they are indeed slower. But when clocked similarly the difference is indeed incredibly small. The author of this article knows what he's talking about.
    And here's proof (It's 3770k and not 4770k, but im sure most people would agree the difference isn't great between those two)
    http://alienbabeltech.com/main/ivy-bridge-3770k-gaming-results-vs-core-i7-920-at-4-2ghz/5/
  • cknobman
    Great article, I really enjoyed the read. Thanks!
  • boju
    1272783 said:
    Quote:
    Implying that an i7-4770K is little better than an i7-950 is just dead wrong for quite a number of games. There are plenty of real-world gaming benchmarks that prove this so I'm surprised you made such a glaring mistake. Using a synthetic benchmark is not a good idea either. Frankly, I found the article was very technically heavy were not necessary like the PCIe section and glossed over other things very quickly. I know a lot about computers so maybe I'm not the guy to ask but it felt to me like a non-PC guy wouldn't get the simplified and straightforward information he wanted.
    You're wrong. The old I7s had much slower clocks, thats why the performance wasn't as good as the newer ones, and many becnhmarks would confirm that they are indeed slower. But when clocked similarly the difference is indeed incredibly small. The author of this article knows what he's talking about. And here's proof (It's 3770k and not 4770k, but im sure most people would agree the difference isn't great between those two) http://alienbabeltech.com/main/ivy-bridge-3770k-gaming-results-vs-core-i7-920-at-4-2ghz/5/


    To my understanding Photonboy meant stock speeds 950 @ 3.06ghz would have a hard time keeping up with the later generations with higher clocks along with other improvements. A genuine analyst in this case wouldn't consider overclocking as it's too inconsistent to set a standard. Photonboy would be well aware of the potential in overclocking.

    I love your link though, haven't seen one comparing an overclocked 920. Since the first generation i7's there has been a GHz+ stock increase. My 920 is @ 3.9, not quite 4.2 as in the link, though I'd imagine not far off in performance and I'm glad the early generations 'when' overclocked to that level are still making a statement :)
  • redgarl
    Myth, stuttering...

    1 big card is better than it prevent stutn 2 because it prevents stuttering... dead wrong, two cards gives better FPS. I have been using CF for the last 5 years and I never had stuttering issues.
  • Dantte
    Your Cable chart is wrong, HDMI 1.4 DOES support 4K, it does not support 4K @60hz, this is what HDMI 2.0 addresses!

    "HDMI 1.4 was released on May 28, 2009, and the first HDMI 1.4 products were available in the second half of 2009. HDMI 1.4 increases the maximum resolution to 4K × 2K, i.e. 4096×2160 at 24 Hz"
  • danwat1234
    What's the name of the giant cat godzilla-esque game? Can't find it on google for some reason
  • TechyInAZ
    Thank you Tom's Hardware for making part 2! I loved part 1 a lot which made me wonder if you were ever gona get part 2 posted.

    This defiantly clears some stuff even i didn't know about, glad for this series.
  • pc-cola
    @danwat1234

    Catzilla is the name of the benchmarking program.
  • youcanDUit
    so... did i do a good job picking up a GTX 970? or was that a waste? i got it for 220 bucks.
  • Math Geek
    I always wondered about the pcie bandwidth and how much was actually used. theoretical bandwidth doubling each generation left me wondering if the cards were keeping up and using it or if the technology was simply a "because we can" kind of thing. i know this is not the last word on the subject but i won't feel as apprehensive using pcie 2.0 mobo's for the near future. clearly this is ample bandwidth for the average build and average user.
  • SheaK
    I'm leaving this comment mostly because I'm sure this is an uncommon bit of information, and I wanted to throw out my community contribution.

    I have a Xeon workstation based on two E5 2696v2 chips (24 cores, 48 threads) with 128gb RAM running a 2tb Mushkin Scorpion Deluxe PCIe and currently a single 295x2 at 2560x1600 (system is liquid cooled).

    With regard to mantle: I've noticed a massive difference and with core-parking disabled mantle gives as much as a 15% increase in FPS. It appears mantle can consume upward of 24 threads during normal use, and I've seen CPU hit 97% of 48 threads (!!!) briefly during loads.

    As this system has 80 PCI-E lanes available, the second 295x2 will probably give a more linear "crossfire" scaling than some systems with bottlenecks.

    I'm passing the information along as many workstations like this are not used for both work and for gaming.
  • Eggz
    OMG! So glad you made this series. There's a lot of assumption-based randomness echoing throughout the interwebz, and this does a great job of addressing graphics-related questions in a disciplined way.
  • bygbyron3
    Some (very few) TVs can display a 120 Hz signal from a PC without any interpolation. The refresh rate has to be overridden or set custom.