Nvidia Boosts AI Performance With TensorRT

Nvidia Stable Diffusion TensorRT Update
(Image credit: Tom's Hardware)

Nvidia has been busy working on further improvements to its suite of AI/ML (Artificial Intelligence/Machine Learning) and LLM (Large Language Model) tools. The latest addition is TensorRT and TensorRT-LLM, designed to optimize performance of consumer GPUs and many of the best graphics cards for running tasks like Stable Diffusion and Llama 2 text generation. We've tested some of Nvidia's latest GPUs using TensorRT and found performance in Stable Diffusion was improved by up to 70%. TensorRT should be available for download at Nvidia's Github page now, though we had early access for purposes of this initial look.

We've seen a lot of movement in Stable Diffusion over the past year or so. Our first look used Automatic1111's webui, which initially only had support for Nvidia GPUs under Windows. Since then, the number of forks and alternative text to image AI generation tools has exploded, and both AMD and Intel have released more finely tuned libraries that have closed the gap somewhat with Nvidia's performance. You can see our latest roundup of Stable Diffusion benchmarks in our AMD RX 7800 XT and RX 7700 XT reviews. Now, Nvidia is ready to widen the gap again with TensorRT.

The basic idea is similar to what AMD and Intel have already done. Leveraging ONNX, an open format for AI and ML models and operators, the base Hugging Face stable diffusion model gets converted into an ONNX format. From there, you can further optimize performance for the specific GPU that you're using. It takes a few minutes (or sometimes more) for TensorRT to tune things, but once complete, you should get a substantial boost in performance along with better memory utilization.

We've run all of Nvidia's latest RTX 40-series GPUs through the tuning process (each has to be done separately for optimal performance), along with testing the base Stable Diffusion performance and performance using Xformers. We're not quite ready for the full update comparing AMD, Intel, and Nvidia performance in Stable Diffusion, as we're retesting a bunch of additional GPUs using the latest optimized tools, so this initial look is focused only on Nvidia GPUs. We've included one RTX 30-series (RTX 3090) and one RTX 20-series (RTX 2080 Ti) to show how the TensorRT gains apply to all of Nvidia's RTX line.

Each of the above galleries, at 512x512 and 768x768, uses the Stable Diffusion 1.5 models. We've "regressed" to using 1.5 instead of 2.1, as the creator community generally prefers the results of 1.5, though the results should be roughly the same with the newer models. For each GPU, we ran different batch sizes and batch counts to find the optimal throughput, generating 24 images total per run. Then we averaged the throughput of three separate runs to determine the overall rate — so 72 images total were generated for each model format and GPU (not counting discarded runs).

Various factors come into play with the overall throughput. GPU compute matters quite a bit, as does memory bandwidth. VRAM capacity tends to be a lesser factor, other than potentially allowing for larger image resolution targets or batch sizes — there are things you can do with 24GB of VRAM that won't be possible with 8GB, in other words. L2 cache sizes may also factor in, though we didn't attempt to model this directly. What we can say is that the 4060 Ti 16GB and 8GB cards, which are basically the same specs (slightly different clocks due to the custom model on the 16GB), had nearly identical performance and optimal batch sizes.

There are some modest differences in relative performance, depending on the model format used. The base model is the slowest, with Xformers boosting performance by anywhere from 30–80 percent for 512x512 images, and 40–100 percent for 768x768 images. TensorRT then boosts performance an additional 50~65 percent at 512x512, and 45~70 percent at 768x768.

What's interesting is that the smallest gains (of the GPUs tested so far) come from the RTX 3090. It's not exactly clear what the limiting factor might be, though we'll have to test additional GPUs to come to any firm conclusions. The RTX 40-series has fourth generation Tensor cores, RTX 30-series has third generation Tensor cores, and RTX 20-series has second gen Tensor cores (with the Volta architecture being first gen Tensor). Newer architectures should be more capable, in other words, though for the sort of work required in Stable Diffusion it mostly seems to build down to raw compute and memory bandwidth.

We're not trying to make this the full Nvidia versus the world performance comparison, but updated testing of the RX 7900 XTX as an example tops out at around 18~19 images per minute for 512x512, and around five images per minute at 768x768. We're working on full testing of the AMD GPUs with the latest Automatic1111 DirectML branch, and we'll have an updated Stable Diffusion compendium once that's complete. Note also that Intel's Arc A770 manages 15.5 images/min at 512x512, and 4.7 images/min at 768x768.

Nvidia Stable Diffusion TensorRT Update

(Image credit: Nvidia)

So what's going on exactly with TensorRT that it can improve performance so much? I spoke with Nvidia about this topic, and it's mostly about optimizing resources and model formats.

ONNX was originally developed by Facebook and Microsoft, but is an open source initiative based around the Apache License model. ONNX is designed to allow AI models to be used with a wide variety of backends: PyTorch, OpenVINO, DirectML, TensorRT, etc. ONNX allows for a common definition of different AI models, providing a computation graph model along with the necessary built-in operators and a set of standard data types. This allows the models to be easily transported between various AI acceleration frameworks.

TensorRT meanwhile is designed to be more performant on Nvidia GPUs. In order to take advantage of TensorRT, a developer would normally need to write their models directly into the format expected by TensorRT, or convert an existing model into that format. ONNX helps simplify this process, which is why it has been used by AMD (DirectML) and Intel (OpenVINO) for those tuned branches of Stable Diffusion.

Finally, one of the options with TensorRT is that you can also tune for an optimal path with a model. In our case, we're doing batches of 512x512 and 768x768 images. The generic TensorRT model we generate may have dynamic image size of 512x512 to 1024x1024, with a batch size of one to eight, and an optimal configuration of 512x512 and batch size of 1. Doing 512x512 batches of 8 might end up being 10% slower, give or take. So we can do another TensorRT model that specifically targets 512x512x8, or 768x768x4, or whatever. And we did all of this to find the best configuration for each GPU.

AMD's DirectML fork has some similar options, though right now there are some limitations we've encountered (we can't do a batch size other than one, as an example). We anticipate further tuning of AMD and Intel models as well, though over time the gains are likely to taper off.

Nvidia Stable Diffusion TensorRT Update

(Image credit: Nvidia)

The TensorRT updated isn't just for Stable Diffusion, of course. Nvidia shared the above slide detailing improvements it has measured with Llama 2 7B int4 inference, using TensorRT. That's a text generation tool with seven billion parameters.

As the chart shows, generating a single batch of text shows a modest benefit, but the GPU in this case (RTX 4090) doesn't appear to be getting a full workout. Increasing the batch size to four boosts overall throughput by 3.6X, while a batch size of eight gives a 4.4X speedup. Larger batch sizes in this case can be used to generate multiple text responses, allowing the user to select the one they prefer — or even combine parts of the output, if that's useful.

The TesorRT-LLM isn't yet out but should be available on developer.nvidia.com (free registration required) in the near future.

Nvidia Stable Diffusion TensorRT Update

(Image credit: Nvidia)

Finally, as part of its AI-focused updates for LLMs, Nvidia is also working on a TensorRT-LLM tool that will allow the use of Llama 2 as the base model, and then import local data for more domain specific and updated knowledge. As an example of what this can do, Nvidia imported 30 recent Nvidia news articles into the tool and you can see the difference in response between the base Llama 2 model and the model with this local data.

The base model provides all the information about how to generate meaningful sentences and such, but it has no knowledge of recent events or announcements. In this case, it things Alan Wake 2 doesn't have any official information released. With the updated local data, however, it's able to provide a more meaningful response.

Another example Nvidia gave was using such local data with your own email or chat history. Then you could ask it things like, "What movie were Chris and I talking about last year?" and it would be able to provide an answer. It's potentially a smarter search option, using your own information.

We can't help but see this as a potential use case for our own HammerBot, though we'll have to see if it's usable on our particular servers (since it needs an RTX card). As with all LLMs, the results can be a bit variable in quality depending on the training data and the questions you ask.

Nvidia Stable Diffusion TensorRT Update

(Image credit: Nvidia)

Nvidia also announced updates to it's Video Super Resolution, now with support for RTX 20-series GPUs and native artifact reduction. The latter means if you're watching a 1080p stream on a 1080p monitor, VSR can still help with noise removal and image enhancement. VSR 1.5 is available with Nvidia's latest drivers.

Jarred Walton

Jarred Walton is a senior editor at Tom's Hardware focusing on everything GPU. He has been working as a tech journalist since 2004, writing for AnandTech, Maximum PC, and PC Gamer. From the first S3 Virge '3D decelerators' to today's GPUs, Jarred keeps up with all the latest graphics trends and is the one to ask about game performance.

  • derekullo
    Would it be possible doing a upscale test to see how long it takes to upscale a single image with SwinIR or another upscaler?

    Using a 1024x1024 image from Bing AI I am able to upscale it to 8192x8192 using an 8x scale in SwinIR with Stable Diffusion in about 4 minutes using a Geforce 3080 Ti.

    I just stumbled upon SwinIR the other day ... had no idea you can add detail so easily!
    Reply
  • JarredWaltonGPU
    derekullo said:
    Would it be possible doing a upscale test to see how long it takes to upscale a single image with SwinIR or another upscaler?

    Using a 1024x1024 image from Bing AI I am able to upscale it to 8192x8192 using an 8x scale in SwinIR with Stable Diffusion in about 4 minutes using a Geforce 3080 Ti.

    I just stumbled upon SwinIR the other day ... had no idea you can add detail so easily!
    I haven't played around much with upscaling, as it tends to be very memory intensive and runs into out of memory errors a lot. I'm not sure what variant of SwinIR you're using, but the "SwinIR_4x" that's part of the base Automatic1111 install isn't doing so hot trying to go from 768x768 to 3072x3072 (4x upscale) with an RTX 2080 Ti. A 2x upscale went relatively quickly, which makes me think this is barely fitting in system+GPU memory and will take forever — I don't even have a progress bar yet! LOL Poking around a bit more, I seem able to do 768x768 to 1920x1920 via a 2.5X upscale without going beyond 8~10 GB of VRAM use.

    I had the 2080 Ti in my test PC from yesterday, so maybe it would do better with a 30-series card. But I'm not sure what settings exactly you're using with SwinIR, or if you're using it in a different fashion than the standard A1111 integration. (You probably are, since I can only go to 4X with the built-in options.)

    Details on what precisely you're doing would be helpful, but I also suspect that while I might be able to get this working to varying degrees with Nvidia GPUs, it could be more problematic with the Intel and AMD GPUs.
    Reply
  • derekullo
    SwinIR_4x was the setting I used ... sorry for the confusion!
    I also use the base Automatic111 install.
    Not sure why your resize slider doesn't go to 8x.
    I didn't set many other settings besides the upscaler, image scale multiplier and the image itself.
    Upscaler 2 is set to none
    GFPGAN visibility, CodeFormer visibility, CodeFormer weight sliders at the bottom are set to 0.

    Was able to upscale a 450x303 picture to 3600x2424 in 41.79s ...8x
    Postprocess upscale by: 8, Postprocess upscaler: SwinIR_4x
    Time taken: 41.79s
    Torch active/reserved: 2982/3676 MiB, Sys VRAM: 5914/12288 MiB (48.13%)

    Was able to upscale a 1024x1024 picture to 8192x8192 in 5 minutes 20 seconds.
    Postprocess upscale by: 8, Postprocess upscaler: SwinIR_4x
    Time taken: 5m20s
    Torch active/reserved: 6943/9058 MiB, Sys VRAM: 11424/12288 MiB (92.97%)

    Looking at those results a 12 gigabyte card may be needed for an 8x upscale on a 1024x1024 image.

    More pixels takes more vram but the relationship is a bit unclear.
    Multiplying it all out 1024x1024 is 7.69 times the amount of pixels as 450x303, but only about twice the amount of vram.
    Reply
  • JarredWaltonGPU
    derekullo said:
    SwinIR_4x was the setting I used ... sorry for the confusion!
    I also use the base Automatic111 install.
    Not sure why your resize slider doesn't go to 8x.
    I didn't set many other settings besides the upscaler, image scale multiplier and the image itself.
    Upscaler 2 is set to none
    GFPGAN visibility, CodeFormer visibility, CodeFormer weight sliders at the bottom are set to 0.

    Was able to upscale a 450x303 picture to 3600x2424 in 41.79s ...8x
    Postprocess upscale by: 8, Postprocess upscaler: SwinIR_4x
    Time taken: 41.79s
    Torch active/reserved: 2982/3676 MiB, Sys VRAM: 5914/12288 MiB (48.13%)

    Was able to upscale a 1024x1024 picture to 8192x8192 in 5 minutes 20 seconds.
    Postprocess upscale by: 8, Postprocess upscaler: SwinIR_4x
    Time taken: 5m20s
    Torch active/reserved: 6943/9058 MiB, Sys VRAM: 11424/12288 MiB (92.97%)

    Looking at those results a 12 gigabyte card may be needed for an 8x upscale on a 1024x1024 image.

    More pixels takes more vram but the relationship is a bit unclear.
    Multiplying it all out 1024x1024 is 7.69 times the amount of pixels as 450x303, but only about twice the amount of vram.
    Are you doing this in one pass, or is this a separate upscale, after you've already generated an image? Any screenshots would be useful, as well as information on whether you're using xformers or not. :)

    I am testing other GPUs right now and have a 3080 12GB in my test PC, so I'm trying to figure out how to test upscaling. One thing I'm relatively sure of is that less tuning and optimization has been put into the upscalers as opposed to the base Stable Diffusion stuff.

    Here's what I've tried (which does not seem to be going too well as it has shown zero progress on the actual resizing process). I don't know if you're using the Hires.fix or something else. Best I can do, with a 12GB 3080, is about a 3X upscale from 768x768.

    Are you running on Windows, or Linux? Did you have to do any special patching or whatever first?

    285
    Reply
  • derekullo
    I am running xformers ... must have toggled that on when I first configured it.

    @Echo off
    set PYTHON=
    set GIT=
    set VENV_DIR=
    set COMMANDLINE_ARGS=--xformers --listen
    call webui.bat
    Is the entire .bat i use to launch it.

    I am up-scaling the image as a separate step after an image has been created ... no point up-scaling an image I don't like!

    At the top click Extras and that should take you to the section I am at.

    2pakvAw
    I am pretty sure the checkpoint/model isn't used for this ... tried an anime checkpoint and the image was identical.
    I am using Windows 11 Pro
    Reply
  • JarredWaltonGPU
    derekullo said:
    I am running xformers ... must have toggled that on when I first configured it.

    @Echo off
    set PYTHON=
    set GIT=
    set VENV_DIR=
    set COMMANDLINE_ARGS=--xformers --listen
    call webui.bat
    Is the entire .bat i use to launch it.

    I am up-scaling the image as a separate step after an image has been created ... no point up-scaling an image I don't like!

    At the top click Extras and that should take you to the section I am at.

    2pakvAw
    Ha! I never even looked in the "Extras" tab... because why would upscaling of an existing image be listed under "Extras" rather than "Upscaling"? I'll have to see if the upscaling works with non-Nvidia cards. I did confirm it works, going up to very high resolutions, even with cards that only have 10GB. I suppose it probably does a sort of tiled upscaling, whereas if you use the "Hires Fix" it does some weird stuff — like not just upscaling, but generating in a fashion that is not at all the same as the upscaling.
    Reply
  • derekullo
    Here are the actual 2 pictures from the test!
    2paezih2pakBqN
    Reply
  • KarlKnecht
    The future for Stable Diffusion is SDXL with a resolution of 1024x1024. Can you give us an idea of the performance boost we can expect for that scenario? Would it be similar?
    Reply
  • JarredWaltonGPU
    KarlKnecht said:
    The future for Stable Diffusion is SDXL with a resolution of 1024x1024. Can you give us an idea of the performance boost we can expect for that scenario? Would it be similar?
    I would expect relatively similar scaling, though my past experience suggests Nvidia GPUs do better at higher resolutions than AMD GPUs. I'm not sure what the exact cause might be, possibly just better memory management in the drivers, or CUDA, or whatever.

    One thing to also consider is whether you even want to bother with SDXL. You could potentially do 512x512 and upscaling using SwinIR_4X and get similar results, faster. But that's a different can of worms to open up. :-)

    Which brings up the final point: I think the base models from Hugging Face (SD1.5, SD2.1, SDXL) can all work as is with the TensorRT stuff from Nvidia. But other things like upscaling may need tweaking and work to convert them to TensorRT. I'm not even sure how to go about doing that, which means you might end up back at base CUDA performance in many cases until/unless TensorRT uptake for AI stuff really takes off.
    Reply