Here’s how NVIDIA is supercharging GenAI speed with TensorRT running locally

(Image credit: NVIDIA)

In the last couple of years, AI has exploded in popularity, with chatbots and image generators driving much of that surge. These tools are trained extensively on vast datasets called Large Language Models (LLMs), which they draw from to generate the results we see. However, getting those results quickly relies on some serious computing power. Over 100 million users are already putting powerful NVIDIA hardware to task running AI models. That’s because NVIDIA offers hardware that excels at that process, known as inference, with GPUs that include cores specifically designed for AI capabilities, and combines this hardware with TensorRT software that optimizes performance by essentially finding shortcuts for working through the models without sacrificing accuracy.

These AI-powered cores are known as Tensor Cores, and they are the backbone of NVIDIA’s TensorRT, software that wrangles AI applications to run on NVIDIA’s hardware for extreme acceleration of their inference. While your typical computer might have the hardware to run between 10 and 45 AI teraops (TOPS), the latest NVIDIA RTX GPUs can run between 200 and 1,300 TOPS, and that’s local, on-device processing. Data center GPUs can take it up another notch.

With the spread of laptop, desktop, and data center NVIDIA RTX GPUs all offering the Tensor Cores necessary to work with the TensorRT SDK, NVIDIA’s hardware is accelerating AI operations across the board.

Using TensorRT-LLM, a software that takes AI models and optimizes them to run on NVIDIA’s hardware, these Tensor Cores can be put to task working with the latest popular LLMs, such as Llama2 or Mistral. This makes it easy to not only run these LLMs quickly on-device without the needed to send information back and forth between your computer and a data center (i.e., without the need for an internet connection), but it also makes it possible to feed the LLM new information to customize it and then query it with this new data in mind.

NVIDIA has even built ChatRTX to streamline this process for new users. Here’s a quick look at how that works.

Between the speed of local processing accelerated by Tensor Cores and the customization available, TensorRT and TensorRT-LLM are making AI all the more accessible, and this has made NVIDIA one of the top players in the space.

If you have NVIDIA RTX hardware running in your system, you can tap into TensorRT now to begin running AI text and image generators locally. And that’s just scratching the surface of what you can do.

To stay in tune with the latest developments for TensorRT and NVIDIA’s AI capabilities, follow NVIDIA’s AI Decoded series. There you’ll find news on AI as well as helpful, digestible explanations on the technology working behind the scenes and demonstrative looks at how others are deploying RTX-powered AI to tackle all sorts of challenges.