Nvidia ACE Brings AI to Game Characters, Allows Lifelike Conversations

Jin
(Image credit: Tom's Hardware)

There are so many ways you can have a text chat with a large language model, from ChatGPT to Google Bard or MLC LLM, a local chatbot that can run on your phone. The next frontier for AI is bringing the power of LLMs to NPCs (non-player characters) in games where, instead of having a canned set of interactions, you can have a wide-open conversation. 

During its Computex 2023 keynote, Nvidia CEO Jensen Huang unveiled ACE for Games, an AI model foundry service, designed to bring game characters to life using natural language conversation, audio-to-facial-expression and text-to-speech / speech-to-text capabilities. Huang showed a game demo where an NPC named Jin, who runs a ramen noodle shop, interacted with a human player who was asking questions with voice and getting back real-sounding answers that match the NPC's backstory.

In the demo, the gamer (named Kai), walks into Jin's Ramen shop, asks him how he's doing (in voice) and has a conversation about the fact that the area has a high crime rate. Kai asks if he can help and Jin responds saying that "if you want to do something about this, I have heard rumors that the powerful crime lord Kumon Aoki is causing all sorts of chaos in the city. He may be the root of this violence." Kai asks where to find Aoki and Jin tells him, setting the user off on his quest.

"Not only will AI contribute to the rednering and the synthesis of the environment, AI will also animate the characters," Huang said. "AI will be a very big part of the future of video games."

Nvidia ACE for Games will offer high-speed access to three different components that already exist. The first, Nvidia NeMo, is an AI framework for training and deploying LLMs and it includes NeMo Guardrails, which is designed to prevent inappropriate / "unsafe" AI conversations. Presumably, this would stop NPCs from answering inappropriate, off-topic prompts from users. Guardrails also has security which should prevent users or would-be prompt injectors from "jailbreaking" the bots and getting them to do bad things.

Nvidia Riva is the company's speech-to-text / text-to-speech solution. In the ACE for games workflow, a gamer will ask a question via their microphone and Riva will convert it to text which is fed to the LLM. The LLM will then generate a text response which Riva turns back into speech that the user will hear. Of course, we'd expect games to also show the responses in text.  You can try Nvidia Riva's speech-to-text and text-to-speech capabilities yourself on the company's site.

Nvidia Omniverse Audio2Face provides the last step in the ACE for games workflow as it allows the characters to have facial expressions that match what they're saying. The company currently offers this product in beta and you can try it here

The demo, which is called Kairos, was designed by Convai, an AI-in-gaming startup that's part of Nvidia's Inception program that connects up-and-coming companies with venture capital. On the company's site, it offers a toolset that allows game developers to build lifelike NPCs with complex backstories. 

The company has a great explainer video about how its tools work and what they can do. In the video, you can see players talking to NPCs and asking them to do things that involve actual objects and other characters in the game.

For example, in the video, a player asks an NPC to hand him a gun that's sitting on a table and the NPC complies. In another part of the video, the player asks a soldier NPC to shoot at a target that's located in a particular place. We also see how Convai's tools make this all possible.

Having that added context so that the NPC is aware of what's going on in-game is so important. Recently, we tested a Minecraft AI plugin that allows you to talk to NPCs in that game, but the NPCs have no situational awareness at all. We were able to continue a conversation with a sheep after we had killed it (and it didn't know it was dead), for example.

Avram Piltch
Avram Piltch is Tom's Hardware's editor-in-chief. When he's not playing with the latest gadgets at work or putting on VR helmets at trade shows, you'll find him rooting his phone, taking apart his PC or coding plugins. With his technical knowledge and passion for testing, Avram developed many real-world benchmarks, including our laptop battery test.
  • PlaneInTheSky
    I have seen kitchen robots with more life in them.
    Reply
  • InvalidError
    NeMo Guardrails... I predict we'll see a bunch of videos about creative ways to bypass attempts at preventing inappropriate discussions.
    Reply
  • USAFRet
    In game, why would I want to talk to a software routine?
    Reply
  • bit_user
    If their LLM is any good, I'd expect it's going to chew up a lot of VRAM. That's not good, when people are already complaining about insufficient VRAM in recent graphics card models.

    The situational awareness is another aspect that really intrigues me. I'm guessing it's quite restricted in what sorts of things the NPC is aware of.
    Reply
  • bit_user
    USAFRet said:
    In game, why would I want to talk to a software routine?
    Usually, you talk to NPCs to get information needed to play the game (e.g. find items, reveal story line, complete quests, etc.). AI could enable more free-form queries, rather than being limited to multiple-choice (which is sort of "cheating", in a way) or having to phrase things very particularly.

    Also, the speech-to-text aspect is nice, since gaming with a headset means you can just speak it at them.
    Reply
  • hotaru251
    bit_user said:
    . That's not good, when people are already complaining about insufficient VRAM in recent models.
    "just buy a 4090" - jensen
    Reply
  • InvalidError
    bit_user said:
    If their LLM is any good, I'd expect it's going to chew up a lot of VRAM. That's not good, when people are already complaining about insufficient VRAM in recent models.
    The way Nvidia pitches it as "offering high-speed access to three components that already exists" makes me think this may be an AI-as-a-service thing rather than something intended to run locally.

    If you want NPCs to feel relatively unique, you need to have dozens of AI model so you don't feel like you are running into the same AI every 3rd NPC you encounter. If each "personality" has a 8GB model like the smallest portable GPT variant, your games would end up having 100+GB of AI models if this stuff ran locally.
    Reply
  • bit_user
    InvalidError said:
    The way Nvidia pitches it as "offering high-speed access to three components that already exists" makes me think this may be an AI-as-a-service thing rather than something intended to run locally.
    Yeah, that thought crossed my mind, but I thought maybe not, considering the scalability challenges of hosting server resources for it + the online dependency. I guess I'm old-school in my thinking about the latter, but still... the costs of centralized processing would seem to be an obstacle.

    InvalidError said:
    If you want NPCs to feel relatively unique, you need to have dozens of AI model so you don't feel like you are running into the same AI every 3rd NPC you encounter.
    I'd imagine that can be controlled via implicit prompting. Since the vast majority of the LLM would be the same, between different NPCs, you'd rather have a single instance of the model loaded and just prompt it to behave differently.
    Reply
  • s997863
    If the AI learns from scouring the internet:
    PC: "ok, I'll go and see this drug lord"
    NPC: "wait, PC. You must see the terminal by door for a better ship before you go. Try something at StarCitizen.Nvidia.DLC. Now is not the time to be cheap! You can get 1950coins for discount in the next 5 minutes! I used to be a detective like you, but then I took a bullet to the knee!"
    Reply
  • baboma
    One of my main pet peeves when playing a game with a storyline is that it's mostly canned, including all of the convos. It's not interactive.

    The Nvidia ACE (Avatar Cloud Engine) perked my interest in that respect. But watching the demo, it seems that it would be very expensive (in HW resources) to implement, and not much improvement over the canned stuff.

    The main problem is that it still looks like a mannequin piping out the audio. The delivery is very mechanical, with no inflection. It's not enough that the mouth/face move in sync with the words. Normal people gesticulates when they talk; they emote. This doesn't feel real; I don't get any extra immersion from it.

    But the extra "conversational personality" for NPCs (using LLM) would be a win in making the storyline less canned, less on-rail. I think we can get that just with a text interface, like with ChatGPT, or perhaps with speech recognition so you can talk to NPC w/o typing.

    BTW, for those who prefer text rather than trudge thru the keynote video, you can read all the doodads on the Nvidia blog.

    https://blogs.nvidia.com/blog/2023/05/28/computex-keynote-generative-ai/
    Reply