Google, OpenAI, and Anthropic are competing to see whose AI can play Pokémon the best — Twitch streams of beloved RPG game test the models' true might

Game Boy advance SP pokemon — (Image credit: Getty / John Keeble)

While innumerable benchmarks and tests exist to determine the savvy and capabilities of AI, one perhaps more obscure benchmark appears to be making waves in the AI community. According to a new report, companies like Google, OpenAI, and Anthropic are now making their models play old-school Pokémon to evaluate performance, as reported by the Wall Street Journal.

"The thing that has made Pokémon fun and that has captured the [machine learning] community’s interest is that it’s a lot less constrained than Pong or some of the other games that people have historically done this on. It’s a pretty hard problem for a computer program to be able to do," Anthropic AI lead David Hershey told the outlet.

Visual explainer on how Claude plays Pok&eacute;mon — Visual explainer on how Claude plays Pokémon (Image credit: ClaudePlaysPokémon on Twitch)

It all started last year when Claude — Anthropic's frontier LLM — was put on a Twitch stream by Hershey, dubbed "Claude Plays Pokémon." David is the applied AI lead at Anthropic, meaning his job is to help customers deploy the AI, so this is just another way of testing the models. Claude's gaming efforts have inspired freelance developers to put up similar "Gemini Plays Pokémon" and "GPT Plays Pokémon" streams, too.

These projects have received official recognition from Google and OpenAI, with their labs even stepping in to tweak the models sometimes. Such deliberation has allowed both Gemini and GPT to already beat Pokémon Blue, so they've moved on to the sequels, but no version of Claude has pulled through yet. The latest Opus 4.5 model is currently busy tackling the challenge on stream.

Claude playing Pok&eacute;mon on Twitch with chat helping/cheering it on, on the side — (Image credit: ClaudePlaysPokémon on Twitch)

David says that using Pokémon to test these AI models is quite beneficial as "it provides [us] with, like, this great way to just see how a model is doing and to evaluate it in a quantitative way." In the game, you have to level up, train your existing roster, and capture new Pokémon by beating their gym masters. It's not a simple linear progression, but one that requires judgment.

You're often met with a choice to either pursue a risk, fighting a powerful trainer to seize their Pokémon, or sharpen the skills of those you already have. Humans excel at making decisions like these; they're part of the fun, but for AI, it's a test on logical reasoning, risk assessment, and long-term thinking that will affect overall progress. Therefore, how a model chooses to play the game helps researchers understand it more.

David does so by sharing his findings with customers, improving the "harness" built around AI that targets specific tasks. Harness refers to essentially the software framework that controls a model, helps direct its resources in a more meaningful way to cater to particular task requirements. David applies what he learns from his Pokémon streams to real-world clients looking to improve their compute efficiency.

As Big Tech moves toward its goal of achieving AGI, inference will transition from simple answers to long-running, successive progress, which a game like Pokémon is perfect for. To finish the game, you have to win the Pokémon League, and that requires several steps in a row, testing the AI's strategic planning and resource management. It also makes the performance easily quantifiable instead of being subjective.

Previously, we covered another exercise in AI capabilities where a bunch of models were asked to build a clone of Minesweeper. OpenAI's Codex emerged as the winner there, with Google's Gemini failing to even produce a playable game. That was a much easier ask, so something as complex as even a retro RPG is certainly a step-up in assessment criteria.

Follow Tom's Hardware on Google News, or add us as a preferred source, to get our latest news, analysis, & reviews in your feeds.

Hassam Nasir is a die-hard hardware enthusiast with years of experience as a tech editor and writer, focusing on detailed CPU comparisons and general hardware news. When he’s not working, you’ll find him bending tubes for his ever-evolving custom water-loop gaming rig or benchmarking the latest CPUs and GPUs just for fun.