AI language model runs on a Windows 98 system with Pentium II and 128MB of RAM — Open-source AI flagbearers demonstrate Llama 2 LLM in extreme conditions

(Image credit: Alex Cheema on GitHub)

EXO Labs has penned a detailed blog post about running Llama on Windows 98 and demonstrated a rather powerful AI large language model (LLM) running on a 26-year-old Windows 98 Pentium II PC in a brief video on social media. The video shows an ancient Elonex Pentium II @ 350 MHz booting into Windows 98, and then EXO then fires up its custom pure C inference engine based on Andrej Karpathy's Llama2.c and asks the LLM to generate a story about Sleepy Joe. Amazingly, it works, with the story being generated at a very respectable pace.

LLM running on Windows 98 PC26 year old hardware with Intel Pentium II CPU and 128MB RAM.Uses llama98.c, our custom pure C inference engine based on @karpathy llama2.cCode and DIY guide 👇 pic.twitter.com/pktC8hhvvaDecember 28, 2024

The above eye-opening feat is nowhere near the end game for EXO Labs. This somewhat mysterious organization came out of stealth in September with a mission "to democratize access to AI." A team of researchers and engineers from Oxford University formed the organization. Briefly, EXO sees a handful of megacorps controlling AI as a very bad thing for culture, truth, and other fundamental aspects of our society. Thus EXO hopes to "Build open infrastructure to train frontier models and enable any human to run them anywhere." In this way, ordinary folk can hope to train and run AI models on almost any device – and this crazy Windows 98 AI feat is a totemic demo of what can be done with (severely) limited resources.

Since the Tweet video is rather brief, we were thankful to find EXO's blog post about Running Llama on Windows 98. This post is published as Day 4 of "the 12 days of EXO" series (so stay tuned).

As readers might expect, it was trivial for EXO to pick up an old Windows 98 PC from eBay as the foundation of this project, but there were many hurdles to overcome. EXO explains that getting data onto the old Elonex branded Pentium II was a challenge, making them resort to using "good old FTP" for file transfers via the ancient machine's Ethernet port.

Compiling modern code for Windows 98 was probably a greater challenge. EXO was glad to find Andrej Karpathy's llama2.c, which can be summarized as "700 lines of pure C that can run inference on models with Llama 2 architecture." With this resource and the old Borland C++ 5.02 IDE and compiler (plus a few minor tweaks), the code could be made into a Windows 98-compatible executable and run. Here's a GitHub link to the finished code.

35.9 tok/sec on Windows 98 🤯This is a 260K LLM with Llama-architecture.We also tried out larger models. Results in the blog post. https://t.co/QsViEQLqS9 pic.twitter.com/lRpIjERtSrDecember 28, 2024

One of the fine folks behind EXO, Alex Cheema, made a point of thanking Andrej Karpathy for his code, marveling at its performance, delivering "35.9 tok/sec on Windows 98" using a 260K LLM with Llama architecture. It is probably worth highlighting that Karpathy was previously a director of AI at Tesla and was on the founding team at OpenAI.

Of course, a 260K LLM is on the small side, but this ran at a decent pace on an ancient 350 MHz single-core PC. According to the EXO blog, Moving up to a 15M LLM resulted in a generation speed of a little over 1 tok/sec. Llama 3.2 1B was glacially slow at 0.0093 tok/sec, however.

BitNet is the bigger plan

By now, you will be well aware that this story isn't just about getting an LLM to run on a Windows 98 machine. EXO rounds out its blog post by talking about the future, which it hopes will be democratized thanks to BitNet.

"BitNet is a transformer architecture that uses ternary weights," it explains. Importantly, using this architecture, a 7B parameter model only needs 1.38GB of storage. That may still make a 26-year-old Pentium II creak, but that's feather-light to modern hardware or even for decade-old devices.

EXO also highlights that BitNet is CPU-first – swerving expensive GPU requirements. Moreover, this type of model is claimed to be 50% more efficient than full-precision models and can leverage a 100B parameter model on a single CPU at human reading speeds (about 5 to 7 tok/sec).

Before we go, please note that EXO is still looking for help. If you also want to avoid the future of AI being locked into massive data centers owned by billionaires and megacorps and think you can contribute in some way, you could reach out.

For a more casual liaison with EXO Labs, they host a Discord Retro channel to discuss running LLMs on old hardware like old Macs, Gameboys, Raspberry Pis, and more.

Mark Tyson is a news editor at Tom's Hardware. He enjoys covering the full breadth of PC tech; from business and semiconductor design to products approaching the edge of reason.

8 Comments Comment from the forums

acadia11

Very king cool! Something has to be done power consumption required to train todays models, … I wonder how precise their approach happens to be?
Reply
bit_user

The article said:
Of course, a 260K LLM is on the small side
That's putting it mildly! If you look at the output it generated, it's like one step above complete gibberish!

In my mind, this utterly fails as a PR stunt, because the result is both useless and unimpressive. They should've gone for the largest model the machine could handle, unless that would've barely moved the needle on output quality.

The article said:
Thus EXO hopes to "Build open infrastructure to train frontier models and enable any human to run them anywhere." In this way, ordinary folk can hope to train and run AI models on almost any device
Did he say the "any device" part? Because no - you need a lot of data, a large cluster, and lots of communication bandwidth for training. This isn't going to happen on "any device".

The way I read the "open infrastructure" comment is just that you would have a training infrastructure you could more easily port from one hardware to another, not that it's going to vastly lower compute requirements than existing training solutions.
Reply
evdjj3j

I wonder how much upgrading to a P3 which would add SSE would speed things up?
Reply
bit_user

evdjj3j said:
I wonder how much upgrading to a P3 which would add SSE would speed things up?
The original SSE only added 4x fp32 operations. So, if they're using a lot of 8-bit integer arithmetic or similar, then you might not benefit from going beyond MMX, until you get to SSE2.
Reply
Notton

All this proved to me was that an LED light bulb seller can slap on the AI moniker so long as it has a processor and some storage.
Reply
awake283

This is just flat out cool.
Reply
Mindstab Thrull

Suddenly I wonder how much AI LLM you could get running on Knight Rider's KITT... from the 80s.
Reply
awake283

How how about

Mindstab Thrull said:
Suddenly I wonder how much AI LLM you could get running on Knight Rider's KITT... from the 80s.
Or this? https://s.yimg.com/ny/api/res/1.2/Md7ZN7FRIBnCpeTV8w5mhQ--/YXBwaWQ9aGlnaGxhbmRlcjt3PTk2MDtoPTYzMQ--/https://o.aolcdn.com/hss/storage/midas/271bd57a05b524f513e91b913675dcb5/202807395/wargames_still8.jpg
Reply

Show more comments