Nvidia and Mistral AI's super-accurate small language model works on laptops and PCs

Mistral-NeMo-Minitron 8B art from Nvidia

(Image credit: Nvidia)

Nvidia and Mistral AI have released a new small language model that purportedly features "state-of-the-art" accuracy in a tiny footprint. The new LM is known as the Mistral-NemMo-Minitron 8B, a miniaturized version of NeMo 12B that has been pruned from 12 billion to 8 billion parameters.

The new 8 billion-parameter small language model was shrunken down through two different AI optimization methods, said Bryan Catanzaro, VP of deep learning research at Nvidia, in a blog post. The team behind the new LM used a process that combines pruning and distillation. "Pruning downsizes a neural network by removing model weights that contribute the least to accuracy. During distillation, the team retrained this pruned model on a small dataset to significantly boost accuracy, which had decreased through the pruning process."

These optimizations enabled the developers to train the optimized language model on a "fraction of the original dataset" resulting in up to 40x cost savings in terms of raw compute. Normally, AI models have to balance between model size and accuracy, but with Nvidia and Mistral AI's new pruning and distillation techniques, language models can have the best of both worlds.

Nvidia has designed Minitron 8B around consumer-based computer hardware. The LM is packaged as a Nvidia NIM microservice, and the AI model is optimized for low latency, which improves response times. Nvidia provides its custom model service, AI Foundry, to take Minitron 8B and manipulate it to work on even less powerful systems, such as smartphones. Accuracy and performance won't be as good, but Nvidia claims the model would still be a high-accuracy LM, requiring a fraction of the training data and compute infrastructure that it would otherwise need.

Pruning and distillation appear to be the next frontier for artificial intelligence performance optimization. Theoretically, there's nothing preventing developers from applying these optimization techniques to all current language models, which would significantly boost performance across the board, including large language models that can only be powered by AI-accelerated server farms.

See more GPUs News

Aaron Klotz is a contributing writer for Tom’s Hardware, covering news related to computer hardware such as CPUs, and graphics cards.

1 Comment Comment from the forums

Tom_Neverwinter

I mean llama 3, 3.1 and most 8b models will run on cpu with 0 issues. I am literally running these on a orange pi 5 plus for fun. if they get model switching working so it can load whisperai, then unload it or be more efficient. I can then load a 8b model. process the data. unload the model. then load a xtts model output voice to the user and repeat. all on 8gb. my orangepi5plus has 16GB of ram. so I dont need to offload whisper the model or xtts but the cpu bottleneck even at 6TOPS is painfully slow at this time.
Reply