Microsoft researchers build 1-bit AI LLM with 2B parameters — model small enough to run on some CPUs
This is a lightweight as it gets.

Microsoft researchers just created BitNet b1.58 2B4T, an open-source 1-bit large language model (LLM) with two billion parameters trained on four trillion tokens. But what makes this AI model unique is that it’s lightweight enough to work efficiently on a CPU, with TechCrunch saying an Apple M2 chip can run it. The model is also readily available on Hugging Face, allowing anyone to experiment with it.
Bitnets use 1-bit weights with only three possible values: -1, 0, and +1 — technically it's a "1.58-bit model" due to the support for three values. This saves a lot of memory compared to mainstream AI models with 32-bit or 16-bit floating-point formats, allowing them to operate much more efficiently and require less memory and computational power. Bitnet’s simplicity has one drawback, though — it’s less accurate compared to larger AI models. However, BitNet b1.58 2B4T makes up for this with its massive training data, which is estimated to be more than 33 million books.
The team behind this lightweight model compared it against leading mainstream models, including Meta’s LLaMa 3.2 1B, Google’s Gemma 3 1B, and Alibaba’s Qwen 2.5 1.5B. BitNet b1.58 2B4T scored relatively well against these models in most tests, and even took top honors in a few benchmarks. More importantly, it only consumed 400MB in non-embedded memory — less than 30% of what the next smallest model (Gemma 3 1B) used, which is 1.4 GB.
Benchmark | BitNet b1.58 2B | LLaMa 3.2 1B | Gemma 3 1B | Qwen 2.5 1.5B |
---|---|---|---|---|
Non-embedding memory usage | 0.4 GB | 2 GB | 1.4 GB | 2.6 GB |
Latency (CPU Decoding) | 29ms | 48ms | 41ms | 65ms |
Training tokens | 4 trillion | 9 trillion | 2 trillion | 18 trillion |
ARC-Challenge | 49.91 | 37.80 | 38.40 | 46.67 |
ARC-Easy | 74.79 | 63.17 | 63.13 | 76.01 |
OpenbookQA | 41.60 | 34.80 | 38.80 | 40.80 |
BoolQ | 80.18 | 64.65 | 74.22 | 78.04 |
HellaSwag | 68.44 | 60.80 | 57.69 | 68.28 |
PIQA | 77.09 | 74.21 | 71.93 | 76.12 |
WinoGrande | 71.90 | 59.51 | 58.48 | 62.83 |
CommonsenseQA | 71.58 | 58.48 | 42.10 | 76.41 |
TruthfulQA | 45.31 | 43.80 | 38.66 | 46.67 |
TriviaQA | 33.57 | 37.60 | 23.49 | 38.37 |
MMLU | 53.17 | 45.58 | 39.91 | 60.25 |
HumanEval+ | 38.40 | 31.10 | 37.20 | 50.60 |
GSM8K | 58.38 | 38.21 | 31.16 | 56.79 |
MATH-500 | 43.40 | 23.00 | 42.00 | 53.00 |
IFEval | 53.48 | 62.71 | 66.67 | 50.12 |
MT-bench | 5.85 | 5.43 | 6.40 | 6.12 |
Average | 54.19 | 44.90 | 43.74 | 55.23 |
However, the LLM must use the bitnet.cpp inference framework for it to run this efficiently. The team specifically said that this model will not have the performance efficiency gains “when using it with the standard transformers library, even with the required fork.”
You will need to grab the framework available on GitHub if you want to take advantage of its benefits on lightweight hardware. The repository describes bitnet.cpp as offering “a suite of optimized kernels that support fast and lossless inference of 1.58-bit models on CPU (with NPU and GPU support coming next). While it doesn’t support AI-specific hardware at the moment, it still allows anyone with a computer to experiment with AI without requiring expensive components.
AI models are often criticized for taking too much energy to train and operate. But lightweight LLMs, such as BitNet b1.58 2B4T, could help us run AI models locally on less powerful hardware. This could reduce our dependence on massive data centers and even give people without access to the latest processors with built-in NPUs and the most powerful GPUs to use artificial intelligence.
Follow Tom's Hardware on Google News to get our up-to-date news, analysis, and reviews in your feeds. Make sure to click the Follow button.
Stay On the Cutting Edge: Get the Tom's Hardware Newsletter
Get Tom's Hardware's best news and in-depth reviews, straight to your inbox.

Jowi Morales is a tech enthusiast with years of experience working in the industry. He’s been writing with several tech publications since 2021, where he’s been interested in tech hardware and consumer electronics.
-
Giroro So, like, CPUs can access way more memory than a GPU.Reply
Or in the case of the mentioned Apple M2, the memory is shared.
So I'm not sure what they're trying to brag about. -
hotaru251 that it’s lightweight enough to work efficiently on a CPU, with TechCrunch saying an Apple M2 chip can run it.
....that isnt saying much given M2 has unified memory. -
coolitic Bitnets use 1-bit weights with only three possible values: -1, 0, and +1
That's at least 2 bits, not 1. -
AngelusF
Well, the article then goes on to describe it as 1.58 bits, which is more accurate. Although I've also heard that described as ternary.coolitic said:That's at least 2 bits, not 1. -
jp7189 The race to the bottom is finished... time to turn around and start heading back up. Though I think we'll need a paradigm shift to improve efficiency at higher precision. I still believe that eventually we'll get there.Reply -
usertests
If it works, who cares? But I say that more about the precision than the parameter count. Parameter counts being optimized downward has been an important development, but at some point you have to increase it, and you mostly need more memory which desktop CPUs/APUs should be able to have easily.jp7189 said:The race to the bottom is finished... time to turn around and start heading back up. Though I think we'll need a paradigm shift to improve efficiency at higher precision. I still believe that eventually we'll get there.
We can make LLMs run on 8 GB phones, but it would be better to have 24-32 GB of memory. I believe we can reach a point where budget smartphones can have 32-64 GB of RAM, if Samsung/Micron/SK Hynix successfully develop 3D DRAM (like 3D NAND) in the early-mid 2030s, and that drives cost-per-bit down by an order of magnitude.
As far as improving efficiency at higher precision goes, I think AMD's adoption of "Block FP16" in the XDNA2 NPU could qualify for that. They promoted it as having the accuracy of 16-bit with the speed of INT8. There are only so many mathematical tricks you could pull to double performance though, and maybe we're already stuck with a choice of BFP16, INT8, INT4, FP4, INT2, INT1.58, etc. -
jp7189
I don't think gpu or even npu will scale well enough to get beyond small incremental improvements, and as long as that's the paradigm we'll keeping working at the bottom.. lower precision, lower parameters, better distillation techniques.usertests said:If it works, who cares? But I say that more about the precision than the parameter count. Parameter counts being optimized downward has been an important development, but at some point you have to increase it, and you mostly need more memory which desktop CPUs/APUs should be able to have easily.
We can make LLMs run on 8 GB phones, but it would be better to have 24-32 GB of memory. I believe we can reach a point where budget smartphones can have 32-64 GB of RAM, if Samsung/Micron/SK Hynix successfully develop 3D DRAM (like 3D NAND) in the early-mid 2030s, and that drives cost-per-bit down by an order of magnitude.
As far as improving efficiency at higher precision goes, I think AMD's adoption of "Block FP16" in the XDNA2 NPU could qualify for that. They promoted it as having the accuracy of 16-bit with the speed of INT8. There are only so many mathematical tricks you could pull to double performance though, and maybe we're already stuck with a choice of BFP16, INT8, INT4, FP4, INT2, INT1.58, etc.
I'm thinking something entirely new will be needed to make something significantly more useful than what we have today. I have no idea what that will look like. A new achitecture? New math? Cheap quantum endpoints? Who knows. -
usertests
We're on an "s-curve" desperately searching for the next s-curve. This research could be applicable, if low-precision math can be used for e.g. a spiking neuron model. On the hardware side maybe we'll see 3D neuromorphic chips developed to accelerate it. I'm predicting that an ever-closer "brain imitation" will be the next big thing, and planar chips are insufficient. The current NPUs could become a thing of the past or repurposed for non-AI matrix operations.jp7189 said:I'm thinking something entirely new will be needed to make something significantly more useful than what we have today. I have no idea what that will look like. A new achitecture? New math? Cheap quantum endpoints? Who knows.
I don't think we've seen much adoption of "in-memory computing" yet. -
Rob1C Just when you think they couldn't get decent results below FP4 this comes out, and it's been a thing for a year or more:Reply
https://github.com/ggml-org/llama.cpp/pull/8151https://blog.nolano.ai/Spectra-suite/https://arxiv.org/abs/2403.01241