Microsoft researchers build 1-bit AI LLM with 2B parameters — model small enough to run on some CPUs

Data being transferred
(Image credit: Getty Images)

Microsoft researchers just created BitNet b1.58 2B4T, an open-source 1-bit large language model (LLM) with two billion parameters trained on four trillion tokens. But what makes this AI model unique is that it’s lightweight enough to work efficiently on a CPU, with TechCrunch saying an Apple M2 chip can run it. The model is also readily available on Hugging Face, allowing anyone to experiment with it.

Bitnets use 1-bit weights with only three possible values: -1, 0, and +1 — technically it's a "1.58-bit model" due to the support for three values. This saves a lot of memory compared to mainstream AI models with 32-bit or 16-bit floating-point formats, allowing them to operate much more efficiently and require less memory and computational power. Bitnet’s simplicity has one drawback, though — it’s less accurate compared to larger AI models. However, BitNet b1.58 2B4T makes up for this with its massive training data, which is estimated to be more than 33 million books.

The team behind this lightweight model compared it against leading mainstream models, including Meta’s LLaMa 3.2 1B, Google’s Gemma 3 1B, and Alibaba’s Qwen 2.5 1.5B. BitNet b1.58 2B4T scored relatively well against these models in most tests, and even took top honors in a few benchmarks. More importantly, it only consumed 400MB in non-embedded memory — less than 30% of what the next smallest model (Gemma 3 1B) used, which is 1.4 GB.

Swipe to scroll horizontally

Benchmark

BitNet b1.58 2B

LLaMa 3.2 1B

Gemma 3 1B

Qwen 2.5 1.5B

Non-embedding memory usage

0.4 GB

2 GB

1.4 GB

2.6 GB

Latency (CPU Decoding)

29ms

48ms

41ms

65ms

Training tokens

4 trillion

9 trillion

2 trillion

18 trillion

ARC-Challenge

49.91

37.80

38.40

46.67

ARC-Easy

74.79

63.17

63.13

76.01

OpenbookQA

41.60

34.80

38.80

40.80

BoolQ

80.18

64.65

74.22

78.04

HellaSwag

68.44

60.80

57.69

68.28

PIQA

77.09

74.21

71.93

76.12

WinoGrande

71.90

59.51

58.48

62.83

CommonsenseQA

71.58

58.48

42.10

76.41

TruthfulQA

45.31

43.80

38.66

46.67

TriviaQA

33.57

37.60

23.49

38.37

MMLU

53.17

45.58

39.91

60.25

HumanEval+

38.40

31.10

37.20

50.60

GSM8K

58.38

38.21

31.16

56.79

MATH-500

43.40

23.00

42.00

53.00

IFEval

53.48

62.71

66.67

50.12

MT-bench

5.85

5.43

6.40

6.12

Average

54.19

44.90

43.74

55.23

However, the LLM must use the bitnet.cpp inference framework for it to run this efficiently. The team specifically said that this model will not have the performance efficiency gains “when using it with the standard transformers library, even with the required fork.”

You will need to grab the framework available on GitHub if you want to take advantage of its benefits on lightweight hardware. The repository describes bitnet.cpp as offering “a suite of optimized kernels that support fast and lossless inference of 1.58-bit models on CPU (with NPU and GPU support coming next). While it doesn’t support AI-specific hardware at the moment, it still allows anyone with a computer to experiment with AI without requiring expensive components.

AI models are often criticized for taking too much energy to train and operate. But lightweight LLMs, such as BitNet b1.58 2B4T, could help us run AI models locally on less powerful hardware. This could reduce our dependence on massive data centers and even give people without access to the latest processors with built-in NPUs and the most powerful GPUs to use artificial intelligence.

Follow Tom's Hardware on Google News to get our up-to-date news, analysis, and reviews in your feeds. Make sure to click the Follow button.

Jowi Morales
Contributing Writer

Jowi Morales is a tech enthusiast with years of experience working in the industry. He’s been writing with several tech publications since 2021, where he’s been interested in tech hardware and consumer electronics.

  • Giroro
    So, like, CPUs can access way more memory than a GPU.
    Or in the case of the mentioned Apple M2, the memory is shared.

    So I'm not sure what they're trying to brag about.
    Reply
  • ezst036
    Microsoft? Light weight? Could they port some of these enhancements over to Windows?
    Reply
  • hotaru251
    that it’s lightweight enough to work efficiently on a CPU, with TechCrunch saying an Apple M2 chip can run it.

    ....that isnt saying much given M2 has unified memory.
    Reply
  • coolitic
    Bitnets use 1-bit weights with only three possible values: -1, 0, and +1
    That's at least 2 bits, not 1.
    Reply
  • AngelusF
    coolitic said:
    That's at least 2 bits, not 1.
    Well, the article then goes on to describe it as 1.58 bits, which is more accurate. Although I've also heard that described as ternary.
    Reply
  • jp7189
    The race to the bottom is finished... time to turn around and start heading back up. Though I think we'll need a paradigm shift to improve efficiency at higher precision. I still believe that eventually we'll get there.
    Reply
  • usertests
    jp7189 said:
    The race to the bottom is finished... time to turn around and start heading back up. Though I think we'll need a paradigm shift to improve efficiency at higher precision. I still believe that eventually we'll get there.
    If it works, who cares? But I say that more about the precision than the parameter count. Parameter counts being optimized downward has been an important development, but at some point you have to increase it, and you mostly need more memory which desktop CPUs/APUs should be able to have easily.

    We can make LLMs run on 8 GB phones, but it would be better to have 24-32 GB of memory. I believe we can reach a point where budget smartphones can have 32-64 GB of RAM, if Samsung/Micron/SK Hynix successfully develop 3D DRAM (like 3D NAND) in the early-mid 2030s, and that drives cost-per-bit down by an order of magnitude.

    As far as improving efficiency at higher precision goes, I think AMD's adoption of "Block FP16" in the XDNA2 NPU could qualify for that. They promoted it as having the accuracy of 16-bit with the speed of INT8. There are only so many mathematical tricks you could pull to double performance though, and maybe we're already stuck with a choice of BFP16, INT8, INT4, FP4, INT2, INT1.58, etc.
    Reply
  • jp7189
    usertests said:
    If it works, who cares? But I say that more about the precision than the parameter count. Parameter counts being optimized downward has been an important development, but at some point you have to increase it, and you mostly need more memory which desktop CPUs/APUs should be able to have easily.

    We can make LLMs run on 8 GB phones, but it would be better to have 24-32 GB of memory. I believe we can reach a point where budget smartphones can have 32-64 GB of RAM, if Samsung/Micron/SK Hynix successfully develop 3D DRAM (like 3D NAND) in the early-mid 2030s, and that drives cost-per-bit down by an order of magnitude.

    As far as improving efficiency at higher precision goes, I think AMD's adoption of "Block FP16" in the XDNA2 NPU could qualify for that. They promoted it as having the accuracy of 16-bit with the speed of INT8. There are only so many mathematical tricks you could pull to double performance though, and maybe we're already stuck with a choice of BFP16, INT8, INT4, FP4, INT2, INT1.58, etc.
    I don't think gpu or even npu will scale well enough to get beyond small incremental improvements, and as long as that's the paradigm we'll keeping working at the bottom.. lower precision, lower parameters, better distillation techniques.

    I'm thinking something entirely new will be needed to make something significantly more useful than what we have today. I have no idea what that will look like. A new achitecture? New math? Cheap quantum endpoints? Who knows.
    Reply
  • usertests
    jp7189 said:
    I'm thinking something entirely new will be needed to make something significantly more useful than what we have today. I have no idea what that will look like. A new achitecture? New math? Cheap quantum endpoints? Who knows.
    We're on an "s-curve" desperately searching for the next s-curve. This research could be applicable, if low-precision math can be used for e.g. a spiking neuron model. On the hardware side maybe we'll see 3D neuromorphic chips developed to accelerate it. I'm predicting that an ever-closer "brain imitation" will be the next big thing, and planar chips are insufficient. The current NPUs could become a thing of the past or repurposed for non-AI matrix operations.

    I don't think we've seen much adoption of "in-memory computing" yet.
    Reply
  • Rob1C
    Just when you think they couldn't get decent results below FP4 this comes out, and it's been a thing for a year or more:

    https://github.com/ggml-org/llama.cpp/pull/8151https://blog.nolano.ai/Spectra-suite/https://arxiv.org/abs/2403.01241
    Reply