BitNet b1.58 2B4T Technical Report

106 points by galeos 2 days ago

I built it at home this morning and tried it, perhaps my expectations were high but I wasn't terribly impressed. I asked it for a list of ten types of data I might show on a home info display panel. It gave me three. I clarified that I wanted ten, it gave me six. Every request after that just returned the same six things.

I know it's not chatGPT4 but I've tried other very small models that run on CPU only and had better results

Me1000 2 days ago

This is a technology demo, not a model you'd want to use. Because Bitnet models are only average 1.58 bits per weight you'd expect to need the model to be much larger than your fp8/fp16 counterparts in terms of parameter count. Plus this is only a 2 billion parameter model in the first place, even fp16 2B parameter models generally perform pretty poorly.
- nopelynopington 2 days ago
  
  Ok that's fair. I still think something was up with my build though, the online demo worked far better than my local build
ashirviskas 2 days ago

> I've tried other very small models that run on CPU only and had better results
Maybe you can you share some comparative examples?
- nopelynopington 2 days ago
  
  sure, here's my conversation with BitNet b1.58 2B4T
  https://pastebin.com/ZZ1tADvp
  here's the same prompt given to smollm2:135m
  https://pastebin.com/SZCL5WkC
  The quality of the second results are not fantastic. The data isn't public, and it repeats itself mentioning income a few times. I don't think I would use either of these models for accurate data but I was surprised at the truncated results from bitnet
  Smollm2:360M returned better quality results, no repetition, but it did suggest things which didn't fit the brief exactly (public data given location only)
  https://pastebin.com/PRFqnqVF
  Edit:
  I tried the same query on the live demo site and got much better results. Maybe something went wrong on my end?
  - sroussey 2 days ago
    
    You were using bitnet.cpp?
    
    nopelynopington 2 days ago
    
    Yes

Havoc 2 days ago

Is there a reason why the 1.58 ones are always aimed at quite small ones? Think I’ve seen an 8B but that’s about it.

Is there a technical reason for it or just research convenience ?

londons_explore 2 days ago

I suspect because current GPU hardware can't efficiently train such low bit depth models. You end up needing activations to use 8 or 16 bits in all the data paths, and don't get any more throughput per cycle on the multiplications than you would have done with FP32.
Custom silicon would solve that, but nobody wants to build custom silicon for a data format that will go out of fashion before the production run is done.
- zamadatix 2 days ago
  
  The custom CUDA kernel for 4-in-8 seems to have come out better than a naive approach (such as just treating each as an fp8/int8) + it lowers memory bandwidth. Custom hardware would certainly make that improvement even better but I don't think that's what's limiting training to 2-8 billion parameters as much as something like research convenience while the groundwork for this type of model is still being figured out.
- Havoc 2 days ago
  
  Makes sense. Might be good for mem throughput constrained devices though so hoping it’ll pick up
yieldcrv 2 days ago

They aren’t, there is a 1.58 version of deepseek that’s like 200gb instead of 700
- logicchains 2 days ago
  
  That's not a real BitNet, it's just a post-training quantisation, and its performance suffers compared to if it was trained from scratch at 1.58 bits.

galeos 2 days ago

You can try out the model in a demo they have setup: https://bitnet-demo.azurewebsites.net/

Thoreandan 2 days ago

I guess B1FF@BITNET posts are gonna come from an LLM now.

Context: https://web.archive.org/web/20030830105202/http://www.catb.o...

balazstorok 2 days ago

Does someone have a good understanding how 2B models can be useful in production? What tasks are you using them for? I wonder what tasks you can fine-tune them on to produce 95-99% results (if anything).

nialse 2 days ago

The use case for small models include sentiment and intent analysis, spam and abuse detection, and classifications of various sorts. Generally LLM are thought of as chat models but the output need not be a conversation per se.
- mhitza 2 days ago
  My impression was that text embeddings are better suited for classification. Of course the big caveat is that the embeddings must have "internalized" the semantic concept you're trying to map.
  From some article I have in my draft, experimenting with open source text embeddings:
  ./match venture capital purchase 0.74005488647684 sale 0.80926752301733 place 0.81188663814236 positive sentiment 0.90793311875207 negative sentiment 0.91083707598925 time 0.9108697315425 ./store sillicon valley ./match venture capital sillicon valley 0.7245139487301 purchase 0.74005488647684 sale 0.80926752301733 place 0.81188663814236 positive sentiment 0.90793311875207 negative sentiment 0.91083707598925 time 0.9108697315425
  Of course you need to figure out what these black boxes understand. For example for sentiment analysis, instead of having it match against "positive" "negative" you would have the matching terms be "kawai" and "student debt". Depending how the text embedding internalized negatives and positives based on their training data.
snovv_crash 2 days ago

Anything you'd normally train a smaller custom model for, but with an LLM you can use a prompt instead of training.
meltyness 2 days ago

I'm more interested in how users are taking 95-99% to 99.99% for generation-assisted tasks. I haven't seen a review or study of techniques, even though on the ground it's pretty trivial to think of some candidates.
- oezi 2 days ago
  
  Three strategies seem to be:
  - Use LLM to evaluate result and retry if it doesn't match.
  - let users trigger a retry
  - let users edit
future10se 2 days ago

The on-device models used for Apple Intelligence (writing tools, notification and email/message summaries, etc.) are around ~3B parameters.
I mean, they could be better (to put it nicely), but there is a legitimate use-case for them and I'd love to see more work in this space.
https://machinelearning.apple.com/research/introducing-apple...
https://arxiv.org/abs/2407.21075
Lapel2742 2 days ago

I'm just playing / experimenting around with local LLM's. Just to see what I can do with them. One thing that comes to mind is gaming: E.g. text/dialog generation in procedural worlds / adventures.
logicchains 2 days ago

2B models by themselves aren't so useful, but it's very interesting as a proof of concept, because the same technique used to train a 200B model could produce one that's much more efficient (cheaper and more environmentally friendly) than existing 200B models, especially with specialised hardware support.
throwaway314155 2 days ago

Summarization on mobile/embedded might be a good usecase?

akoboldfrying 2 days ago

They give some description of how their weights are stored: they pack 4 weights into an int8, indicating that their storage format isn't optimal (2 bits per weight instead of the optimal ~1.58 bits). But I don't know enough about LLM internals to know how material this is.

Could anyone break down the steps further?

Fubwubs 2 days ago

This model maps weights to ternary values {-1, 0, 1} (aka trits). One trit holds log(3)/log(2) ≈ 1.58 bits of information. To represent a single trit by itself would require 2 bits, but it is possible to pack 5 trits into 8 bits. This article explains it well: https://compilade.net/blog/ternary-packing
By using 4 ternary weights per 8 bits, the model is not quite as space-efficient as it could be in terms of information density. (4*1.58)/8 = 0.79 vs (5*1.58)/8 = 0.988 There is currently no hardware acceleration for doing operations on 5 trits packed into 8 bits, so the weights have to be packed and unpacked in software. Packing 5 weights into 8 bits requires slower, more complex packing/unpacking algorithms.
- akoboldfrying a day ago
  
  That link gives a great description of how to pack trits more efficiently, thanks. Encoding in "base 3" was obvious to me, but I didn't realise that 5 trits fit quite tightly into a byte, or that it's possible to "space the values apart" so that they can be extracted using just multiplications and bitwise ops (no division or remainder).

rbanffy 2 days ago

Not to be confused with BITNET

https://en.m.wikipedia.org/wiki/BITNET

rcMgD2BwE72F 2 days ago

I ask about the last French election and the #1 sentence is:

>Marine Le Pen, a prominent figure in France, won the 2017 presidential election despite not championing neoliberalism. Several factors contributed to her success: (…)

What data did they train their model on?