nopelynopington 2 days ago

I built it at home this morning and tried it, perhaps my expectations were high but I wasn't terribly impressed. I asked it for a list of ten types of data I might show on a home info display panel. It gave me three. I clarified that I wanted ten, it gave me six. Every request after that just returned the same six things.

I know it's not chatGPT4 but I've tried other very small models that run on CPU only and had better results

  • Me1000 2 days ago

    This is a technology demo, not a model you'd want to use. Because Bitnet models are only average 1.58 bits per weight you'd expect to need the model to be much larger than your fp8/fp16 counterparts in terms of parameter count. Plus this is only a 2 billion parameter model in the first place, even fp16 2B parameter models generally perform pretty poorly.

    • nopelynopington 2 days ago

      Ok that's fair. I still think something was up with my build though, the online demo worked far better than my local build

  • ashirviskas 2 days ago

    > I've tried other very small models that run on CPU only and had better results

    Maybe you can you share some comparative examples?

    • nopelynopington 2 days ago

      sure, here's my conversation with BitNet b1.58 2B4T

      https://pastebin.com/ZZ1tADvp

      here's the same prompt given to smollm2:135m

      https://pastebin.com/SZCL5WkC

      The quality of the second results are not fantastic. The data isn't public, and it repeats itself mentioning income a few times. I don't think I would use either of these models for accurate data but I was surprised at the truncated results from bitnet

      Smollm2:360M returned better quality results, no repetition, but it did suggest things which didn't fit the brief exactly (public data given location only)

      https://pastebin.com/PRFqnqVF

      Edit:

      I tried the same query on the live demo site and got much better results. Maybe something went wrong on my end?

Havoc 2 days ago

Is there a reason why the 1.58 ones are always aimed at quite small ones? Think I’ve seen an 8B but that’s about it.

Is there a technical reason for it or just research convenience ?

  • londons_explore 2 days ago

    I suspect because current GPU hardware can't efficiently train such low bit depth models. You end up needing activations to use 8 or 16 bits in all the data paths, and don't get any more throughput per cycle on the multiplications than you would have done with FP32.

    Custom silicon would solve that, but nobody wants to build custom silicon for a data format that will go out of fashion before the production run is done.

    • zamadatix 2 days ago

      The custom CUDA kernel for 4-in-8 seems to have come out better than a naive approach (such as just treating each as an fp8/int8) + it lowers memory bandwidth. Custom hardware would certainly make that improvement even better but I don't think that's what's limiting training to 2-8 billion parameters as much as something like research convenience while the groundwork for this type of model is still being figured out.

    • Havoc 2 days ago

      Makes sense. Might be good for mem throughput constrained devices though so hoping it’ll pick up

  • yieldcrv 2 days ago

    They aren’t, there is a 1.58 version of deepseek that’s like 200gb instead of 700

    • logicchains 2 days ago

      That's not a real BitNet, it's just a post-training quantisation, and its performance suffers compared to if it was trained from scratch at 1.58 bits.

balazstorok 2 days ago

Does someone have a good understanding how 2B models can be useful in production? What tasks are you using them for? I wonder what tasks you can fine-tune them on to produce 95-99% results (if anything).

  • nialse 2 days ago

    The use case for small models include sentiment and intent analysis, spam and abuse detection, and classifications of various sorts. Generally LLM are thought of as chat models but the output need not be a conversation per se.

    • mhitza 2 days ago

      My impression was that text embeddings are better suited for classification. Of course the big caveat is that the embeddings must have "internalized" the semantic concept you're trying to map.

      From some article I have in my draft, experimenting with open source text embeddings:

          ./match venture capital
          purchase           0.74005488647684
          sale               0.80926752301733
          place              0.81188663814236
          positive sentiment 0.90793311875207
          negative sentiment 0.91083707598925
          time               0.9108697315425
       
          ./store sillicon valley
          ./match venture capital
          sillicon valley    0.7245139487301
          purchase           0.74005488647684
          sale               0.80926752301733
          place              0.81188663814236
          positive sentiment 0.90793311875207
          negative sentiment 0.91083707598925
          time               0.9108697315425
      
      Of course you need to figure out what these black boxes understand. For example for sentiment analysis, instead of having it match against "positive" "negative" you would have the matching terms be "kawai" and "student debt". Depending how the text embedding internalized negatives and positives based on their training data.
  • snovv_crash 2 days ago

    Anything you'd normally train a smaller custom model for, but with an LLM you can use a prompt instead of training.

  • meltyness 2 days ago

    I'm more interested in how users are taking 95-99% to 99.99% for generation-assisted tasks. I haven't seen a review or study of techniques, even though on the ground it's pretty trivial to think of some candidates.

    • oezi 2 days ago

      Three strategies seem to be:

      - Use LLM to evaluate result and retry if it doesn't match.

      - let users trigger a retry

      - let users edit

  • Lapel2742 2 days ago

    I'm just playing / experimenting around with local LLM's. Just to see what I can do with them. One thing that comes to mind is gaming: E.g. text/dialog generation in procedural worlds / adventures.

  • logicchains 2 days ago

    2B models by themselves aren't so useful, but it's very interesting as a proof of concept, because the same technique used to train a 200B model could produce one that's much more efficient (cheaper and more environmentally friendly) than existing 200B models, especially with specialised hardware support.

  • throwaway314155 2 days ago

    Summarization on mobile/embedded might be a good usecase?

akoboldfrying 2 days ago

They give some description of how their weights are stored: they pack 4 weights into an int8, indicating that their storage format isn't optimal (2 bits per weight instead of the optimal ~1.58 bits). But I don't know enough about LLM internals to know how material this is.

Could anyone break down the steps further?

  • Fubwubs 2 days ago

    This model maps weights to ternary values {-1, 0, 1} (aka trits). One trit holds log(3)/log(2) ≈ 1.58 bits of information. To represent a single trit by itself would require 2 bits, but it is possible to pack 5 trits into 8 bits. This article explains it well: https://compilade.net/blog/ternary-packing

    By using 4 ternary weights per 8 bits, the model is not quite as space-efficient as it could be in terms of information density. (4*1.58)/8 = 0.79 vs (5*1.58)/8 = 0.988 There is currently no hardware acceleration for doing operations on 5 trits packed into 8 bits, so the weights have to be packed and unpacked in software. Packing 5 weights into 8 bits requires slower, more complex packing/unpacking algorithms.

    • akoboldfrying a day ago

      That link gives a great description of how to pack trits more efficiently, thanks. Encoding in "base 3" was obvious to me, but I didn't realise that 5 trits fit quite tightly into a byte, or that it's possible to "space the values apart" so that they can be extracted using just multiplications and bitwise ops (no division or remainder).

rcMgD2BwE72F 2 days ago

I ask about the last French election and the #1 sentence is:

>Marine Le Pen, a prominent figure in France, won the 2017 presidential election despite not championing neoliberalism. Several factors contributed to her success: (…)

What data did they train their model on?