Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

At my work, we self-host some models and have found that for anything remotely similar to RAG or use cases that are very specific, the quantized models have proven to be more than sufficient. This helps us keep them running on smaller infra and generally lower costs


Personally I've noticed major changes in performance between different quantisations of the same model.

Mistral's large 123B model works well (but slowly) at 4-bit quantisation, but if I knock it down to 2.5-bit quantisation for speed, performance drops to the point where I'm better off with a 70B 4-bit model.

This makes me reluctant to evaluate new models in heavily quantised forms, as you're measuring the quantisation more than the actual model.


That's a fair point - the trick with dynamic quants is we selectively choose not to quantize many components - ie attention is left at 4 or 6bit, just the MoE parts are 1.5bit (-1, 0, 1)

There are distilled versions like Qwen 1.5, 3, 14, 32, Llama 8, 70, but those are distilled - if you want to run the original R1, then the quants are currently the only way.

But I agree quants do affect perf - hence the trick for MoEs is to not quantize specific areas!


How are you doing your evals?

Being able to do semantic diffs of the output of the two models should tell you what you need to do.




Consider applying for YC's Summer 2026 batch! Applications are open till May 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: