Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Model: Qwen3 32b

GPU: RTX 5090 (no rops missing), 32 GB VRAM

Quants: Unsloth Dynamic 2.0, it's 4-6 bits depending on the layer.

RAM is 96 GB: more RAM makes a difference even if the model fits entirely in the GPU: filesystem pages containing the model on disk are cached entirely in RAM so when you switch models (we use other models as well) the overhead of unloading/loading is 3-5 seconds.

The Key Value Cache is also quantized to 8 bit (less degrades quality considerably).

This gives you 1 generation with 64k context, or 2 concurrent generations with 32k each. Everything takes 30 GB VRAM, which also leaves some space for a Whisper speech-to-text model (turbo & quantized) running in parallel as well.






Thanks a lot. Interesting that without concurrent requests the context could be doubled, 64k is pretty decent for working on a few files at once. A local LLM server is something a lot of companies should be looking into I think.

Are you doing this with vLLM? If you're using Llama.cpp/Ollama, you could likely see some pretty massive improvements.

We're using llama.cpp. We use all kinds of different models other than Qwen3, and vLLM startup when switching models is prohibitively slow (several times slower than llama.cpp, which is already 5 sec)

From what I understand, vLLM is best when there's only 1 active model pinned to the GPU and you have many concurrent users (4, 8 etc.). But with just a single 32 GB GPU you have to switch the models pretty often, and you can't fit more than 2 concurrent users anyway (without sacrificing the context length considerably: 4 users = just 16k context, 8 users = 8k context), so I think vLLM so far isn't worth it. Once we have several cards, we may switch to vLLM.




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: