Quants: Unsloth Dynamic 2.0, it's 4-6 bits depending on the layer.
RAM is 96 GB: more RAM makes a difference even if the model fits entirely in the GPU: filesystem pages containing the model on disk are cached entirely in RAM so when you switch models (we use other models as well) the overhead of unloading/loading is 3-5 seconds.
The Key Value Cache is also quantized to 8 bit (less degrades quality considerably).
This gives you 1 generation with 64k context, or 2 concurrent generations with 32k each. Everything takes 30 GB VRAM, which also leaves some space for a Whisper speech-to-text model (turbo & quantized) running in parallel as well.
Thanks a lot. Interesting that without concurrent requests the context could be doubled, 64k is pretty decent for working on a few files at once. A local LLM server is something a lot of companies should be looking into I think.
We're using llama.cpp. We use all kinds of different models other than Qwen3, and vLLM startup when switching models is prohibitively slow (several times slower than llama.cpp, which is already 5 sec)
From what I understand, vLLM is best when there's only 1 active model pinned to the GPU and you have many concurrent users (4, 8 etc.). But with just a single 32 GB GPU you have to switch the models pretty often, and you can't fit more than 2 concurrent users anyway (without sacrificing the context length considerably: 4 users = just 16k context, 8 users = 8k context), so I think vLLM so far isn't worth it. Once we have several cards, we may switch to vLLM.
GPU: RTX 5090 (no rops missing), 32 GB VRAM
Quants: Unsloth Dynamic 2.0, it's 4-6 bits depending on the layer.
RAM is 96 GB: more RAM makes a difference even if the model fits entirely in the GPU: filesystem pages containing the model on disk are cached entirely in RAM so when you switch models (we use other models as well) the overhead of unloading/loading is 3-5 seconds.
The Key Value Cache is also quantized to 8 bit (less degrades quality considerably).
This gives you 1 generation with 64k context, or 2 concurrent generations with 32k each. Everything takes 30 GB VRAM, which also leaves some space for a Whisper speech-to-text model (turbo & quantized) running in parallel as well.