Model: Qwen3 32b GPU: RTX 5090 (no rops missing), 32 GB VRAM Quants: Unsloth Dyn...

pu_pe · 2025-07-06T16:13:05 1751818385

Thanks a lot. Interesting that without concurrent requests the context could be doubled, 64k is pretty decent for working on a few files at once. A local LLM server is something a lot of companies should be looking into I think.

oceansweep · 2025-07-06T18:14:34 1751825674

Are you doing this with vLLM? If you're using Llama.cpp/Ollama, you could likely see some pretty massive improvements.

kgeist · 2025-07-06T19:27:24 1751830044

We're using llama.cpp. We use all kinds of different models other than Qwen3, and vLLM startup when switching models is prohibitively slow (several times slower than llama.cpp, which is already 5 sec)

From what I understand, vLLM is best when there's only 1 active model pinned to the GPU and you have many concurrent users (4, 8 etc.). But with just a single 32 GB GPU you have to switch the models pretty often, and you can't fit more than 2 concurrent users anyway (without sacrificing the context length considerably: 4 users = just 16k context, 8 users = 8k context), so I think vLLM so far isn't worth it. Once we have several cards, we may switch to vLLM.