I have a 2023 mbp, and I get about 100-150 tok/sec locally with lmstudio.

datadrivenangel · on Dec 31, 2024

Which models?

refulgentis · on Dec 31, 2024

For context, I got M2 Max MBP, 64 GB shared RAM, bought it March 2023 for $5-6K.

  Llama 3.2 1.0B - 650 t/s
  Phi 3.5   3.8B - 60 t/s.
  Llama 3.1 8.0B - 37 t/s.
  Mixtral  14.0B - 24 t/s.

Full GPU acceleration, using llama.cpp, just like LM Studio.

lowercased · on Dec 31, 2024

hugging-quants/llama-3.2-1b-instruct-q8_0-gguf - 100-150 tok/sec

second-state/llama-2-7b-chat-gguf net me around ~35 tok/sec

lmstudio-community/granite-3.1.-8b-instruct-GGUF - ~50 tok/sec

MBP M3 Max, 64g. - $3k

refulgentis · on Dec 31, 2024

I'm not sure if you're pointing out any / all of these:

#1. It is possible to get an arbitrarily fast tokens/second number, given you can pick model size.

#2. Llama 1B is roughly GPT-4.

#3. Given Llama 1B runs at 100 tokens/sec, and given performance at a given model size has continued to improve over the past 2 years, we can assume there will eventually be a GPT-4 quality model at 1B.

On my end:

#1. Agreed.

#2. Vehemently disagree.

#3. TL;DR: I don't expect that, at least, the trend line isn't steep enough for me to expect that in the next decade.

lowercased · on Jan 1, 2025

I specifically missed the GPT4 part of "up to 10 token/sec out of a kinda sorta GPT-4". Was just looking at token/sec.