Given that LLMs appear to, in large part, "think" by virtue of feeding its own i...

imtringued · on May 1, 2024

If the LLM had an affordable model it would always generate enough tokens for the task at hand. The fact that this particular method would require more tokens would be irrelevant. If you don't have an affordable model, then you would always be at the mercy of the LLM being biased towards answering with an estimate instead of the actual answer.

Also, most speculative decoding strategies produce identical output compared to running the model sequentially. If the prediction is wrong, the token gets discarded and the speedup is lost.