Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Given that LLMs appear to, in large part, "think" by virtue of feeding its own input into itself, people have consistently noticed that insisting that the model "think out loud" results in higher quality reasoning. i.e. "chain of thought" reasoning will contrast simply having the model answer a question directly with first having it write out things like:

- restating what it thinks is being asked of it

- expressing a high level strategy over what sort of information it might need in order to answer that question

- stating the information it knows

- describing how that information might inform its initial reasoning

etc...

I'd be concerned that going about this by having the model predict the next multiple tokens at any given time would essentially have the opposite effect.

Chain of thought prompting appears to indicate that a model is "smarter" when it has n + m tokens than when it just has n tokens as input. As such, getting the next 5 tokens for a given n might net worse results than getting the next 1 token at n, then the next 1 token at n + 1, and so on.



If the LLM had an affordable model it would always generate enough tokens for the task at hand. The fact that this particular method would require more tokens would be irrelevant. If you don't have an affordable model, then you would always be at the mercy of the LLM being biased towards answering with an estimate instead of the actual answer.

Also, most speculative decoding strategies produce identical output compared to running the model sequentially. If the prediction is wrong, the token gets discarded and the speedup is lost.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: