I've been prototyping using LLMs for some borderline use cases, and the cost isn't really the concern, it's the reliability. Using less than the most frontier model seems irresponsible if it could mean the difference between 99.95% reliability and 99% reliability, and that's the threshold where you should've hired a human to do it because you lost more money on that 0.95% error rate than you saved on salaries. (I don't actually have any use cases where this kind of calculation makes sense, but in principle I think it applies to most uses of LLMs, even if you can't quantify the harm.)
Problem is that the frontier models are nowhere near 99% reliable. Orchestration and good system design is how you get reliability. Yes, the frontier models still are going to be better by default than open source models. But the LLM is still only a component in a broader system. What's seeming to be actually necessary for any high-usage worthwhile use case is making your model task specific (via fine-tuning / post-training / RL). I build these systems for enterprises. The frontier models are not enough.
First off, you’re ignoring error bars. On average, frontier models might be 99.95% accurate. But for many work streams, there are surely tail cases where a series of questions only produce 99% accuracy (or even less), even in the frontier model case.
The challenge that businesses face is how to integrate these fallible models into reliable and repeatable business processes. That doesn’t sound so different than software engineering of yesteryear.
I suspect that as AI hype continues to level-off, business leaders will come to their senses and realize that it’s more marginally productive to spend on integration practices than squeaking out minor gains on frontier models.