The trick they announce for Grok Heavy is running multiple agents in parallel and then having them compare results at the end, with impressive benchmarks across the board. This is a neat idea! Expensive and slow, but it tracks as a logical step. Should work for general agent design, too. I'm genuinely looking forward to trying this out.
EDIT: They're announcing big jumps in a lot of benchmarks. TIL they have an API one could use to check this out, but it seems like xAI really has something here.
I can understand how/that this works, but it still feels like a 'hack' to me. It still feels like the LLM's themselves are plateauing but the applications get better by running the LLM's deeper, longer, wider (and by adding 'non ai' tooling/logic at the edges).
But maybe that's simply the solution, like the solution to original neural nets was (perhaps too simply put) to wait for exponentially better/faster hardware.
This is exactly how human society scaled from the cavemen era to today. We didn't need to make our brains bigger in order to get to the modern industrial age - increasingly sophisticated tool use and organization was all we did.
It only mattered that human brains are just big enough to enable tool use and organization. It ceased to matter once our brains are past a certain threshold. I believed LLMs are past this threshold as well (it has not 100% matched human brain or ever will, but this doesn't matter.)
An individual LLM call might lack domain knowledge, context and might hallucinate. The solution is not to scale the individual LLM and hope the problems are solved, but to direct your query to a team of LLMs each playing a different role: planner, designer, coder, reviewer, customer rep, ... each working with their unique perspective & context.
I get that feeling too - the underlying tech has plateaued, but now they're brute force trading extra time and compute for better results. I don't know if that scale anything but, at best, linearly. Are we going to end up with 10,000 AI monkeys on 10,000 AI typewriters and a team of a dozen monkeys deciding which one's work they like the most?
This is an interesting point. If this ends up working well after being optimized for scale it could become the dominant architecture. If not it could become another dead leaf node in the evolutionary tree of AI.
Isn't that kinda why we have collaboration and get in room with colleagues to discuss ideas? i.e., thinking about different ideas, getting different perspectives, considering trade-offs in various approaches, etc. results in a better solution than just letting one person go off and try to solve it with their thoughts alone.
Not sure if that's a good parallel, but seems plausible.
…like what? I thought the consensus was that humans exhibit truly general intelligence. If LLMs require access to very specific tools to solve certain classes of problems, then it’s not clear that they can evolve into a form of general intelligence.
Specifically, which portions of the brain are “very specialized”? I’m not aware of any aspect of the brain that’s as narrowly applied to tasks as the tools LLMs use. For example, there’s no coding module within the brain - the same brain regions you use when programming could be used to perform many, many other tasks.
They are, but I think the keyword is "generalization". Humans do very well when innovation is required, because innovation needs generalized models that can be used to make very specialized predictions and then meta-models that can predict how specialized models relate to each other and cross reference those predictions. We don't learn arithmetic by getting fed terabytes of text like "1+1=2". We only use text to communicate information, but learn the actual logic and concept behind arithmetic, and then we use that generalized model for arithmetic in our reasoning.
I struggle to imagine how much further a purely text based system can be pushed - a system that basically knows that 1+1=2 not because it has built an internal model of arithmetic, but because it estimates that the sequence of `1+1=` is mostly followed by `2`.
They have somewhat an internal model of arithmetic, with lookup tables and separate treatment of digits. I'm conscious you might have seen this already and not interpret it like that, but in case you haven't section 6 on addition in this Anthropic interpretability paper goes into it.
Keep in mind that is a basic level of understanding of what is going on in quite a small model (Claude 3.5 Haiku). We don't know what is happening inside larger models.
I can’t help but call out that o1-pro was great, it rarely took more than five minutes and I was almost never dissatisfied with the results per the wait. I happily paid for o1-pro the entire time it was available. Now, o3-pro is a relative disaster, often taking over 20 minutes just to refuse to follow directions and gaslight people about files being available for download that don’t exist, or provide simplified answers after waiting 20 minutes. It’s worse than useless when it actively wastes users time. I don’t see myself ever trusting OpenAI again after this “pro” subscription fiasco. To go from a great model to then just take it away and force an objectively terrible replacement, is definitely going the wrong way, when everyone else is improving (Gemini 2.5, Claude code with opus, etc). I can’t believe meta would pay a premium to poach the OpenAI people responsible for this severe regression.
I have never had o3-pro take longer than 6-8 minutes. How are you getting it to think for 20 minutes?! My results using it have also been great, but I never used o1-pro so I don't have that as a reference point.
I mean, either they cheated on evals ala Llama4, or they have a paradigm that's currently best in class in at least a few standard evals. Both alternatives are possible, I suppose.
So the progress is basically to brute force even more?
We got from "single prompt, single output", to reasoning (simple brute-forcing) and now to multiple parallel instances of reasoning (distributed brute-forcing)?
No wonder the prices are increasing and capacity is more limited.
EDIT: They're announcing big jumps in a lot of benchmarks. TIL they have an API one could use to check this out, but it seems like xAI really has something here.