The trick they announce for Grok Heavy is running multiple agents in parallel an...

icoder · 2025-07-10T10:49:06 1752144546

I can understand how/that this works, but it still feels like a 'hack' to me. It still feels like the LLM's themselves are plateauing but the applications get better by running the LLM's deeper, longer, wider (and by adding 'non ai' tooling/logic at the edges).

But maybe that's simply the solution, like the solution to original neural nets was (perhaps too simply put) to wait for exponentially better/faster hardware.

crazylogger · 2025-07-11T04:39:02 1752208742

This is exactly how human society scaled from the cavemen era to today. We didn't need to make our brains bigger in order to get to the modern industrial age - increasingly sophisticated tool use and organization was all we did.

It only mattered that human brains are just big enough to enable tool use and organization. It ceased to matter once our brains are past a certain threshold. I believed LLMs are past this threshold as well (it has not 100% matched human brain or ever will, but this doesn't matter.)

An individual LLM call might lack domain knowledge, context and might hallucinate. The solution is not to scale the individual LLM and hope the problems are solved, but to direct your query to a team of LLMs each playing a different role: planner, designer, coder, reviewer, customer rep, ... each working with their unique perspective & context.

SketchySeaBeast · 2025-07-10T17:10:22 1752167422

I get that feeling too - the underlying tech has plateaued, but now they're brute force trading extra time and compute for better results. I don't know if that scale anything but, at best, linearly. Are we going to end up with 10,000 AI monkeys on 10,000 AI typewriters and a team of a dozen monkeys deciding which one's work they like the most?

woah · 2025-07-10T18:24:52 1752171892

> the underlying tech has plateaued, but now they're brute force trading extra time and compute for better results

You could say the exact same thing about the original GPT. Brute forcing has gotten us pretty far.

SketchySeaBeast · 2025-07-10T18:55:29 1752173729

How much farther can it take us? Apparently they've started scaling out rather than up. When does the compute become too cost prohibitive?

tibbar · 2025-07-11T00:28:41 1752193721

Until recently, training-time compute was the dominant cost, so we're really just getting started down the test-time scaling road.

jjmarr · 2025-07-10T23:50:58 1752191458

Yes. It works pretty well.

the8472 · 2025-07-10T11:56:25 1752148585

grug think man-think also plateau, but get better with tool and more tribework

Pointy sticks and ASML's EUV machines were designed by roughly the same lumps of compute-fat :)

SauciestGNU · 2025-07-10T16:47:58 1752166078

This is an interesting point. If this ends up working well after being optimized for scale it could become the dominant architecture. If not it could become another dead leaf node in the evolutionary tree of AI.

billti · 2025-07-10T18:36:36 1752172596

Isn't that kinda why we have collaboration and get in room with colleagues to discuss ideas? i.e., thinking about different ideas, getting different perspectives, considering trade-offs in various approaches, etc. results in a better solution than just letting one person go off and try to solve it with their thoughts alone.

Not sure if that's a good parallel, but seems plausible.

cfn · 2025-07-10T11:36:21 1752147381

Maybe this is the dawn of the multicore era for LLMs.

qoez · 2025-07-10T17:32:23 1752168743

It's basically a mixture of experts but instead of a learned operator picking the predicted best model, you use a 'max' operator across all experts.

simondotau · 2025-07-10T12:09:47 1752149387

You could argue that many aspects of human cognition are "hacks" too.

emp17344 · 2025-07-10T13:31:26 1752154286

…like what? I thought the consensus was that humans exhibit truly general intelligence. If LLMs require access to very specific tools to solve certain classes of problems, then it’s not clear that they can evolve into a form of general intelligence.

whynotminot · 2025-07-10T13:52:00 1752155520

What would you call the very specialized portions of our brains?

The brain is not a monolith.

emp17344 · 2025-07-10T14:28:20 1752157700

Specifically, which portions of the brain are “very specialized”? I’m not aware of any aspect of the brain that’s as narrowly applied to tasks as the tools LLMs use. For example, there’s no coding module within the brain - the same brain regions you use when programming could be used to perform many, many other tasks.

satvikpendem · 2025-07-10T17:32:02 1752168722

Broca's area, Wernicke's area, visual and occipital cortices (the latter of which, if damage occurs, can cause loss of sight).

Xmd5a · 2025-07-10T23:21:06 1752189666

Most people with aphasia can still swear because it's handled by the reptilian part of the brain. ahaha

djmips · 2025-07-10T15:51:35 1752162695

Are you able to point to a coding module in an LLM?

short_sells_poo · 2025-07-10T16:42:13 1752165733

They are, but I think the keyword is "generalization". Humans do very well when innovation is required, because innovation needs generalized models that can be used to make very specialized predictions and then meta-models that can predict how specialized models relate to each other and cross reference those predictions. We don't learn arithmetic by getting fed terabytes of text like "1+1=2". We only use text to communicate information, but learn the actual logic and concept behind arithmetic, and then we use that generalized model for arithmetic in our reasoning.

I struggle to imagine how much further a purely text based system can be pushed - a system that basically knows that 1+1=2 not because it has built an internal model of arithmetic, but because it estimates that the sequence of `1+1=` is mostly followed by `2`.

frabcus · 2025-07-10T23:33:15 1752190395

They have somewhat an internal model of arithmetic, with lookup tables and separate treatment of digits. I'm conscious you might have seen this already and not interpret it like that, but in case you haven't section 6 on addition in this Anthropic interpretability paper goes into it.

https://transformer-circuits.pub/2025/attribution-graphs/bio...

Keep in mind that is a basic level of understanding of what is going on in quite a small model (Claude 3.5 Haiku). We don't know what is happening inside larger models.

Voloskaya · 2025-07-10T09:07:51 1752138471

> Expensive and slow

Yes, but... in order to train your next SotA model you have to do this anyway and do rejection sampling to generate good synthetic data.

So if you can do it in prod for users paying 300$/month, it's a pretty good deal.

daniel_iversen · 2025-07-10T10:13:59 1752142439

Very clever, thanks for mentioning this!

irthomasthomas · 2025-07-10T08:19:43 1752135583

Like llm-consortium? But without the model diversity.

https://x.com/karpathy/status/1870692546969735361

https://github.com/irthomasthomas/llm-consortium

simianwords · 2025-07-10T04:56:22 1752123382

that's how o3 pro also works IMO

bobjordan · 2025-07-10T08:09:42 1752134982

I can’t help but call out that o1-pro was great, it rarely took more than five minutes and I was almost never dissatisfied with the results per the wait. I happily paid for o1-pro the entire time it was available. Now, o3-pro is a relative disaster, often taking over 20 minutes just to refuse to follow directions and gaslight people about files being available for download that don’t exist, or provide simplified answers after waiting 20 minutes. It’s worse than useless when it actively wastes users time. I don’t see myself ever trusting OpenAI again after this “pro” subscription fiasco. To go from a great model to then just take it away and force an objectively terrible replacement, is definitely going the wrong way, when everyone else is improving (Gemini 2.5, Claude code with opus, etc). I can’t believe meta would pay a premium to poach the OpenAI people responsible for this severe regression.

sothatsit · 2025-07-10T11:40:51 1752147651

I have never had o3-pro take longer than 6-8 minutes. How are you getting it to think for 20 minutes?! My results using it have also been great, but I never used o1-pro so I don't have that as a reference point.

zone411 · 2025-07-10T05:37:13 1752125833

This is the speculation, but then it wouldn't have to take much longer to answer than o3.

tibbar · 2025-07-10T04:59:48 1752123588

Interesting. I'd guess this technique should probably work with any SOTA model in an agentic tool loop. Fun!

JKCalhoun · 2025-07-10T12:59:23 1752152363

> I'm genuinely looking forward to trying this out.

Myself, I'm looking forward to trying it out when companies with less, um, baggage implement the same. (I have principles I try to maintain.)

nisegami · 2025-07-10T14:52:22 1752159142

I've suspected that technique could work on mitigating hallucinations, where other agents could call bullshit on a made up source.

sidibe · 2025-07-10T04:52:41 1752123161

You are making the mistake of taking one of Elon's presentations at face value.

tibbar · 2025-07-10T04:55:56 1752123356

I mean, either they cheated on evals ala Llama4, or they have a paradigm that's currently best in class in at least a few standard evals. Both alternatives are possible, I suppose.

einrealist · 2025-07-10T14:43:01 1752158581

So the progress is basically to brute force even more?

We got from "single prompt, single output", to reasoning (simple brute-forcing) and now to multiple parallel instances of reasoning (distributed brute-forcing)?

No wonder the prices are increasing and capacity is more limited.

Impressive. /s