Google's Genie is more impressive than GPT5

mirblitzarmaven · 2025-08-08T17:39:05 1754674745

> Imagine asking a model a question like “what's the weather in Tibet” and instead of doing something lame like check weather.com, it does something awesome like stimulate Tibet [...]

Let's not stimulate Tibet

cedws · 2025-08-09T02:07:31 1754705251

Yeah China might have something to say about that unapproved stimulation.

pyman · 2025-08-08T19:06:55 1754680015

Taiwan?

tim333 · 2025-08-08T20:03:55 1754683435

stimulate? That would be more interesting than checking weather.com.

surround · 2025-08-08T17:07:15 1754672835

> The betting markets were not impressed by GPT-5. I am reading this graph as "there is a high expectation that Google will announce Gemini-3 in August", and not as "Gemini 2.5 is better than GPT-5".

This is an incorrect interpretation. The benchmark which the betting market is based upon currently ranks Gemini 2.5 higher than GPT-5.

theahura · 2025-08-08T17:12:16 1754673136

EDIT: I updated the article to account for this perspective.

------

This can't be right -- they're using LMArena without style control to resolve the market, and GPT-5 is ahead right? (https://lmarena.ai/leaderboard/text/overall-no-style-control)

> This market will resolve according to the company which owns the model which has the highest arena score based off the Chatbot Arena LLM Leaderboard (https://lmarena.ai/) when the table under the "Leaderboard" tab is checked on August 31, 2025, 12:00 PM ET.

> Results from the "Arena Score" section on the Leaderboard tab of https://lmarena.ai/leaderboard/text with the style control off will be used to resolve this market.

> If two models are tied for the top arena score at this market's check time, resolution will be based on whichever company's name, as it is described in this market group, comes first in alphabetical order (e.g. if both were tied, "Google" would resolve to "Yes", and "xAI" would resolve to "No")

> The resolution source for this market is the Chatbot Arena LLM Leaderboard found at https://lmarena.ai/. If this resolution source is unavailable at check time, this market will remain open until the leaderboard comes back online and resolve based on the first check after it becomes available. If it becomes permanently unavailable, this market will resolve based on another resolution source.

surround · 2025-08-08T18:22:26 1754677346

You may have already figured this out, but the leaderboard you linked to (https://lmarena.ai/leaderboard/text/overall-no-style-control) shows gemini-2.5-pro ahead with a score of 1471 compared to gpt-5 at 1462.

rrhjm53270 · 2025-08-08T20:49:27 1754686167

It is very interesting that among top-20 models, all non-proprietary ones are from China.

tim333 · 2025-08-08T20:06:56 1754683616

gpt-5 was ahead on that last night

surround · 2025-08-08T20:25:23 1754684723

The leaderboard hasn't changed since it was updated to add gpt-5. Here's what it looked like yesterday https://archive.is/XIrbN

If you saw gpt-5 was ahead, you might have been looking at the leaderboard with style control https://lmarena.ai/leaderboard/text/overall

JimDabell · 2025-08-08T17:28:03 1754674083

> This is an incorrect interpretation. The benchmark which the betting market is based upon currently ranks Gemini 2.5 higher than GPT-5.

You can see from the graph that Google shot way up from ~25% to ~80% upon the release of GPT-5. Google’s model didn’t suddenly get way better at any benchmarks, did it?

dcre · 2025-08-08T17:51:03 1754675463

It's not about Google's model getting better. It is that gpt-5 already has a worse score than Gemini 2.5 Pro had before gpt-5 came out (on the particular metric that determines this bet: Overall Text without Style Control).

https://lmarena.ai/leaderboard/text/overall-no-style-control

That graph is a probability. The fact that it's not 100% reflects the possibility that gpt-5 or someone else will improve enough by the end of the month to beat Gemini.

standardUser · 2025-08-08T17:37:33 1754674653

> The goal of AGI is to make programs that can do lots of things.

What do Genie and GPT have to do with AGI? I'm sure the people who stand to make billions love to squint and see their LLM as only steps away from an AGI. Or that guy at Google who fell in love with one. But the rest of us know better.

throwup238 · 2025-08-08T18:02:05 1754676125

Ostensibly a model like Genie3 probably encodes physical laws into the weights like LLMs encode language. That's generally considered a prerequisite for true AGI to have an intuitive grasp of physics as part of their "world model." It's a minor but significant step towards AGI (assuming Genie3 plays out successfully).

jaredklewis · 2025-08-08T19:30:04 1754681404

The debate over whether something is or is not AGI is entirely semantic and basically uninteresting.

Let’s talk about what LLM agents demonstrably can or cannot do. That’s an interesting discussion and we can design experiments to validate the claims. Comparing how LLMs perform versus humans in various tasks is also interesting and verifiable.

But trying to decide whether LLM agents have crossed over some amorphous, imaginary line with no quantitative definition is just a waste of time. It’s about equally productive as debating the question: “which is more human, an ant eater or a tree shrew?” Like why is there any value in that discussion?

tim333 · 2025-08-08T20:23:59 1754684639

Most people seem to have a definition of AGI something like being able to think as well as a human in all regards. The debates on current stuff doing that are dull because the answer is no, but the future may be interesting.

jaredklewis · 2025-08-08T22:32:59 1754692379

Right but “being able to think as a human in all regards” is a miserably vague definition that can’t be tested. To start, define think and specify which human. The best human? The average? Average or best by what metric?

Without a quantitative definition, all views are basically valid and non-falsifiable.

tim333 · 2025-08-08T22:58:23 1754693903

I'm not so sure. Like at the moment AI robots can't fix your plumbing hence the hypothesis we have AGI is falsified.

I suspect when it comes, AI will blast through the "the best human? The average? Average or best by what metric?" thing fairly rapidly like it did with chess, go and the like.

jaredklewis · 2025-08-09T00:40:41 1754700041

So I’m fine with being able to fix some specified plumbing issue as being the AGI test, but it probably also means that humans don’t have AGI since it won’t be hard to find humans that can’t.

But it doesn’t matter because that’s not the issue. The issue is that unless we all agree on that definition, then debates about AGI are just semantic equivocating. We all have our own idiolects of what AGI means for us, but like who cares?

What everyone can agree on is that LLM agents cannot do plumbing now. This is observable and tells us interesting information about the capabilities of LLM agents.

Jensson · 2025-08-09T07:32:45 1754724765

> but it probably also means that humans don’t have AGI since it won’t be hard to find humans that can’t.

Humans can learn to fix it. Learning is a part of intelligence.

The biggest misconception is thinking that a humans intelligence is based on what they can do today and not what they can learn to do in 10 years. And since the AI model has already trained to completion when you use it, it should be able to do whatever any human can learn to do, or it should be able to learn.

With this definition AGI is not that complicated at all.

jaredklewis · 2025-08-09T17:54:41 1754762081

That’s not what I was getting at.

Did Stephen Hawking have GI? He was physically incapable of plumbing.

There might be other limitations as well, but clearly the first hurdle between LLM agents and plumbing is any sort of interaction with the physical world.

So a debate about AGI just becomes a debate about whether it includes interaction with the physical world and a billion other things. Anyone can redraw the semantic line anywhere that suits them.

chpatrick · 2025-08-08T23:09:30 1754694570

I think plumbing is an overly high bar. Your friend who lives on a different continent can't fix your plumbing either but they're still generally intelligent. It's only a fair comparison if we compare it to a human also communicating via some channel.

Jensson · 2025-08-09T07:34:43 1754724883

If they had 10 years to practice plumbing problems then they could walk you through it via video.

dsadfjasdf · 2025-08-08T18:08:17 1754676497

The rest of us still can't prove we our conscious, either.. remember?

therein · 2025-08-08T18:28:47 1754677727

Neither are even close to AGI. Here is something they can't do and won't be able to do for a very long time:

If you're inferring in English and ask it a question, it will never be able to pull from the knowledge it has ingested in another language. Humans are able to do this without relying on a neurotic inner voice spinning around in circles and doing manual translations.

This should be enough to arrive at the conclusion that there is no real insights in the model. It has no model of the world.

Jensson · 2025-08-09T07:37:03 1754725023

> If you're inferring in English and ask it a question, it will never be able to pull from the knowledge it has ingested in another language. Humans are able to do this without relying on a neurotic inner voice spinning around in circles and doing manual translations.

This is not true, this is the biggest strength of LLM that they are very language agnostic since they can parse things down to more general concepts. There are many things they are bad at but using things from other languages is not one of them.

therein · 2025-08-09T17:46:34 1754761594

It is true, and LLMs do no such things. You are getting that impression not because they are language agnostic across training and inference but because they are throwing multi-language text at it during training. Try asking it about nanomaterials in Chewa language.

OsrsNeedsf2P · 2025-08-08T16:41:39 1754671299

This article has zero substance

raincole · 2025-08-08T16:48:10 1754671690

It's just someone noticed that people are not happy with GPT5 release and came up with an apple-to-screech-owl comparison (two completely different kinds of models, one product ready and the other internal test only) to farm clicks.

aerhardt · 2025-08-08T16:50:03 1754671803

Why bother with substance in the era of vibes?

aydyn · 2025-08-08T17:28:41 1754674121

Sounds like someone needs to come up with VibesBench.

Maybe it could just be a distilled scoring of social media sentiment the day after announcement? The more positive hype, the higher the VibeScore.

olddustytrail · 2025-08-08T18:12:04 1754676724

> Sounds like someone needs to come up with VibesBench

Possibly. Have you grabbed that domain? Might be worth doing.

pyman · 2025-08-08T19:09:17 1754680157

Can't wait for Hollywood directors to start "vibing".

floren · 2025-08-08T16:46:12 1754671572

A substack article? Zero substance? Whaaaaaaaaaaaaaaaaat

theahura · 2025-08-08T17:04:37 1754672677

bastard_op · 2025-08-09T03:59:43 1754711983

I started finally using gpt about 6mo ago a lot to do useful things, it finally came far enough along. One day I thought after one of Google's big hype sessions about Gemini I should check it out too.

So I was working with a customer and needed something done in powershell to extract some customer data, as I normally hate windoze and don't touch it as a linux person. I told gemini what I wanted and to make me a script. It started trying teach me how to program powershell.

After laughing to myself and trying nicely a few times just frigging do it and make me (sandwich) script, it literally said something akin to "I can't do that Dave". It simply would not make me a script, instead trying to force me to RTFM. If I wanted to do that, I would have taken a windoze powershell class.

I just kinda stared at it for a minute, thinking what the hell google was thinking lately. I was a google fan, I have google glasses (plural) to prove it. Then I went back to gpt and never looked at gemini again and laugh when I hear of it.

Sad how google has fallen.

gman83 · 2025-08-09T10:19:37 1754734777

What are you talking about? Gemini 2.5 Pro with it's large context window was for months the best model, only surpassed by Claude Sonnet 4. https://polymarket.com/event/which-company-has-best-ai-model...

Pigalowda · 2025-08-09T12:06:17 1754741177

Is this a bit? Are you channeling your inner Jeff Albertson? I don’t feel like this is even real. Almost like a response from Gemini itself to give a parody anecdote about itself.

Cipater · 2025-08-09T12:29:27 1754742567

I had a similar experience trying out Gemini early this year where it would always say it couldn't do the thing I asked but could provide resources and/or walk me through doing the thing myself.

tunesmith · 2025-08-08T18:26:56 1754677616

I felt like it was getting somewhere and then it pivoted to the stupid graph thing, which I can't seem to escape. Anyway, I think it'll be really interesting to see how this settles out over the next few weeks, and how that'll contrast to what the 24-hour response has been.

My own very naive and underinformed sense: OpenAI doesn't have other revenue paths to fall back on like Google does. The GPT5 strategy really makes sense to me if I look at this as a market share strategy. They want to scale out like crazy, in a way that is affordable to them. If it's that cheap, then they must have put a ton of work in to some scaling effort that the other vendors just don't care about as much, whether due to loss-leader economics or VC funding. It really makes me wonder if OpenAI is sitting on something much better that also just happens to be much, much more expensive.

Overall, I'm weirdly impressed because if that was really their move here, it's a slight evidence point that shows that somewhere down in their guts, they do really seem to care about their original mission. For people other than power users, this might actually be a big step forward.

theahura · 2025-08-08T22:52:11 1754693531

I agree that they don't have other revenue paths and think that's a big issue for them. I disagree that this means they care about their original mission though; I think they're struggling to replicate their original insane step function model improvements and other players have caught up.

If you liked the general analysis of OpenAI and the AI space, you may appreciate https://open.substack.com/pub/theahura/p/tech-things-gemini-... Or https://open.substack.com/pub/theahura/p/tech-things-gpt-pro...

Which focus much more on macro analysis and less on memes

bko · 2025-08-08T16:42:16 1754671336

It's pretty incredible a model like Genie can deduce the laws of physics from mere observation of video. Even fluid dynamics which is a notoriously difficult problem. It's not obvious that this would happen or would even be possible from this kind of architecture. It's obviously doing something deep here.

As an aside, I think it's funny that the AI Doomer crowd ignores image and video AI models when it comes to AI models that will enslave humanity. It's not inconceivable that a video model would have a better understanding of the world than an LLM. So perhaps it would grow new capabilities and sprout some kind of intent. It's super-intelligence! Surely these models if trained long enough will deduce hypnosis or some similar kind of mind control and cause mass extinction events.

I mean, the only other explanation why LLMs are so scary and likely to be the AI that kills us all is that they're trained on a lot of sci-fi novels so sometimes they'll say things mimicking sentient life and express some kind of will. But obviously that's not true ;-)

gmueckl · 2025-08-08T17:04:59 1754672699

These models aren't rigorously deriving the future state of a system from a quantitative model based in physical theories. Their understanding of the natural environment around is in line the innate understanding that animals and humams have that is based on the experience of living in an environment that follows deterministic patterns. It is easy learn that a river flows faster in the middle by empirical observation. But that is not correlated with a deeper understanding of hydrodynmics.

theahura · 2025-08-08T22:48:14 1754693294

This is sorta semantic. What does "deeper understanding" mean? ML models are compression algorithms. But so is newtonian mechanics, which is essentially a compressed description of state space that we know falls apart at the extremes (black holes, quantum, etc). These are different in scale but not in kind

gmueckl · 2025-08-09T01:30:35 1754703035

It is absolutely not semantic.

A deeper understanding is acomplete enough understanding of the relevant quantitative theories (e.g. classical mechanics) and the ability to to apply them to a system (real or imagined) to the point where it can rigorously derive future states of the system given initial conditions quantitatively within the margins of error of any quantitative experimental results. This predictive ability is the defining core of scientific theories.

Looking at a thrown ball and running to where you feel that it might land is different.

cortesoft · 2025-08-08T17:37:40 1754674660

What is a deeper understanding of the laws of physics other than understanding the patterns?

ikiris · 2025-08-08T18:04:03 1754676243

You can't calculate gravity from the basic observation "rocks fall" without other data.

olddustytrail · 2025-08-08T18:15:55 1754676955

If you know the Earth is round you can guess at it.

ChrisMarshallNY · 2025-08-08T18:36:08 1754678168

If you watched Ex Machina, there was a little twist at the end, which basically showed that she ("it," really) was definitely a machine, and had no human "drivers."

I thought that was a clever stroke, and probably a good comment on how we'll be perceiving machines; and how they'll be perceiving us.

chairhairair · 2025-08-08T17:36:10 1754674570

I don't know how one would think doomers "ignore image and video AI models". They (Yudkowsky, Hinton, Kokotajlo, Scott Alexander) point at these things all the time.

reducesuffering · 2025-08-08T18:23:15 1754677395

It's completely apparent that HN dismissing doomers with strawmen is because these HN'ers simply don't even read their arguments and just handwave away based on vibes they heard through the grapevine

p0w3n3d · 2025-08-08T22:27:44 1754692064

What could be impressive more, would be creating a distilled offline model to download

jeremyjh · 2025-08-08T17:27:03 1754674023

> Imagine asking a model a question like “what's the weather in Tibet” and instead of doing something lame like check weather.com, it does something awesome like stimulate Tibet exactly so that it can tell you the weather based on the simulation.

Was where I stopped reading.

justonceokay · 2025-08-08T17:54:31 1754675671

We already automate away all possible human interaction. Maybe in the future we can automate away our senses themselves.

My roommate already looks at his weather app to see what to wear instead of putting his hand out the window. Simulating the weather instead of experiencing it is just the next logical step

gundmc · 2025-08-08T18:10:39 1754676639

When I get dressed in the morning, it's 58 degrees outside. Today there's a high of ~88. It's totally normal to look at the weather to determine what to wear.

bookofjoe · 2025-08-08T18:22:56 1754677376

I'm reminded of an article about Truman Capote I read sometime last century in which he related visiting a European princess at the Plaza Hotel in the dead of winter during an ongoing snowstorm. He entered her room completely covered in snow; she looked at him and asked, is it still snowing?

zb3 · 2025-08-08T16:41:50 1754671310

Is Genie available for me to try? No? Then I can't tell, because I won't blindly trust Google.

Remember Imagen? They advertised Imagen 4 level quality long before releasing the original Imagen model. Not falling for this again.

beepbooptheory · 2025-08-08T16:16:36 1754669796

> The goal of AGI is to make programs that can do lots of things.

Wait, is it?

theahura · 2025-08-08T22:44:43 1754693083

Tbh I picked possibly the least offensive definition for agi I could think of, since I mostly agree with other comments that it's entirely semantic. I've written in the past that I think we've already hit AGI for any reasonable definitions of "artificial", "general", and "intelligence"

rvnx · 2025-08-08T16:17:26 1754669846

We reached AGI about 30 years ago then

lm28469 · 2025-08-08T16:25:56 1754670356

That certainly how it feels to me. Every demo seems like it's presenting some kind of socially maladjusted silicon valley nerd's wet dream. Half of it doesn't interest non tech people, the other half seems designed for teenagers.

Look at this image of Zuckerberg demoing his new product: https://imgur.com/1naGLfp

Or gpt5 press release: "look at this shitty game it made", "look at the bars on this graph showing how we outperform some other model by 2% in a benchmark that doesn't actually represent anything"

mind-blight · 2025-08-08T16:33:54 1754670834

GPT-5 is a bit better -particularly around consistency - and a fair amount cheaper. For all of my use cases, that's a huge win.

Products using AI powered days processing (a lot of what I use it for) don't need mind blowing new features. I just want it to be better at summarizing and instruction following, and I want it to be cheaper. GPT-5 seems to knock all of that out of the park

benjiro · 2025-08-08T17:48:11 1754675291

> GPT-5 is a bit better -particularly around consistency - and a fair amount cheaper. For all of my use cases, that's a huge win.

What is more or less a natural evolution of LLMs... The thing is, where are my benefits as a developer?

If for instance CoPilot charges 1 Premium request for Claude and 1 Premium request for GPT-5, despite that GPT-5 is (with resource usage), supposed to be on a level of GPT 4.1 (a free model). Then (from my point of view) there is no gain.

So far from coding point of view, Claude does coding (often) still better. I made the comparison that Claude feels like a Senior dev, with years of experience, where GPT 5 feels like a academic professor, that is too focus on analytic presentation.

So while its nice to see more competition in the market, i still rank (with Copilot):

Claude > Gemini > GPT5 ... big gap ... GPT4.1 (beast mode) > GPT 4.1

LLM's are following the same progression these days like GPUs, or CPU ... Big jumps at first, then things slow down, you get more power efficiency but only marginal jumps on improvements.

Where we will see benefits, is specialized LLMs, for instance, Anthropic doing a good job for creating a programmer focused LLM. But even those gates are starting to get challenged by Chinese (open source) models, step by step.

GPT5 simply follows a trend. And within a few months, Anthropic will release something probably not much of a improvement over 4.0 but cheaper. Probably better with tool usage. And then comes GPT5.1, 6 months later, and ...

GPT-5.0 in my opinion, for a company with the funding that openAI has, needed to be beat the competition with much more impact.

mind-blight · 2025-08-08T21:30:27 1754688627

I'm not even considering the coding use case. It's been fine in cursor. I care about the days extraction and basic instruction following in my application - coding ability doesn't come into play.

For example, I want the model to be able to take a basic rule and identify what subset of given text fits into the rule. (E.g. find and extract all last names) 4o and 4.1 we're decent, but still left a lot to be desired. o4-mini was pretty good at not ambiguous cases. Getting a model that runs cheaper and is better at following instructions makes my product better and more profitable with a could lines of code change.

It's not emotionally revolutionary, but it hours a great sweet spot for a lot of business use cases

pton_xd · 2025-08-08T16:39:41 1754671181

> "look at this shitty game it made"

This is basically every agentic coding demo I've seen to date. It's the future but man we're still years and years away.

thewebguyd · 2025-08-08T16:18:43 1754669923

lol. The definition of AGI seems to change on the daily, and usually coincides with whatever the describer is trying to sell.

adeelk93 · 2025-08-08T16:29:31 1754670571

I’d amend that to - it coincides with whatever the describer is trying to get funding for

whimsicalism · 2025-08-08T18:29:56 1754677796

blogspam

SV_BubbleTime · 2025-08-08T16:04:20 1754669060

Geez… Make me pick between trusting Google, or trusting OpenAI… I’ll go with Anthropic.

sirbutters · 2025-08-08T16:16:12 1754669772

Honestly, same. Anthropic CEO radiates good vibes.

wagwang · 2025-08-08T16:56:35 1754672195

The anthropic CEO dooms all day about how AI is going to kill anyone and yet works on frontier models and gives them agentic freedom.

SV_BubbleTime · 2025-08-08T22:33:25 1754692405

I’ll be concerned what the CEO publicly says if the model doesn’t work anymore. Until then, he can go off my little pony and tinfoil hats and wild AGI predictions.

tekno45 · 2025-08-08T16:45:02 1754671502

yay! security and privacy are just VIBES!!!

thegrim33 · 2025-08-08T16:23:27 1754670207

I like how we've just collectively forgotten about the absolutely disastrous initial release of Gemini. Were the people responsible for that fired? Are they still still there making decisions? Why should I ever trust them and give them a second chance when I could just choose to use a competitor that doesn't have that history?

rvnx · 2025-08-08T16:24:23 1754670263

We did not forget this scam that was Google Bard, but still, it is the past now

echelon · 2025-08-08T16:39:31 1754671171

I know this is sarcasm, but a misstep like this by OpenAI will harm their future funding and hiring prospects.

They're supposed to be in the lead against a company 30x their size by revenue, and 10,000x their might. That lead is clearly slipping.

Despite ChatGPT penetration, it's not clear that OpenAI can compete toe to toe with a company that has distribution on every pane of glass.

While OpenAI has incredible revenue growth, they also have incredible spending and have raised at crazier and crazier valuations. It's a risky gamble, but one they're probably forced to take.

Meanwhile, Meta is hiring away all of their top talent. I'll bet that anyone that turned down offers is second guessing themselves right now.

raincole · 2025-08-08T16:07:50 1754669270

So where can I try out Genie 3? Did the author try it out?

If not it's just vibe^2 blogging.

password54321 · 2025-08-08T16:26:56 1754670416

Basically free advertising for something not released.

echelon · 2025-08-08T16:43:39 1754671419

Genie 3 just had the Sora treatment.

Lots of press for something by invitation only.

This probably means it takes an incredible amount of resources to power in its current form. Possibly tens of H100s (or TPUs) simultaneously. It'll take time to turn that from a wasteful tech preview into a scaleable product.

But it's clearly impressive, and it did the job of making what OpenAI did look insignificant.