Claude Opus 4.6

ck_one · 2026-02-05T21:38:32 1770327512

Just tested the new Opus 4.6 (1M context) on a fun needle-in-a-haystack challenge: finding every spell in all Harry Potter books.

All 7 books come to ~1.75M tokens, so they don't quite fit yet. (At this rate of progress, mid-April should do it ) For now you can fit the first 4 books (~733K tokens).

Results: Opus 4.6 found 49 out of 50 officially documented spells across those 4 books. The only miss was "Slugulus Eructo" (a vomiting spell).

Freaking impressive!

matt_lo · 2026-02-06T05:41:36 1770356496

use AI to rewrite all the spells from all the books, then try to see if AI can detect the rewritten ones. This will ensure it's not pulling from it's trained data set.

gbalduzzi · 2026-02-06T06:35:18 1770359718

Neat idea, but why should I use AI for a find and replace?

It feels like shooting a fly with a bazooka

jack_pp · 2026-02-06T08:08:23 1770365303

it's like hiring someone to come pick up your trash from your house and put it on the curb.

it's fine if you're disabled

miohtama · 2026-02-06T06:51:12 1770360672

Bazooka guarantees the hit

xenodium · 2026-02-06T07:46:41 1770364001

I like LLMs, but guarantees in LLMs are... you know... not guaranteed ;)

throwaway290 · 2026-02-06T08:20:42 1770366042

I think that was the point

bilekas · 2026-02-06T08:15:25 1770365725

You're missing the point, it's only a testing excersize for the new model.

happyraul · 2026-02-06T08:38:22 1770367102

No, the point is that you can set up the testing exercise without using an LLM to do a simple find and replace.

luckydata · 2026-02-06T08:12:29 1770365549

do you know all the spells you're looking for from memory?

LeoPanthera · 2026-02-06T07:45:51 1770363951

That won't help. The AI replacing them will probably miss the same ones as the AI finding them.

xiomrze · 2026-02-05T22:06:41 1770329201

Honest question, how do you know if it's pulling from context vs from memory?

If I use Opus 4.6 with Extended Thinking (Web Search disabled, no books attached), it answers with 130 spells.

ozim · 2026-02-05T22:24:40 1770330280

Exactly there was this study where they were trying to make LLM reproduce HP book word for word like giving first sentences and letting it cook.

Basically they managed with some tricks make 99% word for word - tricks were needed to bypass security measures that are there in place for exactly reason to stop people to retrieve training material.

pron · 2026-02-05T23:06:11 1770332771

This reminds me of https://en.wikipedia.org/wiki/Pierre_Menard,_Author_of_the_Q... :

> Borges's "review" describes Menard's efforts to go beyond a mere "translation" of Don Quixote by immersing himself so thoroughly in the work as to be able to actually "re-create" it, line for line, in the original 17th-century Spanish. Thus, Pierre Menard is often used to raise questions and discussion about the nature of authorship, appropriation, and interpretation.

ck_one · 2026-02-05T22:35:00 1770330900

Do you remember how to get around those tricks?

djhn · 2026-02-05T22:53:43 1770332023

This is the paper: https://arxiv.org/abs/2601.02671

Grok and Deepmind IIRC didn’t require tricks.

eek2121 · 2026-02-05T23:13:19 1770333199

This really makes me want to try something similar with content from my own website.

I shut it down a while ago because the number of bots overtake traffic. The site had quite a bit of human traffic (enough to bring in a few hundred bucks a month in ad revenue, and a few hundred more in subscription revenue), however, the AI scrapers really started ramping up and the only way I could realistically continue would be to pay a lot more for hosting/infrastructure.

I had put a ton of time into building out content...thousands of hours, only to have scrapers ignore robots, bypass cloudflare (they didn't have any AI products at the time), and overwhelm my measly infrastructure.

Even now, with the domain pointed at NOTHING, it gets almost 100,000 hits a month. There is NO SERVER on the other end. It is a dead link. The stats come from Cloudflare, where the domain name is hosted.

I'm curious if there are any lawyers who'd be willing to take someone like me on contingency for a large copyright lawsuit.

apsurd · 2026-02-06T05:42:34 1770356554

Can we help get your infra cost down to negligible? I'm thinking things like pre-generated static pages and CDNs. I won't assume you hadn't thought of this before, but I'd like to understand more where your non-trivial infra cost come from?

djhn · 2026-02-06T06:39:52 1770359992

I would be tempted to try and optimise this as well. 100000 hits on an empty domain and ~200 dollars worth of bot traffic sounds wild. Are they using JS-enabled browsers or sim farms that download and re-download images and videos as well?

camdenreslink · 2026-02-06T00:14:08 1770336848

The new cloudflare products for blocking bots and AI scrapers might be worth a shot if you put so much work into the content.

petercooper · 2026-02-05T22:18:20 1770329900

One possible trick could be to search and replace them all with nonsense alternatives then see if it extracts those.

andai · 2026-02-05T22:29:47 1770330587

That might actually boost performance since attention pays attention to stuff that stands out. If I make a typo, the models often hyperfixate on it.

jazzyjackson · 2026-02-06T03:48:57 1770349737

A fine instruction following task but if harry potter is in the weights of the neural net, it's going to mix some of the real ones with the alternates.

ck_one · 2026-02-05T22:34:11 1770330851

When I tried it without web search so only internal knowledge it missed ~15 spells.

clanker_fluffer · 2026-02-05T22:16:36 1770329796

What was your prompt?

golfer · 2026-02-05T22:41:13 1770331273

There's lots of websites that list the spells. It's well documented. Could Claude simply be regurgitating knowledge from the web? Example:

https://harrypotter.fandom.com/wiki/List_of_spells

qwertytyyuu · 2026-02-06T03:03:50 1770347030

Hmm… maybe he could switch out all the spells names slightly different ones and see how that goes

ck_one · 2026-02-05T22:44:33 1770331473

It didn't use web search. But for sure it has some internal knowledge already. It's not a perfect needle in the hay stack problem but gemini flash was much worse when I tested it last time.

viraptor · 2026-02-05T22:57:55 1770332275

If you want to really test this, search/replace the names with your own random ones and see if it lists those.

Otherwise, LLMs have most of the books memorised anyway: https://arstechnica.com/features/2025/06/study-metas-llama-3...

jazzyjackson · 2026-02-06T03:45:50 1770349550

Being that it has the books memorized (huh, just learned another US/UK spelling quirk), I would suppose feeding it the books with altered spells would get you a confused mishmash of data in the context and data in the weights.

ribosometronome · 2026-02-06T00:01:45 1770336105

Couldn't you just ask the LLM which 50 (or 49) spells appear in the first four Harry Potter books without the data for comparison?

viraptor · 2026-02-06T00:10:33 1770336633

It's not going to be as consistent. It may get bored of listing them (you know how you can ask for many examples and get 10 in response?), or omit some minor ones for other reasons.

By replacing the names with something unique, you'll get much more certainty.

Grimblewald · 2026-02-06T00:29:27 1770337767

might not work well, but by navigating to a very harry potter dominant part of latent space by preconditioning on the books you make it more likely to get good results. An example would be taking a base model and prompting "what follows is the book 'X'" it may or may not regurgitate the book correctly. Give it a chunk of the first chapter and let it regurgitate from there and you tend to get fairly faithful recovery, especially for things on gutenberg.

So it might be there, by predcondiditioning latent space to the area of harry potter world, you make it so much more probable that the full spell list is regurgitated from online resources that were also read, while asking naive might get it sometimes, and sometimes not.

the books act like a hypnotic trigger, and may not represent a generalized skill. Hence why replacing with random words would help clarify. if you still get the origional spells, regurgitation confirmed, if it finds the spells, it could be doing what we think. An even better test would be to replace all spell references AND jumble chapters around. This way it cant even "know" where to "look" for the spell names from training.

angst · 2026-02-06T02:29:37 1770344977

btw it recalls 42 when i asked. (without web search)

full transcript: pastebin.com/sMcVkuwd

f33d5173 · 2026-02-06T02:44:56 1770345896

Not sure how they're being counted, but that adds up to 46 with the pair spells counted separately. But then nox is counted twice, so maybe 45.

heavyset_go · 2026-02-06T02:13:16 1770343996

No, because you don't know the magic spell (forgive me) of context that can be used to "unlock" that information if it's stored in the NN.

I mean, you can try, but it won't be a definitive answer as to whether that knowledge truly exists or doesn't exist as it is encoded into the NN. It could take a lot of context from the books themselves to get to it.

joshmlewis · 2026-02-05T22:57:41 1770332261

I think the OP was implying that it's probably already baked into its training data. No need to search the web for that.

obirunda · 2026-02-06T01:17:03 1770340623

This underestimates how much of the Internet is actually compressed into and is an integral part of the model's weights. Gemini 2.5 can recite the first Harry Potter book verbatim for over 75% of the book.

NiloCK · 2026-02-06T02:42:10 1770345730

I'm getting astrology when I search for this. Any links on this?

f33d5173 · 2026-02-06T02:48:28 1770346108

Iirc it's not quite true. 75% of the book is more likely to appear than you would expect by chance if prompted with the prior tokens. This suggests that it has the book encoded in its weights, but you can't actually recover it by saying "recite harry potter for me".

jdminhbg · 2026-02-06T02:54:11 1770346451

Do you happen to know, is that because it can’t recite Harry Potter, or because it’s been instructed not to recite Harry Potter?

jazzyjackson · 2026-02-06T03:47:36 1770349656

It's a matter of token likelihood... as a continuation, the rest of chapter one is highly likely to follow the first paragraph.

The full text of Chapter One is not the only/likeliest possible response to "recite chapter one of harry potter for me"

jamesfinlayson · 2026-02-06T04:53:45 1770353625

Instructed not to was my understanding.

obirunda · 2026-02-06T03:22:43 1770348163

https://arxiv.org/abs/2601.02671?hl=en-US

altmanaltman · 2026-02-06T05:58:49 1770357529

> But for sure it has some internal knowledge already.

Pretty sure the books had to be included in its training material in full text. It's one of the most popular book series ever created, of course they would train on it. So "some" is an understatement in this case.

Trasmatta · 2026-02-06T00:54:01 1770339241

Do the same experiment in the Claude web UI. And explicitly turn web searches off. It got almost all of them for me over a couple of prompts. That stuff is already in its training data.

soulofmischief · 2026-02-05T23:23:30 1770333810

The only worthwhile version of this test involves previously unseen data that could not have been in the training set. Otherwise the results could be inaccurate to the point of harmful.

IAmGraydon · 2026-02-06T03:39:07 1770349147

I'm not sure what your knowledge level of the inner workings of LLMs is, but a model doesn't need search or even an internet connection to "know" the information if it's in its training dataset. In your example, it's almost guaranteed that the LLM isn't searching books - it's just referencing one of the hundreds of lists of those spells in it's training data.

This is the LLM's magic trick that has everyone fooled into thinking they're intelligent - it can very convincingly cosplay an intelligent being by parroting an intelligent being's output. This is equivalent to making a recording of Elvis, playing it back, and believing that Elvis is actually alive inside of the playback device. And let's face it, if a time traveler brought a modern music playback device back hundreds of years and showed it to everyone, they WOULD think that. Why? Because they have not become accustomed to the technology and have no concept of how it could work. The same is true of LLMs - the technology was thrust on society so quickly that there was no time for people to adjust and understand its inner workings, so most people think it's actually doing something akin to intelligence. The truth is it's just as far from intelligence your music playback device is from having Elvis inside of it.

eek2121 · 2026-02-05T22:58:03 1770332283

Honestly? My advice would be to cook something custom up! You don't need to do all the text yourself. Maybe have AI spew out a bunch of text, or take obscure existing text and insert hidden phrases here or there.

Shoot, I'd even go so far as to write a script that takes in a bunch of text, reorganizes sentences, and outputs them in a random order with the secrets. Kind of like a "Where's Waldo?", but for text

Just a few casual thoughts.

I'm actually thinking about coming up with some interesting coding exercises that I can run across all models. I know we already have benchmarks, however some of the recent work I've done has really shown huge weak points in every model I've run them on.

clhodapp · 2026-02-05T23:43:57 1770335037

Having AI spew it might suffer from the fact that the spew itself is influenced by AI's weights. I think your best bet would be to use a new human-authored work that was released after the model's context cutoff.

meroes · 2026-02-05T22:06:52 1770329212

What is this supposed to show exactly? Those books have been feed into LLMs for years and there's even likely specific RLHF's on extracting spells from HP.

muzani · 2026-02-05T22:33:17 1770330797

There was a time when I put the EA-Nasir text into base64 and asked AI to convert it. Remarkably it identified the correct text but pulled the most popular translation of the text than the one I gave it.

majewsky · 2026-02-06T01:28:34 1770341314

Sucks that you got a really shitty response to your prompt. If I were you, the model provider would be receiving my complaint via clay tablet right away.

rvz · 2026-02-05T22:24:21 1770330261

> What is this supposed to show exactly?

Nothing.

You can be sure that this was already known in the training data of PDFs, books and websites that Anthropic scraped to train Claude on; hence 'documented'. This is why tests like what the OP just did is meaningless.

Such "benchmarks" are performative to VCs and they do not ask why isn't the research and testing itself done independently but is almost always done by their own in-house researchers.

dudewhocodes · 2026-02-06T08:12:23 1770365543

There are websites with the spells listed... which makes this a search problem. Why is an LLM used here?

bilekas · 2026-02-06T08:14:21 1770365661

It's just a benchmark test excersize.

zamadatix · 2026-02-05T21:51:45 1770328305

To be fair, I don't think "Slugulus Eructo" (the name) is actually in the books. This is what's in my copy:

> The smug look on Malfoy’s face flickered.

> “No one asked your opinion, you filthy little Mudblood,” he spat.

> Harry knew at once that Malfoy had said something really bad because there was an instant uproar at his words. Flint had to dive in front of Malfoy to stop Fred and George jumping on him, Alicia shrieked, “How dare you!”, and Ron plunged his hand into his robes, pulled out his wand, yelling, “You’ll pay for that one, Malfoy!” and pointed it furiously under Flint’s arm at Malfoy’s face.

> A loud bang echoed around the stadium and a jet of green light shot out of the wrong end of Ron’s wand, hitting him in the stomach and sending him reeling backward onto the grass.

> “Ron! Ron! Are you all right?” squealed Hermione.

> Ron opened his mouth to speak, but no words came out. Instead he gave an almighty belch and several slugs dribbled out of his mouth onto his lap.

sobjornstad · 2026-02-05T23:56:13 1770335773

I have a vague recollection that it might come up named as such in Half-Blood Prince, written in Snape's old potions textbook?

In support of that hypothesis, the Fandom site lists it as “mentioned” in Half-Blood Prince, but it says nothing else and I'm traveling and don't have a copy to check, so not sure.

zamadatix · 2026-02-06T01:48:40 1770342520

Hmm, I don't get a hit for "slugulus" or "eructo" (case insensitive) in any of the 7. Interestingly two mentions of "vomit" are in book 6, but neither in reference to to slugs (plenty of Slughorn of course!). Book 5 was the only other one a related hit came up:

> Ron nodded but did not speak. Harry was reminded forcibly of the time that Ron had accidentally put a slug-vomiting charm on himself. He looked just as pale and sweaty as he had done then, not to mention as reluctant to open his mouth.

There could be something with regional variants but I'm doubtful as the Fandom site uses LEGO Harry Potter: Years 1-4 as the citation of the spell instead of a book.

Maybe the real LLM is the universe and we're figuring this out for someone on Slacker News a level up!

ck_one · 2026-02-05T22:30:35 1770330635

Then it's fair that id didn't find it

muzani · 2026-02-05T23:06:15 1770332775

There's a benchmark which works similarly but they ask harder questions, also based on books https://fiction.live/stories/Fiction-liveBench-Feb-21-2025/o...

I guess they have to add more questions as these context windows get bigger.

kybernetikos · 2026-02-05T23:49:59 1770335399

I recently got junie to code me up an MCP for accessing my calibre library. https://www.npmjs.com/package/access-calibre

My standard test for that was "Who ends up with Bilbo's buttons?"

dwa3592 · 2026-02-05T23:30:00 1770334200

have another LLM (gemini, chatgpt) make up 50 new spells. insert those and test and maybe report here :)

dom96 · 2026-02-06T00:05:39 1770336339

I often wonder how much of the Harry Potter books were used in the training. How long before some LLM is able to regurgitate full HP books without access to the internet?

bartman · 2026-02-05T22:32:54 1770330774

Have you by any chance tried this with GPT 4.1 too (also 1M context)?

siwatanejo · 2026-02-06T01:31:27 1770341487

> All 7 books come to ~1.75M tokens

How do you know? Each word is one token?

koakuma-chan · 2026-02-06T01:49:02 1770342542

You can download the books and run them through a tokenizer. I did that half a year ago and got ~2M.

LanceJones · 2026-02-05T22:34:48 1770330888

Assuming this experiment involved isolating the LLM from its training set?

dr_dshiv · 2026-02-06T07:56:56 1770364616

Comparison to another model?

irishcoffee · 2026-02-05T23:40:07 1770334807

The top comment is about finding basterized latin words from childrens books. The future is here.

mhink · 2026-02-06T03:58:13 1770350293

> basterized

And yet, it's still somewhat better than the Hacker News comment using bastardized English words.

Geste · 2026-02-05T23:51:21 1770335481

I'll have some of that coffee too, this is quite a sad time we're living where this is a proper use of our limited resources.

guluarte · 2026-02-05T22:19:09 1770329949

you can get the same result just asking opus/gpt, it is probably internalized knowledge from reddit or similar sites.

ck_one · 2026-02-05T22:33:19 1770330799

If you just ask it you don't get the same result. Around 13 spells were missing when I just prompted Opus 4.6 without the books as context.

guluarte · 2026-02-06T03:05:18 1770347118

because it is non-deterministic, i just did and got 50 spells.

prompt: finding every spell in all Harry Potter books in the first 4 books, only the officially documented spells across those 4 books, there are 50, without using websearch

Here's my best compilation from memory, organized by book. I'll be upfront: getting to exactly 50 is tricky without cross-referencing, so some might be debatable or I might miss a couple. Here goes:

---

*Book 1 - Philosopher's/Sorcerer's Stone*

1. *Lumos* - produces light 2. *Nox* - extinguishes Lumos 3. *Alohomora* - unlocks doors 4. *Wingardium Leviosa* - levitation 5. *Petrificus Totalus* - full body-bind 6. *Locomotor Mortis* - leg-locker curse 7. *Incendio* - produces fire 8. *Reparo* - mends broken objects

*Book 2 - Chamber of Secrets*

9. *Expelliarmus* - disarms opponent 10. *Rictusempra* - tickling charm 11. *Tarantallegra* - forces target to dance 12. *Serpensortia* - conjures a snake 13. *Finite Incantatem* - ends spell effects 14. *Obliviate* - erases memories 15. *Aparecium* - reveals invisible ink 16. *Immobulus* - freezes target in place 17. *Homorphus* - mentioned by Lockhart (werewolf cure claim) 18. *Peskipiksi Pesternomi* - Lockhart's failed pixie spell

*Book 3 - Prisoner of Azkaban*

19. *Expecto Patronum* - produces a Patronus 20. *Riddikulus* - repels a Boggart 21. *Waddiwasi* - shoots objects at target 22. *Impervius* - waterproofing charm 23. *Dissendium* - opens secret passage 24. *Ferula* - conjures bandages/splint 25. *Mobilicorpus* - moves a body 26. *Lumos Maxima* - intense light

*Book 4 - Goblet of Fire*

27. *Accio* - summoning charm 28. *Avada Kedavra* - killing curse 29. *Crucio* - Cruciatus curse (torture) 30. *Imperio* - Imperius curse (control) 31. *Stupefy* - stunning spell 32. *Engorgio* - enlarges target 33. *Reducio* - shrinks target 34. *Sonorus* - amplifies voice 35. *Quietus* - reverses Sonorus 36. *Morsmordre* - conjures the Dark Mark 37. *Priori Incantatem* - reveals last spell cast 38. *Deletrius* - erases magical residue 39. *Densaugeo* - enlarges teeth 40. *Furnunculus* - causes boils 41. *Impedimenta* - slows/stops target 42. *Reducto* - blasts solid objects 43. *Diffindo* - severing charm 44. *Relashio* - releases sparks/grip 45. *Orchideous* - conjures flowers 46. *Avis* - conjures birds 47. *Point Me* - Four-Point Spell (compass) 48. *Ennervate* - revives stunned person 49. *Protego* - shield charm 50. *Conjunctivitis Curse* - affects eyesight (Krum on the dragon)

---

A few caveats: some of these (like Lumos Maxima, Homorphus, Peskipiksi Pesternomi) are borderline since they're either mentioned rather than properly cast, or might be film additions that bleed into memory. The Conjunctivitis Curse is described but its incantation isn't explicitly given in the text. And Protego might technically first appear with its incantation in Book 5 during DA practice rather than Book 4.

If you want, I can turn this into a spreadsheet or document with columns for spell name, effect, who casts it, and which chapter.

TheRealPomax · 2026-02-06T00:00:42 1770336042

That doesn't seem a super useful test for a model that's optimized for programming?

huangmeng · 2026-02-06T02:31:47 1770345107

you are rich

IhateAI · 2026-02-06T01:29:21 1770341361

like I often say, these tools are mostly useful for people to do magic tricks on themselves (and to convince C-suites that they can lower pay, and reduce staff if they pay Anthropic half their engineering budget lmao )

gizmodo59 · 2026-02-05T18:14:01 1770315241

5.3 codex https://openai.com/index/introducing-gpt-5-3-codex/ crushes with a 77.3% in Terminal Bench. The shortest lived lead in less than 35 minutes. What a time to be alive!

callamdelaney · 2026-02-06T08:43:17 1770367397

Anthropic models generally are right first time for me. Chatgpt and Gemini are often way, way out with some fundamental misunderstanding of the task at hand.

wasmainiac · 2026-02-05T19:10:52 1770318652

Dumb question. Can these benchmarks be trusted when the model performance tends to vary depending on the hours and load on OpenAI’s servers? How do I know I’m not getting a severe penalty for chatting at the wrong time. Or even, are the models best after launch then slowly eroded away at to more economical settings after the hype wears off?

tedsanders · 2026-02-05T20:09:42 1770322182

We don't vary our model quality with time of day or load (beyond negligible non-determinism). It's the same weights all day long with no quantization or other gimmicks. They can get slower under heavy load, though.

(I'm from OpenAI.)

wasmainiac · 2026-02-05T22:50:53 1770331853

Thanks for the response, I appreciate it. I do notice variation in quality throughout the day. I use it primarily for searching documentation since it’s faster than google in most case, often it is on point, but also it seems off at times, inaccurate or shallow maybe. In some cases I just end the session.

nl · 2026-02-05T23:02:33 1770332553

Usually I find this kind of variation is due to context management.

Accuracy can decreases at large context sizes. OpenAI's compaction handles this better than anyone else, but it's still an issue.

If you are seeing this kind of thing start a new chat and re-run the same query. You'll usually see an improvement.

wasmainiac · 2026-02-06T06:14:09 1770358449

I don't think so. I am aware that large contexts impacts performance. In long chats an old topic will someone be brought up in new responses, and the direction of the mode is not as focused.

Regardless I tend to use new chats often.

repeekad · 2026-02-06T01:38:13 1770341893

This is called context rot

charcircuit · 2026-02-06T06:04:56 1770357896

I thought context rot was only for long distance queries.

robertclaus · 2026-02-06T06:11:56 1770358316

Hi Ted! Small world to see you here!

GorbachevyChase · 2026-02-06T01:35:37 1770341737

Hi Ted. I think that language models are great, and they’ve enabled me to do passion projects I never would have attempted before. I just want to say thanks.

zamadatix · 2026-02-05T21:43:38 1770327818

I appreciate you taking the time to respond to these kinds of questions the last few days.

Trufa · 2026-02-05T20:18:40 1770322720

Can you be more specific than this? does it vary in time from launch of a model to the next few months, beyond tinkering and optimization?

tedsanders · 2026-02-05T21:11:40 1770325900

Yeah, happy to be more specific. No intention of making any technically true but misleading statements.

The following are true:

- In our API, we don't change model weights or model behavior over time (e.g., by time of day, or weeks/months after release)

- Tiny caveats include: there is a bit of non-determinism in batched non-associative math that can vary by batch / hardware, bugs or API downtime can obviously change behavior, heavy load can slow down speeds, and this of course doesn't apply to the 'unpinned' models that are clearly supposed to change over time (e.g., xxx-latest). But we don't do any quantization or routing gimmicks that would change model weights.

- In ChatGPT and Codex CLI, model behavior can change over time (e.g., we might change a tool, update a system prompt, tweak default thinking time, run an A/B test, or ship other updates); we try to be transparent with our changelogs (listed below) but to be honest not every small change gets logged here. But even here we're not doing any gimmicks to cut quality by time of day or intentionally dumb down models after launch. Model behavior can change though, as can the product / prompt / harness.

ChatGPT release notes: https://help.openai.com/en/articles/6825453-chatgpt-release-...

Codex changelog: https://developers.openai.com/codex/changelog/

Codex CLI commit history: https://github.com/openai/codex/commits/main/

Trufa · 2026-02-05T23:37:11 1770334631

I ask then unironically then, am I imagining that models are great when they start and degrade over time?

I've had this perceived experience so many times, and while of course it's almost impossible to be objective about this, it just seem so in your face.

I don't discard being novelty plus getting used to it, plus psychological factors, do you have any takes on this?

jason_oster · 2026-02-06T01:21:22 1770340882

You might be susceptible to the honeymoon effect. If you have ever felt a dopamine rush when learning a new programming language or framework, this might be a good indication.

Once the honeymoon wears off, the tool is the same, but you get less satisfaction from it.

Just a guess! Not trying to psychoanalyze anyone.

jychang · 2026-02-05T21:30:20 1770327020

What about the juice variable?

https://www.reddit.com/r/OpenAI/comments/1qv77lq/chatgpt_low...

tedsanders · 2026-02-05T21:43:21 1770327801

Yep, we recently sped up default thinking times in ChatGPT, as now documented in the release notes: https://help.openai.com/en/articles/6825453-chatgpt-release-...

The intention was purely making the product experience better, based on common feedback from people (including myself) that wait times were too long. Cost was not a goal here.

If you still want the higher reliability of longer thinking times, that option is not gone. You can manually select Extended (or Heavy, if you're a Pro user). It's the same as at launch (though we did inadvertently drop it last month and restored it yesterday after Tibor and others pointed it out).

tgrowazay · 2026-02-05T21:40:54 1770327654

Isn’t that just how many steps at most a reasoning model should do?

ComplexSystems · 2026-02-05T21:22:47 1770326567

Do you ever replace ChatGPT models with cheaper, distilled, quantized, etc ones to save cost?

tedsanders · 2026-02-06T06:52:18 1770360738

We do care about cost, of course. If money didn't matter, everyone would get infinite rate limits, 10M context windows, and free subscriptions. So if we make new models more efficient without nerfing them, that's great. And that's generally what's happened over the past few years. If you look at GPT-4 (from 2023), it was far less efficient than today's models, which meant it had slower latency, lower rate limits, and tiny context windows (I think it might have been like 4K originally, which sounds insanely low now). Today, GPT-5 Thinking is way more efficient than GPT-4 was, but it's also way more useful and way more reliable. So we're big fans of efficiency as long as it doesn't nerf the utility of the models. The more efficient the models are, the more we can crank up speeds and rate limits and context windows.

That said, there are definitely cases where we intentionally trade off intelligence for greater efficiency. For example, we never made GPT-4.5 the default model in ChatGPT, even though it was an awesome model at writing and other tasks, because it was quite costly to serve and the juice wasn't worth the squeeze for the average person (no one wants to get rate limited after 10 messages). A second example: in our API, we intentionally serve dumber mini and nano models for developers who prioritize speed and cost. A third example: we recently reduced the default thinking times in ChatGPT to speed up the times that people were having to wait for answers, which in a sense is a bit of a nerf, though this decision was purely about listening to feedback to make ChatGPT better and had nothing to do with cost (and for the people who want longer thinking times, they can still manually select Extended/Heavy).

I'm not going to comment on the specific techniques used to make GPT-5 so much more efficient than GPT-4, but I will say that we don't do any gimmicks like nerfing by time of day or nerfing after launch. And when we do make newer models more efficient than older models, it mostly gets returned to people in the form of better speeds, rate limits, context windows, and new features.

jghn · 2026-02-05T21:28:59 1770326939

He literally said no to this in his GP post

joshvm · 2026-02-05T20:58:50 1770325130

My gut feeling is that performance is more heavily affected by harnesses which get updated frequently. This would explain why people feel that Claude is sometimes more stupid - that's actually accurate phrasing, because Sonnet is probably unchanged. Unless Anthropic also makes small A/B adjustments to weights and technically claims they don't do dynamic degradation/quantization based on load. Either way, both affect the quality of your responses.

It's worth checking different versions of Claude Code, and updating your tools if you don't do it automatically. Also run the same prompts through VS Code, Cursor, Claude Code in terminal, etc. You can get very different model responses based on the system prompt, what context is passed via the harness, how the rules are loaded and all sorts of minor tweaks.

If you make raw API calls and see behavioural changes over time, that would be another concern.

Someone1234 · 2026-02-05T20:29:28 1770323368

Specifically including routing (i.e. which model you route to based on load/ToD)?

PS - I appreciate you coming here and commenting!

hhh · 2026-02-05T20:32:14 1770323534

There is no routing with API, or when you choose a specific model in chatGPT.

zwaps · 2026-02-06T06:09:51 1770358191

In the past it seemed there was routing based on context-length. So the model was always the same, but optimized for different lengths. Is this still the case?

derwiki · 2026-02-06T00:07:15 1770336435

Has this always been the case?

fragmede · 2026-02-05T23:25:50 1770333950

I believe you when you say you're not changing the model file loaded onto the H100s or whatever, but there's something going on, beyond just being slower, when the GPUs are heavily loaded.

clbrmbr · 2026-02-06T00:27:52 1770337672

I do wonder about reasoning effort.

hauntsaninja · 2026-02-06T08:40:16 1770367216

Reasoning effort is denominated in tokens, not time, so no difference beyond slowness at heavy load

(I work at OpenAI)

Corence · 2026-02-05T19:31:24 1770319884

It is a fair question. I'd expect the numbers are all real. Competitors are going to rerun the benchmark with these models to see how the model is responding and succeeding on the tasks and use that information to figure out how to improve their own models. If the benchmark numbers aren't real their competitors will call out that it's not reproducible.

However it's possible that consumers without a sufficiently tiered plan aren't getting optimal performance, or that the benchmark is overfit and the results won't generalize well to the real tasks you're trying to do.

mrandish · 2026-02-05T22:22:20 1770330140

> I'd expect the numbers are all real.

I think a lot of people are concerned due to 1) significant variance in performance being reported by a large number of users, and 2) We have specific examples of OpenAI and other labs benchmaxxing in the recent past (https://grok.com/share/c2hhcmQtMw_66c34055-740f-43a3-a63c-4b...).

It's tricky because there are so many subtle ways in which "the numbers are all real" could be technically true in some sense, yet still not reflect what a customer will experience (eg harnesses, etc). And any of those ways can benefit the cost structures of companies currently subsidizing models well below their actual costs with limited investor capital. All with billions of dollars in potential personal wealth at stake for company employees and dozens of hidden cost/performance levers at their disposal.

And it doesn't even require overt deception on anyone's part. For example, the teams doing benchmark testing of unreleased new models aren't the same people as the ops teams managing global deployment/load balancing at scale day-to-day. If there aren't significant ongoing resources devoted to specifically validating those two things remain in sync - they'll almost certainly drift apart. And it won't be anyone's job to even know it's happening until a meaningful number of important customers complain or sales start to fall. Of course, if an unplanned deviation causes costs to rise over budget, it's a high-priority bug to be addressed. But if the deviation goes the other way and costs are little lower than expected, no one's getting a late night incident alert. This isn't even a dig at OpenAI in particular, it's just the default state of how large orgs work.

ifwinterco · 2026-02-05T19:54:11 1770321251

On benchmarks GPT 5.2 was roughly equivalent to Opus 4.5 but most people who've used both for SWE stuff would say that Opus 4.5 is/was noticeably better

CraigJPerry · 2026-02-05T20:57:04 1770325024

There's an extended thinking mode for GPT 5.2 i forget the name of it right at this minute. It's super slow - a 3 minute opus 4.5 prompt is circa 12 minutes to complete in 5.2 on that super extended thinking mode but it is not a close race in terms of results - GPT 5.2 wins by a handy margin in that mode. It's just too slow to be useable interactively though.

ifwinterco · 2026-02-05T21:55:12 1770328512

Interesting, sounds like I definitely need to give the GPT models another proper go based on this discussion

elAhmo · 2026-02-05T20:00:07 1770321607

I mostly used Sonnet/Opus 4.x in the past months, but 5.2 Codex seemed to be on par or better for my use case in the past month. I tried a few models here and there but always went back to Claude, but with 5.2 Codex for the first time I felt it was very competitive, if not better.

Curious to see how things will be with 5.3 and 4.6

georgeven · 2026-02-05T20:04:20 1770321860

Interesting. Everyone in my circle said the opposite.

MadnessASAP · 2026-02-05T23:27:50 1770334070

My experience is that Codex follows directions better but Claude writes better code.

ChatGPT-5.2-Codex follows directions to ensure a task [bead](https://github.com/steveyegge/beads) is opened before starting a task and to keep it updated almost to a fault. Claude-Opus-4.5 with the exact same directions, forgets about it within a round or two. Similarly, I had a project that required very specific behaviour from a couple functions, it was documented in a few places including comments at the top and bottom of the function. Codex was very careful in ensuring the function worked as was documented. Claude decided it was easier to do the exact opposite, rewrote the function, the comments, and the documentation to saynit now did the opposite of what was previously there.

If I believed a LLM could be spiteful, I would've believed it on that second one. I certainly felt some after I realised what it had done. The comment literally said:

  // Invariant regardless of the value of X, this function cannot return Y

And it turned it into:

  // Returns Y if X is true

planckscnst · 2026-02-06T00:23:03 1770337383

That's so strange. I found GPT to be abysmal at following instructions to the point of unusability for any direction-heavy role. I have a common workflow that involves an orchestrator that pretty much does nothing but follow some simple directions [1]. GPT flat-out cannot do this most basic task.

[1]: https://github.com/Vibecodelicious/llm-conductor/blob/main/O...

MadnessASAP · 2026-02-06T02:47:13 1770346033

Strange behaviour and LLMs are the iconic duo of the decade. They've definitley multiplied my productivity, since now instead of putting off writing boring code or getting stuck on details till I get frustrated and give up I just give it to an agent to figure out.

I don't thing my ability read, understand, and write code is going anywhere though.

Neat tool BTW, I'm in the market for something like that.

krzyk · 2026-02-05T20:53:03 1770324783

It probably depends on programming language and expectations.

ifwinterco · 2026-02-05T21:51:56 1770328316

This is mostly Python/TS for me... what Jonathan Blow would probably call not "real programming" but it pays the bills

They can both write fairly good idiomatic code but in my experience opus 4.5 is better at understanding overall project structure etc. without prompting. It just does things correctly first time more often than codex. I still don't trust it obviously but out of all LLMs it's the closest to actually starting to earn my trust

deaux · 2026-02-06T02:54:25 1770346465

Even for the same language it depends on domain.

SatvikBeri · 2026-02-05T22:05:36 1770329136

I pretty consistently heard people say Codex was much slower but produced better results, making it better for long-running work in the background, and worse for more interactive development.

smcleod · 2026-02-05T21:37:35 1770327455

I don't think much from OpenAI can be trusted tbh.

aaaalone · 2026-02-05T19:19:47 1770319187

At the end of the day you test it for your use cases anyway but it makes it a great initial hint if it's worth it to test out.

cyanydeez · 2026-02-05T19:36:03 1770320163

When do you think we should run this benchmark? Friday, 1pm? Monday 8AM? Wednesday 11AM?

I definitely suspect all these models are being degraded during heavy loads.

j_maffe · 2026-02-05T19:49:36 1770320976

This hypothesis is tested regularly by plenty of live benchmarks. The services usually don't decay in performance.

thinkingtoilet · 2026-02-05T21:22:35 1770326555

We know Open AI got caught getting benchmark data and tuning their models to it already. So the answer is a hard no. I imagine over time it gives a general view of the landscape and improvements, but take it with a large grain of salt.

tedsanders · 2026-02-06T01:03:22 1770339802

Are you referring to FrontierMath?

We had access to the eval data (since we funded it), but we didn't train on the data or otherwise cheat. We didn't even look at the eval results until after the model had been trained and selected.

rvz · 2026-02-05T22:37:56 1770331076

The same thing was done with Meta researchers with Llama 4 and what can go wrong when 'independent' researchers begin to game AI benchmarks. [0]

You always have to question these benchmarks, especially when the in-house researchers can potentially game them if they wanted to.

Which is why it must be independent.

[0] https://gizmodo.com/meta-cheated-on-ai-benchmarks-and-its-a-...

purplerabbit · 2026-02-05T18:29:18 1770316158

The lack of broad benchmark reports in this makes me curious: Has OpenAI reverted to benchmaxxing? Looking forward to hearing opinions once we all try both of these out

MallocVoidstar · 2026-02-05T19:24:14 1770319454

The -codex models are only for 'agentic coding', nothing else.

nharada · 2026-02-05T18:23:20 1770315800

That's a massive jump, I'm curious if there's a materially different feeling in how it works or if we're starting to reach the point of benchmark saturation. If the benchmark is good then 10 points should be a big improvement in capability...

jkelleyrtp · 2026-02-05T18:27:40 1770316060

claude swe-bench is 80.8 and codex is 56.8

Seems like 4.6 is still all-around better?

gizmodo59 · 2026-02-05T18:28:19 1770316099

Its SWE bench pro not swe bench verified. The verified benchmark has stagnated

joshuahedlund · 2026-02-05T18:30:36 1770316236

Any ideas why verified has stagnated? It was increasing rapidly and then basically stopped.

Snuggly73 · 2026-02-05T18:53:23 1770317603

it has been pretty much a benchmark for memorization for a while. there is a paper on the subject somewhere.

swe bench pro public is newer, but its not live, so it will get slowly memorized as well. the private dataset is more interesting, as are the results there:

https://scale.com/leaderboard/swe_bench_pro_private

Rudybega · 2026-02-05T21:59:49 1770328789

You're comparing two different benchmarks. Pro vs Verified.

pjot · 2026-02-05T18:03:16 1770314596

Claude Code release notes:

  > Version 2.1.32:
     • Claude Opus 4.6 is now available!
     • Added research preview agent teams feature for multi-agent collaboration (token-intensive feature, requires setting
     CLAUDE_CODE_EXPERIMENTAL_AGENT_TEAMS=1)
     • Claude now automatically records and recalls memories as it works
     • Added "Summarize from here" to the message selector, allowing partial conversation summarization.
     • Skills defined in .claude/skills/ within additional directories (--add-dir) are now loaded automatically.
     • Fixed @ file completion showing incorrect relative paths when running from a subdirectory
     • Updated --resume to re-use --agent value specified in previous conversation by default.
     • Fixed: Bash tool no longer throws "Bad substitution" errors when heredocs contain JavaScript template literals like ${index + 1}, which
     previously interrupted tool execution
     • Skill character budget now scales with context window (2% of context), so users with larger context windows can see more skill descriptions
     without truncation
     • Fixed Thai/Lao spacing vowels (สระ า, ำ) not rendering correctly in the input field
     • VSCode: Fixed slash commands incorrectly being executed when pressing Enter with preceding text in the input field
     • VSCode: Added spinner when loading past conversations list

neuronexmachina · 2026-02-05T18:08:39 1770314919

> Claude now automatically records and recalls memories as it works

Neat: https://code.claude.com/docs/en/memory

I guess it's kind of like Google Antigravity's "Knowledge" artifacts?

bityard · 2026-02-05T19:46:23 1770320783

If it works anything like the memories on Copilot (which have been around for quite a while), you need to be pretty explicit about it being a permanent preference for it to be stored as a memory. For example, "Don't use emoji in your response" would only be relevant for the current chat session, whereas this is more sticky: "I never want to see emojis from you, you sub-par excuse for a roided-out spreadsheet"

flutas · 2026-02-05T21:10:54 1770325854

It's a lot more iffy than that IME.

It's very happy to throw a lot into the memory, even if it doesn't make sense.

anupamchugh · 2026-02-06T05:05:56 1770354356

This is the core problem. The agent writes its own memory while working, so it has blind spots about what matters. I've had sessions where it carefully noted one thing but missed a bigger mistake in the same conversation — it can't see its own gaps.

A second pass over the transcript afterward catches what the agent missed. Doesn't need the agent to notice anything. Just reads the conversation cold.

The two approaches have completely different failure modes, which is why you need both. What nobody's built yet is the loop where the second pass feeds back into the memory for the next session.

9dev · 2026-02-05T20:56:29 1770324989

> you sub-par excuse for a roided-out spreadsheet

That’s harsh, man.

om8 · 2026-02-05T18:23:41 1770315821

Is there a way to disable it? Sometimes I value agent not having knowledge that it needs to cut corners

nerdsniper · 2026-02-05T19:19:31 1770319171

90-98% of the time I want the LLM to only have the knowledge I gave it in the prompt. I'm actually kind of scared that I'll wake up one day and the web interface for ChatGPT/Opus/Gemini will pull information from my prior chats.

vineyardmike · 2026-02-05T21:18:12 1770326292

All these of these providers support this feature. I don’t know about ChatGPT but the rest are opt-in. I imagine with Gemini it’ll be default on soon enough, since it’s consumer focused. Claude does constantly nag me to enable it though.

pdntspa · 2026-02-05T21:27:58 1770326878

They already do this

I've had claude reference prior conversations when I'm trying to get technical help on thing A, and it will ask me if this conversation is because of thing B that we talked about in the immediate past

sanxiyn · 2026-02-06T00:53:52 1770339232

You can disable this at Settings > Capabilities > Memory > Search and reference chats.

hypercube33 · 2026-02-05T19:48:30 1770320910

I'm fairly sure OpenAI/GPT does pull prior information in the form of its memories

nerdsniper · 2026-02-05T19:50:39 1770321039

Ah, that could explain why I've found myself using it the least.

sharifhsn · 2026-02-05T19:53:04 1770321184

Gemini has this feature but it’s opt-in.

kzahel · 2026-02-05T20:17:57 1770322677

Claude told me he can disable it by putting instructions in the MEMORY.md file to not use it. So only a soft disable AFAIK and you'd need to do it on each machine.

4b11b4 · 2026-02-05T21:23:10 1770326590

I understand everyone's trying to solve this problem but I'm envisioning 1 year down the line when your memory is full of stuff that shouldn't be in there.

codethief · 2026-02-05T18:26:37 1770315997

Are we sure the docs page has been updated yet? Because that page doesn't say anything about automatic recording of memories.

neuronexmachina · 2026-02-05T19:39:47 1770320387

Oh, quite right. I saw people mention MEMORY.md online and I assumed that was the doc for it, but it looks like it isn't.

pdntspa · 2026-02-05T21:26:45 1770326805

I thought it was already doing this?

I asked Claude UI to clear its memory a little while back and hoo boy CC got really stupid for a couple of days

kzahel · 2026-02-05T20:20:08 1770322808

I looked into it a bit. It stores memories near where it stores JSONL session history. It's per-project (and specific to the machine) Claude pretty aggressively and frequently writes stuff in there. It uses MEMORY.md as sort of the index, and will write out other files with other topics (linking to them from the main MEMORY.md) file.

It gives you a convenient way to say "remember this bug for me, we should fix tomorrow". I'll be playing around with it more for sure.

I asked Claude to give me a TLDR (condensed from its system prompt):

----

Persistent directory at ~/.claude/projects/{project-path}/memory/, persists across conversations

MEMORY.md is always injected into the system prompt; truncated after 200 lines, so keep it concise

Separate topic files for detailed notes, linked from MEMORY.md What to record: problem constraints, strategies that worked/failed, lessons learned

Proactive: when I hit a common mistake, check memory first - if nothing there, write it down

Maintenance: update or remove memories that are wrong or outdated

Organization: by topic, not chronologically

Tools: use Write/Edit to update (so you always see the tool calls)

ra7 · 2026-02-05T22:18:12 1770329892

> Persistent directory at ~/.claude/projects/{project-path}/memory/, persists across conversations

I create a git worktree, start Claude Code in that tree, and delete after. I notice each worktree gets a memory directory in this location. So is memory fragmented and not combined for the "main" repo?

vardalab · 2026-02-06T01:57:32 1770343052

Yes, I noticed the same thing, and Claude told me that it's going to be deleted. I will have it improve the skill that is part of our worktree cleanup process to consolidate that memory into the main memory if there's anything useful.

legitster · 2026-02-05T18:05:12 1770314712

I'm still not sure I understand Anthropic's general strategy right now.

They are doing these broad marketing programs trying to take on ChatGPT for "normies". And yet their bread and butter is still clearly coding.

Meanwhile, Claude's general use cases are... fine. For generic research topics, I find that ChatGPT and Gemini run circles around it: in the depth of research, the type of tasks it can handle, and the quality and presentation of the responses.

Anthropic is also doing all of these goofy things to try to establish the "humanity" of their chatbot - giving it rights and a constitution and all that. Yet it weirdly feels the most transactional out of all of them.

Don't get me wrong, I'm a paying Claude customer and love what it's good at. I just think there's a disconnect between what Claude is and what their marketing department thinks it is.

bobbylarrybobby · 2026-02-06T05:18:13 1770355093

I really like that Claude feels transactional. It answers my question quickly and concisely and then shuts up. I don't need the LLM I use to act like my best friend.

andkenneth · 2026-02-06T06:18:27 1770358707

Weirdly I feel like partially because of this it feels more "human" and more like a real person I'm talking to. GPT models feel fake and forced, and will yap in a way that is like they're trying to get to be my friend, but offputting in a way that makes it not work. Meanwhile claude has always had better "emotional intelligence".

Claude also seems a lot better at picking up what's going on. If you're focused on tasks, then yeah, it's going to know you want quick answers rather than detailed essays. Could be part of it.

apples_oranges · 2026-02-06T08:25:21 1770366321

fyi in settings, you can configure chatGPT to do the same

cryptoegorophy · 2026-02-06T06:48:55 1770360535

Then why are they advertising to people that are complete opposite of you? Why couldn’t they just … ask LLM what their target audience is?

tgtweak · 2026-02-05T18:09:50 1770314990

Claude itself (outside of code workflows) actually works very well for general purpose chat. I have a few non-technical friends that have moved over from chatgpt after some side-by-side testing and I've yet to see one go back - which is good since claude circa 8 months ago was borderline unusable for anything but coding on the api.

eaf7e281 · 2026-02-05T18:16:31 1770315391

I kinda agree. Their model just doesn't feel "daily" enough. I would use it for any "agentic" tasks and for using tools, but definitely not for day to day questions.

lukebechtel · 2026-02-05T18:22:40 1770315760

Why? I use it for all and love it.

That doesn't mean you have to, but I'm curious why you think it's behind in the personal assistant game.

legitster · 2026-02-05T18:41:54 1770316914

I have three specific use cases where I try both but ChatGPT wins:

- Recipes and cooking: ChatGPT just has way more detailed and practical advice. It also thinks outside of the box much more, whereas Claude gets stuck in a rut and sticks very closely to your prompt. And ChatGPT's easier to understand/skim writing style really comes in useful.

- Travel and itinerary: Again, ChatGPT can anticipate details much more, and give more unique suggestions. I am much more likely to find hidden gems or get good time-savers than Claude, which often feels like it is just rereading Yelp for you.

- Historical research: ChatGPT wins on this by a mile. You can tell ChatGPT has been trained on actual historical texts and physical books. You can track long historical trends, pull examples and quotes, and even give you specific book or page(!) references of where to check the sources. Meanwhile, all Claude will give you is a web search on the topic.

aggie · 2026-02-05T19:59:43 1770321583

How does #3 square with Anthropic's literal warehouse full of books we've seen from the copyright case? Did OpenAI scan more books? Or did they take a shadier route of training on digital books despite copyright issues, but end up with a deeper library?

legitster · 2026-02-05T22:27:31 1770330451

I have no idea, but I suspect there's a difference between using books to train an LLM and be able to reproduce text/writing styles, and being able to actually recall knowledge in said books.

rolisz · 2026-02-05T20:10:09 1770322209

I think they bought the books after they were caught that they pirated the books and lost that case (because they pirated, not because of copyright).

eaf7e281 · 2026-02-05T21:25:55 1770326755

It's hard to say. Maybe it has to do with the way Claude responds or the lack of "thinking" compared to other models. I personally love Claude and it's my only subscription right now, but it just feels weird compared to the others as a personal assistant.

lukebechtel · 2026-02-06T00:07:34 1770336454

Oh, I always use opus 4.5 thinking mode. Maybe that's the diff.

FergusArgyll · 2026-02-06T00:36:41 1770338201

My 2 cents:

All the labs seem to do very different post training. OpenAI focuses on search. If it's set to thinking, it will search 30 websites before giving you an answer. Claude regularly doesn't search at all even for questions it obviously should. It's postraining seems more focused on "reasoning" or planning - things that would be useful in programming where the bottleneck is: just writing code without thinking how you'll integrate it later and search is mostly useless. But for non coding - day to day "what's the news with x" "How to improve my bread" "cheap tasty pizza" or even medical questions, you really just want a distillation of the internet plus some thought

solarkraft · 2026-02-05T18:43:14 1770316994

But that’s what makes it so powerful (yeah, mixing model and frontend discussion here yet again). I have yet to see a non-DIY product that can so effortlessly call tens of tools by different providers to satisfy your request.

quietsegfault · 2026-02-05T22:40:07 1770331207

Claude is far superior for daily chat. I have to work hard to get it to not learn how to work around various bad behaviors I have but don’t want to change.

Squarex · 2026-02-05T20:59:21 1770325161

Claude sucks at non English languages. Gemini and ChatGPT are much better. Grok is the worst. I am a native Czech speaker and Claude makes up words and Grok sometimes respond in Russian. So while I love it for coding, it’s unusable for general purpose for me.

9dev · 2026-02-05T21:05:55 1770325555

> Grok sometimes respond in Russian

Geopolitically speaking this is hilarious.

Squarex · 2026-02-05T21:58:47 1770328727

The voice mode sounded like a Ukrainian trying to speak Czech. I don’t think it means anything.

deaux · 2026-02-06T03:04:11 1770347051

You mean Claude sucks at Czech. You're extrapolating here. I can name languages that Claude is better at than GPT.

Gemini is the most fluent in the highest number of human languages and has been for years (!) at this point - namely since Gemini 1.5 Pro, which was released Feb 2024. Two years ago.

Squarex · 2026-02-06T05:55:41 1770357341

Yeah, sure, I was overly generalising it from one experience.

jorl17 · 2026-02-05T22:07:17 1770329237

Claude is quite good at European Portuguese in my limited tests. Gemini 3 is also very good. ChatGPT is just OK and keeps code-switching all the time, it's very bizarre.

I used to think of Gemini as the lead in terms of Portuguese, but recently subjectively started enjoying Claude more (even before Opus 4.5).

In spite of this, ChatGPT is what I use for everyday conversational chat because it has loads of memories there, because of the top of the line voice AI, and, mostly, because I just brainstorm or do 1-off searches with it. I think effectively ChatGPT is my new Google and first scratchpad for ideas.

khendron · 2026-02-06T01:01:47 1770339707

Claude is helping me learn French right now. I am using it as a supplementary tutor for a class I am taking. I have caught it in a couple of mistakes, but generally it seems to be working pretty well.

kuboble · 2026-02-05T21:58:15 1770328695

Claude code (opus) is very good in Polish.

I sometimes vibe code in polish and it's as good as with English for me. It speaks a natural, native level Polish.

I used opus to translate thousands of strings in my app into polish, Korean, and two Chinese dialects. Polish one is great, and the other are also good according to my customers.

Squarex · 2026-02-06T05:58:58 1770357538

> I sometimes vibe code in polish

This is interesting to me. I always switch to English automatically when using Claude Code as I have learned software engineering on an English speaking Internet. Plus the muscle memory of having to query google in English.

altern8 · 2026-02-06T00:35:41 1770338141

Your game is amazing!

I wish there was a "Reset" button to go back to the original position.

Where are you in Poland?

kuboble · 2026-02-06T04:34:37 1770352477

Thanks :) Click "Level" -> "Try again"

Originally from Wrocław, but don't live in Poland anymore

koakuma-chan · 2026-02-06T02:27:39 1770344859

You could say its Polish is polished.

dimgl · 2026-02-06T01:49:01 1770342541

I don't get what's so difficult to understand. They have ambitions beyond just coding. And Claude is generally a good LLM. Even beyond just the coding applications.

redox99 · 2026-02-06T00:33:33 1770338013

Why would I even use Claude for asking something on their web, considering that chips away my claude code usage limit?

Their limit system is so bad.

fnordpiglet · 2026-02-06T05:49:47 1770356987

Enterprise, government, and regulated institutions. It’s also defacto standard for programming assistants at most places. They have a better story around compliance, alignment, task based inference, agentic workflows, etc. Their retail story is meh, but I think their view is to be the aws of LLMs while OpenAI can be the retail and Gemini the whatever Google does with products.

derwiki · 2026-02-06T00:14:01 1770336841

It feels very similar to how Lyft positioned themselves against Uber. (And we know how that played out)

dev1ycan · 2026-02-06T07:37:15 1770363435

Their "constitution" is just garbage meant to defend them ripping off copyrighted material with the excuse that "it's not plagiarizing, it thinks!!!!1" which is, false.

handoflixue · 2026-02-06T07:48:36 1770364116

I don't recall them ever offering that legal reasoning - I'm sure you can provide a citation?

simonw · 2026-02-05T17:58:52 1770314332

The bicycle frame is a bit wonky but the pelican itself is great: https://gist.github.com/simonw/a6806ce41b4c721e240a4548ecdbe...

stkai · 2026-02-05T18:56:40 1770317800

Would love to find out they're overfitting for pelican drawings.

fdeage · 2026-02-05T23:28:49 1770334129

OpenAI claims not to: https://x.com/aidan_mclau/status/1986255202132042164

andy_ppp · 2026-02-05T19:16:19 1770318979

Yes, Racoon on a unicycle? Magpie on a pedalo?

throw310822 · 2026-02-05T21:27:51 1770326871

Correct horse battery staple:

https://claude.ai/public/artifacts/14a23d7f-8a10-4cde-89fe-0...

ta988 · 2026-02-05T21:49:58 1770328198

no staple?