Surely the corpus Opus 4.6 ingested would include whatever reference you used to...

sigmoid10 · 2026-02-06T09:58:40 1770371920

Most people still don't realize that general public world knowledge is not really a test for a model that was trained on general public world knowledge. I wouldn't be surprised if even proprietary content like the books themselves found their way into the training data, despite what publishers and authors may think of that. As a matter of fact, with all the special deals these companies make with publishers, it is getting harder and harder for normal users to come up with validation data that only they have seen. At least for human written text, this kind of data is more or less reserved for specialist industries and higher academia by now. If you're a janitor with a high school diploma, there may be barely any textual information or fact you have ever consumed that such a model hasn't seen during training already.

rendx · 2026-02-06T13:37:42 1770385062

> I wouldn't be surprised if even proprietary content like the books themselves found their way into the training data

No need for surprises! It is publicly known that the corpus of 'shadow libraries' such as Library Genesis and Anna's Archive were specifically and manually requested by at least NVIDIA for their training data [1], used by Google in their training [2], downloaded by Meta employees [3] etc.

[1] https://news.ycombinator.com/item?id=46572846

[2] https://www.theguardian.com/technology/2023/apr/20/fresh-con...

[3] https://www.theverge.com/2023/7/9/23788741/sarah-silverman-o...

paodealho · 2026-02-06T13:49:54 1770385794

also:

"Researchers Extract Nearly Entire Harry Potter Book From Commercial LLMs"

https://www.aitechsuite.com/ai-news/ai-shock-researchers-ext...

sigmoid10 · 2026-02-06T13:57:42 1770386262

The big AI houses are all in involved in varying degrees of litigation (all the way to class action lawsuits) with the big publishing houses. I think they at least have some level of filtering for their training data to keep them legally somewhat compliant. But considering how much copyrighted stuff is spread blisfully online, it is probably not enough to filter out the actual ebooks of certain publishers.

rendx · 2026-02-06T23:06:02 1770419162

> I think they at least have some level of filtering for their training data to keep them legally somewhat compliant.

So far, courts are siding with the "fair use" argument. No need to exclude any data.

https://natlawreview.com/article/anthropic-and-meta-fair-use...

"Even if LLM training is fair use, AI companies face potential liability for unauthorized copying and distribution. The extent of that liability and any damages remain unresolved."

https://www.whitecase.com/insight-alert/two-california-distr...

joenot443 · 2026-02-06T13:26:27 1770384387

> even proprietary content like the books themselves

This definitely raises an interesting question. It seems like a good chunk of popular literature (especially from the 2000s) exists online in big HTML files. Immediately to mind was House of Leaves, Infinite Jest, Harry Potter, basically any Stephen King book - they've all been posted at some point.

Do LLMS have a good way of inferring where knowledge from the context begins and knowledge from the training data ends?

rendx · 2026-02-06T13:42:25 1770385345

> It seems like a good chunk of popular literature (especially from the 2000s) exists online in big HTML files

Anna's Archive alone claims to currently publicly host 61,654,285 books, more than 1PB in total.

yunohn · 2026-02-06T15:16:47 1770391007

Maybe y’all missed this?

https://www.washingtonpost.com/technology/2026/01/27/anthrop...

Anthropic, specifically, ingested libraries of books by scanning and then disposing of them.

beepbooptheory · 2026-02-06T16:30:20 1770395420

> If you're a janitor with a high school diploma, there may be barely any textual information or fact you have ever consumed that such a model hasn't seen during training already.

The plot of Good Will Hunting would like a word.

MarcellusDrum · 2026-02-06T10:03:36 1770372216

So a good test would be replacing the spell names in the books with made-up spells. And if a "real" spell name was given, it also tests whether it "cheated".

ggrab · 2026-02-10T23:03:12 1770764592

I've run that experiment now, spoiler: It cheated with its pre-training knowledge https://georggrab.net/content/opus46retrieval.html

MarcellusDrum · 2026-02-14T10:19:08 1771064348

Thanks for trying! Good to know.

outofpaper · 2026-02-06T10:38:07 1770374287

A real test is synthesizing 100,000 sentences of this slect random ones and then inject the traits you want thr LLM to detect and describe, eg have a set of words or phrases that may represent spells and have them used so that they do something. Then have the LLM find these random spells in the random corpus.

lxgr · 2026-02-06T12:06:46 1770379606

It could still remember where each spell is mentioned. I think the only way to properly test this would be to run it against an unpublished manuscript.

staticman2 · 2026-02-06T12:54:59 1770382499

Any obscure work of fiction or fanfiction would likely be fine as a casual test.

If you ask a model to discuss an obscure work it'll have no clue what it's about.

This is very different than asking about Harry Potter.

lxgr · 2026-02-06T13:18:41 1770383921

Yeah, that's what I've been doing as well, and at least Gemini 3 Pro did not fare very well.

staticman2 · 2026-02-06T13:29:13 1770384553

For fun I've asked Gemini Pro to answer open ended questions about obscure books like "Read this novel and tell me what the hell is this book, do a deep reading and analyze" and I've gotten insightful/ enjoyable answers but I've never asked it to make lists of spells or anything like that.

vercaemert · 2026-02-06T11:10:29 1770376229

It's impressive, even if the books and the posts you're talking about were both key parts of the training data.

There are many academic domains where the research portion of a PhD is essentially what the model just did. For example, PhD students in some of the humanities will spend years combing ancient sources for specific combinations of prepositions and objects, only to write a paper showing that the previous scholars were wrong (and that a particular preposition has examples of being used with people rather than places).

This sort of experiment shows that Opus would be good at that. I'm assuming it's trivial for the OP to extend their experiment to determine how many times "wingardium leviosa" was used on an object rather than a person.

(It's worth noting that other models are decent at this, and you would need to find a way to benchmark between them.)

adastra22 · 2026-02-06T11:21:09 1770376869

I don’t think this example proves your point. There’s no indication that the model actually worked this out from the input context, instead of regurgitating it from the training weights. A better test would be to subtly modify the books fed in as input to the model so that there was actually 51 spells, and see if it pulls out the extra spell, or to modify the names of some spells, etc.

In your example, it might be the case that the model simply spits out consensus view, rather than actually finding/constructing this information on his own.

vercaemert · 2026-02-06T11:38:02 1770377882

Ah, that's a good point.

fastasucan · 2026-02-07T20:09:03 1770494943

Since it got 49 of 50 right its worse than what you would get using a simple google search. People would immediately disregard a conventional source that only listed 49 out of 50.

ehatr · 2026-02-06T13:55:25 1770386125

The poster you reply to works in AI. The marketing strategy is to always have a cute Pelican or Harry Potter comment as the top comment for positive associations.

The poster knows all of that, this is plain marketing.

throw10920 · 2026-02-06T14:11:38 1770387098

This sounds compelling, but also something that an armchair marketer would have theorycrafted without any real-world experience or evidence that it actually works - and I searched online and can't find any references to something like it.

Do you have a citation for this?

rlt · 2026-02-06T21:13:40 1770412420

They should try the same thing but replace the original spell names with something else.

zaphirplane · 2026-02-06T14:01:24 1770386484

Why doesn’t you ask it and find out ;)

grey-area · 2026-02-06T14:18:15 1770387495

Because the model doesn't know but will happily tell a convincing lie about how it works.