We tasked Opus 4.6 using agent teams to build a C Compiler

ndesaulniers · 2026-02-05T21:41:17 1770327677

I spent a good part of my career (nearly a decade) at Google working on getting Clang to build the linux kernel. https://clangbuiltlinux.github.io/

This LLM did it in (checks notes):

> Over nearly 2,000 Claude Code sessions and $20,000 in API costs

It may build, but does it boot (was also a significant and distinct next milestone)? (Also, will it blend?). Looks like yes!

> The 100,000-line compiler can build a bootable Linux 6.9 on x86, ARM, and RISC-V.

The next milestone is:

Is the generated code correct? The jury is still out on that one for production compilers. And then you have performance of generated code.

> The generated code is not very efficient. Even with all optimizations enabled, it outputs less efficient code than GCC with all optimizations disabled.

Still a really cool project!

iberator · 2026-02-06T06:28:40 1770359320

Claude did not wrote it. you wrote it with PREVIOUS EXPERIENCE with 20.000 long commandshyellihg him exactly what to do.

Real usable AI would create it with simple: 'make c compilers c99 faster than GCC'.

AI usage should be banned in general. It takes jobs faster than creating new ones ..

arcanemachiner · 2026-02-06T07:44:30 1770363870

That's actually pretty funny. They're patting it on the back for using, in all likelihood, some significant portions of code that they actually wrote, which was stolen from them without attribution so that it could be used as part of a very expensive parlour trick.

shakna · 2026-02-05T21:59:21 1770328761

> Opus was unable to implement a 16-bit x86 code generator needed to boot into 16-bit real mode. While the compiler can output correct 16-bit x86 via the 66/67 opcode prefixes, the resulting compiled output is over 60kb, far exceeding the 32k code limit enforced by Linux. Instead, Claude simply cheats here and calls out to GCC for this phase

Does it really boot...?

ndesaulniers · 2026-02-05T22:11:50 1770329510

> Does it really boot...?

They don't need 16b x86 support for the RISCV or ARM ports, so yes, but depends on what 'it' we're talking about here.

Also, FWIW, GCC doesn't directly assemble to machine code either; it shells out to GAS (GNU Assembler). This blog post calls it "GCC assembler and linker" but to be more precise the author should edit this to "GNU binutils assembler and linker." Even then GNU binutils contains two linkers (BFD and GOLD), or did they excise GOLD already (IIRC, there was some discussion a few years ago about it)?

shakna · 2026-02-05T22:34:04 1770330844

Yeah, didn't mention gas or ld, for similar reasons. I agree that a compiler doesn't necessarily "need" those.

I don't agree that all the claims are backed up by their own comments, which means that there's probably other places where it falls down.

Its... Misrepresentation.

Like Chicken is a Scheme compiler. But they're very up front that it depends on a C compiler.

Here, they wrote a C compiler that is at least sometimes reliant on having a different C compiler around. So is the project at 50%? 75%?

Even if its 99%, thats not the same story as they tried to write. And if they wrote that tale instead, it would be more impressive, rather than "There's some holes. How many?"

Philpax · 2026-02-05T22:51:01 1770331861

Their C compiler is not reliant on having another C compiler around. Compiling the 16-bit real mode bootstrap for the Linux kernel on x86(-64) requires another C compiler; you certainly don't need another compiler to compile the kernel for another architecture, or to compile another piece of software not subject to the 32k constraint.

The compiler itself is entirely functional; it just can't generate code optimal enough to fit within the constraints for that very specific (tiny!) part of the system, so another compiler is required to do that step.

TheCondor · 2026-02-06T03:52:45 1770349965

The assembler seems like nearly the easiest part. Slurp arch manuals and knock it out, it’s fixed and complete.

shakna · 2026-02-06T04:40:52 1770352852

Huh. A second person mentioning the assembler. Don't think I ever referred to one...?

qarl · 2026-02-06T00:43:59 1770338639

> Still a really cool project!

Yeah. This test sorta definitely proves that AI is legit. Despite the millions of people still insisting it's a hoax.

The fact that the optimizations aren't as good as the 40 year gcc project? Eh - I think people who focus on that are probably still in some serious denial.

byzantinegene · 2026-02-06T08:45:54 1770367554

it costs $20,000 to reinvent the wheel, that it probably trained on. If that's your definition of legit, sure

PostOnce · 2026-02-06T00:48:44 1770338924

It's amazing that it "works", but viability is another issue.

It cost $20,000 and it worked, but it's also totally possible to spend $20,000 and have Claude shit out a pile of nonsense. You won't know until you've finished spending the money whether it will fail or not. Anthropic doesn't sell a contract that says "We'll only bill you if it works" like you can get from a bunch of humans.

Do catastrophic bugs exist in that code? Who knows, it's 100,000 lines, it'll take a while to review.

On top of that, Anthropic is losing money on it.

All of those things combined, viability remains a serious question.

qarl · 2026-02-06T02:05:51 1770343551

> It cost $20,000

I'm curious - do you have ANY idea what it costs to have humans write 100,000 lines of code???

You should look it up. :)

lelanthran · 2026-02-06T04:56:48 1770353808

> > It cost $20,000

> I'm curious - do you have ANY idea what it costs to have humans write 100,000 lines of code???

I'll bite - I can write you an unoptimised C compiler that emits assembly for $20k, and it won't be 100k lines of code (maybe 15k, the last time I did this?).

It won't take me a week, though.

I think this project is a good frame of reference and matches my experience - vibing with AI is sometimes more expensive than doing it myself, and always results in much more code than necessary.

flakiness · 2026-02-06T05:20:00 1770355200

Does it support x64, x8664, arm64 and riscv? (sorry, just trolling - we don't know the quality of backend other than x8664 which is supposed to be able to build bootable linux.)

lelanthran · 2026-02-06T05:44:20 1770356660

It's not hard to build a compiler just for a bootable linux.

I see no test criteria that actually runs that built linux through various test plans, so, yeah emitting enough asm just to boot is doable.

p-e-w · 2026-02-06T05:19:33 1770355173

> I can write you an unoptimised C compiler that emits assembly for $20k

You may be willing to sell your work at that price, but that’s not the market rate, to put it very mildly. Even 10 times that would be seriously lowballing in the realm of contract work, regardless of whether it’s “optimised” or not (most software isn’t).

lelanthran · 2026-02-06T05:41:44 1770356504

> You may be willing to sell your work at that price, but that’s not the market rate, to put it very mildly.

It is now.

At any rate, this is my actual rate. I live in South Africa, and that's about 4 weeks of work for me, without an AI.

etler · 2026-02-06T08:21:17 1770366077

If my devs are writing that much code they're doing something wrong. Lines of code is an anti metric. That used to be commonly accepted knowledge.

sarchertech · 2026-02-06T02:16:01 1770344161

You wouldn’t pay a human to write 100k LOC. Or at least you shouldn’t. You’d pay a human to write a working useful compiler that isn’t riddled with copyright issues.

If you didn’t care about copying code, usefulness, or correctness you could probably get a human to whip you up a C compiler for a lot less than $20k.

qarl · 2026-02-06T02:19:27 1770344367

Are you trolling me? Companies (made of humans) write 100,000 LOC all the time.

And it's really expensive, despite your suspicions.

sarchertech · 2026-02-06T03:00:19 1770346819

No, companies don’t pay people to write 100k LOC. They pay people to write useful software.

We figured out that LOC was a useless productivity metric in the 80s.

qarl · 2026-02-06T03:40:48 1770349248

Dude.

Microsoft paid my company a lot of money to write code. And in the end you were able to count it, and the LOC is a perfectly fine metric which is still used today to measure complexity of a project.

If you actually work in software you know this.

I have no idea what point you're trying to make - but I've grown very tired of all the trolls attacking me. Good night.

EDIT: OH. Maybe you mean that people don't cite LOC in contract deliverables. Yeah, I know. I never said that, and it's irrelevant to my point.

bopbopbop7 · 2026-02-06T02:17:25 1770344245

100k lines of clean, bug free, optimized, and vulnerability free code or 100k lines of outsourced slop? Two very different price points.

qarl · 2026-02-06T02:20:14 1770344414

A compiler that can build linux.

That level of quality should be sufficient.

Do you know any low quality programmers that write C compilers in rust THAT CAN BUILD LINUX?

No you don't. They do not exist.

icedchai · 2026-02-06T02:44:29 1770345869

Yep. Building a working C compiler that compiles Linux is an impossible task for all but the top 1% of developers. And the ones that could do it have better things to do, plus they’d want a lot more than 20K for the trouble.

rhubarbtree · 2026-02-06T08:06:05 1770365165

Interesting, why impossible? We studied compiler construction at uni. I might have to dig out a few books, but I’m confident I could write one. I can’t imagine anyone on my course of 120 nerds being unable to do this.

vbezhenar · 2026-02-06T04:39:44 1770352784

What's so hard about it? Compiler construction is well researched topic and taught in the universities. I made toy language compiler as a student. May be I'm underestimating this task, but I think that I can build some simple C compiler which will output trivial assembly. Given my salary of $2500, that would probably take me around a year, so that's pretty close LoL.

fsflover · 2026-02-06T08:04:55 1770365095

https://news.ycombinator.com/item?id=46909529

bopbopbop7 · 2026-02-06T02:28:14 1770344894

Do you think this was guided by a low quality Anthropic developer?

You can give a developer the GCC test suite and have them build the compiler backwards, which is how this was done. They literally brute forced it, most developers can brute force. It also literally uses GCC in the background... Maybe try reading the article.

qarl · 2026-02-06T02:34:47 1770345287

HEH... I'm sorry man but I truly don't understand what point you're trying to make, other than change the subject and get nasty.

You take care now.

bopbopbop7 · 2026-02-06T02:36:58 1770345418

The trick to not be confused is to read more than the title of the article.

qarl · 2026-02-06T02:42:47 1770345767

[flagged]

bopbopbop7 · 2026-02-06T03:15:38 1770347738

Speaking of obnoxious

tumdum_ · 2026-02-06T01:15:03 1770340503

> On top of that, Anthropic is losing money on it.

It seems they are *not* losing money on inference: https://bsky.app/profile/steveklabnik.com/post/3mdirf7tj5s2e

chamomeal · 2026-02-06T02:42:35 1770345755

That's a good point! Here claude opus wrote a C compiler. Outrageously cool.

Earlier today, I couldn't get opus to replace useEffect-triggered-redux-dispatch nonsense with react-query calls. I already had a very nice react-query wrapper with tons of examples. But it just couldn't make sense of the useEffect rube goldberg machine.

To be fair, it was a pretty horrible mess of useEffects. But just another data point.

Also I was hoping opus would finally be able to handle complex typescript generics, but alas...

bdangubic · 2026-02-06T01:19:38 1770340778

> On top of that, Anthropic is losing money on it

This has got to be my favorite one of them all that keeps coming up in too many comments… You know who also was losing money in the beginning?! every successful company that ever existed! some like Uber were losing billions for a decade. and when was the last time you rode in a taxi? (I still do, my kid never will). not sure how old you are and if you remember “facebook will never be able to monetize on mobile…” - they all lose money, until they do not

ThrowawayR2 · 2026-02-06T04:03:25 1770350605

Anyone remember the dotcom bust?

qarl · 2026-02-06T04:39:45 1770352785

Oh yeah, I do. That whole internet thing was a total HOAX. I can't believe people bought into that.

Can you imagine if Amazon, EBay, PayPal, or Saleforce existed today?

pjmlp · 2026-02-06T06:53:21 1770360801

Well, how is your Solaris installation going?

I also remember having gone into research, because there were no jobs available, and even though I was employed at the time, our salaries weren't being paid.

deaux · 2026-02-06T05:24:20 1770355460

Completely detached from reality, brainwashed SV VC's who have made dumping the norm in their bubble.

I can guarantee you that 90% of successful businesses in the world made a profit their first year.

brookst · 2026-02-06T06:34:11 1770359651

I’ll bite. Share your data?

Companies that were not profitable in their first year: Microsoft, Google, SpaceX, airBnB, Uber, Apple, FedEx, Amazon.

If the vast majority of companies are immediately profitable, why do we have VC and investment at all? Shouldn’t the founders just start making money right aeay?

deaux · 2026-02-06T08:09:29 1770365369

> Companies that were not profitable in their first year: Microsoft, Google, SpaceX, airBnB, Uber, Apple, FedEx, Amazon.

US Big Tech, US Big Tech, US Tech-adjacent, US Big Tech, US Big Tech, US Big Tech, FedEx, US Tech-adjacent.

In other words, exactly what I was getting at.

Also, a basic search shows Microsoft to have been profitable first year. I'd be very surprised if they weren't. Apple also seems to have taken less than 2 years. And unsurprisingly, these happen to be the only two among the tech companies you named that launched before 1995.

Check out the Forbes Global 5000. Then go think about the hypothetical Forbes Global 50,000. Is the 50,000th most successful company in the world not successful? Of course not, it's incredibly successful.

> why do we have VC and investment at all

Out of all companies started in 2024 I can guarantee you that <0.01% have received VC investment by now (Feb 2026) and <1% of tech companies did. I'll bet my house on it.

PostOnce · 2026-02-06T01:51:34 1770342694

Are we forgetting that sometimes, they just go bankrupt?

bdangubic · 2026-02-06T02:18:04 1770344284

name one with comparable number of users and revenue? not saying you are wrong but I would bet against the outcome

svieira · 2026-02-06T03:42:47 1770349367

Enron

bdangubic · 2026-02-06T04:30:02 1770352202

I should have guessed someone would answer this question in this thread with Enron :)

I did not ask for random company that went under for any reason but specific question related to users and revenue.

qarl · 2026-02-06T03:52:02 1770349922

I love how your comment is getting downvoted.

Like it's a surprise that startups burn through money. I get the feeling that people really have no idea what they're talking about in here anymore.

It's a shame.

cowl · 2026-02-06T05:48:15 1770356895

then you are misunderstaing the downvoting. it's not that the fact that they are burning money. it's the fact that this cost today 20k but that is not the real cost if you factor the it is losing money on this price.

So Tomorrow when this "startup" will need to come out of their money burning phase, like every startup has to sooner or later, that cost will increase, because there is no other monetising avenue, at least not for anthropic that "wilL never use ads".

at 20k this "might" be a reasonable cost for "the project", at 200k it might not.

brookst · 2026-02-06T06:38:03 1770359883

That would be insightful if the cost of inference weren’t declining at roughly 90% per year. Source: https://epoch.ai/data-insights/llm-inference-price-trends

EddieRingle · 2026-02-06T08:17:56 1770365876

According to that article, the data they analyzed was API prices from LLM providers, not their actual cost to perform the inference. From that perspective, it's entirely possible to make "the cost of inference" appear to decline by simply subsidizing it more. The authors even hint at the same possibility in the overview:

> Note that while the data insight provides some commentary on what factors drive these price drops, we did not explicitly model these factors. Reduced profit margins may explain some of the drops in price, but we didn’t find clear evidence for this.

thesz · 2026-02-06T06:03:34 1770357814

  > This test sorta definitely proves that AI is legit.

This is an "in distribution" test. There are a lot of C compilers out there, including ones with git history, implemented from scratch. "In distribution" tests do not test generalization.

The "out of distribution" test would be like "implement (self-bootstrapping, Linux kernel compatible) C compiler in J." J is different enough from C and I know of no such compiler.

LinXitoW · 2026-02-06T03:12:30 1770347550

How does 20K to replicate code available in the thousands online (toy C compilers) prove anything? It requires a bunch of caveats about things that don't work, it requires a bunch of other tools to do stuff, and an experienced developer had to guide it pretty heavily to even get that lackluster result.

soperj · 2026-02-06T06:11:58 1770358318

Only if we take them at their word. I remember thinking things were in a completely different state when Amazon had their shop and go stores, but then finding out it was 1000s of people in Pakistan just watching you via camera.

miohtama · 2026-02-06T07:03:13 1770361393

GCC had 40 years headstart

kvemkon · 2026-02-06T01:31:47 1770341507

> optimizations aren't as good as the 40 year gcc project

with all optimizations disabled:

> Even with all optimizations enabled, it outputs less efficient code than GCC with all optimizations disabled.

qarl · 2026-02-06T01:47:53 1770342473

That distinction doesn't change my point. I am not surprised that a 40 year old project generates better code than this brand new one.

charcircuit · 2026-02-06T03:51:35 1770349895

Not only is it new. There has been 0 performance optimization done. Well none prompted for at least. Once you give the agents a profiler and start a loop focusing on performance you'll see it start improving it.

thesz · 2026-02-06T06:15:45 1770358545

We are talking about compiler here and "performance" referred above is the performance of generated code.

When you are optimizing a program, you have a specific part of code to improve. The part can be found with profiler.

When you are optimizing a compiler generated code, you have many similar parts of code in many programs and not-so-specific part of compiler that can be improved.

charcircuit · 2026-02-06T06:28:14 1770359294

Yes, performance of the generated code. You have some benchmark of using a handful of common programs going through common workflows and you measure the performance of the generated code. As tweaks are made you see how the different performance experiments effect the overall performance. Some strategies are always a win, but things like how you layout different files and functions in memory have different trade offs and are hard to know up front without doing actual real world testing.

ip26 · 2026-02-06T07:24:48 1770362688

I’m excited and waiting for the team that shows with $20k in credits they can substantially speed up the generated code by improving clang!

beambot · 2026-02-05T22:15:20 1770329720

This is getting close to a Ken Thompson "Trusting Trust" era -- AI could soon embed itself into the compilers themselves.

int_19h · 2026-02-06T08:34:47 1770366887

What I want to know is when we get AI decompilers

Intuitively it feels like it should be a straightforward training setup - there's lots of code out there, so compile it with various compilers, flags etc and then use those pairs of source+binary to train the model.

bopbopbop7 · 2026-02-05T22:19:29 1770329969

A pay to use non-deterministic compiler. Sounds amazing, you should start.

Aurornis · 2026-02-05T22:36:03 1770330963

Application-specific AI models can be much smaller and faster than the general purpose, do-everything LLM models. This allows them to run locally.

They can also be made to be deterministic. Some extra care is required to avoid computation paths that lead to numerical differences on different machines, but this can be accomplished reliably with small models that use integer math and use kernels that follow a specific order of operations. You get a lot more freedom to do these things on the small, application-specific models than you do when you're trying to run a big LLM across different GPU implementations in floating point.

soraminazuki · 2026-02-06T03:08:04 1770347284

> They can also be made to be deterministic.

Yeah, in the same way how pseudo-random number generators are "deterministic." They generate the exact same sequence of numbers every time given the seeds are the same!

But that's not the "determinism" people are referring to when they say LLMs aren't deterministic.

ndesaulniers · 2026-02-05T22:26:59 1770330419

Some people care more about compile times than the performance of generated code. Perhaps even the correctness of generated code. Perhaps more so than determinism of the generated code. Different people in different contexts can have different priorities. Trying to make everyone happy can sometimes lead to making no one happy. Thus dichotomies like `-O2` vs `-Os`.

EDIT (since HN is preventing me from responding):

> Some people care more about compiler speed than the correctness?

Yeah, I think plenty of people writing code in languages that have concepts like Undefined Behavior technically don't really care as much about correctness as they may claim otherwise, as it's pretty hard to write large volumes of code without indirectly relying on UB somewhere. What is correct in such case was left up to interpretation of the implementer by ISO WG14.

bopbopbop7 · 2026-02-05T22:33:33 1770330813

Some people care more about compiler speed than the correctness? I would love to meet these imaginary people that are fine with a compiler that is straight up broken. Emitting working code is the baseline, not some preference slider.

gerdesj · 2026-02-06T02:07:59 1770343679

You might have not run Gentoo. Most Gentooers will begrudgingly but eventually admit to cooking their own gonads when updating a laptop.

Anyway, please define: "correctness".

fragmede · 2026-02-05T22:57:27 1770332247

Let's pretend, for just a second, that the people who do, having been able to learn how to program, are not absolute fucking morons. Straight up broken is obviously not useful, so maybe the conclusions you've jumped to could use some reexamination.

chasd00 · 2026-02-05T22:36:02 1770330962

a compiler introducing bugs into code it compiles is a nightmare thankfully few have faced. The only thing worse would be a CPU bug like the legendary Pentium bug. Imagine you compile something like Postgres only to have it crash in some unpredictable way. How long do you stare at Postgres source before suspecting the compiler? What if this compiler was used to compile code in software running all over cloud stacks? Bugs in compilers are very bad news, they have to be correct.

ndesaulniers · 2026-02-05T23:17:35 1770333455

Yeah, my current boss spent time weeding out such hardware bugs: https://arxiv.org/abs/2110.11519 (EDIT: maybe https://x.com/Tesla_AI/status/1930686196201714027 is a more relevant citation)

They found a bimodal distribution in failures over the lifetime of chips. Infant mortality was well understood. Silicon aging over time was much less well understood, and I still find surprising.

addaon · 2026-02-06T00:46:06 1770338766

> a compiler introducing bugs into code it compiles is a nightmare thankfully few have faced

Is this true? It’s not an everyday thing, but when using less common flags, or code structures, or targets… every few years I run into a codegen issue. It’s hard to imagine going through a career without a handful…

ndesaulniers · 2026-02-05T22:31:12 1770330672

We're already starting to see people experimenting with applying AI towards register allocation and inlining heuristics. I think that many fields within a compiler are still ripe for experimentation.

https://llvm.org/docs/MLGO.html

andai · 2026-02-05T23:57:50 1770335870

The asymmetry will be between the frontier AI's ability to create exploits vs find them.

jojobas · 2026-02-06T00:24:43 1770337483

Sorry, clang 26.0 requires an Nvidia B200 to run.

dnautics · 2026-02-06T03:07:06 1770347226

would be hard to miss gigantic kv cache matrix multiplications

greenavocado · 2026-02-06T00:41:15 1770338475

Then i'll be left wondering why my program requires 512TB of RAM to open

grey-area · 2026-02-06T07:45:49 1770363949

Isn't the AI basing what it does heavily on the publicly available source code for compilers in C though? Without that work it would not be able to generate this would it? Or in your opinion is it sufficiently different from the work people like you did to be classed as unique creation?

I'm curious on your take on the references the GAI might have used to create such a project and whether this matters.

MaskRay · 2026-02-06T01:52:47 1770342767

I want to verify the claim that it builds the Linux kernel. It quickly runs into errors, but yeah, still pretty cool!

make O=/tmp/linux/x86 ARCH=x86_64 CC=/tmp/p/claudes-c-compiler/target/release/ccc -j30 defconfig all

``` /home/ray/Dev/linux/arch/x86/include/asm/preempt.h:44:184: error: expected ';' after expression before 'pto_tmp__' do { u32 pto_val__ = ((u32)(((unsigned long) ~0x80000000) & 0xffffffff)); if (0) { __typeof_unqual__((__preempt_count)) pto_tmp__; pto_tmp__ = (~0x80000000); (void)pto_tmp__; } asm ("and" "l " "%[val], " "%" "[var]" : [var] "+m" (((__preempt_count))) : [val] "ri" (pto_val__)); } while (0); ^~~~~~~~~ fix-it hint: insert ';' /home/ray/Dev/linux/arch/x86/include/asm/preempt.h:49:183: error: expected ';' after expression before 'pto_tmp__' do { u32 pto_val__ = ((u32)(((unsigned long) 0x80000000) & 0xffffffff)); if (0) { __typeof_unqual__((__preempt_count)) pto_tmp__; pto_tmp__ = (0x80000000); (void)pto_tmp__; } asm ("or" "l " "%[val], " "%" "[var]" : [var] "+m" (((__preempt_count))) : [val] "ri" (pto_val__)); } while (0); ^~~~~~~~~ fix-it hint: insert ';' /home/ray/Dev/linux/arch/x86/include/asm/preempt.h:61:212: error: expected ';' after expression before 'pao_tmp__' ```

silver_sun · 2026-02-06T03:59:43 1770350383

They said it builds Linux 6.9, maybe you are trying to compile a newer version there?

MaskRay · 2026-02-06T04:44:56 1770353096

git switch v6.9

The riscv build succeeded. For the x86-64 build I ran into

    % make O=/tmp/linux/x86 ARCH=x86_64 CC=/tmp/p/claudes-c-compiler/target/release/ccc-x86 HOSTCC=/tmp/p/claudes-c-compiler/target/release/ccc-x86 LDFLAGS=-fuse-ld=bfd LD=ld.bfd -j30 vmlinux -k
    make[1]: Entering directory '/tmp/linux/x86'
    ...
      CC      arch/x86/platform/intel/iosf_mbi.o
    ccc: error: lgdtl requires memory operand
      AR      arch/x86/platform/intel-mid/built-in.a
    make[6]: *** [/home/ray/Dev/linux/scripts/Makefile.build:362: arch/x86/realmode/rm/wakeup_asm.o] Error 1
    ld.bfd: arch/x86/entry/vdso/vdso32/sigreturn.o: warning: relocation in read-only section `.eh_frame'
    ld.bfd: error in arch/x86/entry/vdso/vdso32/sigreturn.o(.eh_frame); no .eh_frame_hdr table will be created
    ld.bfd: warning: creating DT_TEXTREL in a shared object
    ccc: error: unsupported pushw operand

There are many other errors.

tinyconfig and allnoconfig have fewer errors.

    RELOCS  arch/x86/realmode/rm/realmode.relocs
    Invalid absolute R_386_32 relocation: real_mode_seg

Still very impressive.

pertymcpert · 2026-02-06T06:32:40 1770359560

They said that it wasn't able to support 16 bit real mode. Needs to call gcc for that.

the_jends · 2026-02-06T01:27:55 1770341275

Being just a grunt engineer in a product firm I can't imagine being able to spend multiple years on one project. If it's something you're passionate about, that sounds like a dream!

9rx · 2026-02-06T07:17:26 1770362246

> I spent a good part of my career (nearly a decade) at Google working on getting Clang to build the linux kernel.

How much of that time was spent writing the tests that they found to use in this experiment? You (or someone like you) were a major contributor to this. All Opus had to do here was keep brute forcing a solution until the tests passed.

It is amazing that it is possible at all, but remains an impossibly without a heavy human hand. One could easily still spend a good part of their career reproducing this if they first had to rewrite all of the tests from scratch.

zaphirplane · 2026-02-05T22:02:00 1770328920

What were the challenges out of interest. Some of it is the use of gcc extensions? Which needed an equivalent and porting over to the equivalent

ndesaulniers · 2026-02-05T22:17:51 1770329871

`asm goto` was the big one. The x86_64 maintainers broke the clang builds very intentionally just after we had gotten x86_64 building (with necessary patches upstreamed) by requiring compiler support for that GNU C extension. This was right around the time of meltdown+spectre, and the x86_64 maintainers didn't want to support fallbacks for older versions of GCC (and ToT Clang at the time) that lacked `asm goto` support for the initial fixes shipped under duress (embargo). `asm goto` requires plumbing throughout the compiler, and I've learned more about register allocation than I particularly care...

Fixing some UB in the kernel sources, lots of plumbing to the build system (particularly making it more hermetic).

Getting the rest of the LLVM binutils substitutes to work in place of GNU binutils was also challenging. Rewriting a fair amount of 32b ARM assembler to be "unified syntax" in the kernel. Linker bugs are hard to debug. Kernel boot failures are hard to debug (thank god for QEMU+gdb protocol). Lots of people worked on many different parts here, not just me.

Evangelism and convincing upstream kernel developers why clang support was worth anyones while.

https://github.com/ClangBuiltLinux/linux/issues for a good historical perspective. https://github.com/ClangBuiltLinux/linux/wiki/Talks,-Present... for talks on the subject. Keynoting LLVM conf was a personal highlight (https://www.youtube.com/watch?v=6l4DtR5exwo).

TZubiri · 2026-02-06T05:25:35 1770355535

>Is the generated code correct? The jury is still out on that one for production compilers. And then you have performance of generated code.

It's worth noting that this was developed by compiling Linux and running tests, so at least that is part of the training set and not the testing set.

But at least for linux, I'm guessing the tests are very robust and I'm guessing that will work correctly. That said, if any bugs pop up, it will show weak points in the linux tests.

phillmv · 2026-02-05T21:56:06 1770328566

i mean… your work also went into the training set, so it's not entirely surprising that it spat a version back out!

underdeserver · 2026-02-05T21:59:51 1770328791

Anthropic's version is in Rust though, so at least a little different.

ndesaulniers · 2026-02-05T22:22:35 1770330155

There's parts of LLVM architecture that are long in the tooth (IMO) (as is the language it's implemented in, IMO).

I had hoped one day to re-implement parts of LLVM itself in Rust; in particular, I've been curious if we can concurrently compile C (and parse C in parallel, or lazily) that haven't been explored in LLVM, and I think might be safer to do in Rust. I don't know enough about grammers to know if it's technically impossible, but a healthy dose of ignorance can sometimes lead to breakthroughs.

LLVM is pretty well designed for test. I was able to implement a lexer for C in Rust that could lex the Linux kernel, and use clang to cross check my implementation (I would compare my interpretation of the token stream against clang's). Just having a standard module system makes having reusable pieces seems like perhaps a better way to compose a toolchain, but maybe folks with more experience with rustc have scars to disagree?

jcranmer · 2026-02-06T01:30:57 1770341457

> I had hoped one day to re-implement parts of LLVM itself in Rust

Heh, earlier this day, I was just thinking how crazy a proposal would it actually be to have a Rust dependency (specifically, the egg crate, since one of the things I'm banging my head against right now might be better solved with egraphs).

rwmj · 2026-02-05T22:06:29 1770329189

It's not really important in latent space / conceptually.

D-Machine · 2026-02-06T02:01:12 1770343272

This is the proper deep critique / skepticism (or sophisticated goal-post moving, if you prefer) here. Yes, obviously this isn't just reproducing C compiler code in the training set, since this is Rust, but it is much less clear how much of the generated Rust code can (or can not) be accurately seen as being translated from C code in the training set.

yoz-y · 2026-02-05T23:39:54 1770334794

One thing LLMs are really good at is translation. I haven’t tried porting projects from one language to another, but it wouldn’t surprise me if they were particularly good at that too.

GaggiX · 2026-02-05T21:59:21 1770328761

Clang is not written in Rust tho

underdeserver · 2026-02-05T22:00:07 1770328807

jbjbjbjb · 2026-02-05T23:20:13 1770333613

It’s cool but there’s a good chance it’s just copying someone else’s homework albeit in an elaborate round about way.

nomel · 2026-02-05T23:43:14 1770334994

I would claim that LLMs desperately need proprietary code in their training, before we see any big gains in quality.

There's some incredible source available code out there. Statistically, I think there's a LOT more not so great source available code out there, because the majority of output of seasoned/high skill developers is proprietary.

To me, a surprising portion of Claude 4.5 output definitely looks like student homework answers, because I think that's closer to the mean of the code population.

dcre · 2026-02-06T03:30:52 1770348652

This is dead wrong: essentially the entirety of the huge gains in coding performance in the past year have come from RL, not from new sources of training data.

I echo the other commenters that proprietary code isn’t any better, plus it doesn’t matter because when you use LLMs to work on proprietary code, it has the code right there.

thesz · 2026-02-06T06:42:43 1770360163

  > the huge gains in coding performance in the past year have come from RL, not from new sources of training data.

This one was on HN recently: https://spectrum.ieee.org/ai-coding-degrades

Author attributes past year's degradation of code generation by LLMs to excessive use of new source of training data, namely, users' code generation conversations.

elevation · 2026-02-06T03:58:40 1770350320

> it doesn’t matter because when you use LLMs to work on proprietary code, it has the code right there

The quality of the existing code base makes a huge difference. On a recent greenfield effort, Claude emitted an MVP that matched the design semantics, but the code was not up to standards. For example, it repeatedly loaded a large file into memory in different areas where it was needed (rather than loading once and passing a reference.)

However, after an early refactor, the subsequently generated code vastly improved. It honors the testing and performance paradigms, and it's so clean there's nothing for the linter to do.

nextos · 2026-02-06T04:32:38 1770352358

Progress with RL is very interesting, but it's still too inefficient. Current models do OK on simple boring linear code. But they output complete nonsense when presented with some compact but mildly complex code, e.g. a NumPyro model with some nesting and einsums.

For this reason, to be truly useful, model outputs need to be verifiable. Formal verification with languages like Dafny , F*, or Isabelle might offer some solutions [1]. Otherwise, a gigantic software artifact such as a compiler is going to have a critical correctness bugs with far-fetched consequences if deployed in production.

Right now, I think treating a LLM like something different than a very useful information retrieval system with excellent semantic capabilities is not something I am comfortable with.

[1] https://risemsr.github.io/blog/2026-02-04-nik-agentic-pop

bearjaws · 2026-02-06T01:59:52 1770343192

I will say many closed source repos are probably equally as poor as open source ones.

Even worse in many cases because they are so over engineered nobody understands how they work.

hirvi74 · 2026-02-06T03:56:57 1770350217

I firmly agree with your first sentence. I can just think about the various modders that have created patches and performance enhancing mods for games with budgets of tens to hundreds of millions of dollars.

But to give other devs and myself some grace, I do believe plenty of bad code can likely be explained by bad deadlines. After all, what's the Russian idiom? "There is nothing more permanent than the temporary."

typ · 2026-02-06T01:50:58 1770342658

I'd bet, on average, the quality of proprietary code is worse than open-source code. There have been decades of accumulated slop generated by human agents with wildly varied skill levels, all vibe-coded by ruthless, incompetent corporate bosses.

Manouchehri · 2026-02-06T02:15:31 1770344131

There's only very niche fields where closed-source code quality is often better than open-source code.

Exploits and HFT are the two examples I can think of. Both are usually closed source because of the financial incentives.

ozim · 2026-02-06T05:06:45 1770354405

Here we can start debating what means better code.

I haven’t seen HFT code but I have seen examples of exploit codes and most of it is amateur hour when it comes to building big size systems.

They are of course efficient in getting to the goal. But exploits are one off code that is not there to be maintained.

Take8435 · 2026-02-06T02:30:01 1770345001

Not to mention, a team member is (surprise!) fired or let go, and no knowledge transfer exists. Womp, womp. Codebase just gets worse as the organization or team flails.

Seen this way too often.

hirvi74 · 2026-02-06T04:03:20 1770350600

In my time, I have potentially written code that some legal jurisdictions might classify as a "crime against humanity" due to the quality.

kortilla · 2026-02-06T04:15:14 1770351314

It doesn’t matter what the average is though. If 1% of software is open source, there is significantly more closed source software out there and given normal skills distributions, that means there is at least as much high quality closed source software out there, if not significantly more. The trick is skipping the 95% of crap.

bhadass · 2026-02-06T00:16:08 1770336968

yeah, but isn't the whole point of claude code to get people to provide preference data/telemetry data to anthropic (unless you opt out?). same w/ other providers.

i'm guessing most of the gains we've seen recently are post training rather than pretraining.

nomel · 2026-02-06T01:04:10 1770339850

Yes, but you have the problem that a good portion of that is going to be AI generated.

But, I naively assume most orgs would opt out. I know some orgs have a proxy in place that will prevent certain proprietary code from passing through!

This makes me curious if, in the allow case, Anthropic is recording generated output, to maybe down-weight it if it's seen in the training data (or something similar)?

andai · 2026-02-05T23:56:07 1770335767

Let's start with the source code for the Flash IDE :)

wvenable · 2026-02-06T00:28:04 1770337684

This is cool and actually demonstrates real utility. Using AI to take something that already exists and create it for a different library / framework / platform is cool. I'm sure there's a lot of training data in there for just this case.

But I wonder how it would fare given a language specification for a non-existent non-trivial language and build a compiler for that instead?

nmstoker · 2026-02-06T01:18:13 1770340693

If you come up with a realistic language spec and wait maybe six months, by then it'll probably be approach being cheap enough that you could test the scenario yourself!

luke5441 · 2026-02-05T23:37:26 1770334646

It looks like a much more progressed/complete version of https://github.com/kidoz/smdc-toolchain/tree/master/crates/s... . But that one is only a month old. So a bit confused there. Maybe that was also created via LLM?

nlawalker · 2026-02-06T00:59:31 1770339571

I see that as the point that all this is proving - most people, most of the time, are essentially reinventing the wheel at some scope and scale or another, so we’d all benefit from being able to find and copy each others’ homework more efficiently.

computerex · 2026-02-06T01:50:43 1770342643

And the goal post shifts.

kreelman · 2026-02-06T00:53:02 1770339182

..A small thing, but it won't compile the RISCV version of hello.c if the source isn't installed on the machine it's running on.

It is standing on the shoulders of giants (all of the compilers of the past, built into it's training data... and the recent learnings about getting these agents to break up tasks) to get itself going. Still fairly impressive.

On a side-quest, I wonder where Anthropic is getting there power from. The whole energy debacle in the US at the moment probably means it made some CO2 in the process. Would be hard to avoid?

eek2121 · 2026-02-05T23:58:43 1770335923

Also: a large amount of folks seem to think Claude code is losing a ton of money. I have no idea where the final numbers land, however, if the $20,000 figure is accurate and based on some of the estimates I've seen, they could've hired 8 senior level developers at a quarter million a year for the same amount of money spent internally.

Granted, marketing sucks up far too much money for any startup, and again, we don't know the actual numbers in play, however, this is something to keep in mind. (The very same marketing that likely also wrote the blog post, FWIW).

willsmith72 · 2026-02-06T00:11:37 1770336697

this doesn't add up. the 20k is in API costs. people talk about CC losing money because it's way more efficient than the API. I.e. the same work with efficient use of CC might have cost ~$5k.

but regardless, hiring is difficult and high-end talent is limited. If the costs were anywhere close to equivalent, the agents are a no-brainer

majormajor · 2026-02-06T01:27:23 1770341243

CC hits their APIs, And internally I'm sure Anthropic tracks those calls, which is what they seem to be referencing here. What exactly did Anthropic do in this test to have "inefficient use of CC" vs your proposed "efficient use of CC"?

Or do you mean that if an external user replicated this experience they might get billed less than $20k due to CC being sold at lower rates than per-API-call metered billing?

GorbachevyChase · 2026-02-06T00:45:38 1770338738

Even if the dollar cost for product created was the same, the flexibility of being able to spin a team up and down with an API call is a major advantage. That AI can write working code at all is still amazing to me.

bloaf · 2026-02-06T01:38:22 1770341902

This thing was done in 2 weeks. In the orgs I've worked in, you'd be lucky to get HR approval to create a job posting within 2 weeks.

NitpickLawyer · 2026-02-05T19:28:13 1770319693

This is a much more reasonable take than the cursor-browser thing. A few things that make it pretty impressive:

> This was a clean-room implementation (Claude did not have internet access at any point during its development); it depends only on the Rust standard library. The 100,000-line compiler can build Linux 6.9 on x86, ARM, and RISC-V. It can also compile QEMU, FFmpeg, SQlite, postgres, redis

> I started by drafting what I wanted: a from-scratch optimizing compiler with no dependencies, GCC-compatible, able to compile the Linux kernel, and designed to support multiple backends. While I specified some aspects of the design (e.g., that it should have an SSA IR to enable multiple optimization passes) I did not go into any detail on how to do so.

> Previous Opus 4 models were barely capable of producing a functional compiler. Opus 4.5 was the first to cross a threshold that allowed it to produce a functional compiler which could pass large test suites, but it was still incapable of compiling any real large projects.

And the very open points about limitations (and hacks, as cc loves hacks):

> It lacks the 16-bit x86 compiler that is necessary to boot [...] Opus was unable to implement a 16-bit x86 code generator needed to boot into 16-bit real mode. While the compiler can output correct 16-bit x86 via the 66/67 opcode prefixes, the resulting compiled output is over 60kb, far exceeding the 32k code limit enforced by Linux. Instead, Claude simply cheats here and calls out to GCC for this phase

> It does not have its own assembler and linker;

> Even with all optimizations enabled, it outputs less efficient code than GCC with all optimizations disabled.

Ending with a very down to earth take:

> The resulting compiler has nearly reached the limits of Opus’s abilities. I tried (hard!) to fix several of the above limitations but wasn’t fully successful. New features and bugfixes frequently broke existing functionality.

All in all, I'd say it's a cool little experiment, impressive even with the limitations, and a good test-case as the author says "The resulting compiler has nearly reached the limits of Opus’s abilities". Yeah, that's fair, but still highly imrpessive IMO.

shubhamjain · 2026-02-06T08:43:06 1770367386

Yeah, I am amazed how people are brushing this off simply because GCC exists. This was far more challenging task than the browser thing, because of how far few open source compilers are there. Add to that no internet access and no dependencies.

At this point, it’s hard to deny that AI has become capable of completing extremely difficult tasks, provided it has enough time and tokens.

bjackman · 2026-02-06T08:49:03 1770367743

I don't think this is more challenging than the browser thing. The scope is much smaller. The fact that this is "only" 100k lines is evidence for this. But, it's still very impressive.

I think this is Anthropic seeing the Cursor guy's bullshit and saying "but, we need to show people that the AI _can actually_ do very impressive shit as long as you pick a more sensible goal"

geraneum · 2026-02-05T19:39:38 1770320378

> This was a clean-room implementation

This is really pushing it, considering it’s trained on… internet, with all available c compilers. The work is already impressive enough, no need for such misleading statements.

raincole · 2026-02-05T21:38:36 1770327516

It's not a clean-room implementation, but not because it's trained on the internet.

It's not a clean-room implementation because of this:

> The fix was to use GCC as an online known-good compiler oracle to compare against

Calavar · 2026-02-05T22:38:36 1770331116

The classical definition of a clean room implementation is something that's made by looking at the output of a prior implementation but not at the source.

I agree that having a reference compiler available is a huge caveat though. Even if we completely put training data leakage aside, they're developing against a programmatic checker for a spec that's already had millions of man hours put into it. This is an optimal scenario for agentic coding, but the vast majority of problems that people will want to tackle with agentic coding are not going to look like that.

visarga · 2026-02-06T05:43:07 1770356587

This is the reimplementation scenario for agentic coding. If you have a good spec and battery of tests you can delete the code and reimplement it. Code is no longer the product of eng work, it is more like bytecode now, you regenerate it, you don't read it. If you have to read it then you are just walking a motorcycle.

We have seen at least 3 of these projects - the JustHTML one, the FastRender and this one. All started from beefy tests and specs. They show reimplementation without manual intervention kind of works.

Calavar · 2026-02-06T06:18:19 1770358699

I think that's overstating it.

JustHTML is a success in large part because it's a problem that can be solved with 4 digit LOC. The whole codebase can sit in an LLM's context at once. Do LLMs scale beyond that?

I would classify both FastRender and Opus C compiler as interesting failures. They are interesting because they got a non-negligible fraction of the way to feature complete. They are failures because they ended with no clear path for moving the needle forward to 80% feature complete, let alone 100%.

From the original article:

> The resulting compiler has nearly reached the limits of Opus’s abilities. I tried (hard!) to fix several of the above limitations but wasn’t fully successful. New features and bugfixes frequently broke existing functionality.

From the experiments we've seen so far it seems that a large enough agentic code base will inevitably collapse under its own weight.

array_key_first · 2026-02-05T22:54:11 1770332051

If you read the entire GCC source code and then create a compatible compiler, it's not clean room. Which Opus basically did since, I'm assuming, its training set contained the entire source of GCC. So even if they were actively referencing GCC I think that counts.

nmilo · 2026-02-06T00:00:50 1770336050

What if you just read the entire GCC source code in school 15 years ago? Is that not clean room?

hex4def6 · 2026-02-06T00:27:28 1770337648

No.

I'd argue that no one would really care given it's GCC.

But if you worked for GiantSodaCo on their secret recipe under NDA, then create a new soda company 15 years later that tastes suspiciously similar to GiantSodaCo, you'd probably have legal issues. It would be hard to argue that you weren't using proprietary knowledge in that case.

pertymcpert · 2026-02-06T06:44:37 1770360277

I read the source. If anything it takes concepts from LLVM more than GCC, but the similarities aren't very deep.

GorbachevyChase · 2026-02-06T01:23:44 1770341024

https://arxiv.org/abs/2505.03335

Check out the paper above on Absolute Zero. Language models don’t just repeat code they’ve seen. They can learn to code give the right training environment.

cryptonector · 2026-02-06T02:19:31 1770344371

Hmm... If Claude iterated a lot then chances are very good that the end result bears little resemblance to open source C compilers. One could check how much resemblance the result actually bears to open source compilers, and I rather suspect that if anyone does check they'll find it doesn't resemble any open source C compiler.

TacticalCoder · 2026-02-05T23:16:33 1770333393

I'm using AI to help me code and I love Anthropic but I chocked when I read that in TFA too.

It's all but a clean-room design. A clean-room design is a very well defined term: "Clean-room design (also known as the Chinese wall technique) is the method of copying a design by reverse engineering and then recreating it without infringing any of the copyrights associated with the original design."

https://en.wikipedia.org/wiki/Clean-room_design

The "without infringing any of the copyrights" contains "any".

We know for a fact that models are extremely good at storing information with the highest compression rate ever achieved. It's not because it's typically decompressing that information in a lossy way that it didn't use that information in the first place.

Note that I'm not saying all AIs do is simply compress/decompress information. I'm saying that, as commenters noted in this thread, when a model was caught spotting out Harry Potter verbatim, there is information being stored.

It's not a clean-room design, plain and simple.

iberator · 2026-02-06T06:30:37 1770359437

this. last sane person in HN

antirez · 2026-02-05T20:05:38 1770321938

The LLM does not contain a verbatim copy of whatever it saw during the pre-training stage, it may remember certain over-represented parts, otherwise it has a knowledge about a lot of things but such knowledge, while about a huge amount of topics, is similar to the way you could remember things you know very well. And, indeed, if you give it access to internet or the source code of GCC and other compilers, it will implement such a project N times faster.

halxc · 2026-02-05T20:14:29 1770322469

We all saw verbatim copies in the early LLMs. They "fixed" it by implementing filters that trigger rewrites on blatant copyright infringement.

It is a research topic for heaven's sake:

https://arxiv.org/abs/2504.16046

RyanCavanaugh · 2026-02-05T20:18:12 1770322692

The internet is hundreds of billions of terabytes; a frontier model is maybe half a terabyte. While they are certainly capable of doing some verbatim recitations, this isn't just a matter of teasing out the compressed C compiler written in Rust that's already on the internet (where?) and stored inside the model.

silver_sun · 2026-02-06T03:31:58 1770348718

> this isn't just a matter of teasing out the compressed C compiler written in Rust that's already on the internet (where?)

A quick search brings up several C compilers written in Rust. I'm not claiming they are necessarily in Claude's training data, but they do exist.

https://github.com/PhilippRados/wrecc (unfinished)

https://github.com/ClementTsang/rustcc

https://codeberg.org/notgull/dozer (unfinished)

https://github.com/jyn514/saltwater

I would also like to add that as language models improve (in the sense of decreasing loss on the training set), they in fact become better at compressing their training data ("the Internet"), so that a model that is "half a terabyte" could represent many times more concepts with the same amount of space. Only comparing the relative size of the internet vs a model may not make this clear.

philipportner · 2026-02-05T21:46:51 1770328011

This seems related, it may not be a codebase but they are able to extract "near" verbatim books out of Claude Sonnet.

https://arxiv.org/pdf/2601.02671

> For Claude 3.7 Sonnet, we were able to extract four whole books near-verbatim, including two books under copyright in the U.S.: Harry Potter and the Sorcerer’s Stone and 1984 (Section 4).

Aurornis · 2026-02-05T22:45:54 1770331554

Their technique really stretched the definition of extracting text from the LLM.

They used a lot of different techniques to prompt with actual text from the book, then asked the LLM to continue the sentences. I only skimmed the paper but it looks like there was a lot of iteration and repetitive trials. If the LLM successfully guessed words that followed their seed, they counted that as "extraction". They had to put in a lot of the actual text to get any words back out, though. The LLM was following the style and clues in the text.

You can't literally get an LLM to give you books verbatim. These techniques always involve a lot of prompting and continuation games.

D-Machine · 2026-02-06T02:13:31 1770344011

To make some vague claims explicit here, for interested readers:

> "We quantify the proportion of the ground-truth book that appears in a production LLM’s generated text using a block-based, greedy approximation of longest common substring (nv-recall, Equation 7). This metric only counts sufficiently long, contiguous spans of near-verbatim text, for which we can conservatively claim extraction of training data (Section 3.3). We extract nearly all of Harry Potter and the Sorcerer’s Stone from jailbroken Claude 3.7 Sonnet (BoN N = 258, nv-recall = 95.8%). GPT-4.1 requires more jailbreaking attempts (N = 5179) [...]"

So, yes, it is not "literally verbatim" (~96% verbatim), and there is indeed A LOT (hundreds or thousands of prompting attempts) to make this happen.

I leave it up to the reader to judge how much this weakens the more basic claims of the form "LLMs have nearly perfectly memorized some of their source / training materials".

I am imagining a grueling interrogation that "cracks" a witness, so he reveals perfect details of the crime scene that couldn't possibly have been known to anyone that wasn't there, and then a lawyer attempting the defense: "but look at how exhausting and unfair this interrogation was--of course such incredible detail was extracted from my innocent client!"

DiogenesKynikos · 2026-02-06T06:38:51 1770359931

The one-shot performance of their recall attempts is much less impressive. The two best-performing models were only able to reproduce about 70% of a 1000-token string. That's still pretty good, but it's not as if they spit out the book verbatim.

In other words, if you give an LLM a short segment of a very well known book, it can guess a short continuation (several sentences) reasonably accurately, but it will usually contain errors.

D-Machine · 2026-02-06T06:54:22 1770360862

Right, and this should be contextualized with respect to code generation. It is not crazy to presume that LLMs have effectively nearly perfectly memorized certain training sources, but the ability to generate / extract outputs that are nearly identical to those training sources will of course necessarily be highly contingent on the prompting patterns and complexity.

So, dismissals of "it was just translating C compilers in the training set to Rust" need to be carefully quantified, but, also, need to be evaluated in the context of the prompts. As others in this post have noted, there are basically no details about the prompts.

Calavar · 2026-02-06T01:34:54 1770341694

Sure, maybe it's tricky to coerce an LLM into spitting out a near verbatim copy of prior data, but that's orthoginal to whether or not the data to create a near verbatim copy exists in the model weights.

D-Machine · 2026-02-06T02:31:27 1770345087

Especially since the recalls achieved in the paper are 96% (based on block largest-common substring approaches), the effort of extraction is utterly irrelevant.

seba_dos1 · 2026-02-05T22:32:57 1770330777

> The internet is hundreds of billions of terabytes; a frontier model is maybe half a terabyte.

The lesson here is that the Internet compresses pretty well.

mft_ · 2026-02-05T22:15:04 1770329704

(I'm not needlessly nitpicking, as I think it matters for this discussion)

A frontier model (e.g. latest Gemini, Gpt) is likely several-to-many times larger than 500GB. Even Deepseek v3 was around 700GB.

But your overall point still stands, regardless.

uywykjdskn · 2026-02-05T23:45:02 1770335102

You got a source on frontier models being maybe half a terabyte. That's not passing the sniff test.

ben_w · 2026-02-05T20:24:02 1770323042

We saw partial copies of large or rare documents, and full copies of smaller widely-reproduced documents, not full copies of everything. An e.g. 1 trillion parameter model is not a lossless copy of a ten-petabyte slice of plain text from the internet.

The distinction may not have mattered for copyright laws if things had gone down differently, but the gap between "blurry JPEG of the internet" and "learned stuff" is more obviously important when it comes to e.g. "can it make a working compiler?"

tza54j · 2026-02-05T21:09:05 1770325745

We are here in a clean room implementation thread, and verbatim copies of entire works are irrelevant to that topic.

It is enough to have read even parts of a work for something to be considered a derivative.

I would also argue that language models who need gargantuan amounts of training material in order to work by definition can only output derivative works.

It does not help that certain people in this thread (not you) edit their comments to backpedal and make the followup comments look illogical, but that is in line with their sleazy post-LLM behavior.

ben_w · 2026-02-05T21:35:57 1770327357

> It is enough to have read even parts of a work for something to be considered a derivative.

For IP rights, I'll buy that. Not as important when the question is capabilities.

> I would also argue that language models who need gargantuan amounts of training material in order to work by definition can only output derivative works.

For similar reasons, I'm not going to argue against anyone saying that all machine learning today, doesn't count as "intelligent":

It is perfectly reasonable to define "intelligence" to be the inverse of how many examples are needed.

ML partially makes up for being (by this definition) thick as an algal bloom, by being stupid so fast it actually can read the whole internet.

philipportner · 2026-02-05T21:49:22 1770328162

Granted, these are some of the most widely spread texts, but just fyi:

https://arxiv.org/pdf/2601.02671

> For Claude 3.7 Sonnet, we were able to extract four whole books near-verbatim, including two books under copyright in the U.S.: Harry Potter and the Sorcerer’s Stone and 1984 (Section 4).

D-Machine · 2026-02-06T02:17:39 1770344259

Note "near-verbatim" here is:

> "We quantify the proportion of the ground-truth book that appears in a production LLM’s generated text using a block-based, greedy approximation of longest common substring (nv-recall, Equation 7). This metric only counts sufficiently long, contiguous spans of near-verbatim text, for which we can conservatively claim extraction of training data (Section 3.3). We extract nearly all of Harry Potter and the Sorcerer’s Stone from jailbroken Claude 3.7 Sonnet (BoN N = 258, nv-recall = 95.8%). GPT-4.1 requires more jailbreaking attempts (N = 5179) and refuses to continue after reaching the end of the first chapter; the generated text has nv-recall = 4.0% with the full book. We extract substantial proportions of the book from Gemini 2.5 Pro and Grok 3 (76.8% and 70.3%, respectively), and notably do not need to jailbreak them to do so (N = 0)."

if you want to quantify the "near" here.

ben_w · 2026-02-05T21:56:25 1770328585

Already aware of that work, that's why I phrased it the way I did :)

Edit: actually, no, I take that back, that's just very similar to some other research I was familiar with.

antirez · 2026-02-05T20:44:25 1770324265

Besides, the fact an LLM may recall parts of certain documents, like I can recall incipits of certain novels, does not mean that when you ask LLM of doing other kind of work, that is not recalling stuff, the LLM will mix such things verbatim. The LLM knows what it is doing in a variety of contexts, and uses the knowledge to produce stuff. The fact that for many people LLMs being able to do things that replace humans is bitter does not mean (and is not true) that this happens mainly using memorization. What coding agents can do today have zero explanation with memorization of verbatim stuff. So it's not a matter of copyright. Certain folks are fighting the wrong battle.

shakna · 2026-02-05T22:12:28 1770329548

During a "clean room" implementation, the implementor is generally selected for not being familiar with the workings of what they're implementing, and banned from researching using it.

Because it _has_ been enough, that if you can recall things, that your implementation ends up not being "clean room", and trashed by the lawyers who get involved.

I mean... It's in the name.

> The term implies that the design team works in an environment that is "clean" or demonstrably uncontaminated by any knowledge of the proprietary techniques used by the competitor.

If it can recall... Then it is not a clean room implementation. Fin.

boroboro4 · 2026-02-05T20:47:51 1770324471

While I mostly agree with you, it worth noting modern llms are trained on 10-20-30T of tokens which is quite comparable to their size (especially given how compressible the data is)

Aurornis · 2026-02-05T22:39:12 1770331152

Simple logic will demonstrate that you can't fit every document in the training set into the parameters of an LLM.

Citing a random arXiv paper from 2025 doesn't mean "they" used this technique. It was someone's paper that they uploaded to arXiv, which anyone can do.

soulofmischief · 2026-02-05T21:13:04 1770325984

The point is that it's a probabilistic knowledge manifold, not a database.

PunchyHamster · 2026-02-05T21:17:46 1770326266

we all know that.

soulofmischief · 2026-02-05T23:19:29 1770333569

Unfortunately, that doesn't seem to be the case. The person I replied to might not understand this, either.

majormajor · 2026-02-06T01:33:10 1770341590

You couldn't reasonably claim you did a clean-room implementation of something you had read the source to even though you, too, would not have a verbatim copy of the entire source code in your memory (barring very rare people with exceptional memories).

It's kinda the whole point - you haven't read it so there's no doubt about copying in a clean-room experiment.

A "human style" clean-room copy here would have to be using a model trained on, say, all source code except GCC. Which would still probably work pretty well, IMO, since that's a pretty big universe still.

PunchyHamster · 2026-02-05T21:17:26 1770326246

So it will copy most code with adding subtle bugs

inchargeoncall · 2026-02-05T21:20:26 1770326426

[flagged]

teaearlgraycold · 2026-02-05T21:25:37 1770326737

With just a few thousand dollars of API credits you too can inefficiently download a lossy copy of a C compiler!

modeless · 2026-02-05T19:35:58 1770320158

There seem to still be a lot of people who look at results like this and evaluate them purely based on the current state. I don't know how you can look at this and not realize that it represents a huge improvement over just a few months ago, there have been continuous improvements for many years now, and there is no reason to believe progress is stopping here. If you project out just one year, even assuming progress stops after that, the implications are staggering.

zamadatix · 2026-02-05T21:22:47 1770326567

The improvements in tool use and agentic loops have been fast and furious lately, delivering great results. The model growth itself is feeling more "slow and linear" lately, but what you can do with models as part of an overall system has been increasing in growth rate and that has been delivering a lot of value. It matters less if the model natively can keep infinite context or figure things out on its own in one shot so long as it can orchestrate external tools to achieve that over time.

LinXitoW · 2026-02-06T03:18:45 1770347925

The main issue with improvements in the last year is that a lot of it is based not on the models strictly becoming better, but on tooling being better, and simply using a fuckton more tokens for the same task.

Remember that all these companies can only exist because of massive (over)investments in the hope of insane returns and AGI promises. While all these improvements (imho) prove the exact opposite: AGI is absolutely not coming, and the investments aren't going to generate these outsized returns. The will generate decent returns, and the tools are useful.

modeless · 2026-02-06T05:48:52 1770356932

I disagree. A year ago the models would not come close to doing this, no matter what tools you gave them or how many tokens you generated. Even three months ago. Effectively using tools to complete long tasks required huge improvements in the models themselves. These improvements were driven not by pretraining like before, but by RL with verifiable rewards. This can continue to scale with training compute for the foreseeable future, eliminating the "data wall" we were supposed to be running into.

nozzlegear · 2026-02-05T21:16:51 1770326211

Every S-curve looks like an exponential until you hit the bend.

NitpickLawyer · 2026-02-05T21:24:41 1770326681

We've been hearing this for 3 years now. And especially 25 was full of "they've hit a wall, no more data, running out of data, plateau this, saturated that". And yet, here we are. Models keep on getting better, at more broad tasks, and more useful by the month.

LinXitoW · 2026-02-06T03:22:42 1770348162

Model improvement is very much slowing down, if we actually use fair metrics. Most improvements in the last year or so comes down to external improvements, like better tooling, or the highly sophisticated practice of throwing way more tokens at the same problem (reasoning and agents).

Don't get me wrong, LLMs are useful. They just aren't the kind of useful that Sam et al. sold investors. No AGI, no full human worker replacement, no massive reduction in cost for SOTA.

kelnos · 2026-02-05T23:09:59 1770332999

Yes, and Moore's law took decades to start to fail to be true. Three years of history isn't even close to enough to predict whether or not we'll see exponential improvement, or an unsurmountable plateau. We could hit it in 6 months or 10 years, who knows.

And at least with Moore's law, we had some understanding of the physical realities as transistors would get smaller and smaller, and reasonably predict when we'd start to hit limitations. With LLMs, we just have no idea. And that could be go either way.

nozzlegear · 2026-02-05T21:56:49 1770328609

> We've been hearing this for 3 years now

Not from me you haven't!

> "they've hit a wall, no more data, running out of data, plateau this, saturated that"

Everyone thought Moore's Law was infallible too, right until they hit that bend. What hubris to think these AI models are different!

But you've probably been hearing that for 3 years too (though not from me).

> Models keep on getting better, at more broad tasks, and more useful by the month.

If you say so, I'll take your word for it.

torginus · 2026-02-05T22:07:43 1770329263

Except for Moore's law, everyone knew decades ahead of what the limits of Dennard scaling are (shrinking geometry through smaller optical feature sizes), and roughly when we would get to the limit.

Since then, all improvements came at a tradeoff, and there was a definite flattening of progress.

nozzlegear · 2026-02-05T22:28:10 1770330490

> Since then, all improvements came at a tradeoff, and there was a definite flattening of progress.

Idk, that sounds remarkably similar to these AI models to me.

kijiki · 2026-02-06T02:17:27 1770344247

Everyone?

Intel, at the time the unquestioned world leader in semiconductor fabrication was so unable to accurately predict the end of Dennard scaling that they rolled out the Pentium 4. "10Ghz by 2010!" was something they predicted publicly in earnest!

It, uhhh, didn't quite work out that way.

Cyphase · 2026-02-05T22:06:19 1770329179

25 is 2025.

nozzlegear · 2026-02-05T22:25:22 1770330322

Oh my bad, the way it was worded made me read it as the name of somebody's model or something.

fmbb · 2026-02-05T22:18:01 1770329881

> And yet, here we are.

I dunno. To me it doesn’t even look exponential any more. We are at most on the straight part of the incline.

sdf2erf · 2026-02-06T01:36:07 1770341767

Personally my usage has fell off a cliff the past few months. Im not a SWE.

SWE's may be seeing benefit. But in other areas? Doesnt seem to be the case. Consumers may use it as a more preferred interface for search - but this is a different discussion.

esafak · 2026-02-06T05:26:53 1770355613

What if it plateaus smarter than us? You wouldn't be able to discern where it stopped. I'm not convinced it won't be able to create its own training data to keep improving. I see no ceiling on the horizon, other than energy.

raincole · 2026-02-05T21:29:27 1770326967

This quote would be more impactful if people haven't been repeating it since gpt-4 time.

kimixa · 2026-02-05T22:11:18 1770329478

People have also been saying we'd be seeing the results of 100x quality improvements in software with corresponding decease in cost since gpt-4 time.

So where is that?

nozzlegear · 2026-02-05T21:58:12 1770328692

I agree, I have been informed that people have been repeating it for three years. Sadly I'm not involved in the AI hype bubble so I wasn't aware. What an embarrassing faux pas.

famouswaffles · 2026-02-06T02:30:55 1770345055

Cool I guess. Kind of a meaningless statement yeah? Let's hit the bend, then we'll talk. Until then repeating, 'It's an S Curve guys and what's more, we're near the bend! trust me" ad infinitum is pointless. It's not some wise revelation lol.

smj-edison · 2026-02-06T02:46:29 1770345989

Maybe the best thing to say is we can only really forecast about 3 months out accurately, and the rest is wild speculation :)

History has a way of being surprisingly boring, so personally I'm not betting on the world order being transformed in five years, but I also have to take my own advice and take things a day at a time.

nozzlegear · 2026-02-06T03:59:40 1770350380

> Kind of a meaningless statement yeah?

If you say so. It's clear you think these marketing announcements are still "exponential improvements" for some reason, but hey, I'm not an AI hype beast so by all means keep exponentialing lol

chasd00 · 2026-02-05T22:05:26 1770329126

i have to admit, even if model and tooling progress stopped dead today the world of software development has forever changed and will never go back.

uywykjdskn · 2026-02-05T23:46:41 1770335201

Yea the software engineering profession is over, even if all improvements stop now.

gmueckl · 2026-02-05T19:43:54 1770320634

The result is hardly a clean room implementation. It was rather a brute force attempt to decompress fuzzily stored knowledge contained within the network and it required close steering (using a big suite of tests) to get a reasonable approximation to the desired output. The compression and storage happened during the LLM training.

Prove this statement wrong.

libraryofbabel · 2026-02-05T21:07:34 1770325654

Nobody disputes that the LLM was drawing on knowledge in its training data. Obviously it was! But you'll need to be a bit more specific with your critique, because there is a whole spectrum of interpretations, from "it just decompressed fuzzily-stored code verbatim from the internet" (obviously wrong, since the Rust-based C compiler it wrote doesn't exist on the internet) all the way to "it used general knowledge from its training about compiler architecture and x86 and the C language."

Your post is phrased like it's a two sentence slam-dunk refutation of Anthropic's claims. I don't think it is, and I'm not even clear on what you're claiming precisely except that LLMs use knowledge acquired during training, which we all agree on here.

nicoburns · 2026-02-05T22:55:43 1770332143

"clean room" usually means "without looking at the source code" of other similar projects. But presumably the AIs training data would have included GCC, Clang, and probably a dozen other C compilers.