The result is a fuzzy reproduction of the training input, specifically of the co...

astrange · 2026-02-06T10:00:08 1770372008

> The result is a fuzzy reproduction of the training input, specifically of the compilers contained within.

Is it? I'm somewhat familiar with gcc and clang's source and it doesn't really particularly look like it to me.

https://github.com/anthropics/claudes-c-compiler/blob/main/s...

https://llvm.org/doxygen/LoopStrengthReduce_8cpp_source.html

https://github.com/gcc-mirror/gcc/blob/master/gcc/gimple-ssa...

gmueckl · 2026-02-06T15:49:51 1770392991

Checking for similarity with compilers that consist of orders of magnitudes more code probably doesn't reveal much. There many more smaller compilers for C-adjacent languages out there pkus cod3 fragments from text books.

astrange · 2026-02-06T17:36:31 1770399391

There are not many more compilers with the specific optimization pass I linked.

Also, I don't think you could reuse code from a different compiler unless you used the same IR.

libraryofbabel · 2026-02-06T01:00:07 1770339607

Thanks for elaborating. So what is the empirically-testable assertion behind this… that an LLM cannot create a (sufficiently complex) system without examples of the source code of similar systems in its training set? That seems empirically testable, although not for compilers without training a whole new model that excludes compiler source code from training. But what other kind of system would count for you?

gmueckl · 2026-02-06T15:52:58 1770393178

I personally work on simulation software and create novel simulation methods as part of the job. I find that LLMs can only help if I reduce that task to a translation of detailed algorithms descriptions from English to code. And even then, the output is often riddled with errors.