Nope, they encoded or compiled in a simple VM / WASM interpreter to the transformer weights, there is no training. You'd be forgiven for this misreading, as they deliberately mislead early on that their model is (in principle) trainable, but later admit that their actual model is not actually differentiable, but that a differentiable approximation "should" still work (despite no info about what loss function or training data could allow scoring partially correct / incomplete program outputs).
Thanks, but where do they say that? I can only find this instance of "different" (as in "differentiable") in their article:
Because the execution trace is part of the forward pass, the whole process remains differentiable: we can even propagate gradients through the computation itself. That makes this fundamentally different from an external tool. It becomes a trainable computational substrate that can be integrated directly into a larger model.
In the section "Programs into weights & training beyond gradient descent", near the end, they say:
[...] *the compilation machinery we built for generating those weights** can go further. In principle, arbitrary programs can be compiled directly into the transformer weights, bypassing the need to represent them as token sequences at all. [...] [my emphasis]
In the same section, they also continue:
Weights become a deployment target: instead of learning software-like behavior, models contain compiled program logic.
If logic can be compiled into weights, then gradient descent is no longer the only way to modify a model. Weight compilation provides another route for inserting structure, algorithms, and guarantees directly into a network.
So they (almost-invisibly) admit they compile in the weights, but make it clearer this was the whole intention the whole time in later sentences.