a text LLM isn't going to learn by trial and error, it's not been given that sort of freedom. RLHF would be the llm version of trial and error - but it's like the chef is only allowed to do that for a few days after years of chef school and from then on, he has to stick to what he has already learnt.
Pre-training is based on a proxy for desired output not actually desired output. It’s not in the form of responses to a prompt, and 1:1 reproducing copyrighted works in production would be bad.
It’s the difference between a painter copying some work and a painter making an original piece and then get feedback on it. We consider the second trial and error because the full process is being tested not just technique.
a chef doesn't get feedback on his meal after picking up the spoon. he gets feedback when he or somebody else tastes the meal part way through and at the end.
This is exemplified by how altitude has a meaningful impact but isn’t discussed for a given recipe.