> I highly doubt you are getting novel, high quality data.
Why wouldn't you? Presumably the end user would try their use case on the existing model, and if it performs well, wouldn't bother with the expense of setting up an RL environment specific to their task.
If it doesn't perform well, they do bother, and they have all the incentive in the world to get the verifier right -- which is not an extraordinarily sophisticated task if you're only using rules-based outcome rewards (as R1 and R1-Zero do)
Why wouldn't you? Presumably the end user would try their use case on the existing model, and if it performs well, wouldn't bother with the expense of setting up an RL environment specific to their task.
If it doesn't perform well, they do bother, and they have all the incentive in the world to get the verifier right -- which is not an extraordinarily sophisticated task if you're only using rules-based outcome rewards (as R1 and R1-Zero do)