Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Are you referring to FrontierMath?

We had access to the eval data (since we funded it), but we didn't train on the data or otherwise cheat. We didn't even look at the eval results until after the model had been trained and selected.



No one believes you.


If you don't believe me, that's fair enough. Some pieces of evidence that might update you or others:

- a member of the team who worked with this eval has left OpenAI and now works at a competitor; if we cheated, he would have every incentive to whistleblow

- cheating on evals is fairly easy to catch and risks destroying employee morale, customer trust, and investor appetite; even if you're evil, the cost-benefit doesn't really pencil out to cheat on a niche math eval

- Epoch made a private held-out set (albeit with a different difficulty); OpenAI performance on that set doesn't suggest any cheating/overfitting

- Gemini and Claude have since achieved similar scores, suggesting that scoring ~40% is not evidence of cheating with the private set

- The vast majority of evals are open-source (e.g., SWE-bench Pro Public), and OpenAI along with everyone else has access to their problems and the opportunity to cheat, so FrontierMath isn't even unique in that respect




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: