They also said this is not part of GPT-5, and “will be released later”. It’s very, very likely a model specifically fine-tuned for this benchmark, where afterwards they’ll evaluate what actual real-world problems it’s good at (eg like “use o4-mini-high for coding”).
Their hardware isn't fine tuned to it though, it uses the same general intelligence hardware that all other humans use.
So its a big difference if you use a general intelligence system and makes it do well in math, or when you create a specialized system that is only good at math and can't be used to get good in other areas.