> Testers preferred o3-mini's responses to o1-mini 56% of the time
I hope by this they don't mean me, when I'm asked 'which of these two responses do you prefer'.
They're both 2,000 words, and I asked a question because I have something to do. I'm not reading them both; I'm usually just selecting the one that answered first.
That prompt is pointless. Perhaps as evidenced by the essentially 50% response rate: it's a coin-flip.
Typically in these tests you have three options "A is better", "B is better" or "they're equal/can't decide". So if 56% prefer O3 Mini, it's likely that way less than half prefer O1.also, the way I understand it, they're comparing a mini model with a large one.
If you use ChatGPT, it sometimes gives you two versions of its response, and you have to choose one or the other if you want to continue prompting. Sure, not picking a response might be a third category. But if that's how they were approaching the analysis, they could have put out a more favorable-looking stat.
Erm, why not? A 0.56 result with n=1000 ratings is statistically significantly better than 0.5 with a p-value of 0.00001864, well beyond any standard statistical significance threshold I've ever heard of. I don't know how many ratings they collected but 1000 doesn't seem crazy at all. Assuming of course that raters are blind to which model is which and the order of the 2 responses is randomized with every rating -- or, is that what you meant by "poorly designed"? If so, where do they indicate they failed to randomize/blind the raters?
> If so, where do they indicate they failed to randomize/blind the raters?
Win rate if user is under time constraint
This is hard to read tbh. Is it STEM? Non-STEM? If it is STEM then this shows there is a bias. If it is Non-STEM then this shows a bias. If it is a mix, well we can't know anything without understanding the split.
Note that Non-STEM is still within error. STEM is less than 2 sigma variance, so our confidence still shouldn't be that high.
Because you're not testing "will a user click the left or right button" (for which asking a thousand users to click a button would be a pretty good estimation), you're testing "which response is preferred".
If 10% of people just click based on how fast the response was because they don't want to read both outputs, your p-value for the latter hypothesis will be atrocious, no matter how large the sample is.
Yes, I am assuming they evaluated the models in good faith, understand how to design a basic user study, and therefore when they ran a study intended to compare the response quality between two different models, they showed the raters both fully-formed responses at the same time, regardless of the actual latency of each model.
I did read that comment. I don't think that person is saying they were part of the study that OpenAI used to evaluate the models. They would probably know if they had gotten paid to evaluate LLM responses.
But I'm glad you pointed that out, I now suspect that is responsible for a large part of the disagreement between "huh? a statistically significant blind evaluation is a statistically significant blind evaluation" vs "oh, this was obviously a terrible study" repliers is due to different interpretations of that post. Thanks. I genuinely didn't consider the alternative interpretation before.
Sure, it could be, you can define "preference" as basically anything, but it just loses its meaning if you do that. I think most people would think "56% prefer this product" means "when well-informed, 56% of users would rather have this product than the other".
Each question falls into a different category (ie math, coding, story writing etc). Typically models are better at some categories and worse at others. Saying "56% of people preferred responses from o3-mini" makes me wonder if those 56 are only from certain categories and the model isn't uniformly 56% preferred.
Those prompts are so irritating and so frequent that I’ve taken to just quickly picking whichever one looks worse at a cursory glance. I’m paying them, they shouldn’t expect high quality work from me.
You know there's no such as base truth here? You want to write something like this to start your prompts, "Respond in English, using standard capitalization and punctuation, following rules of grammar as written by Strunk & White, where numbers are represented using arabic numerals in base 10 notation...."???
A lot of preferences have nothing to do with any truth. Do you like code segments or full code? Do you like paragraphs or bullet points? Heck, do you want English or Japanaese?
I think my awareness that this may influence future responses has actually been detrimental to my response rate. The responses are often so similar that I can imagine preferring either in specific circumstances. While I’m sure that can be guided by the prompt, I’m often hesitant to click on a specific response as I can see the value of the other response in a different situation and I don’t want to bias the future responses. Maybe with more specific prompting this wouldn’t be such an issue, or maybe more of an understanding of how inter-chat personalisation is applied (maybe I’m missing some information on this too).
I know for a fact that as of yesterday I did not have to pick one to continue the conversation. It just maximizes the second choice and displayed a 2/2 below the response.
Why not always pick the one on the left, for example? I understand wanting to speed through and not spend time doing labor for OpenAI, but it seems counter-productive to spend any time feeding it false information.
My assumption is they measure the quality of user feedback, either on a per user basis or in an aggregate. I want them to interrupt me less, so I want them to either decide I’m a bad teacher or that users in general are bad teachers.
The order doesn't matter. They often generate tokens at different speeds, and produce different lengths of text. "The one that answered first" != "The first option"
"Evaluations by expert testers showed that o3-mini produces more accurate and clearer answers, with stronger reasoning abilities, than OpenAI o1-mini. Testers preferred o3-mini's responses to o1-mini 56% of the time and observed a 39% reduction in major errors on difficult real-world questions. W"
That makes the result stronger though. Even though many people click randomly, there is still a 12% margin between both groups. Not the world, but still quite a lot.
I think it’s somewhat natural and am not personally surprised. It’s easy to quickly select an option, that has no consequence, compared to actively considering that not selecting something is an option. Not selecting something feels more like actively participating than just checking a box and moving on. /shrug
We -- the people who live in front of a computer -- have been training ourselves to avoid noticing annoyances like captchas, advertising, and GDPR notices for quite a long time.
We find what appears to be the easiest combination "Fuck off, go away" buttons and use them without a moment of actual consideration.
(This doesn't mean that it's actually the easiest method.)
Then 56% is even more impressive. Example: if 80% choose randomly and 20% choose carefully, that implies an 80% preference rate for o3-mini (0.8*0.2 + 0.5*0.8 = 0.56)
I read the one on the left but choose the shorter one.
The interface wastes so much screen real estate already and the answers are usually overly verbose unless I've given explicit instructions on how to answer.
The default level of verbosity you get without explicitly prompting for it to be succinct makes me think there’s an office full of workers getting paid by the token.
Also, it's not clear if the preference comes from the quality of the 'meat' of the answer, or the way it reports its thinking and the speed with which it responds. With o1, I get a marked feeling of impatience waiting for it to spit something out, and the 'progress of thought' is in faint grey text I can't read. With o3, the 'progress of thought' comes quickly, with more to read, and is more engaging even if I don't actually get anything more than entertainment value.
I'm not going to say there's nothing substantive about o3 vs. o1, but I absolutely do not put it past Sam Altman to juice the stats every chance he gets.
This is just a way to prove, statistically, that one model is better than another as part of its validation. It's not collected from normal people using ChatGPT, you don't ever get shown two responses from different models at once.
1) Coming off as a jerk, and from a new account is a bad look
2) "Literally the opposite of a coin flip" would probably be either 0% or 100%
3) Your reasoning doesn't stand up without further info; it entirely depends on the sample size. I could have 5 coin flips all come up heads, but over thousands or millions it averages to 50%. 56% on a small sample size is absolutely within margin of error/noise. 56% on a MASSIVE sample size is _statistically_ significant, but isn't even still that much to brag about for something that I feel like they probably intended to be a big step forward.
1. The message was net-upvoted. Whether there are downvotes in there I can't tell, but the final karma is positive. A similarly spirited message of mine in the same thread was quite well receive as well.
2. I can't see how my message would come across as a jerk? I wrote 2 simple sentences, not using any offensive language, stating a mere fact of statistics. Is that being jerk? And a long-winded berating of a new member of the community isn't?
3. A coin flip is 50%. Anything else is not, once you have a certain sample size. So, this was not. That was my statement. I don't know why you are building a strawman of 5 coin flips. 56% vs 44% is a margin of 12%, as I stated, and with a huge sample size, which they had, that's massive in a space where the returns are deep in "diminishing" territory.
I wasn't expecting for my comment to be red so literally but ok.
We're talking about the most cost-efficient model, the competition here is on price, not on a 12% incremental performance (which would make sense for the high end model).
To my knowledge deepseek is the cheaper service which is what matters on the low-end (unless the increase in performance was in such magnitude that the extra-charge would be worth the money).
I don't think they make it clear: I wonder if they mean testers prefer o3 mini 56% of the time when they express an opinion, or overall? Some percentage of people don't choose; if that number is 10% and they aren't excluded, that means 56% of the time people prefer o3 mini, 34% of the time people prefer o1 mini, and 10% of the time people don't choose. I'm not sure I think it would be reasonable to present the data that way, but it seems possible.
I too have questioned the approach of showing the long side-by-side answers from two different models.
1) sometimes I wanted the short answer, and so even though the long answer is better I picked the short one.
2) sometimes both contain code that is different enough that I am inclined to go with the one that is more similar to what I already had, even if the other approach seems a bit more solid.
3) Sometimes one will have less detail but more big picture awareness and the other will have excellent detail but miss some overarching point that is valuable. Depending on my mood I sometimes choose but it is annoying to have to do so because I am not allowed to say why I made the choice.
The area of human training methodology seems to be a big part of what got Deepseek's model so strong. I read the explanation of the test results as an acknowledgement by OpenAI of some weaknesses in its human feedback paradigm.
IMO the way it should work is that the thumbs up or down should be read in context by a reasoning being and a more in-depth training case should be developed that helps future models learn whatever insight the feedback should have triggered.
Feedback that A is better or worse than B is definitely not (in my view) sufficient except in cases where a response is a total dud. Usually the responses have different strengths and weaknesses and it's pretty subjective which one is better.
I hope by this they don't mean me, when I'm asked 'which of these two responses do you prefer'.
They're both 2,000 words, and I asked a question because I have something to do. I'm not reading them both; I'm usually just selecting the one that answered first.
That prompt is pointless. Perhaps as evidenced by the essentially 50% response rate: it's a coin-flip.