Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

> Testers preferred o3-mini's responses to o1-mini 56% of the time

I hope by this they don't mean me, when I'm asked 'which of these two responses do you prefer'.

They're both 2,000 words, and I asked a question because I have something to do. I'm not reading them both; I'm usually just selecting the one that answered first.

That prompt is pointless. Perhaps as evidenced by the essentially 50% response rate: it's a coin-flip.



It's kind of strange that they gave that stat. Maybe they thought people would somehow think about "56% better" or something.

Because when you think about it, it really is quite damning. Minus statistical noise it's no better.


And another way to rephrase it is that almost half of the users prefer the older model, which is terrible PR.


Not if the goal is to claim that the models deliver comparable quality, but with the new one excelling at something else (here: inferrence cost).


its mini to mini, its the same cost


Typically in these tests you have three options "A is better", "B is better" or "they're equal/can't decide". So if 56% prefer O3 Mini, it's likely that way less than half prefer O1.also, the way I understand it, they're comparing a mini model with a large one.


If you use ChatGPT, it sometimes gives you two versions of its response, and you have to choose one or the other if you want to continue prompting. Sure, not picking a response might be a third category. But if that's how they were approaching the analysis, they could have put out a more favorable-looking stat.


> If you use ChatGPT, it sometimes gives you two versions

Does no one else hate it when this happens (especially when on a handheld device)?


That would be 12%, why would you assume that is eaten by statistical noise?


The OPs comment is probably a testament of that. With such a poorly designed A/B test I doubt this has a p-value of < 0.10.


Erm, why not? A 0.56 result with n=1000 ratings is statistically significantly better than 0.5 with a p-value of 0.00001864, well beyond any standard statistical significance threshold I've ever heard of. I don't know how many ratings they collected but 1000 doesn't seem crazy at all. Assuming of course that raters are blind to which model is which and the order of the 2 responses is randomized with every rating -- or, is that what you meant by "poorly designed"? If so, where do they indicate they failed to randomize/blind the raters?


  > If so, where do they indicate they failed to randomize/blind the raters?

  Win rate if user is under time constraint
This is hard to read tbh. Is it STEM? Non-STEM? If it is STEM then this shows there is a bias. If it is Non-STEM then this shows a bias. If it is a mix, well we can't know anything without understanding the split.

Note that Non-STEM is still within error. STEM is less than 2 sigma variance, so our confidence still shouldn't be that high.


Because you're not testing "will a user click the left or right button" (for which asking a thousand users to click a button would be a pretty good estimation), you're testing "which response is preferred".

If 10% of people just click based on how fast the response was because they don't want to read both outputs, your p-value for the latter hypothesis will be atrocious, no matter how large the sample is.


Yes, I am assuming they evaluated the models in good faith, understand how to design a basic user study, and therefore when they ran a study intended to compare the response quality between two different models, they showed the raters both fully-formed responses at the same time, regardless of the actual latency of each model.


I would recommend you read the comment that started this thread then, because that's the context we're talking about: https://news.ycombinator.com/item?id=42891294


I did read that comment. I don't think that person is saying they were part of the study that OpenAI used to evaluate the models. They would probably know if they had gotten paid to evaluate LLM responses.

But I'm glad you pointed that out, I now suspect that is responsible for a large part of the disagreement between "huh? a statistically significant blind evaluation is a statistically significant blind evaluation" vs "oh, this was obviously a terrible study" repliers is due to different interpretations of that post. Thanks. I genuinely didn't consider the alternative interpretation before.


> If 10% of people just click based on how fast the response was

Couldn't this be considered a form of preference?

Whether it's the type of preference OpenAI was testing for, or the type of preference you care about, is another matter.


Sure, it could be, you can define "preference" as basically anything, but it just loses its meaning if you do that. I think most people would think "56% prefer this product" means "when well-informed, 56% of users would rather have this product than the other".


They even include error bars. It doesn't seem to be statistical noise, but it's still not great.


It’s 3x cheaper and faster


Yeah. I immediately thought: I wonder if that 56% is in one or two categories and the rest are worse?


44% of the people prefers the existing model ?


Each question falls into a different category (ie math, coding, story writing etc). Typically models are better at some categories and worse at others. Saying "56% of people preferred responses from o3-mini" makes me wonder if those 56 are only from certain categories and the model isn't uniformly 56% preferred.


With many people too lazy to read 2 walls of text, a lot of picks might be random.


exactly I was surprised as well


Those prompts are so irritating and so frequent that I’ve taken to just quickly picking whichever one looks worse at a cursory glance. I’m paying them, they shouldn’t expect high quality work from me.


Have you considered the possibility that your feedback is used to choose what type of response to give to you specifically in the future?

I would not consider purposely giving inaccurate feedback for this reason alone.


I don't want a model that's customized to my preferences. My preferences and understanding changes all the time.

I want a single source model that's grounded in base truth. I'll let the model know how to structure it in my prompt.


You know there's no such as base truth here? You want to write something like this to start your prompts, "Respond in English, using standard capitalization and punctuation, following rules of grammar as written by Strunk & White, where numbers are represented using arabic numerals in base 10 notation...."???


actually, I might appreciate that.

I like precision of language, so maybe just have a system prompt that says "use precise language (ex: no symbolism of any kind)"


A lot of preferences have nothing to do with any truth. Do you like code segments or full code? Do you like paragraphs or bullet points? Heck, do you want English or Japanaese?


What is base truth for e.g. creative writing?


Constang meh and fixing prompts to the right direction vs unable to escape the bubble


I think my awareness that this may influence future responses has actually been detrimental to my response rate. The responses are often so similar that I can imagine preferring either in specific circumstances. While I’m sure that can be guided by the prompt, I’m often hesitant to click on a specific response as I can see the value of the other response in a different situation and I don’t want to bias the future responses. Maybe with more specific prompting this wouldn’t be such an issue, or maybe more of an understanding of how inter-chat personalisation is applied (maybe I’m missing some information on this too).


Alternatively, I'll use the tool that is most user friendly and provides the most value for my money.

Wasting time on an anti pattern is not value nor is it trying to outguess the way that selection mechanism is used.


Spotted the pissed off OpenAI RLHF engineer! Hahahahaha!


That's such a counter-productive and frankly dumb thing to do. Just don't vote on them.


You have to pick one to continue the chat.


I know for a fact that as of yesterday I did not have to pick one to continue the conversation. It just maximizes the second choice and displayed a 2/2 below the response.


Why not always pick the one on the left, for example? I understand wanting to speed through and not spend time doing labor for OpenAI, but it seems counter-productive to spend any time feeding it false information.


My assumption is they measure the quality of user feedback, either on a per user basis or in an aggregate. I want them to interrupt me less, so I want them to either decide I’m a bad teacher or that users in general are bad teachers.


> I'm usually just selecting the one that answered first

Which is why you randomize the order. You aren’t a tester.

56% vs 44% may not be noise. That’s why we have p values. It depends on sample size.


The order doesn't matter. They often generate tokens at different speeds, and produce different lengths of text. "The one that answered first" != "The first option"


The article says "expert testers."

"Evaluations by expert testers showed that o3-mini produces more accurate and clearer answers, with stronger reasoning abilities, than OpenAI o1-mini. Testers preferred o3-mini's responses to o1-mini 56% of the time and observed a 39% reduction in major errors on difficult real-world questions. W"


Those are two different sentences. The second sentence doesn't refer to experts explicitly.


That makes the result stronger though. Even though many people click randomly, there is still a 12% margin between both groups. Not the world, but still quite a lot.


Funny - I had ChatGPT document some stuff for me this week and asked which responses I preferred as well.

Didn’t bother reading either of them, just selected one and went on with my day.

If it were me I would have set up a “hey do you mind if we give you two results and you can pick your favorite?” prompt to weed out people like me.


I'm surprised how many people claim to do this. You can just not select one.


I think it’s somewhat natural and am not personally surprised. It’s easy to quickly select an option, that has no consequence, compared to actively considering that not selecting something is an option. Not selecting something feels more like actively participating than just checking a box and moving on. /shrug


We -- the people who live in front of a computer -- have been training ourselves to avoid noticing annoyances like captchas, advertising, and GDPR notices for quite a long time.

We find what appears to be the easiest combination "Fuck off, go away" buttons and use them without a moment of actual consideration.

(This doesn't mean that it's actually the easiest method.)


I can't even believe how many times in a day I frustratedly think "whatever, go away!"


I wonder if they down-weight responses that come in too fast to be meaningful, or without sufficient scrolling.


That’s fine. Your random click would be balanced by someone else randomly clicking


Then 56% is even more impressive. Example: if 80% choose randomly and 20% choose carefully, that implies an 80% preference rate for o3-mini (0.8*0.2 + 0.5*0.8 = 0.56)


Yes I'd bet most users just 50/50 it, which actually makes it more remarkable that there was a 56% selection rate


I read the one on the left but choose the shorter one.

The interface wastes so much screen real estate already and the answers are usually overly verbose unless I've given explicit instructions on how to answer.


The default level of verbosity you get without explicitly prompting for it to be succinct makes me think there’s an office full of workers getting paid by the token.


In my experience the verbosity significantly improves output quality


Also, it's not clear if the preference comes from the quality of the 'meat' of the answer, or the way it reports its thinking and the speed with which it responds. With o1, I get a marked feeling of impatience waiting for it to spit something out, and the 'progress of thought' is in faint grey text I can't read. With o3, the 'progress of thought' comes quickly, with more to read, and is more engaging even if I don't actually get anything more than entertainment value.

I'm not going to say there's nothing substantive about o3 vs. o1, but I absolutely do not put it past Sam Altman to juice the stats every chance he gets.


they also pay contractors to do these evaluations with much more detailed metrics, no idea which their number is based on though


Maybe we should take both answers, paste them into a new chat and ask for a summary amalgamation of them


This is just a way to prove, statistically, that one model is better than another as part of its validation. It's not collected from normal people using ChatGPT, you don't ever get shown two responses from different models at once.


Wait what? I get shown this with ChatGPT maybe 5% of the time


Those are both responses from the same model. It's not one response from o1 and another from o3.


People could be flipping a coin and the score would be the same.


A 12% margin is literally the opposite of a coin flip. Unless you have a really bad coin.


You're being downvoted for 3 reasons:

1) Coming off as a jerk, and from a new account is a bad look

2) "Literally the opposite of a coin flip" would probably be either 0% or 100%

3) Your reasoning doesn't stand up without further info; it entirely depends on the sample size. I could have 5 coin flips all come up heads, but over thousands or millions it averages to 50%. 56% on a small sample size is absolutely within margin of error/noise. 56% on a MASSIVE sample size is _statistically_ significant, but isn't even still that much to brag about for something that I feel like they probably intended to be a big step forward.


I'm a little puzzled by your response.

1. The message was net-upvoted. Whether there are downvotes in there I can't tell, but the final karma is positive. A similarly spirited message of mine in the same thread was quite well receive as well.

2. I can't see how my message would come across as a jerk? I wrote 2 simple sentences, not using any offensive language, stating a mere fact of statistics. Is that being jerk? And a long-winded berating of a new member of the community isn't?

3. A coin flip is 50%. Anything else is not, once you have a certain sample size. So, this was not. That was my statement. I don't know why you are building a strawman of 5 coin flips. 56% vs 44% is a margin of 12%, as I stated, and with a huge sample size, which they had, that's massive in a space where the returns are deep in "diminishing" territory.


I wasn't expecting for my comment to be red so literally but ok.

We're talking about the most cost-efficient model, the competition here is on price, not on a 12% incremental performance (which would make sense for the high end model).

To my knowledge deepseek is the cheaper service which is what matters on the low-end (unless the increase in performance was in such magnitude that the extra-charge would be worth the money).


What does deepseek have to do with a comparison between o1-mini and o3-mini?


I'm not sure I follow - your assertion was that 12% is significative.

I personally chose for price on a low-cost model (unless the improvement is to significant that it justifies the higher price).


I don't think they make it clear: I wonder if they mean testers prefer o3 mini 56% of the time when they express an opinion, or overall? Some percentage of people don't choose; if that number is 10% and they aren't excluded, that means 56% of the time people prefer o3 mini, 34% of the time people prefer o1 mini, and 10% of the time people don't choose. I'm not sure I think it would be reasonable to present the data that way, but it seems possible.


This prompt is like "See Attendant" on the gas pump. I'm just going to use another AI instead for this chat.


Glad to know I’m not the only person who just drives to the next station when I see a “see attendant” message.


I almost always pick the second one, because it's closer to the submit button and the one I read first.


It seems like the the first response must get chosen a majority of the time just to account for friction


I too have questioned the approach of showing the long side-by-side answers from two different models.

1) sometimes I wanted the short answer, and so even though the long answer is better I picked the short one.

2) sometimes both contain code that is different enough that I am inclined to go with the one that is more similar to what I already had, even if the other approach seems a bit more solid.

3) Sometimes one will have less detail but more big picture awareness and the other will have excellent detail but miss some overarching point that is valuable. Depending on my mood I sometimes choose but it is annoying to have to do so because I am not allowed to say why I made the choice.

The area of human training methodology seems to be a big part of what got Deepseek's model so strong. I read the explanation of the test results as an acknowledgement by OpenAI of some weaknesses in its human feedback paradigm.

IMO the way it should work is that the thumbs up or down should be read in context by a reasoning being and a more in-depth training case should be developed that helps future models learn whatever insight the feedback should have triggered.

Feedback that A is better or worse than B is definitely not (in my view) sufficient except in cases where a response is a total dud. Usually the responses have different strengths and weaknesses and it's pretty subjective which one is better.


i enjoy it, i like getting two answers for free - often one of them is significantly better and probably the newer model


RLUHF, U = useless.


You know you can configure default instructions to your prompts, right?

I have something like “always be terse and blunt with your answers.”




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: