Just Ask for Calibration: Strategies for Eliciting Calibrated Confidence Scores from Language Models Fine-Tuned with Human Feedback - https://arxiv.org/abs/2305.14975
It's wild that people post papers that they haven't read or don't understand because the headline supports some view they have.
To wit, in your first link it seems the figure is just showing the trivial fact that the model is trained on the MMLU dataset (and after RLHF it is no longer optimized for that). The second link main claim seems to be contradicted by their Figure 12 left panel which shows ~0 correlation between model-predicted and actual truth.
I'm not going to bother going through the rest.
I don't yet understand exactly what they are doing in the OP's article but I suspect it also suffers from serious problems.
>The second link main claim seems to be contradicted by their Figure 12 left panel which shows ~0 correlation between model-predicted and actual truth.
The claim in the abstract is:
"""We first show that larger models are well-calibrated on diverse multiple choice and true/false questions when they are provided in the right format.
Next, we investigate whether models can be trained to
predict "P(IK)", the probability that "I know" the answer to a question, without reference
to any particular proposed answer. Models perform well at predicting P(IK) and partially
generalize across tasks, though they struggle with calibration of P(IK) on new tasks."""
The plot is much denser in the origin and top right. How is that 0 correlation ? Depending on the number of their held-out test set, that could be pretty strong correlation even.
And how does that contradict the claims they've made, especially on calibration (Fig 13 down) ?
Figure 13 right panel also shows there isn't a y=x relationship on out-of-sample tests.
First we agree by observation that outside of the top-right and bottom-left corners there isn't any meaningful relationship in the data, regardless of what the numerical value of the correlation is. Second, in those corners it is not clear to me what the relationship is but it looks flattish (i.e. if the ground truth is ~0 then the model-guess-for-truth could be anywhere from 0 to 0.5). This is also consistent with the general behavior displayed in figure 13.
If you have some other interpretation of the data you should lay it out. The authors certainly did not do that.
edit:
By the way there are people working on a re-sampling algorithm based on the entropy and variance of the output logits called entropix: if the output probabilities for the next token are spread evenly for example (and not have overwhelming probability for a single token) they prompt for additional clarification. They don't really claim anything like the model "knows" whether it's wrong but they say it improves performance.
>Figure 13 right panel also shows there isn't a y=x relationship on out-of-sample tests.
A y=x relationship is not necessary for meaningful correlation and the abstract is quite clear on out of sample performance either way.
>Second, in those corners it is not clear to me what the relationship is but it looks flattish (i.e. if the ground truth is ~0 then the model-guess-for-truth could be anywhere from 0 to 0.5).
The upper bound for guess-for-truth is not as important as the frequency. Yes it could guess 0.5 for 0 but how often compared to reasonable numbers? A test set on TriviaQA could well be thousands of questions.
>edit: By the way there are people working on a re-sampling algorithm based on the entropy and variance of the output logits called entropix
I know about entropix. It hinges strongly on the model's representations. If it works, then choosing to call it "knowing" or not is just semantics.
> A y=x relationship is not necessary for meaningful correlation
I’m not concerned with correlation (which may or may not indicate an actual relationship) per se, I’m concerned with whether there is a meaningful relationship between predicted and actual. The 12 plot clearly shows that predicted isn’t tracking actual even in the corners. I think one of the lines (predicting 0% but actual is like 40%, going from memory on my phone) of Figure 13 right even more clearly shows there isn’t a meaningful relationship. In any case the authors haven’t made any argument about how those plots support their arguments and I don’t think you can either.
> the abstract is quite clear on out of sample performance either way.
Yes I’m saying the abstract is not supported by the results. You might as well say the title is very clear.
> The upper bound for guess-for-truth is not as important as the frequency. Yes it could guess 0.5 for 0 but how often compared to reasonable numbers? A test set on TriviaQA could well be thousands of questions.
Now we’ve gone from “the paper shows” to speculating about what the paper might have shown (and even that is probably not possible based on the Figure 13 line I described above)
> choosing to call it "knowing" or not is just semantics.
Yes it’s semantics but that implies it’s meaningless to use the term instead of actual underlying properties.
For the red Lambada line in Fig 13 when the model predicts ~0 the ground truth is 0.7. No one can look at that line and say there is a meaningful relationship. The Py Func Synthesis line also doesn't look good above 0.3-0.4.
> The abstract also quite literally states that models struggle with out of distribution tests so again, what is the contradiction here ?
Out of distribution is the only test that matters. If it doesn't work out of distribution it doesn't work. Surely you know that.
> Would it have been hard to simply say you found the results unconvincing?
Anyone can look at the graphs, especially Figure 13, and see this isn't a matter of opinion.
> There is nothing contradictory in the paper.
The results contradict the claim the titular claim that "Language Models (Mostly) Know What They Know".
>For the red Lambada line in Fig 13 when the model predicts ~0 the ground truth is 0.7. No one can look at that line and say there is a meaningful relationship. The Py Func Synthesis line also doesn't look good above 0.3-0.4.
Yeah but Lambada is not the only line there.
>Out of distribution is the only test that matters. If it doesn't work out of distribution it doesn't work. Surely you know that.
Train the classifier on math questions and get good calibration for math, train the classifier on true/false questions and get good calibration for true/false, train the train the classifier on math but struggle with true/false (and vice versa). This is what "out-of-distribution" is referring to here.
Make no mistake, the fact that both the first two work is evidence that models encode some knowledge about the truthfulness of their responses. If they didn't, it wouldn't work at all. Statistics is not magic and gradient descent won't bring order where there is none.
What out of distribution "failure" here indicates is that "truth" is multifaceted and situation dependent and interpreting the models features is very difficult. You can't train a "general LLM lie detector" but that doesn't mean model features are unable to provide insight into whether a response is true or not.
> Well good thing Lambada is not the only line there.
There are 3 out-of-distribution lines, all of them bad. I explicitly described two of them. Moreover, it seems like the worst time for your uncertainty indicator to silently fail is when you are out of distribution.
But okay, forget about out-of-distribution and go back to Figure 12 which is in-distribution. What relationship are you supposed to take away from the left panel? From what I understand they were trying to train a y=x relationship but as I said previously the plot doesn't show that.
An even bigger problem might be the way the "ground truth" probability is calculated: they sample the model 30 times and take the percentage of correct results as ground truth probability, but it's really fishy to say that the "ground truth" is something that is partly an internal property of the model sampler and not of objective/external fact. I don't have more time to think about this but something is off about it.
All this to say that reading long scientific papers is difficult and time-consuming and let's be honest, you were not posting these links because you've spent hours poring over these papers and understood them, you posted them because the headlines support a world-view you like. As someone else noted you can find good papers that have opposite-concluding headlines (like the work of rao2z).
>It's wild that people post papers that they haven't read or don't understand because the headline supports some view they have.
It's related research either way. And I did read them. I think there's probably issues with the methodology of 4 but it's there anyway because it's interesting research that is related and is not without merit.
>The second link main claim seems to be contradicted by their Figure 12 left panel which shows ~0 correlation between model-predicted and actual truth.
The panel is pretty weak on correlation but it's quite clearly also not the only thing that supports that particular claim neither does it contradict it.
>I'm not going to bother going through the rest.
Ok? That's fine
>I don't yet understand exactly what they are doing in the OP's article but I suspect it also suffers from serious problems.
> The panel is pretty weak on correlation but it's quite clearly also not the only thing that supports that particular claim neither does it contradict it.
It very clearly contradicts it: There is no correlation between the predicted truth value and the actual truth value. That is the essence of the claim. If you had read and understood the paper you would be able to specifically detail why that isn't so rather than say vaguely that "it is not the only thing that supports that particular claim".
To be fair, I'm not sure people writing papers understand what they're writing either. Much of the ML community has seemed to fully embraced "black box" nature rather than seeing it as something to overcome. I routinely hear both readers and writers tout that you don't need much math. But yet mistakes and misunderstand are commonplace and they're right, they don't need much math. How much do you need to understand the difference between entropy and perplexity? Is that more or less than what's required to know the difference between probability and likelihood? I would hope we could at least get to a level where we understand the linear nature of PCA
I'm not so sure that's the reason. I'm in the field, and trust me, I'm VERY frustrated[0]. But isn't the saying to not attribute to malice what can be attributed to stupidity? I think the problem is that they're blinded by the hype but don't use the passion to drive understanding more deeply. It's a belief that the black box can't be opened, no why bother?
I think it comes from the ad hoc nature of evaluation in young fields. It's like you need an elephant but obviously you can't afford one, so you put a dog in an elephant costume and can it an elephant, just to get in the right direction. It takes a long time to get that working and progress can still be made by upgrading the dog costume. But at some point people forgot that we need an elephant so everyone is focused on the intricacies of the costume and some will try dressing up the "elephant" as another animal. Eventually the dog costume isn't "good enough" and leads us in the wrong direction. I think that's where we are now.
I mean do we really think we can measure language with entropy? Fidelity and coherence with FID? We have no mathematical description of language, artistic value, aesthetics, and so on. The biggest improvement has been RLHF where we just use Justice Potter's metric: "I know it when I see it"
I don't think it's malice. I think it's just easy to lose sight of the original goal. ML certainly isn't the only one to have done this but it's also hard to bring rigor in and I think the hype makes it harder. Frankly I think we still aren't ready for a real elephant yet but I'd just be happy if we openly acknowledge the difference between a dog in a costume proxying as an elephant and an actually fucking elephant.
[0] seriously, how do we live in a world where I have to explain what covariance means to people publishing works on diffusion models and working for top companies or at top universities‽
>If you had read and understood the paper you would be able to specifically detail why that isn't so rather than say vaguely that "it is not the only thing that supports that particular claim".
Not every internet conversation need end in a big debate. You've been pretty rude and i'd just rather not bother.
You also seem to have a lot to say on how much people actually read papers but your first response also took like 5 minutes. I'm sorry but you can't say you've read even one of those in that time. Why would i engage with someone being intellectually dishonest?
> I guess i understand seeing as you couldn't have read the paper in the 5 minutes it took for your response.
You've posted the papers multiple times over the last few months, so no I did not read them in the last five minutes though you could in fact find both of the very basic problems I cited in that amount of time.
Because it's pointless to reply to a comment days after it was made or after engagement with the post has died down. All of this is a convenient misdirection for not having read and understood the papers you keep posting because you like the headlines.
> you can't say you've read even one of those in that time.
I'm not sure if you're aware, but most of those papers are well known. All the arxiv papers are from 2022 or 2023. So I think your 5 minutes is pretty far off. I for one have spent hours, but the majority of that was prior to this comment.
You're claiming intellectual dishonestly too soon.
That said, @foobarqux, I think you could expand on your point more to clarify. @og_kalu, focus on the topic and claims (even if not obvious) rather than the time
>I'm not sure if you're aware, but most of those papers are well known. All the arxiv papers are from 2022 or 2023. So I think your 5 minutes is pretty far off. I for one have spent hours, but the majority of that was prior to this comment.
You're claiming intellectual dishonestly too soon.
Fair Enough. With the "I'm not going to bother with the rest", it seemed like a now thing.
>focus on the topic and claims (even if not obvious) rather than the time
I should have just done that yes. 0 correlation is obviously false with how much denser the plot is at the extremes and depending on how many questions are in the test set, it could even be pretty strong.
> 0 correlation is obviously false with how much denser the plot is at the extremes and depending on how many questions are in the test set, it could even be pretty strong.
I took it as hyperbole. And honestly I don't find that plot or much of the paper convincing. Though I have a general frustration in that it seems many researchers (especially NLP) willfully do not look for data spoilage. I know they do deduplication but I do question how many try to vet this by manual inspection. Sure, you can't inspect everything, but we have statistics for that. And any inspection I've done leaves me very unconvinced that there is no spoilage. There's quite a lot in most datasets I've seen, which can have a huge change in the interpretation of results. After all, we're elephant fitting
I explicitly wrote "~0", and anyone who looks at that graph can say that there is no relationship at all in the data, except possibly at the extremes, where it doesn't matter that much (it "knows" sure things) and I'm not even sure of that. One of the reasons to plot data is so that this type of thing jumps out at you and you aren't misled by some statistic.
They just posted a list of articles, and said that they were related. What view do you think they have, that these papers support? They haven’t expressed a view as far as I can see…
Maybe you’ve inferred some view based on the names of the titles, but in that case you seem to be falling afoul of your own complaint?
Much like you can search the internet until you find a source that agrees with you, you can select a set of papers that "confirm" a particular viewpoint, especially in developing fields of research. In this case, the selected papers all support the view LLMs "know what they know" on some internal level, which iiuc is not (yet?) a consensus viewpoint (from my outsider perspective). But from the list alone, you might get that impression.
If you have discussed those things previously with the poster, I don't agree. If you were to go digging through their history only to respond to the current comment, that's more debatable. But, we're supposed to assume good faith here on HN, so I would take the first explanation.
In this case the poster seems to have projected opinions on to a post where none were expressed. That seems problematic regardless of how they came to associate the opinions with their respondent. Maybe the poster they responded to still hold the projected opinions, perhaps that poster abandoned the projected opinions, or perhaps they thought the projected opinions distracting and resultantly chose not to share.
If I am wrong or not useful in my posts, I would hope to be allowed to remove what was wrong and/or not useful without losing my standing to share the accurate, useful things. Anything else seems like residual punishment outside the appropriate context.
When I see a post I strongly disagree with, I tend to check out the poster's history: it's often quite illuminating to be confronted with completely different viewpoints, and also realize I agree to other posts of the same person.
GPT-4 logits calibration pre RLHF - https://imgur.com/a/3gYel9r
Language Models (Mostly) Know What They Know - https://arxiv.org/abs/2207.05221
The Geometry of Truth: Emergent Linear Structure in Large Language Model Representations of True/False Datasets - https://arxiv.org/abs/2310.06824
The Internal State of an LLM Knows When It's Lying - https://arxiv.org/abs/2304.13734
LLMs Know More Than What They Say - https://arjunbansal.substack.com/p/llms-know-more-than-what-...
Just Ask for Calibration: Strategies for Eliciting Calibrated Confidence Scores from Language Models Fine-Tuned with Human Feedback - https://arxiv.org/abs/2305.14975
Teaching Models to Express Their Uncertainty in Words - https://arxiv.org/abs/2205.14334