> Sometimes that's presented as deception/misalignment but that's a category error: "find the answer" and "explain your reasoning" are two distinct tasks
Right but if your answer to "explain your reasoning" is not a true representation of your reasoning, then you are being deceptive. If it doesn't "know" its reasoning, then the honest answer is that it doesn't know.
(To head off any meta-commentary on humans' inability to explain their own reasoning, they would at least be able to honestly describe whether they used EXIF or actual semantic knowledge of a photography)
My point is that dishonesty/misalignment doesn't make sense for o3, which is not capable of being honest because it's not capable of understanding what words mean. It's like saying a monkey at a typewriter is being dishonest if it happens to write a falsehood.
No, I think a non-sentient AI which is much more advanced than GPT could lie - I never said sentience, and the example I gave involved a monkey, which is sentient. The problem is transformer ANNs themselves are too stupid to lie.
In 2023 OpenAI co-authored an excellent paper on LLMs disseminating conspiracy theories - sorry, don't have the link handy. But a result that stuck with me: if you train a bidirectional transformer LLM where half the information about 9/11 is honest and half is conspiracy theories, it has a 50-50 chance of telling you one or the other if you ask about 9/11. It is not smart enough to tell there is an inconsistency. This extends to reasoning traces vs its "explanations": it does not understand its own reasoning steps and is not smart enough to notice if the explanation is inconsistent.
Right but if your answer to "explain your reasoning" is not a true representation of your reasoning, then you are being deceptive. If it doesn't "know" its reasoning, then the honest answer is that it doesn't know.
(To head off any meta-commentary on humans' inability to explain their own reasoning, they would at least be able to honestly describe whether they used EXIF or actual semantic knowledge of a photography)