And, these models' architectures are changing over time in ways that I can't tel...

And, these models' architectures are changing over time in ways that I can't tell if they're "hallucinating" their responses about being able to do something or not, because some multimodal models are entirely token based, including transforming on image token and audio token data, and some are entirely isolated systems glued together.

You can't know unless you know specifically what that model's architecture is, and I'm not at all up-to-date on which of OpenAI's are now only textual tokens or multimodal ones.