The fact that they give these models low res photos but don't provide them with built in tools for querying more details feels suboptimal. Executing python to crop an image is clever from model and a facepalm from the implementation side.
No, the LLM can only "see" a lower res version of the uploaded photo. It has to crop to process finer details, and they are suggesting its silly this isn't a built in feature and instead relies on python to do this.