The python zoom in seems performative. A vision model already has access to all ...

Legend2440 · 2025-04-26T18:43:37 1745693017

Vision models are typically bad at small details. If there’s too much stuff going on at once, they can’t focus on the entire image.

simonw · 2025-04-26T17:59:15 1745690355

Yeah, I'm a little unconvinced by that. My best guess there is that the vision input has quite a restricted resolution and "zooming in" (really, cropping to an area) lets it get more information about the region of the photo because it's not as "fuzzy". Just a hunch though.

energy123 · 2025-04-26T17:58:55 1745690335

Yeah, once it gets converted into tokens how does "zooming in" somehow increase information content?

nutrientharvest · 2025-04-26T19:30:14 1745695814

It's cropping the original image then tokenizing it again with less downsampling, not cropping its internal representation.