Hacker News new | past | comments | ask | show | jobs | submit login

The python zoom in seems performative. A vision model already has access to all the data, how does zooming in help it? Still very cool that it can!



Vision models are typically bad at small details. If there’s too much stuff going on at once, they can’t focus on the entire image.


Yeah, I'm a little unconvinced by that. My best guess there is that the vision input has quite a restricted resolution and "zooming in" (really, cropping to an area) lets it get more information about the region of the photo because it's not as "fuzzy". Just a hunch though.


Yeah, once it gets converted into tokens how does "zooming in" somehow increase information content?


It's cropping the original image then tokenizing it again with less downsampling, not cropping its internal representation.




Consider applying for YC's Summer 2025 batch! Applications are open till May 13

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: