Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Different tasks then? If you are using VLMs in the context of medical imaging, I have concerns. That is not a place to use hallucinatory AI.

But yes, the transformer model itself isn't useless. It's the application of it. OCR, image description, etc, are all that kind of narrow-intelligence task that lends itself well to the fuzzy nature of AI/ML.



The world is a fuzzy place, most things are not binary.

I haven't worked in medical imaging in a while but VLMs make for much better diagnostic tools than task specific classifiers or segmentation models which tend to find hacks in the data to cheat on the objective that they're optimized for.

The next token objective turns our to give us much better vision supervision than things like CLIP or classification losses. (ex: https://arxiv.org/abs/2411.14402)

I spent the last few years working on large scale food recognition models and my multi label classification models had no chance of competing with GPT4 Vision, which was trained on all of the internet and has an amazing prior thanks to it's vast knowledge of facts about food (recipes, menus, ingredients and etc).

Same goes for other areas like robotics, we've seen very little progress outside of simulation up until about a year ago, when people took pretrained VLMs and tuned them to predict robot actions, beating all previous methods by a large margin (google Vision-Language-Action models). It turns out you need good foundational model with a core understanding of the world before you can train a robot to do general tasks.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: