](https://deep-paper.org/en/paper/file-3216/images/cover.png)
Why CLIP Can't Read Between the Lines: Fixing Compositional Reasoning in Vision-Language Models
Introduction Imagine showing a picture of a horse riding on a person (a strange image, granted) to a state-of-the-art AI model. Then, you ask the model to pick the correct caption between two options: “a person riding a horse” and “a horse riding a person.” Ideally, this should be easy. The nouns are the same, but the relationship is flipped. However, most modern Vision-Language Models (VLMs), including the famous CLIP, struggle significantly with this. They act like “Bag-of-Words” models—they see “horse,” they see “person,” and they declare a match, completely ignoring the syntax or the relationship described by the verb “riding.” ...
](https://deep-paper.org/en/paper/file-3215/images/cover.png)
](https://deep-paper.org/en/paper/2406.12203/images/cover.png)
](https://deep-paper.org/en/paper/file-3213/images/cover.png)
](https://deep-paper.org/en/paper/file-3212/images/cover.png)
](https://deep-paper.org/en/paper/file-3211/images/cover.png)
](https://deep-paper.org/en/paper/2406.13683/images/cover.png)
](https://deep-paper.org/en/paper/2406.14491/images/cover.png)
](https://deep-paper.org/en/paper/2404.16418/images/cover.png)
](https://deep-paper.org/en/paper/2401.13586/images/cover.png)
](https://deep-paper.org/en/paper/2410.05052/images/cover.png)
](https://deep-paper.org/en/paper/2403.00824/images/cover.png)
](https://deep-paper.org/en/paper/2410.01518/images/cover.png)
](https://deep-paper.org/en/paper/file-3203/images/cover.png)
](https://deep-paper.org/en/paper/2401.11206/images/cover.png)
](https://deep-paper.org/en/paper/2404.11095/images/cover.png)
](https://deep-paper.org/en/paper/file-3200/images/cover.png)
](https://deep-paper.org/en/paper/2404.10877/images/cover.png)
](https://deep-paper.org/en/paper/2503.16043/images/cover.png)
](https://deep-paper.org/en/paper/2405.10512/images/cover.png)