](https://deep-paper.org/en/paper/2501.03841/images/cover.png)
Bridging the Gap: How OmniManip Connects VLM Reasoning to Precise Robot Action
The dream of general-purpose robotics is a machine that can walk into a messy kitchen, identify a teapot and a cup, and pour you a drink without having been explicitly programmed for that specific teapot or that specific cup. In recent years, we have seen massive leaps in Vision-Language Models (VLMs). These models (like GPT-4V) have incredible “common sense.” They can look at an image and tell you, “That is a teapot, you hold it by the handle, and you pour liquid from the spout.” However, knowing what to do is very different from knowing exactly how to do it in 3D space. ...
](https://deep-paper.org/en/paper/2412.09612/images/cover.png)
](https://deep-paper.org/en/paper/2503.00361/images/cover.png)
](https://deep-paper.org/en/paper/file-2155/images/cover.png)
](https://deep-paper.org/en/paper/2503.12096/images/cover.png)
](https://deep-paper.org/en/paper/2503.17142/images/cover.png)
](https://deep-paper.org/en/paper/file-2152/images/cover.png)
](https://deep-paper.org/en/paper/2503.18794/images/cover.png)
](https://deep-paper.org/en/paper/2503.18361/images/cover.png)
](https://deep-paper.org/en/paper/file-2149/images/cover.png)
](https://deep-paper.org/en/paper/2506.06898/images/cover.png)
](https://deep-paper.org/en/paper/2412.01256/images/cover.png)
](https://deep-paper.org/en/paper/2502.05165/images/cover.png)
](https://deep-paper.org/en/paper/file-2145/images/cover.png)
](https://deep-paper.org/en/paper/2410.10604/images/cover.png)
](https://deep-paper.org/en/paper/file-2142/images/cover.png)
](https://deep-paper.org/en/paper/2504.05046/images/cover.png)
](https://deep-paper.org/en/paper/file-2140/images/cover.png)
](https://deep-paper.org/en/paper/2503.09962/images/cover.png)
](https://deep-paper.org/en/paper/2405.17421/images/cover.png)