](https://deep-paper.org/en/papers/2025-10/2510.00515/images/cover.png)
The Tortoise and the Hare of AI: How Gradual Learning Makes Visual AI Faster
Multi-modal Large Language Models (MLLMs) are reshaping how we interact with AI. Models like LLaVA can look at an image and hold a conversation about it—combining the seeing ability of computer vision with the reasoning power of large language models (LLMs). They’re like high-performance sports cars: incredible on the track, but they burn through fuel—in this case, computational resources—at a staggering rate. The main fuel drain? The sheer number of visual tokens. While a text prompt might be dozens of tokens, a single image is often broken into hundreds of them, and high-resolution images or multi-frame videos can explode this count further. This data flood creates a computational bottleneck—slowing inference speed and hogging memory. ...