Introduction
Imagine you are trying to find a specific friend in a crowded stadium. You don’t stare at the entire stadium at once and hope to instantly process every single face. Instead, your eyes dart around. You scan sections, focus on a group of people wearing the right color jersey, zoom in on a specific row, and filter out the noise. This cognitive process is known as visual search, and it is fundamental to how humans interact with the world. We dynamically adjust our focus to filter out irrelevant information and concentrate on what matters.
For Large Multimodal Models (LMMs)—the AI systems that power tools like GPT-4V or LLaVA—this process is surprisingly difficult. Most LMMs process images in a static way. They either downsize a high-resolution image to a fixed size (losing crucial details) or chop it up into fixed patches (losing global context). When asked to find a small object or answer a detailed question about a cluttered scene, these models often suffer from “hallucinations”—they confidently claim to see things that aren’t there because they cannot effectively “zoom in” or filter out visual noise.
Today, we are diving into a fascinating paper titled “DyFo: A Training-Free Dynamic Focus Visual Search for Enhancing LMMs in Fine-Grained Visual Understanding.” The researchers propose a novel method called DyFo (Dynamic Focus) that allows LMMs to mimic human visual search. The best part? It requires zero training. By orchestrating a collaboration between a smart LMM and a specialized “Visual Expert” model, DyFo allows AI to dynamically explore an image, focus on relevant regions, and answer fine-grained questions with significantly higher accuracy.

As shown in Figure 1, standard LMMs (a) often miss small details, like a person on a bike. Previous attempts like SEAL (b) try to fix this but require expensive fine-tuning. DyFo (c) successfully locates the target through an intelligent search process, without modifying the underlying model weights.
Background: The Resolution and Attention Problem
To understand why DyFo is necessary, we first need to look at the limitations of current LMMs.
The Trade-off: Context vs. Detail
When an LMM processes an image, it typically uses an encoder like ViT (Vision Transformer). These encoders often have a fixed input resolution (e.g., 336x336 pixels). If you feed a 4K image of a busy street into this model, the image is squashed down, turning small objects into unrecognizable blurs. This leads to object hallucination, where the model guesses what is in the blurry region based on probability rather than visual evidence.
Some newer models, like LLaVA-Next or Qwen2-VL, try to solve this by “patching”—cutting the image into tiles. However, blindly increasing the number of tiles introduces a new problem: Information Overload. In a cluttered scene, 90% of the image might be irrelevant background. Feeding all of that into the model creates interference, confusing the AI and actually increasing the rate of hallucinations.
Existing Solutions and Their Flaws
There have been attempts to give LMMs “eyes” to look around. Methods like SEAL integrate visual search by adding localization modules (tools that output bounding boxes or heatmaps) and a “visual working memory.” While effective, SEAL has a major drawback: it requires fine-tuning. You have to train the model specifically to use these tools, which involves collecting expensive datasets and spending massive computational resources. Furthermore, if a new, better LMM comes out (like Qwen2-VL), you have to start the training process all over again.
This brings us to the core innovation of DyFo: Can we achieve dynamic visual search using off-the-shelf models, without any training?
The Core Method: Dynamic Focus (DyFo)
DyFo is designed to act as a bridge between two types of AI models:
- The Large Multimodal Model (LMM): The “Brain.” It understands the user’s question, reasons about the image, and decides what to look for next. (e.g., LLaVA, Qwen-VL).
- The Visual Expert: The “Eyes.” These are specialized models (like Grounding DINO or SAM) that are excellent at detecting objects based on text prompts but lack deep reasoning capabilities.
DyFo combines these two using a framework inspired by Monte Carlo Tree Search (MCTS). It treats the process of finding the right visual region as a game where the model must choose the best “move” (focus shift) to maximize the “reward” (finding the answer).

As illustrated in Figure 2, the framework consists of two main pillars: the Focus Adjuster and the Focus Tree Search. Let’s break these down.
1. The Focus Adjuster
The Focus Adjuster is the mechanism that allows the LMM and the Visual Expert to talk to each other. It creates a closed loop of interaction.
The process is iterative. We define a “focus state” \(f = (I, T)\), where \(I\) is the current image region (a crop of the original) and \(T\) is the text description or query associated with it. The update cycle works as follows:
- LMM’s Turn: The LMM looks at the current image region (\(I\)) and the current question. It generates a new text instruction (\(T\)) on what to look for next.
- Visual Expert’s Turn: The Visual Expert takes this text instruction (\(T\)) and scans the image to find the matching object. It crops the image to that region, creating a new image input (\(I\)).
- Update: This creates a new focus state.
Mathematically, the authors define this update loop as:

Here, \(L\) represents the LMM and \(E\) represents the Visual Expert. \(A^i\) represents the “action” being taken.
The Action Space
To simulate human eyes, DyFo doesn’t just randomly crop images. It uses specific actions (\(A\)) that mimic human cognitive reflexes:
- Semantic Focus (\(A_1\)): This mimics “conjunction search.” The model actively looks for an object mentioned in the query (e.g., “Find the red bus”).
- Semantic Scatter (\(A_2\)): This mimics “divergence.” If the focus is too tight, the model zooms out slightly to capture context, ensuring it hasn’t missed neighboring details.

Figure 3 visualizes this loop. Notice how the “Context” (the question) drives the LMM to generate a prompt, which drives the Visual Expert to change the visual input, which loops back to the LMM.
2. Focus Tree Search (MCTS)
If we only followed the loop above, the model might get stuck in a “rabbit hole,” zooming in on the wrong object and never coming back. To prevent this, DyFo uses Monte Carlo Tree Search (MCTS).
MCTS builds a “Focus Tree.” The root of the tree is the original full image. Each branch represents a decision to focus on a specific sub-region. The goal is to explore this tree to find the node (region) that provides the best answer to the user’s question.
Selection: Balancing Exploration and Exploitation
At every step, the algorithm must decide: Should I keep investigating this promising region (Exploitation), or should I try a different area that I haven’t looked at yet (Exploration)?
To make this decision, DyFo uses the Upper Confidence Bound (UCT) formula. For a given focus node \(f\), the algorithm selects the next action \(a^*\) that maximizes:

- \(Q(f, a)\): The expected quality (reward) of taking action \(a\). This represents exploitation—sticking to what works.
- The square root term: This represents exploration. \(N(f)\) is how many times the parent node was visited, and \(N(c(f,a))\) is how many times this specific child node was visited. If a node hasn’t been visited much, this term is large, encouraging the model to try it.
Rewards and Backpropagation
How does the model know if a focus region is “good”? It needs a Reward Function (\(R\)).
In reinforcement learning, rewards usually come from an external environment. Here, the environment is the image itself. The authors devised a clever “Consensus-Based Reward.” A focus region is considered good if the LMM and the Visual Expert agree.
If the Visual Expert crops an image based on the text “red car,” and the LMM looks at that crop and independently says “this is a red car,” we have a match.
The reward function is defined as:

- \(\mathbb{I}_{\{I=T\}}\) is an indicator that is 1 if the image content matches the text, and 0 otherwise.
- \(\frac{s_{f^{i}}}{s_{o}}\) is the ratio of the cropped area to the original area. This term penalizes the model for zooming in too much on insignificant pixels, encouraging it to find the largest relevant region.
Once a leaf node is reached, the accumulated reward is backpropagated up the tree to update the \(Q\) values of all parent nodes:

3. Multi-Grained Voting
After the search finishes, we have a tree full of different focus regions, each with a different view of the scene. Instead of just picking the single “best” node, DyFo uses a voting mechanism.
It aggregates the answers from different nodes, weighted by their rewards (\(R_f\)). This ensures that the final answer considers both the broad context (from nodes higher up the tree) and fine details (from nodes lower down).

This voting strategy is crucial because it prevents the model from losing global hints (“The scene is a kitchen”) while focusing on local details (“The spoon is silver”).
Experiments and Results
The researchers tested DyFo extensively to answer two main questions: Does it reduce hallucinations? And does it improve fine-grained detail recognition?
Reducing Hallucinations (POPE Benchmark)
The POPE benchmark evaluates how often a model claims an object exists when it doesn’t. It uses three settings: Random, Popular (objects that appear often in the dataset), and Adversarial (objects that often co-occur with present objects but are absent).
The results, shown in Table 1 below, are compelling.

Key Takeaways from POPE:
- Consistent Gains: DyFo improved performance across almost every single category for both LLaVA-1.5 (fixed resolution) and Qwen2-VL (dynamic resolution).
- Precision vs. Recall: Look at the “Adversarial” setting for LLaVA1.5. The accuracy jumps from 81.83 to 83.40. This confirms that actively searching for visual evidence prevents the model from hallucinating objects just because they “should” be there.
We can see a visual example of this in Figure 4.

In the baseball example (right), standard models (LLaVA and Qwen) fail to see the baseball bat because it is small and off-center. They answer “No.” DyFo, however, actively searches, locates the bat (red box), and correctly answers “Yes.”
Fine-Grained Understanding (V* Bench)
V Bench* is a difficult benchmark specifically designed for high-resolution images and small details. It asks questions about attributes (color, material) and spatial relationships.

Key Takeaways from V Bench:*
- Beating the Baseline: Look at the “Overall” column in Table 3. DyFo-L (using LLaVA) scores 59.16%, significantly higher than the base LLaVA-1.5 at 48.68%.
- Surpassing Training-Based Methods: Most impressively, DyFo-Q (using Qwen) achieves 81.15%, beating SEAL (75.39%). Remember, SEAL required specialized fine-tuning and extra modules. DyFo achieved this result essentially “out of the box” by orchestrating existing models.
Figure 5 shows why this matters in real-world scenarios.

In the left image, the user asks about the material of a glove. The glove is a tiny fraction of the image. Standard models guess “Cotton” (a safe, common guess). DyFo zooms in on the hand, sees the texture, and correctly identifies it as “Rubber.” On the right, it correctly identifies a stylized “Dove” on a poster that other models mistook for a horse or dog.
Why MCTS? (Ablation Studies)
You might wonder: Do we really need the complex Tree Search? Can’t we just crop the image once and be done?
The researchers tested this. They compared MCTS against other search algorithms like Breadth-First Search (BFS), Depth-First Search (DFS), and A*.

As shown in Table 6, MCTS is the most efficient, requiring an average of only 3.20 search steps to find the target. BFS and DFS waste time exploring irrelevant areas (high search length). This efficiency is critical because every step requires calling the LMM, which takes time and compute. MCTS balances the need to find the answer with the need to be efficient.
Conclusion
The DyFo paper presents a compelling argument: we don’t always need bigger models or more training data to solve computer vision problems. Sometimes, we just need to change how the model looks at the data.
By simulating the human cognitive process of visual search—scanning, focusing, and verifying—DyFo transforms general-purpose LMMs into sharp-eyed detectives.
Key Implications:
- Plug-and-Play: The biggest advantage is that DyFo is training-free. As soon as a better LMM (like GPT-5V or LLaVA-Next) is released, DyFo can immediately be applied to it to enhance its fine-grained capabilities.
- Hallucination Mitigation: By grounding answers in specific image crops rather than the entire blurry scene, DyFo offers a robust defense against AI hallucinations.
- Modular AI: This research reinforces the trend of “Compound AI Systems”—combining different specialized models (experts + reasoners) to achieve results that neither could achieve alone.
As LMMs continue to integrate into high-stakes fields like medical imaging or autonomous driving, the ability to reliably “focus” on the right details will be just as important as the model’s general intelligence. DyFo is a significant step toward that reality.
](https://deep-paper.org/en/paper/2504.14920/images/cover.png)