Introduction
If you have ever tried to ask a Multimodal Large Language Model (MLLM) like LLaVA or GPT-4V a question about a tiny detail in a massive panoramic photo, you might have noticed a frustrating phenomenon: the model often hallucinates or simply says it cannot see the object.
The reason lies in the architecture. While models have scaled up in intelligence, their “eyes” are often limited. Most MLLMs resize inputs to a fixed, low resolution (typically \(336 \times 336\) or \(448 \times 448\) pixels) to save on computational costs. For a high-resolution (HR) image—say, an 8K photo—this downsampling is catastrophic. It introduces shape distortion and blurring that wipes out fine-grained details necessary for tasks like reading small text (OCR) or visual grounding.
To solve this, researchers have historically relied on heuristic cropping (cutting the image into fixed grids) or complex sliding window searches. However, these methods often lose the global context or become computationally heavy.
In a recent paper, Retrieval-Augmented Perception: High-Resolution Image Perception Meets Visual RAG, researchers propose a paradigm shift. Instead of treating high-resolution perception as a purely visual processing task, they approach it as a long-context retrieval problem. By adapting Retrieval-Augmented Generation (RAG)—a technique famous in Natural Language Processing—to the visual domain, they achieve remarkable improvements in accuracy without retraining the underlying models.

As shown in Figure 1, the proposed framework, Retrieval-Augmented Perception (RAP), dynamically retrieves relevant image crops and reconstructs them in a way that preserves spatial context, leading to massive performance gains across various model sizes.
Background: The Resolution Bottleneck
Current MLLMs consist of a Visual Encoder (like CLIP or SigLIP) and a Large Language Model (LLM). The visual encoder turns images into “visual tokens,” which the LLM processes alongside text.
If you feed an 8K image directly into a standard Vision Transformer (ViT), you might generate ~300,000 visual tokens. This sequence length is prohibitively expensive for most LLMs. The standard solution is to resize the image, which works for describing a sunset but fails when asked, “What is the date written on the tiny stamp in the corner?”
Existing Approaches and Limitations
- Cropping-Based Methods: These split the image into a grid. While this preserves detail, it increases token count linearly and often breaks objects that lie on the boundaries of crops.
- Search-Based Methods: These treat the image like a map, zooming in step-by-step. However, they usually start from a low-resolution overview. If the small object isn’t visible in the overview, the search path fails immediately.
The researchers behind RAP asked a fundamental question: Can we enhance the long-context capability of MLLMs using RAG, just as we do for text LLMs?
The Pilot Study: How to Do Visual RAG?
Implementing RAG for images isn’t as simple as retrieving text chunks. Images are 2D, spatial data. The researchers conducted a pilot study to answer two critical questions before building their framework.
1. Does Layout Matter?
In text RAG, you can often paste retrieved paragraphs sequentially. In vision, if you retrieve three crops from an image—top-left, bottom-right, and center—and feed them to the model as a sequence, the model loses the understanding of where these crops are relative to each other.
The researchers tested three strategies:
- Sorting crops by retrieval score (relevance).
- Sorting by original order.
- Preserving relative spatial positions.

As Table 1 illustrates, Strategy 3 (preserving positions) was critical. While simple retrieval helped with finding single objects (Fine-grained Single-instance Perception, or FSP), it hurt the model’s ability to understand relationships between objects (Fine-grained Cross-instance Perception, or FCP) unless the spatial layout was preserved.
2. How Many Crops (\(K\)) Should We Retrieve?
In RAG, the number of retrieved documents (\(K\)) is a hyperparameter. In Visual RAG, \(K\) represents the number of image crops.

The results in Figure 2 reveal a tradeoff:
- For FSP (finding one thing): A small \(K\) is better. Adding too many crops introduces noise and resolution overhead.
- For FCP (relationships): A larger \(K\) is necessary to capture the context between objects.
This implies that a fixed \(K\) is suboptimal. The system needs to dynamically decide how much visual information to retrieve based on the query.
Core Method: Retrieval-Augmented Perception (RAP)
Based on these insights, the authors propose RAP, a training-free framework composed of three main stages:
- VisRAG: Retrieving relevant crops.
- Spatial-Awareness Layout: Compressing the image while keeping relative positions.
- RE-Search: An adaptive algorithm to find the optimal \(K\).
Let’s break down the architecture.

Step 1: Retrieval with VisRAG
First, the high-resolution image is divided into a set of crops \(V\). The system treats the user’s text question as the query \(q\). Using a visual retriever (like VisRAG or SigLIP), the system calculates the similarity score between the query and every image crop.
The similarity score \(s(q, V)\) is defined as:

This step filters out the noise. Instead of processing the whole image, the model focuses only on the regions semantically related to the question.
Step 2: Spatial-Awareness Layout
Once the top \(K\) crops are selected, we cannot simply concatenate them. We must present them to the MLLM as a cohesive image to preserve spatial reasoning (e.g., “left of,” “above”).
The researchers introduce a Spatial-Awareness Layout. They represent the selected crops in a binary matrix \(M\), where \(1\) indicates a selected crop and \(0\) is an empty space. To create an efficient input image, they compress this matrix by removing rows and columns that are entirely empty (zeros).
The indices for the compressed matrix are calculated as:

By mapping the selected crops into this compressed grid, the relative positions are maintained—a crop that was to the top-left of another crop in the original image remains to the top-left in the synthesized image—but the overall resolution is significantly reduced compared to the full original image.
Step 3: Retrieved-Exploration Search (RE-Search)
Since different questions require different amounts of detail (different \(K\)), RAP uses an adaptive search algorithm called RE-Search. This is inspired by the A* search algorithm.
The algorithm builds a “RE-Tree.” Each node in the tree represents a version of the image constructed with a different retention ratio of crops (e.g., top 10%, top 20%, top 50%).
To navigate this tree and find the best node (image version), RAP calculates a cost function \(f(t)\) that balances two factors:
- Relevance: How similar are the crops to the query?
- Confidence: Is the MLLM confident it can answer the question given these crops?
The Relevance Cost (\(g\)) is the average similarity score of the retained crops:

The Heuristic Cost (\(h\)) estimates how close we are to a good answer. It uses the MLLM’s own confidence. The system prompts the model: “Could you answer the question based on the available visual information? Answer Yes or No.” The probability of “Yes” drives the cost down:

Finally, these are combined into a total cost function \(f(t_s)\). The weights shift dynamically: at shallow depths (fewer crops), the model’s confidence is unreliable, so the system relies more on retrieval scores. As the tree gets deeper (more visual context), the model’s confidence becomes the primary guide.
The dynamic weight \(w\) and the final cost function are defined as:


The search terminates when the model’s answering confidence exceeds a threshold (e.g., 0.6), ensuring efficiency.
Experiments and Results
The researchers evaluated RAP on challenging benchmarks like \(V^*\) Bench and HR-Bench (4K and 8K resolutions).
Quantitative Performance
The improvements are drastic. As shown in Table 2, RAP boosts the performance of open-source models (like LLaVA-v1.5 and v1.6) to levels that rival or surpass closed-source giants like GPT-4o on specific tasks.

For example:
- LLaVA-v1.5-7B jumped from 32.1% to 53.8% overall accuracy on HR-Bench 8K.
- \(V^*\) Bench scores nearly doubled for the 7B model.
Comparison with Other Search Methods
How does RAP compare to other methods that try to solve the high-resolution problem, like Zoom Eye or \(DC^2\)?

Table 8 shows that RAP consistently outperforms these methods. Specifically, on the LLaVA-v1.5-7B model, RAP achieves a +27.0% gain over the baseline, compared to +22.5% for Zoom Eye. This suggests that “retrieving” is more effective than “zooming” hierarchically.
Efficiency
One might assume that searching for the optimal crop count is slow. However, because VisRAG computes similarities for all crops in parallel and the search tree is shallow, RAP is highly efficient.

Table 5 demonstrates that RAP actually achieves higher throughput (4.2 samples/minute) compared to Zoom Eye (3.3) and \(DC^2\) (2.1), while also delivering higher accuracy.
Adaptive Selection of \(K\)
Does the RE-Search actually work? The distribution of selected \(K\) values proves the system adapts to the task difficulty.

Figure 4 shows that for FSP tasks (finding a single object), the distribution leans toward smaller \(K\) values. For FCP tasks (relationships), the system automatically selects larger \(K\) values, confirming the hypothesis from the pilot study.
Qualitative Analysis
The power of RAP is best understood through examples.
Example 1: Single-Instance Perception In Figure 7, the model is asked to identify a number on a small sign in the background.

- Zoom Eye: Fails because its search path cuts off the text “08-26,” leaving only “08-”.
- RAP: Retrieves the specific crops containing the sign and the text, allowing the MLLM to read “08-26” correctly.
Example 2: Cross-Instance Perception In Figure 8, the query asks about the location of a stone cairn relative to a waterfall.

- Zoom Eye: Retrieves a crop of the cairn but loses the context of the waterfall, leading to a hallucinated directional answer (“To the left”).
- RAP: Using the Spatial-Awareness Layout, it retrieves crops for both the cairn and the waterfall and maintains their relative positions. The model correctly identifies the cairn is at the “bottom right.”
Ablation Study: What Matters Most?
The authors broke down the contribution of each component in RAP.

- Baseline: 32.1% accuracy.
- + VisRAG: Using retrieval alone boosts FSP significantly but hurts FCP (due to lost spatial info).
- + Spatial Layout (SL): Fixes the FCP drop, slightly improving overall score.
- + RE-Search: The massive jump to 53.8% comes from dynamically selecting \(K\). This confirms that using a static number of crops is a major bottleneck.
Conclusion
The paper Retrieval-Augmented Perception introduces a compelling argument: we don’t necessarily need larger context windows or heavier encoders to solve high-resolution vision. Instead, we can treat visual details as retrievable information.
By combining VisRAG for relevance, Spatial-Awareness Layout for context, and RE-Search for adaptability, RAP allows standard MLLMs to “see” details in 8K images that were previously invisible to them. This approach not only improves accuracy by large margins (up to 43% on some benchmarks) but does so efficiently, without the need for expensive training.
As multimodal models continue to integrate into real-world applications—from analyzing satellite imagery to reading dense documents—techniques like RAP will be essential in bridging the gap between pixel count and genuine perception.
](https://deep-paper.org/en/paper/2503.01222/images/cover.png)