The human eye is a marvel of biological engineering, but it is also surprisingly economical. We do not perceive the world in uniform high definition. Instead, we possess a fovea—a small central region of high acuity—surrounded by a periphery that progressively blurs into low resolution. This mechanism allows us to process complex scenes efficiently, allocating limited biological resources (photoreceptors and optic nerve bandwidth) where they matter most.
In contrast, modern Computer Vision (CV) and Large Multimodal Models (LMMs) are brute-force processors. They typically ingest images at a uniform, high resolution across the entire Field of View (FOV). While effective, this approach is computationally expensive and bandwidth-heavy.
For an autonomous drone operating on a weak connection or an edge device with limited battery, transmitting 4K images for processing is simply not feasible. This leads to a compelling question posed by researchers Gizdov, Ullman, and Harari in their recent paper: Can we improve the performance and efficiency of AI models by mimicking the human foveated sampling scheme?
In this post, we will dive deep into their research, exploring how “seeing more with less” not only saves bandwidth but can actually improve model accuracy and induce human-like representations in artificial neural networks.
Background: The Efficiency Gap
Before understanding the solution, we must define the problem. Current LMMs (like BLIP2, LLaVA, or MDETR) treat every pixel with equal importance. Whether a pixel contains a crucial detail of a face or a patch of empty sky in the corner, the computational cost to process it is often identical.
Recent techniques like “token merging” attempt to reduce this load by pruning redundant information after the image has been fully embedded. However, this doesn’t solve the “bandwidth bottleneck.” If you are a rover on Mars or a drone in a remote area, you cannot afford to upload a high-resolution image to a server just to have the server throw away half the tokens. The compression needs to happen at the source.
The biological solution is Foveation. By varying resolution—high at the center of fixation and lower at the edges—humans balance detail with context. This paper investigates whether applying this biological principle to standard, pre-trained AI architectures yields similar benefits.
The Core Method: Information-Matched Sampling
The researchers set out to test two competing hypotheses regarding image downsampling. To do this fairly, they introduced the concept of Information-Matched Images.
If we want to compare a “Human-like” view vs. a “Machine-like” view, we must ensure both views consume the exact same “budget” of pixels. If one method uses more pixels, it has an unfair advantage.
1. The Sampling Schemes
The authors compared two specific sampling strategies:
- Variable Resolution (Foveated): This mimics the human eye. The density of sample points peaks at the center (fixation point) and decreases linearly as you move toward the periphery.
- Uniform Resolution: The sample points are spread evenly across the image.
Crucially, both schemes are constrained to use the exact same number of total samples (\(N\)).

As shown in Figure 1, the difference is striking. In panel (a), the Variable scheme allocates the pixel budget to the carriage (the object of interest), preserving fine details like the wheels and passengers. In panel (b), the Uniform scheme spreads that same budget across the trees and street, rendering the carriage a blurry mess.
2. Mathematical Formulation
To achieve this, the researchers define a sampling map \(S\). Let \(I\) be the original high-resolution image. The sampling map determines which pixels are kept (\(1\)) and which are discarded (\(0\)).

The total number of samples \(N\) is the sum of all points where \(S(x,y)=1\). The strict constraint for their experiments is:
\[ \sum S_{\mathrm{var}} = \sum S_{\mathrm{uni}} = N \]Once the pixels are sampled, the image is effectively a sparse collection of points. To feed this into standard architectures (which expect a grid), the researchers reconstruct the image using interpolation (denoted as \(\mathcal{I}\)).

3. The Visual Result
What does this look like to the AI? Figure 2 below illustrates a Visual Question Answering (VQA) task. The model is asked, “What are the people sitting in front of?”

In the Variable resolution version (b), despite having a tiny fraction of the original pixels, the “plane” is clearly visible because the sampling focused on the center. In the Uniform version (c), the plane is blurred beyond recognition, leading the model to hallucinate “ocean.”
Experiments & Results
The researchers evaluated several state-of-the-art models, including ViLT, MDETR, BLIP2, InstructBLIP, and LLaVA. They tested these models on major datasets like VQAv2, GQA, and SEED-Bench.
The primary constraint was severe: a 3% sampling density. This means the models were only allowed to “see” 3% of the pixels present in the original image.
1. Foveation Wins on Accuracy
The results were consistent across the board: Variable (Foveated) sampling outperforms Uniform sampling when the pixel budget is tight.

As seen in Table 1:
- ViLT on VQAv2: Variable sampling scored 64.9% vs. Uniform’s 62.9%.
- MDETR on GQA: Variable scored 46.8% vs. Uniform’s 44.1%.
While these margins might seem small in absolute numbers, in the world of benchmarks, a 2.7% gain is significant—especially considering that no architectural changes were made to the models. This is purely an improvement based on how the data was presented.
We can see a more detailed breakdown in Figure 3, which displays a radar chart for the BLIP2 model on the SEED-Bench dataset.

Remarkably, for specific tasks like “Instance Identity,” the Variable model at 3% density actually outperformed the full-resolution baseline. The authors hypothesize that the blur in the periphery acts as a natural attention mechanism, filtering out background noise and forcing the model to focus on the subject.
2. Diminishing Returns of Resolution
One of the most profound findings of this paper is how wasteful current approaches are. By testing sample densities ranging from 1% to 100%, the authors mapped out the performance scaling.

Looking at Figure 5, observe the curve for “MDetr - Variable” (solid orange line in panel b). It rises sharply and begins to plateau around 20-30%. This indicates that models achieve up to 80% of their full capability using just 3% of the pixels.
This finding challenges the trend of simply training models on larger, higher-resolution images. It suggests that texture and fine-grained detail (preserved by foveation) are more critical than having high-definition pixels on background elements like grass or sky.
3. Addressing the “Center Bias”
A skeptical reader might ask: “Is the variable model only winning because photographers usually put the subject in the center of the picture?”
This is a valid concern, known as photographer bias. To control for this, the researchers ran ablation studies where they shifted the fixation point to the corners of the image (Bottom-Left, Top-Right, etc.).

Table 2 shows the results. Even when the high-resolution “fovea” was moved to a corner (away from the likely center subject), the Variable method still outperformed the Uniform method (45.2% vs 44.1%).
Furthermore, they conducted a “Bin Experiment” for object detection (Figure 4 below). They measured performance based on how much of an object was actually inside the high-resolution area (HRA).

The graph in Figure 4b reveals that as soon as about 10% of an object enters the high-resolution fovea, the Variable model dominates. The Uniform model only wins when the object is almost entirely in the blurry periphery—a rare edge case.
Interpretability: Developing a “Human” Brain?
Perhaps the most fascinating section of the paper explores the internal state of the models. Does feeding a neural network foveated images change how it “thinks”?
The authors analyzed MDETR (a Transformer-based detection model) to see how its attention mechanisms adapted.
1. Global Attention
In Vision Transformers, “Self-Attention” calculates how much one part of an image relates to another. The authors defined Attention Distance (\(d_i\)) to measure how far across the image a specific token is “looking.”

A higher \(d_i\) means the model is integrating information from distant parts of the image (global context).
The analysis revealed that models trained with Variable sampling developed significantly longer attention distances for peripheral tokens. Essentially, because the periphery is blurry, the model learned to look toward the sharp center for context, establishing a strong information flow between the “fovea” and the “periphery.”
2. Neuronal Selectivity
In the convolutional backbone (ResNet), the authors looked for Resolution Selectivity. In biological brains, some neurons fire only for high-frequency details, while others fire for low-frequency shapes.

Figure 6 provides a comprehensive look at this phenomenon:
- Panels a, b, c: The red dots represent peripheral tokens. In the Variable model (b), the attention map (heatmap) spreads much wider than in the Uniform model (c).
- Panel h: This histogram compares activations on high-resolution crops (blue) vs. low-resolution crops (orange). The separation indicates that specific neurons have specialized to trigger only on high-resolution details—a behavior observed in the human visual cortex but typically absent in standard CNNs.
Conclusion and Future Implications
This research, “Seeing more with less,” provides a strong information-theoretic argument for rethinking how machines see. The key takeaways are:
- Efficiency: We can discard nearly 97% of an image’s pixels and still retain ~80% of model performance if we sample intelligently (foveally).
- Performance: Under tight bandwidth constraints, foveated images yield higher accuracy than uniformly downscaled ones.
- Biology in Silicon: Foveated inputs naturally induce human-like processing traits, such as global attention and resolution-specific neurons, without explicit programming.
Why does this matter? As we push AI to the “edge”—deploying models on micro-drones, wearable glasses, and remote sensors—bandwidth and power are the ultimate constraints. We cannot always rely on massive servers and 5G connections. By adopting the biological strategy of foveation, we can build systems that are not only faster and lighter but also surprisingly more robust.
The paper suggests that the future of computer vision might not just be about bigger models, but about models that—like us—know exactly where to look.
](https://deep-paper.org/en/paper/file-2218/images/cover.png)