Introduction: Beyond Just Words

Imagine searching for a specific chart hidden in hundreds of pages of financial reports, or trying to locate a product in a sprawling digital catalog using both its image and a brief description.
In today’s increasingly multimedia world, documents are more than just text—they are rich ecosystems of words, images, layouts, charts, and tables. Traditional text-only search engines often fail to capture the meaning locked inside these visual elements, missing vital context.

Visual document retrieval aims to bridge this gap—building search systems that understand both textual and visual signals. Recently, the field has looked to large Vision-Language Models (VLMs)—the same models that power impressive image captioning and question-answering systems—and repurposed them for retrieval. The logic seems straightforward: if a model can describe an image, surely it understands it well enough to retrieve it.

While this approach works in part, it carries significant drawbacks. These repurposed VLMs are often huge, slow, and expensive to run. More critically, their architectures—especially causal attention, where tokens only look backwards in the sequence—were designed for generation, not for building precise, context-rich embeddings for retrieval.

A new paper, ModernVBERT: TOWARDS SMALLER VISUAL DOCUMENT RETRIEVERS, challenges the “bigger is better” assumption. The authors systematically examine the design choices that matter most for retrieval, distilling them into a principled recipe.

Their key contribution: ModernVBERT, a compact 250M-parameter model purpose-built for retrieval, not just repurposed from generation. Despite its modest size, this model matches (and sometimes surpasses) the performance of models more than 10× larger.

In this article, we’ll explore their methodology and findings:

  • Does the model’s attention type matter?
  • How crucial is image resolution?
  • Can we train better models with less—or different—data?

Background: Two Flavors of Multimodal Models

Before diving into ModernVBERT’s innovations, it’s useful to understand the two dominant paradigms for vision-language models:

  1. Dual Encoders
    Think of models like CLIP. They use separate “towers” for images and text, each producing a single vector embedding. During training, the model learns to align vectors for matching image-text pairs in a shared embedding space.
    This approach is fast and efficient for retrieval. However, compressing all information into a single vector can lose fine-grained details important for nuanced matches.

  2. Early-Fusion Encoders
    These models merge visual patches and text tokens into a single transformer, enabling deep, token-level cross-modal interactions.
    This architecture excels in capturing complex relationships but has typically been used in large, generative VLMs with causal attention, making them suboptimal for retrieval.

Attention mechanisms are critical here:
Most generative VLMs are causal decoders, predicting each token based only on previous ones. In contrast, bidirectional encoders—like BERT—use Masked Language Modeling (MLM) to predict masked tokens using context from both before and after the mask. For retrieval, where full-context representations are crucial, the bidirectional approach is likely superior.

Finally, consider the retrieval mechanism itself:

  • Single-vector retrieval compares one embedding per query and document.
  • Late interaction (popularized by ColBERT) compares token-level embeddings between query and document, aggregating the strongest matches. This retains fine-grained detail but needs embeddings enriched with global context—something bidirectional encoders are naturally better at.

The ModernVBERT Method: A Recipe for a Better Retriever

The authors designed controlled experiments, isolating individual factors to see their impact.

Model Architecture

ModernVBERT uses an early-fusion design (Figure 2):

  1. Vision Tower – A pre-trained Vision Transformer (siglip2-base-16b-512) encodes images by splitting them into small patches and embedding each patch.
  2. Language Model – A 150M–210M parameter pre-trained language model processes text tokens.
    Image patch embeddings are projected into the same space as text embeddings and concatenated. The unified sequence flows through the language model for joint processing.

The early fusion architecture of ModernVBERT. A vision encoder processes image patches, which are then fed into a language model alongside text tokens. The model is trained with a Masked Language Modeling (MLM) objective, making it ideal for creating rich, bidirectional representations.

Figure 2: The MLM-based early fusion architecture. The visual encoder produces patch representations, passed to a bidirectional language model and trained with Masked Language Modeling—ideal for sequence and token-level representation.

Two-Phase Training

Phase 1: Modality Alignment
Teaches the language model to interpret visual features via a language modeling objective.

  • For decoders:
    Causal Language Modeling (CLM), predicting next tokens:
    \[ \mathcal{L}_{\text{CLM}} = -\sum_{t=1}^{T} \log P_{\theta}(x_t \mid x_{
  • For encoders:
    Masked Language Modeling (MLM), predicting masked tokens with full context:
    \[ \mathcal{L}_{\text{MLM}} = -\sum_{t \in \mathcal{M}} \log P_{\theta}(x_t \mid x_{\setminus \mathcal{M}}) \]

Training uses billions of tokens from document-rich sources (web pages, books, scientific papers).

Phase 2: Contrastive Post-Training
Specializes the model for retrieval using InfoNCE loss over positive and negative pairs:

\[ \mathcal{L}_{\text{InfoNCE}}(\mathbf{q}, \mathbf{d}^+) = -\log \frac{\Phi(\mathbf{q}, \mathbf{d}^+)}{\Phi(\mathbf{q}, \mathbf{d}^+) + \sum_{\mathbf{d}^- \in \mathcal{N}_q} \Phi(\mathbf{q}, \mathbf{d}^-)} \]

Here, \(\Phi\) is a similarity measure over embeddings.


What Makes a Great Visual Retriever? Key Findings

1. Modality Alignment Boosts Document Retrieval, Not Natural Images

The authors compared early-fusion encoders (encoder/decoder variants) with the standalone SigLIP vision tower.

A bar chart comparing encoder (enc), decoder (dec), and standalone SigLIP on three categories: Document Retrieval, Image/Caption Retrieval, and Image Classification. Enc and dec excel at documents. SigLIP dominates in natural image tasks.

Figure 3: Early fusion with a language model greatly improves document retrieval but hurts general image tasks compared to the standalone SigLIP.

For documents, fusion gave a +10.9 nDCG@5 gain. Natural image tasks, however, worsened—suggesting that modality alignment benefits fine-grained document understanding but overcomplicates simpler image tasks.

2. More Alignment Data Helps—But Only for Documents

Scaling alignment data improved document retrieval steadily (Figure 4), surpassing standalone vision models. Natural image task performance plateaued early.

Three line charts showing scaling: Document Retrieval improves steadily; Image/Caption Retrieval and Classification plateau far below SigLIP baseline.

Figure 4: Document retrieval benefits consistently from extended alignment. Natural image tasks see diminishing returns beyond ~1B tokens.

3. Bidirectional Attention Supercharges Late Interaction

When using single-vector retrieval, encoder and decoder models performed similarly. With multi-vector late interaction, bidirectional encoders excelled—boosting performance by ~20 points versus causal decoders.

A stacked bar chart showing ViDoRe nDCG@5 scores. Bidirectional encoder gains massively from late interaction; causal models don’t.

Figure 5: Bidirectional encoders exploit late interaction fully. Causal decoders fail to enrich early-sequence token embeddings, limiting gains.

Critically, removing causal masks from decoders late in training didn’t close the gap—native bidirectional design is key.

4. For Documents, Resolution is King

Training at higher image resolutions (e.g., 2048px) improved document retrieval significantly, at the expense of some natural image performance.

A table showing resolution effects: Document Retrieval jumps with higher resolutions; non-document tasks drop.

Table 1: Higher resolution images yield notable gains for document tasks.

5. Cross-Modal Transfer via Text Data

Adding large volumes of text-only document-query pairs into visual contrastive training improved visual retrieval.

A table showing that text-only pairs boost document retrieval; natural image-caption pairs boost classification.

Table 2: Thoughtfully mixed training data enhances multiple domains.


ModernVBERT: Putting Insights into Practice

The final recipe:

  • Text Encoder: 150M param bidirectional encoder.
  • Vision Encoder: 100M param SigLIP tower.
  • Alignment: MLM objective over 10B tokens.
  • Resolution Cooldown: 2048px high-res stage.
  • Contrastive Training: Scaled mix of visual and text-only document-query pairs with hard negatives.

The late-interaction result is ColModernVBERT—a lean 250M parameters.

Scatter plot showing model size vs. ViDoRe score. ColModernVBERT sits at top-left: small size, high score.

Figure 1: ColModernVBERT achieves top-tier efficiency—outperforming larger models at its size.

Performance comparison (Table 3) shows ColModernVBERT topping all sub-1B models and matching models up to 10× larger, while being CPU-friendly.

The ViDoRe leaderboard table. ColModernVBERT leads its size class in performance and latency.

Table 3: ColModernVBERT offers the best performance-size tradeoff and excels in CPU latency.


Conclusion: Design Trumps Size

The ModernVBERT study provides a clear blueprint for building efficient visual document retrievers:

  1. Bidirectional attention is essential—especially for late interaction retrieval.
  2. Task-specific design matters—document retrieval thrives on early fusion, high-res inputs, and token-level interactions.
  3. Cross-modal data can mitigate scarce visual corpora.
  4. Compact models can match giants when built with the right principles.

This work demonstrates that impactful design beats brute-force scaling. By open-sourcing models and code, the authors invite practitioners to deploy powerful, efficient retrieval systems and to explore the next generation of multimodal architectures.

ModernVBERT proves that in retrieval, it’s not about how big your model is—it’s about how smartly you build it.