Introduction
Imagine you are a financial analyst tasked with answering a specific question based on a 100-page annual report. The report isn’t just text; it is a chaotic mix of paragraphs, bar charts, scatter plots, and infographics spread across different pages. To answer the question, you can’t just find a single sentence. You might need to look at a chart on page 5 to identify a specific region, and then use that region to find a corresponding revenue figure in a table on page 12.
This is the challenge of Multi-page Heterogeneous Document Question-Answering (Doc-QA).
Current AI solutions, specifically Retrieval-Augmented Generation (RAG) and Multimodal Large Language Models (MLLMs), struggle here. Standard RAG retrieves documents based on semantic similarity, often missing visual context. MLLMs have context window limits and can hallucinate when overloaded with too many irrelevant images.
In this post, we explore Doc-React, a novel framework proposed by researchers from UC San Diego and Adobe. Doc-React treats document QA not as a static retrieval task, but as an iterative, agentic process. By mathematically balancing information gain against uncertainty, Doc-React progressively “reasons” its way through complex documents to find precise answers.
The Problem: Why Single-Step Retrieval Fails
In traditional RAG, the system takes a user query, searches a database for the top-k most similar chunks (text or images), and feeds them to an LLM. This works for simple questions like “What is the company’s mission?”
However, complex documents require multi-hop reasoning. Consider the example below. The user asks about “active social network users” in a region with “252M mobile broadband subscriptions.”

As shown in Figure 1, a standard retriever might fail because the query doesn’t explicitly mention “North America.” It only mentions the subscription count. A standard system cannot bridge the gap between the map (which links 252M to North America) and the bar chart (which lists North America’s social users).
We need a system that can look at the map, realize “Ah, the region is North America,” and then generate a new search query for “North America social network users.” This is the core intuition behind Doc-React.
The Doc-React Method
Doc-React is an adaptive, iterative framework. Instead of trying to guess the answer in one shot, it breaks the process down. It uses an MLLM (Multimodal LLM) as a judge and generator to refine its search strategy dynamically.
The framework is grounded in information theory. The goal is to extract a subset of document pages (\(S\)) that contains enough information to derive the correct answer (\(A\)), while minimizing the uncertainty (entropy) of the model.
1. The Mathematical Objective
The researchers formulate the task as an optimization problem. We want to select a sequence of document pages (\(S_i\)) such that we minimize the entropy (uncertainty) of the generated answer, provided that the mutual information between our selected pages and the answer is sufficiently high.

In simpler terms: Find the smallest set of pages that makes the model certain about the correct answer.
2. Normalized Information Gain
Since we cannot know the perfect answer (\(A\)) beforehand, the system acts greedily. At each step \(t\), it tries to select the next set of pages (\(S_{t+1}\)) that provides the most new information relative to what it already knows.
The authors propose maximizing Normalized Information Gain:

Here is how to read this equation:
- Numerator (Information Gain): How much does the new page \(S_{t+1}\) tell us about the Answer (\(A\)), given what we already learned from previous pages (\(\pi_t\))?
- Denominator (Entropy): How “confusing” or complex is the new page? We normalize by entropy to avoid selecting pages that are information-dense but irrelevant or overly noisy.
3. Information Differentiation (The “Residual”)
How does the model know what information is “new”? It calculates the information residual (\(A'_t\)). This represents the “gap” between the complete answer and what the model currently knows.

In practice, we can’t calculate abstract residuals. Instead, Doc-React uses an LLM to explicitly state what is missing. This is called Information Differentiation. The framework prompts the LLM with the current retrieved pages and the query, asking it to identify what information is still needed.

By substituting this residual back into our gain equation, the objective becomes maximizing the information regarding this missing piece (\(A'_t\)):

4. InfoNCE-Guided Retrieval
Now that the system knows what is missing (the residual \(A'_t\)), it needs to find a page that fills that gap. Scanning every page in a large document with a heavy MLLM is too slow.
Instead, Doc-React uses a lightweight retrieval mechanism guided by InfoNCE (Information Noise-Contrastive Estimation). This technique is commonly used in contrastive learning. It estimates the mutual information between the candidate page and the query/residual pair.

This equation allows the system to score candidate pages efficiently. It looks for pages (\(S_{t+1}\)) that are highly correlated with the current query and the missing information (\(A'_t\)).
5. The Algorithm in Action
Putting it all together, the Doc-React algorithm follows this loop:
- Initialize: Start with an empty set of pages.
- Formulate Sub-query: The MLLM analyzes the original query and any currently held pages. It generates a “residual”—a description of what is missing.
- Candidate Evaluation: The system scans the document pool. It scores pages based on how well they address the residual (using the InfoNCE score) and normalizes by the page’s complexity.
- Select & Update: The best page (\(S^*_{t+1}\)) is added to the context.

- Repeat: This loop continues until the MLLM determines it has sufficient information to answer the user’s question or reaches a maximum step count.
This iterative process transforms the “needle in a haystack” problem into a structured treasure hunt.
Experiments and Results
The researchers evaluated Doc-React on two challenging benchmarks:
- SlideVQA: A dataset of presentation slides requiring visual reasoning (charts, diagrams).
- MMLongBench-Doc: A benchmark for long-context document understanding.
Comparison with Multimodal RAG
First, they compared Doc-React against state-of-the-art Multimodal RAG systems, specifically VisRAG and ColPali. ColPali is a powerful retrieval model that uses vision-language models to index documents visually (treating pages as screenshots).

As shown in Table 1, Doc-React significantly outperforms the baselines. On MMLongBench, Doc-React achieves an F1 score of 38.07, compared to 32.17 for ColPali and 29.02 for VisRAG. This highlights that simply having a good retriever (like ColPali) isn’t enough; the iterative reasoning provided by Doc-React is crucial for connecting disjointed pieces of information.
Comparison with Multi-Image MLLMs
Next, they compared Doc-React against “brute force” methods where multiple images are simply fed into an MLLM context window (Standard) or used with Chain-of-Thought prompting (CoT).

Table 2 reveals an interesting trend. While GPT-4o performs decently on its own (Standard), Doc-React still provides a performance boost, particularly on SlideVQA (54.87 vs 53.58). More importantly, Doc-React is more scalable. Feeding all pages into GPT-4o is expensive and context-heavy. Doc-React selectively finds only the necessary pages.
Does the Retriever Matter?
One might wonder if Doc-React’s success is just because it uses ColPali as its underlying retrieval engine. To test this, the authors ran an ablation study.

Table 3 shows that even without ColPali (using a weaker retriever), Doc-React achieves an F1 of 37.22 on MMLongBench, which is still significantly higher than the standalone ColPali baseline (32.17). This confirms that the framework—the logic of iterative differentiation and update—is the primary driver of performance, not just the strength of the visual encoder.
Conclusion
Doc-React represents a significant step forward in document intelligence. It moves away from the “one-shot” retrieval paradigm toward an agentic, iterative approach. By mathematically modeling the question-answering process as an information maximization problem, Doc-React can handle the messy, heterogeneous layouts of real-world documents.
For students and practitioners, the key takeaways are:
- Visual Context Matters: Text-only search fails on charts and figures.
- Iterative is Better: Hard questions often require intermediate steps (finding “X” to find “Y”).
- Feedback Loops: Using an LLM to judge “what is missing” (the residual) is a powerful technique for guiding retrieval.
As multimodal models continue to evolve, frameworks like Doc-React will likely become the standard for processing complex reports, scientific papers, and financial audits where accuracy depends on connecting the dots across pages.
](https://deep-paper.org/en/paper/file-2332/images/cover.png)