Introduction

Imagine you are a financial analyst tasked with answering a specific question based on a 100-page annual report. The report isn’t just text; it is a chaotic mix of paragraphs, bar charts, scatter plots, and infographics spread across different pages. To answer the question, you can’t just find a single sentence. You might need to look at a chart on page 5 to identify a specific region, and then use that region to find a corresponding revenue figure in a table on page 12.

This is the challenge of Multi-page Heterogeneous Document Question-Answering (Doc-QA).

Current AI solutions, specifically Retrieval-Augmented Generation (RAG) and Multimodal Large Language Models (MLLMs), struggle here. Standard RAG retrieves documents based on semantic similarity, often missing visual context. MLLMs have context window limits and can hallucinate when overloaded with too many irrelevant images.

In this post, we explore Doc-React, a novel framework proposed by researchers from UC San Diego and Adobe. Doc-React treats document QA not as a static retrieval task, but as an iterative, agentic process. By mathematically balancing information gain against uncertainty, Doc-React progressively “reasons” its way through complex documents to find precise answers.

The Problem: Why Single-Step Retrieval Fails

In traditional RAG, the system takes a user query, searches a database for the top-k most similar chunks (text or images), and feeds them to an LLM. This works for simple questions like “What is the company’s mission?”

However, complex documents require multi-hop reasoning. Consider the example below. The user asks about “active social network users” in a region with “252M mobile broadband subscriptions.”

Figure 1: Doc-React applied to the multi-page document QA task. The framework processes a user query as input and operates on multi-page documents. It iteratively refines information retrieval and query formulation to maximize information gain and reduce uncertainty, ultimately generating an accurate and contextually relevant answer.

As shown in Figure 1, a standard retriever might fail because the query doesn’t explicitly mention “North America.” It only mentions the subscription count. A standard system cannot bridge the gap between the map (which links 252M to North America) and the bar chart (which lists North America’s social users).

We need a system that can look at the map, realize “Ah, the region is North America,” and then generate a new search query for “North America social network users.” This is the core intuition behind Doc-React.

The Doc-React Method

Doc-React is an adaptive, iterative framework. Instead of trying to guess the answer in one shot, it breaks the process down. It uses an MLLM (Multimodal LLM) as a judge and generator to refine its search strategy dynamically.

The framework is grounded in information theory. The goal is to extract a subset of document pages (\(S\)) that contains enough information to derive the correct answer (\(A\)), while minimizing the uncertainty (entropy) of the model.

1. The Mathematical Objective

The researchers formulate the task as an optimization problem. We want to select a sequence of document pages (\(S_i\)) such that we minimize the entropy (uncertainty) of the generated answer, provided that the mutual information between our selected pages and the answer is sufficiently high.

Equation 1: Minimizing entropy while ensuring sufficient mutual information.

In simpler terms: Find the smallest set of pages that makes the model certain about the correct answer.

2. Normalized Information Gain

Since we cannot know the perfect answer (\(A\)) beforehand, the system acts greedily. At each step \(t\), it tries to select the next set of pages (\(S_{t+1}\)) that provides the most new information relative to what it already knows.

The authors propose maximizing Normalized Information Gain:

Equation 2: The formula for Normalized Information Gain.

Here is how to read this equation:

  • Numerator (Information Gain): How much does the new page \(S_{t+1}\) tell us about the Answer (\(A\)), given what we already learned from previous pages (\(\pi_t\))?
  • Denominator (Entropy): How “confusing” or complex is the new page? We normalize by entropy to avoid selecting pages that are information-dense but irrelevant or overly noisy.

3. Information Differentiation (The “Residual”)

How does the model know what information is “new”? It calculates the information residual (\(A'_t\)). This represents the “gap” between the complete answer and what the model currently knows.

Equation 3: Defining the information residual.

In practice, we can’t calculate abstract residuals. Instead, Doc-React uses an LLM to explicitly state what is missing. This is called Information Differentiation. The framework prompts the LLM with the current retrieved pages and the query, asking it to identify what information is still needed.

Equation 5: Approximating the residual using the MLLM.

By substituting this residual back into our gain equation, the objective becomes maximizing the information regarding this missing piece (\(A'_t\)):

Equation 4: The practical objective function using the residual.

4. InfoNCE-Guided Retrieval

Now that the system knows what is missing (the residual \(A'_t\)), it needs to find a page that fills that gap. Scanning every page in a large document with a heavy MLLM is too slow.

Instead, Doc-React uses a lightweight retrieval mechanism guided by InfoNCE (Information Noise-Contrastive Estimation). This technique is commonly used in contrastive learning. It estimates the mutual information between the candidate page and the query/residual pair.

Equation 6: The InfoNCE lower bound for mutual information estimation.

This equation allows the system to score candidate pages efficiently. It looks for pages (\(S_{t+1}\)) that are highly correlated with the current query and the missing information (\(A'_t\)).

5. The Algorithm in Action

Putting it all together, the Doc-React algorithm follows this loop:

  1. Initialize: Start with an empty set of pages.
  2. Formulate Sub-query: The MLLM analyzes the original query and any currently held pages. It generates a “residual”—a description of what is missing.
  3. Candidate Evaluation: The system scans the document pool. It scores pages based on how well they address the residual (using the InfoNCE score) and normalizes by the page’s complexity.
  4. Select & Update: The best page (\(S^*_{t+1}\)) is added to the context.

Equation 7: Selecting the best subset and updating the policy.

  1. Repeat: This loop continues until the MLLM determines it has sufficient information to answer the user’s question or reaches a maximum step count.

This iterative process transforms the “needle in a haystack” problem into a structured treasure hunt.

Experiments and Results

The researchers evaluated Doc-React on two challenging benchmarks:

  1. SlideVQA: A dataset of presentation slides requiring visual reasoning (charts, diagrams).
  2. MMLongBench-Doc: A benchmark for long-context document understanding.

Comparison with Multimodal RAG

First, they compared Doc-React against state-of-the-art Multimodal RAG systems, specifically VisRAG and ColPali. ColPali is a powerful retrieval model that uses vision-language models to index documents visually (treating pages as screenshots).

Table 1: Comparison with multimodal retrieval-augmented generation baselines.

As shown in Table 1, Doc-React significantly outperforms the baselines. On MMLongBench, Doc-React achieves an F1 score of 38.07, compared to 32.17 for ColPali and 29.02 for VisRAG. This highlights that simply having a good retriever (like ColPali) isn’t enough; the iterative reasoning provided by Doc-React is crucial for connecting disjointed pieces of information.

Comparison with Multi-Image MLLMs

Next, they compared Doc-React against “brute force” methods where multiple images are simply fed into an MLLM context window (Standard) or used with Chain-of-Thought prompting (CoT).

Table 2: Comparisons with multi-image multimodal LLM baselines.

Table 2 reveals an interesting trend. While GPT-4o performs decently on its own (Standard), Doc-React still provides a performance boost, particularly on SlideVQA (54.87 vs 53.58). More importantly, Doc-React is more scalable. Feeding all pages into GPT-4o is expensive and context-heavy. Doc-React selectively finds only the necessary pages.

Does the Retriever Matter?

One might wonder if Doc-React’s success is just because it uses ColPali as its underlying retrieval engine. To test this, the authors ran an ablation study.

Table 3: Ablation study comparing Doc-React with and without ColPali retrieval.

Table 3 shows that even without ColPali (using a weaker retriever), Doc-React achieves an F1 of 37.22 on MMLongBench, which is still significantly higher than the standalone ColPali baseline (32.17). This confirms that the framework—the logic of iterative differentiation and update—is the primary driver of performance, not just the strength of the visual encoder.

Conclusion

Doc-React represents a significant step forward in document intelligence. It moves away from the “one-shot” retrieval paradigm toward an agentic, iterative approach. By mathematically modeling the question-answering process as an information maximization problem, Doc-React can handle the messy, heterogeneous layouts of real-world documents.

For students and practitioners, the key takeaways are:

  1. Visual Context Matters: Text-only search fails on charts and figures.
  2. Iterative is Better: Hard questions often require intermediate steps (finding “X” to find “Y”).
  3. Feedback Loops: Using an LLM to judge “what is missing” (the residual) is a powerful technique for guiding retrieval.

As multimodal models continue to evolve, frameworks like Doc-React will likely become the standard for processing complex reports, scientific papers, and financial audits where accuracy depends on connecting the dots across pages.