Introduction: The Cost of Seeing Clearly
In the rapidly evolving world of Artificial Intelligence, Multimodal Large Language Models (MLLMs)—models that can see and talk—have become the new frontier. Systems like GPT-4V have demonstrated incredible capabilities, describing complex scenes and answering questions about images. However, a significant bottleneck remains: efficiency.
For a model to understand text inside an image (like reading a receipt or analyzing a chart), it typically needs high-resolution inputs. High resolution means dividing the image into thousands of small patches (tokens). For a standard Transformer architecture, more tokens result in quadratically higher computational costs. This creates a barrier for real-world applications where latency and memory are limited.
Do we really need massive, resource-hungry models to read text in images? Or can we design a smarter, more compact architecture?
This is the question addressed in the paper “On Efficient Language and Vision Assistants for Visually-Situated Natural Language Understanding.” The researchers introduce ELVA, a model designed to challenge the trend of simply scaling up. ELVA achieves state-of-the-art performance on text-rich visual tasks while maintaining low inference costs, effectively democratizing access to powerful visual assistants.
The Background: LLaVA and the Resolution Trap
To understand ELVA, we first need to look at the standard architecture it builds upon: LLaVA (Large Language and Vision Assistant).
LLaVA connects a pre-trained vision encoder (like CLIP) to a Large Language Model (LLM) using a simple projection layer (a Multi-Layer Perceptron, or MLP). The process is straightforward:
- An image is fed into the vision encoder.
- The encoder breaks the image into patches and creates embeddings (vectors).
- The MLP aligns these visual vectors with the LLM’s text space.
- The LLM generates a response based on the visual features and the text prompt.

The Problem: To read small text in a document, the image must be high-resolution. In models like LLaVA-NeXT, handling high-resolution images can require up to 2,880 image tokens. This puts a massive load on the LLM, slowing down inference to a crawl and hogging GPU memory.
As shown in the table below, while older LLaVA models were fast, newer high-res iterations (LLaVA-NeXT) see a significant jump in latency.

The researchers behind ELVA realized that simply throwing more tokens at the problem wasn’t sustainable. They needed a way to make the model “read” better without making it heavier.
The ELVA Method: Smarter, Not Just Bigger
The core philosophy of ELVA is optimization over expansion. The authors identify two main reasons why standard open-source models fail at reading text in images:
- Weak Vision Encoders: Standard CLIP models are trained on general natural images, not documents or dense text.
- Lack of Reasoning Supervision: Models often try to guess the answer without explicitly “reading” the text first.
ELVA introduces two novel strategies to fix this: Weight Averaging for the Vision Encoder and Read-and-Reason Prompting.
1. The ELVA-Encoder: The Power of Model Soups
Instead of using a gigantic vision encoder (which slows down the system), ELVA sticks to a smaller, efficient encoder (CLIP-Base) but supercharges it.
The team utilized a technique inspired by “Model Soups.” Here is the recipe:
- Specialized Training: They take the standard vision encoder and fine-tune it specifically on text-reading tasks (OCR datasets) using a small Language Model (1B parameters).
- Multiple Runs: They train 12 different versions of this encoder using different random seeds.
- Weight Averaging: Finally, they average the weights of all these trained encoders into a single, robust encoder.
This process results in a vision module that is highly specialized for reading text but retains the general capabilities of the original model. Crucially, because they average the weights, the inference cost does not increase—the final model is the same size as a single encoder, but significantly smarter.

As seen in Figure 3, the optimized encoder (configurations C5 and C7) consistently outperforms the baseline across varying model sizes.
2. Read-and-Reason (RR) Prompting
Even with a good eye, a model needs to know how to process information. In standard training, a model is given an image of a menu and asked, “How much is the burger?” It might hallucinate an answer based on probability.
ELVA introduces Read-and-Reason (RR) Prompting during the training phase. For text-rich images, the model is explicitly prompted to first transcribe the text it sees and then answer the question.
For example, the training data is structured so the model learns the behavior:
- Prompt: “What is written in this image?”
- Response: “The menu lists a Burger for $10…”
This forces the model to attend to the textual evidence before reasoning. Interestingly, this step is used during training. By inference time, the model has internalized this capability and can reason directly without needing the extra “reading” step explicitly outputted every time, preserving efficiency.

Figure 4 illustrates that models trained with RR-Prompt (R2) consistently outperform those without it (R1), particularly in text-centric tasks.
An example of what this prompt looks like in the training data can be seen below:

Experiments and Results
The researchers evaluated ELVA across a wide range of benchmarks, including DocVQA (documents), ChartQA (charts), and general multimodal tasks.
The Efficiency “Sweet Spot”
The most compelling result is the balance between performance and cost.

In Figure 1, look at the charts on the left.
- Top Left: ELVA achieves high scores with significantly lower latency (ms/img) compared to LLaVA-NeXT.
- Bottom Left: ELVA uses much less memory.
Comprehensive Benchmarking
When compared head-to-head with other state-of-the-art models, ELVA demonstrates that you don’t need massive parameter counts to achieve top-tier results in document understanding.

In Table 4, the ELVA-7B and ELVA-13B models rival or beat LLaVA-NeXT-13B on text-centric benchmarks (Doc, Chart, Info) while using a fraction of the token count (Standard LLaVA-NeXT uses ~2880 tokens; ELVA uses max 637).
Latency Analysis
For real-world users, waiting 4 seconds for a response is often unacceptable. ELVA maintains low latency even as task complexity increases.

Figure 5 shows that while LLaVA-NeXT’s latency spikes dramatically (the green line), ELVA (the red line) stays relatively flat and low, comparable to the much simpler LLaVA-1.5 but with far better accuracy.
Deep Dive: Why Models Fail (and Succeed)
The paper provides a fascinating analysis of hallucinations—when models confidently say something that isn’t true. The authors used a technique called “Logit Lens” to peek inside the model’s layers and see what token it was predicting at each stage of processing.
A Success Case
In a successful prediction, the model identifies the correct text early in its layers. It “sees” the answer.

A Failure Case (Hallucination)
In an ablation study (a stripped-down version of the model without ELVA’s improvements), the model was asked to identify a title. The image did not contain the word “Sweden,” nor did the question. Yet, the model output “Sweden.”
Why? Because the vision encoder failed to extract the text features clearly. Without clear visual evidence, the Language Model fell back on its training data priors—it just guessed a likely word that fits the context of “Country” or “Demographics.”

This highlights exactly why the ELVA-Encoder (for better vision) and RR-Prompting (for grounding text) are so vital. They prevent the LLM from “dreaming” up answers when it can’t see clearly.
New Benchmarks: CORD-Instruct and Parsing-Bench
To further push the field of document understanding, the authors released two new datasets. Existing benchmarks often focus on simple Q&A. However, real-world assistants need to extract structured data (like JSON or Markdown) from documents.
Parsing-Bench and CORD-Instruct require the model to output structured formats, simulating tasks like expense reporting or identity verification.

Figure 11 shows how models are evaluated on Parsing-Bench using an “LLM-as-a-judge” mechanism to verify the accuracy of the extracted structured data.
Conclusion
The ELVA paper presents a crucial lesson for the AI industry: efficiency is an architectural choice, not just a hardware problem. By intelligently designing the vision encoder through weight averaging and structuring the training data with Read-and-Reason prompting, the authors achieved state-of-the-art performance on visually-situated text understanding.
They managed to break the “resolution curse,” proving that we can build powerful, high-resolution document assistants that are lightweight enough to be deployed in practical, cost-sensitive environments. For students and researchers, ELVA serves as a blueprint for optimizing Multimodal LLMs without simply resorting to “making it bigger.”
](https://deep-paper.org/en/paper/2406.11823/images/cover.png)