Introduction
In the rapid evolution of Artificial Intelligence, we have witnessed a shift from models that simply predict the next word to models that can solve complex logic puzzles. With the release of systems like OpenAI’s o1, text-based Large Language Models (LLMs) have demonstrated “System 2” thinking—the ability to deliberate, reason step-by-step, and self-correct before answering.
However, there is a glaring gap in this progress: Vision.
While Multimodal Large Language Models (MLLMs)—models that can see and talk—have become excellent at describing images (perception), they often struggle when asked to perform complex reasoning about what they see. If you show an AI a chart and ask for a deep economic analysis, or a geometric figure and ask for a multi-step proof, it frequently hallucinates or takes a shortcut to the wrong answer.
Why is this happening? Primarily, we lack high-quality data that teaches models how to reason visually, and we rely on single models to do too much at once.
Enter Insight-V.
In a recent paper, researchers introduced a new framework that significantly advances the state-of-the-art in visual reasoning. By creating a scalable pipeline for generating reasoning data and employing a multi-agent system (separating the “thinker” from the “judge”), they have achieved remarkable performance gains.

As shown in Figure 1, Insight-V significantly outperforms standard Chain-of-Thought (CoT) methods and baseline models across difficult benchmarks like MMMU (expert-level multi-discipline tasks) and ChartQA.
In this deep dive, we will unpack how Insight-V works. We will explore how the researchers automated the creation of reasoning data, why they split the model into two agents, and how they used Reinforcement Learning to fine-tune the system.
The Core Problem: Why is Visual Reasoning So Hard?
To understand the solution, we must first appreciate the problem. Text-based reasoning has benefited immensely from “Chain-of-Thought” (CoT) prompting. If you ask an LLM a math problem, asking it to “think step by step” usually yields a better answer.
Applying this to vision is tricky for two main reasons:
- Data Scarcity: Text-only reasoning data is abundant or easily synthesized. Visual reasoning data—images paired with long, correct, step-by-step logical deductions—is incredibly expensive and slow to collect manually.
- Interference: Visual signals are noisy. When a single model tries to perceive pixels, organize them into objects, and perform abstract logic simultaneously, it often gets confused. It might hallucinate visual details to fit a reasoning path or ignore logic to fit a visual perception.
The researchers behind Insight-V tackled these issues by asking: Can we automate the data collection process? And can we architect the model to separate “thinking” from “answering”?
Part 1: The Data Generation Engine
The first contribution of the Insight-V paper is a robust pipeline to generate training data without human labor. If you want a model to reason, you need to show it millions of examples of good reasoning. Since humans are too slow to write these, the researchers designed a Progressive Long-Chain Reasoning Data Generation method.
Step 1: Progressive Generation
Instead of asking a model to output a whole paragraph of reasoning at once (which often leads to rambling errors), Insight-V uses a step-by-step iterative approach.
For a given image \(I\) and question \(Q\), a “Reasoning Generator” model produces a structured response in JSON format. Crucially, this response includes an Action.
- If the model feels it needs to think more, the action is
continue. - If the model feels it has solved the problem, the action is
summary.
This can be formalized mathematically. At step \(t\), the response \(R_t\) is generated based on the image, the question, and all previous reasoning steps:

Here, \(A\) represents the action from the previous step. This loop continues until the model decides it is time to summarize. This mimics how a human solves a hard problem: we think a bit, write down an intermediate result, look at it, think some more, and finally conclude.
By running this process multiple times (\(N\) times) for the same question, the system generates a variety of potential reasoning paths—some short, some long, some correct, and some wrong.
Step 2: Multi-Granularity Assessment
Generating data is easy; generating good data is hard. Once the pipeline creates thousands of reasoning paths, how do we know which ones are smart and which ones are hallucinations?
Insight-V employs a two-tier filtering system, as illustrated below:

Tier 1: Answer Filtering (The “Sanity Check”) First, a strong Language Model (like Qwen2) compares the model’s generated final answer against the ground truth. If the model got the answer wrong, the reasoning path is likely flawed. These are immediately discarded or categorized as “Response with Wrong Answer” (which, surprisingly, becomes useful later).
Tier 2: Reasoning Path Scoring (The “Quality Check”) Getting the right answer isn’t enough—the model might have guessed correctly using bad logic. The remaining responses are sent to a scoring agent (a strong MLLM like Qwen2-VL). This agent reads the step-by-step reasoning and scores it from 1 to 100 based on:
- Logic: Does step B actually follow from step A?
- Hallucination: Did the model invent visual details that aren’t in the image?
- Completeness: Did it skip important steps?
Only the highest-scoring paths make it into the “Reason Dataset.” This automated curation results in a massive, high-quality dataset of long-chain visual reasoning, completely free of human annotation costs.
Part 2: The Multi-Agent Architecture
With high-quality data in hand, the researchers proposed a novel architecture. Most MLLMs are “monolithic”—one model takes an image and question and outputs an answer.
Insight-V argues that decomposition is key. They split the task into two distinct roles handled by two separate agents (derived from the same base model): the Reasoning Agent and the Summary Agent.

The Reasoning Agent ( The “Detective”)
This agent is trained specifically on the high-quality “Reason Dataset” created in Part 1. Its sole job is to produce detailed, step-by-step analysis. It does not worry about being concise; it worries about being thorough. It outputs the structured JSON reasoning steps.
The Summary Agent (The “Judge”)
This agent takes the original question, the image, and the long reasoning trace generated by the Reasoning Agent. Its job is to synthesize this information and provide the final answer.
Why is this separation necessary? If a single model tries to reason and answer, errors in the reasoning chain usually lead directly to a wrong answer. However, the Summary Agent in Insight-V is trained to be critical.
The researchers intentionally trained the Summary Agent on a mix of:
- Perfect Reasoning: High-scoring paths leading to correct answers.
- Flawed Reasoning: Paths that scored lower or had incorrect logic.
This training strategy teaches the Summary Agent to recognize when the “Detective” has messed up. If the Reasoning Agent hallucinates, the Summary Agent can identify the inconsistency with the image and ignore the bad reasoning, or correct it in the final summary. This collaboration significantly increases robustness.
Part 3: Iterative DPO Training
The final piece of the puzzle is how to fine-tune these agents to reach peak performance. Supervised Fine-Tuning (SFT)—teaching the model to mimic the training data—is a good start, but it has limits. To push the reasoning capabilities further, the authors utilized Reinforcement Learning, specifically Direct Preference Optimization (DPO).
Understanding DPO
In standard training, we show the model a “good” response and say “copy this.” In DPO, we show the model two responses: a “winner” (\(y_w\)) and a “loser” (\(y_l\)). We then mathematically adjust the model to increase the probability of generating the winner and decrease the probability of generating the loser.
The probability of preferring one output over another is modeled using the sigmoid function \(\sigma\) of the difference in their rewards, derived from the Bradley-Terry model:

This equation essentially says that the probability of a human (or expert model) preferring response \(y_1\) over \(y_2\) depends on the gap in their “quality scores” (\(r^*\)).
The “Iterative” Innovation
Standard DPO is often done “offline” using a static dataset. However, as the model learns, its behavior changes. The “loser” responses it generated at the start of training might be too easy to beat later on, providing little learning signal.
Insight-V uses Iterative DPO.
- Generate: The current Reasoning Agent generates new reasoning paths for the training images.
- Evaluate: These new paths are scored/ranked to create new pairs of “winners” and “losers.”
- Train: The model is updated using DPO on this fresh data.
- Repeat: The cycle starts again.
This ensures the model is always training against the “frontier” of its own capabilities, constantly refining its reasoning logic rather than just memorizing static preferences.
Experiments and Key Results
How does Insight-V stack up against the competition? The researchers integrated their system into the popular LLaVA-NeXT architecture and also built their own strong base model using Qwen-2.5-7B.
Quantitative Analysis
The results, presented in Table 1 below, are striking.

Key takeaways from the data:
- Broad Improvement: Insight-V improves performance across all 7 benchmarks.
- The “Hard” Tasks: The biggest gains are seen in tasks requiring deep reasoning. On ChartQA (reading and analyzing charts), Insight-V boosts the baseline LLaVA-NeXT model from 69.4% to 77.4%.
- Expert Knowledge: On MMMU, a massive multi-discipline benchmark (science, engineering, culture), the score jumps from 36.9% to 42.0%.
- Impact of DPO: The table shows a clear progression.
+ Multi-Agentimproves the score, and+ Iterative DPOpushes it even higher (The “Insight-V-LLaVA” row), proving the value of the reinforcement learning stage.
Does Reasoning Hurt Perception?
A common fear in AI research is “catastrophic forgetting.” By teaching the model to think hard about math, did we make it forget how to read text (OCR) or identify a cat?

Table 2 puts this fear to rest. On benchmarks like TextVQA and OCRBench, Insight-V actually improves performance slightly. By learning to reason, the model likely becomes better at attending to specific visual regions, which aids in perception tasks.
The Importance of Data Scaling
One of the most interesting findings in the paper is the relationship between the amount of reasoning data and model performance.

Figure 4 shows a clear trend: More data = Smarter Agent. At 50k samples, the model barely outperforms the baseline. But as the dataset grows to 200k samples, the performance curve shoots upward. This validates the importance of the automated data generation pipeline; since the pipeline is scalable, the model can continue to improve simply by running the generator longer.
Qualitative Case Study
Numbers are great, but seeing the model in action is better. Let’s look at a complex economic problem involving marginal product and revenue.

In Figure 5, we see a comparison between:
- Direct SFT (Vanilla): A standard model fine-tuned on the data.
- Reasoning Agent (Insight-V): The multi-agent system.
The Scenario: The question asks to identify the incorrect statement about a table of economic data.
The Vanilla Failure: The standard model tries to reason, but it gets confused. It correctly identifies some information but then claims “Option (C) does not match” without proper calculation, and then surprisingly selects Option (D) as the final answer, which contradicts its own previous sentence. It’s a classic case of an LLM getting “lost in the weeds.”
The Insight-V Success: The Reasoning Agent breaks it down methodically:
- Identify Key Info: Locates the table columns.
- Analyze Data: Calculates marginal products (18 - 13 = 5).
- Verify Option A: Correct.
- Verify Option B: Correct.
- Verify Option C: It explicitly calculates \(7 \times \\)20 = \(140), compares it to the table, finds the mismatch, and flags it.
- Conclusion: The Summary Agent reviews this logical chain, agrees with the math, and correctly identifies (C) as the answer.
This structured, verifying approach is what separates Insight-V from standard vision models.
Assessing the Architecture: Ablation Studies
Are both agents really necessary? Could we just train one super-smart agent? The researchers ran ablation studies to find out.

Table 3 compares different configurations:
- Summary Agent Only: Just answering without generating reasoning first. (Performance drops significantly, e.g., ChartQA drops from 81.2 to 76.3).
- Vanilla - Direct SFT: One model doing both CoT and answering. (Better than nothing, but worse than the split system).
- Multi-Agent: The full Insight-V setup. This achieves the highest average score (62.1).
This confirms that the cognitive load of generating long reasoning and extracting the final answer is best handled by two specialized components.
Finally, looking at the DPO strategy:

Table 4 shows that while standard DPO helps, Iterative DPO is superior. By constantly refreshing the training pairs, the model improves its average score from 62.7 to 63.3. It’s a small but consistent gain that pushes the model toward state-of-the-art.
Conclusion and Future Implications
Insight-V represents a significant step forward in making Multimodal LLMs true reasoning engines rather than just sophisticated image captioners.
The paper makes three critical contributions to the field:
- Scalability: A pipeline that creates infinite high-quality reasoning data without human cost.
- Architecture: A multi-agent design that decouples “thinking” (Reasoning Agent) from “deciding” (Summary Agent), creating robustness against hallucinations.
- Alignment: An iterative Reinforcement Learning strategy that continuously sharpens the model’s logic.
What does this mean for the future? This research suggests that the “System 2” reasoning capabilities we see in text models like OpenAI’s o1 are achievable in vision models. We are moving toward AI that can look at a complex schematic, a messy spreadsheet, or a scientific diagram, and think through the problem with the patience and logic of a human expert.
While the current system relies on large, separate models for scoring and summarization (which can be computationally expensive), the principles established here—data synthesis, agent decomposition, and iterative alignment—will likely form the blueprint for the next generation of visual intelligence.
Note: This blog post is based on the paper “Insight-V: Exploring Long-Chain Visual Reasoning with Multimodal Large Language Models” by Dong et al. For full implementation details and code, please refer to the original paper and GitHub repository.
](https://deep-paper.org/en/paper/2411.14432/images/cover.png)