Introduction

We are currently living in the golden age of Large Multimodal Models (LMMs). Models like GPT-4V and Claude-3 have demonstrated astonishing capabilities: they can describe a complex photograph of a busy street, explain a meme, or identify the breed of a dog from a blurry picture. To the casual observer, it seems like the problem of “computer vision” is largely solved.

However, a peculiar paradox has emerged. While these models can interpret complex natural scenes, they often stumble over tasks that a human child would find trivial. Ask a state-of-the-art model to read the time from a simple analog clock, navigate a 2D floor plan, or interpret the flow of logic in a basic chart, and you might witness a surprising failure.

Why does this happen? The answer lies in the difference between natural images and abstract images. Natural images contain semantic richness (objects, textures, faces) that models have seen billions of times during training. Abstract images—charts, maps, layouts, and dashboards—rely on rigorous geometric logic and spatial reasoning. They are composed of lines, symbols, and precise relationships where “close enough” is often wrong.

In this post, we will deep dive into a fascinating paper titled “Multimodal Self-Instruct: Synthetic Abstract Image and Visual Reasoning Instruction Using Language Model.” The researchers identify this critical gap in AI capability and propose a novel, code-centric solution to fix it.

Benchmarking Leading LMMs on abstract image understanding.

As shown in Figure 1, the researchers highlight the stark contrast between human performance and AI performance across various abstract tasks. While humans breeze through road maps and visual puzzles, models—even the best ones—struggle to keep up.

The Problem: The Abstract Image Gap

To understand why this research is necessary, we must first define the scope of the problem. Current LMMs are primarily trained on massive datasets of image-text pairs scraped from the internet (like LAION). These datasets are dominated by natural photography. Consequently, models become excellent at pattern matching features in natural scenes but fail to learn the underlying rules of visual reasoning.

The researchers categorized these “blind spots” into eight specific daily scenarios:

  1. Charts: Understanding data visualization.
  2. Tables: Extracting structured text data.
  3. Road Maps: Planning routes and spatial navigation.
  4. Dashboards: Reading precise instruments like clocks and speedometers.
  5. Flowcharts: Understanding algorithmic processes.
  6. Relation Graphs: parsing hierarchical or network structures.
  7. Floor Plans: Spatial reasoning within 2D layouts.
  8. Visual Puzzles: Pattern induction and logic.

Examples of abstract reasoning tasks where models fail.

Figure 2 illustrates these specific failures. Notice the “Dashboard” example. A human looks at the clock and immediately sees “10:10.” An AI might hallucinate a completely different time because it doesn’t actually “measure” the angle of the hands; it guesses based on visual similarity to training data. Similarly, in the “Road Map” example, finding a path from point A to point B requires a step-by-step algorithmic approach, not just a visual description.

The challenge in fixing this is data scarcity. Collecting millions of high-quality, annotated abstract images (like a map with a specific, valid path drawn on it and a textual explanation of that path) is incredibly labor-intensive. You cannot simply scrape this from the web because the reasoning (the “why”) is rarely explicitly written down next to the image.

The Solution: Multimodal Self-Instruct

The core contribution of this paper is a methodology called Multimodal Self-Instruct. The authors realized that they didn’t need to manually collect this data. Instead, they could leverage the reasoning and coding capabilities of existing Large Language Models (LLMs) to synthesize it.

The intuition is brilliant: LLMs may be bad at seeing abstract images, but they are excellent at writing code to create them.

The Pipeline

The synthesis pipeline operates in three distinct stages. The entire process is automated, requiring only an LLM (like GPT-4) and a code execution environment.

The Multi-modal Self-instruct pipeline.

Step 1: Visual Idea Proposal

The process begins with the LLM proposing a scenario. It doesn’t just say “draw a chart.” It creates a specific, context-rich scenario. For example, it might propose: “Create a step-by-step flowchart demonstrating how to register for an academic conference.” or “Design a road map for a delivery route in a city center.”

This step ensures diversity. By prompting the LLM to cover various topics (economics, daily life, science), the resulting dataset covers a wide distribution of semantic contexts.

Step 2: Code-Centric Image Synthesis

This is the most critical technical innovation. In many previous attempts to generate synthetic image data, researchers used text-to-image diffusion models (like DALL-E or Stable Diffusion). While those models are artistic, they are imprecise. They struggle with text rendering (producing gibberish) and precise spatial relationships (drawing a line graph that doesn’t actually match the numbers).

Instead of diffusion models, this pipeline uses Code. The LLM generates Python code (using libraries like Matplotlib, Graphviz, or Pyecharts) to render the image.

  • Why Code? Code is deterministic. If the LLM sets a variable time='8:10', the code will render the clock hands exactly at 8:10. There is no ambiguity. The “ground truth” is hard-coded into the generation process.
  • Simulated Data: For things like charts or maps, the LLM first generates the underlying data (e.g., specific percentages for a pie chart) and then writes the code to visualize it.

Step 3: Visual Instruction Construction

Once the image is rendered via code, we have the image and the metadata used to create it. The LLM is then fed this context (the idea, the data, and the code) and asked to generate Question-Answer (Q&A) pairs.

Because the LLM has access to the source code that drew the image, it doesn’t need to look at the image to know the answer. It knows the answer by definition.

  • Question Generation: The model creates diverse questions, ranging from simple perception (“What color is the starting point?”) to complex reasoning (“If I travel from A to B, which intersection do I pass?”).
  • Rationale Generation: Crucially, the model generates a “rationale”—a chain-of-thought explanation for why the answer is correct.

By the end of this pipeline, the researchers created a massive, high-quality dataset without a single human annotator drawing a line.

The Benchmark

Using this strategy, the authors constructed a benchmark of 11,193 instructions covering the eight scenarios mentioned earlier. Let’s look at a few examples of what this synthetic data looks like.

Dashboards and Instruments

One of the most surprising weaknesses of LMMs is reading instruments. The benchmark creates synthetic gauges, thermometers, and clocks.

Examples of Dashboard tasks.

As seen in Figure A4, the dataset includes questions about rulers, blood pressure monitors, and financial dials. These require the model to perform “visual math”—interpolating values between tick marks—rather than just object recognition.

Road Map Navigation

This is perhaps the most demanding task. The researchers used a Rapidly-exploring Random Tree (RRT) algorithm to generate random maps with obstacles and valid paths.

Road Map Navigation examples.

In Figure A3, you can see the complexity. The model is given a map with a start (red) and end (yellow) point and obstacles (dark areas). The question asks for a path. To solve this, a model must understand grid coordinates, spatial directions (up, down, left, right), and obstacle avoidance. The synthetic pipeline generates the map and the correct path text simultaneously.

2D Planar Layouts

Spatial reasoning also applies to understanding layouts, such as floor plans or software UI diagrams.

2D Planar Layout examples.

Figure A8 showcases questions about architectural layouts (e.g., “Does the smallest bedroom have a washroom?”) and diagrams of rocket components. These require the model to understand containment (what is inside what) and connectivity (what is connected to what).

Experiments: How Smart are Today’s Models?

The researchers extensively tested current state-of-the-art models against this new benchmark. The lineup included proprietary giants like GPT-4V, Claude-3.5-Sonnet, and Gemini-1.5, as well as open-source models like LLaVA and DeepSeek-VL.

The results, summarized in Table A1, were sobering.

Benchmarking results table.

Key Takeaways from the Results:

  1. The Human-AI Gap is Huge: The average human performance on these tasks is roughly 82.1%. The best performing AI (Claude-3.5-Sonnet) achieved only 64.74%. GPT-4o followed with roughly 60%.
  2. Simple Tasks are Hard: Look at the “Dashboard” column in the table above. Humans score 85.3%. GPT-4o scores only 54.79%. This confirms that reading a clock is harder for an AI than passing the bar exam.
  3. Map Navigation is a Disaster: In the “Road Map” task, open-source models like LLaVA-1.5 scored essentially 0%. They simply cannot plan a valid path. Even GPT-4V only scored 23.3%.
  4. Closed vs. Open Source: There is a massive disparity. While closed-source models (GPT-4, Claude) show some reasoning ability, standard open-source models generally fail completely on these abstract reasoning tasks, often performing no better than random guessing.

Closing the Gap: Fine-Tuning with Synthetic Data

The benchmark revealed the problem, but the researchers also wanted to prove their solution. They took a standard open-source model, LLaVA-1.5-7B, and fine-tuned it using their synthetic dataset.

They generated a training set of 62,476 instructions focusing on Charts, Tables, and Road Maps. They then compared this fine-tuned model (dubbed Llava-our-62k) against the original baseline.

The Results of Fine-Tuning

The improvements were dramatic.

Comparison of fine-tuned model vs baseline.

As shown in Table 2:

  • Chart Understanding: Improved from 10.5% to 30.3%.
  • Table Understanding: Improved from 15.8% to 51.8%.
  • Road Map Navigation: This was the shocker. The model improved from 0.3% to 67.7%.

Wait, read that again. The fine-tuned 7B parameter model (a relatively small model) achieved 67.7% on Road Maps. Referring back to Table A1, GPT-4o only scored 37.8% on this same task.

By training on high-quality, code-generated synthetic data, a small open-source model was able to outperform the world’s most powerful proprietary models on specific visual reasoning tasks.

Synergistic Effects

The researchers also explored whether learning one task helped with others.

Synergistic effects of training data.

Table 3 reveals an interesting phenomenon. Training on Charts and Tables actually helped the model perform better on Road Maps (improving from 0.3% to roughly 8.9% even without seeing a map). This suggests that training on abstract images helps the model develop a generalized capability for “interpreting lines and geometries,” which transfers across different domains.

Conclusion and Implications

The “Multimodal Self-Instruct” paper provides a compelling blueprint for the future of computer vision training. It highlights that we cannot rely solely on natural images to teach AI how to see. To create truly useful AI agents—ones that can navigate software UIs, analyze business dashboards, or plan routes—we need to teach them the language of abstract imagery.

The implications are threefold:

  1. Code is the Ultimate Annotator: We don’t need human labor to label every chart or map. If we can write code to generate the data, we get perfect labels for free.
  2. Visual Reasoning is distinct from Recognition: Identifying a “clock” is different from “reading the time.” Models need specific training on the latter.
  3. Small Models can Win: With targeted, high-quality synthetic data, small models can punch way above their weight class, beating general-purpose giants in specialized reasoning tasks.

As LMMs continue to integrate into our daily workflows, overcoming the “Abstract Image Gap” will be essential. Thanks to strategies like Multimodal Self-Instruct, we are one step closer to AI that can not only see the world but also understand the diagrams we use to explain it.