Introduction
Imagine you are an interior designer. You look at an empty room and a piece of furniture. In your mind, you rotate the furniture, place it against the back wall, and visualize how the light hits it. You haven’t moved a muscle, but you have performed a complex feat of multimodal reasoning. You combined visual perception with spatial logic.
Now, consider the state of Artificial Intelligence. We know that Large Language Models (LLMs) like GPT-4o or Claude 3.5 are incredible at text-based reasoning. They can pass bar exams and solve complex riddles. We also know they can “see” images. But can they actually reason with those images in the way humans do? Can they perform that mental rotation, or simulate a physics experiment in their “mind’s eye”?
A new research paper, Can MLLMs Reason in Multimodality?, argues that the answer is largely “no.” The researchers introduce EMMA (Enhanced MultiModal reAsoning), a benchmark designed to expose the gap between a model’s ability to recognize an object and its ability to reason about it.
Take a look at the problem below.

As shown in Figure 1, a human solves this physics problem by visualizing force vectors. We know that like charges repel and opposite charges attract. We mentally draw arrows and sum them up. The model (GPT-4o), however, stumbles. It knows the text-based rule (“like charges repel”), but when it tries to apply that to the visual arrangement of charges, it gets the direction wrong. It fails to bridge the gap between text theory and visual reality.
In this post, we will tear down the EMMA benchmark, explore why current AI models struggle with it, and look at what this means for the future of Multimodal LLMs (MLLMs).
The Illusion of Multimodal Competence
To understand why EMMA is necessary, we first need to look at the flaws in previous benchmarks.
In the past few years, we have seen high scores on multimodal benchmarks like MMMU or MathVista. These scores suggest that AI is becoming an expert at interpreting charts, diagrams, and photos. However, the authors of EMMA point out a critical flaw: redundancy.
In many existing datasets, the image is often just “decoration.” The text of the question might contain all the information needed to solve the problem (e.g., “Calculate the area of a circle with radius 5” accompanied by a generic picture of a circle). If a model can solve the problem by reading the text and ignoring the image, we aren’t testing multimodal reasoning—we are just testing text reasoning again.
The Filtering Process: Forcing the Model to “Look”
To create a benchmark that truly tests visual reasoning, the researchers employed a rigorous “filtering” process.

As illustrated in Figure 4, the curation process was ruthless. The researchers took questions from existing datasets and applied a “blindfold” test.
- They used a model to generate a text caption of the image.
- They fed only the question and the text caption to an LLM.
- If the LLM could solve the question using just the text description, the question was thrown out.
This ensures that every question left in EMMA requires the model to actively engage with the visual data. The model cannot shortcut the process by relying on a text summary; it must interpret spatial relationships, patterns, or physical simulations that are too complex to be easily captured in a simple caption.
Inside EMMA: The Four Pillars of Reasoning
EMMA is not just a random collection of hard pictures. It is structured around four specific domains: Mathematics, Physics, Chemistry, and Coding.

Figure 2 provides a high-level view of the types of tasks included. Let’s break down what makes each section so difficult for AI.
1. Mathematics: Spatial Gymnastics
The math section isn’t about solving equations found in a textbook; it’s about visual manipulation.

Look at Figure 9 above. These tasks require 2D Transformation. The model must mentally rotate a pattern, translate shapes across a plane, or visualize a reflection.
- Rotation: Can you visualize what this grid looks like turned 90 degrees?
- Translation: Can these shapes fit together?
- Flipping: What does this image look like in a mirror?
There are also tasks involving 3D Spatial Simulation, such as mental paper folding or cube rotation, and Pattern Inference, where the model must deduce a visual rule (like a color sequence changing based on position) rather than a numerical one.
2. Physics: Simulating the World
Physics questions in EMMA require the model to run a “simulation” of physical laws.

Consider the example in Figure 26 (labeled in the deck as a sample error case). The problem asks for the net direction of electric force on a charge.
- Ground Truth (Human): A human draws a free-body diagram. We see that \(+Q\) is repelled by \(+3Q\) (pushing away) and attracted to \(-2Q\) (pulling closer). The net result is a vector pointing down and to the left.
- Model Failure: The model attempts to calculate this via Chain-of-Thought (CoT). It writes down formulas (Coulomb’s Law). However, it makes a catastrophic error in step 1: it claims the \(+3Q\) charge attracts the \(+Q\) charge. Despite “knowing” the physics laws in text, it fails to map that knowledge correctly onto the spatial arrangement in the image.
This category also includes Dynamics (predicting the path of a billiard ball) and Circuit Analysis, where the topology of the wires matters more than just the component values.
3. Chemistry: The Art of Electron Flow
Chemistry is inherently visual. Chemists use diagrams to represent molecular structures and reactions. EMMA tests this through Reaction Simulation.

In Figure 32, we see a “correct” case involving arrow-pushing. This is a standard notation in organic chemistry where curved arrows show where electrons are moving. To solve this, the model cannot just memorize a chemical formula. It must:
- Recognize the starting molecule structure.
- Interpret the curved arrows as instructions (“these electrons move here, breaking this bond”).
- Visualize the resulting molecular structure.
While models can sometimes get this right (as shown above), they frequently fail when the structures become complex or when the visual representation (like skeletal structures) requires implicit knowledge of carbon and hydrogen placement.
4. Coding: Visualizing Output
The coding section of EMMA is particularly innovative. Rather than just asking “Write code to plot a graph,” EMMA tests the alignment between code and its visual output.

Figure 15 highlights the difference. Traditional benchmarks rely on “MLLM judges” to grade code, which can be unreliable. EMMA uses objective multiple-choice questions:
- Vis Choose Code: “Here is a chart. Which of these four Python snippets generated it?”
- Code Choose Vis: “Here is a Python script. Which of these four charts will it produce?”
- Modification: “Here is a chart and the code that made it. How do you modify the code to change the chart to this new version?”
This tests if the model truly understands the relationship between a line of code (e.g., plt.barh) and the resulting pixels on the screen.
The Results: A Reality Check for AI
So, how did the state-of-the-art models perform? The researchers tested ten major models, including GPT-4o, Claude 3.5 Sonnet, Gemini 1.5 Pro, and open-source models like Qwen2-VL.
The results, summarized in Table 2, are sobering.

Here are the key takeaways from the data:
- Humans are Undefeated: On the balanced subset of the benchmark (“EMMA-mini”), human experts scored 77.75%. The best performing model (Gemini 2.0 Flash Thinking) scored 48.00%. That is a massive gap of nearly 30 percentage points.
- Visual Reasoning is Hard: Most models hovered around 30-40% accuracy. For context, random guessing on a 4-option multiple choice test yields 25%. Some sophisticated models barely outperformed random chance on specific tasks.
- The “Thinking” Models Lead: Models trained specifically to generate reasoning steps (like OpenAI’s o1 and Gemini Flash Thinking) performed best, particularly in coding and physics.
The Chain-of-Thought Paradox
One of the most fascinating findings in the paper involves Chain-of-Thought (CoT) prompting—the technique where you ask the model to “think step-by-step.”
Usually, CoT improves performance. However, in EMMA, the results were mixed.
- Closed-Source Models (GPT-4o, Claude): CoT generally helped slightly (+0.3% to +3%).
- Open-Source Models (Llava, Qwen): CoT often hurt performance.
Why would “thinking” make a model worse? The researchers hypothesize that for visual tasks, textual reasoning can be a distraction or a source of hallucination.

Look at the paper-folding problem in Figure 8.
- The Task: A square paper is folded twice, and a corner is cut off. The model must predict the unfolded pattern.
- Direct Answer: GPT-4o (without CoT) guesses correctly (Option B).
- With CoT: When forced to explain the steps, GPT-4o hallucinates. It tries to verbally describe the symmetry but gets confused by the spatial geometry, leading it to a wrong conclusion (Option E).
This suggests that some visual tasks are “ineffable”—they are hard to describe in words. When a model tries to force a visual process into a text-based chain of thought, it confuses itself.
Can We Brute Force It? (Test-Time Compute)
If one attempt isn’t enough, what if we let the model try 16 times and vote on the best answer? This is known as test-time compute scaling.

As shown in Table 3, throwing more compute at the problem helps, but it doesn’t solve it. Using “Majority Voting” or “Tournament” selection (where answers compete against each other) raised scores by 4% to 7%. However, even with 16 attempts, the models still trailed far behind human experts.
The issue isn’t just generating one bad answer; it’s that the models fundamentally lack the visual reasoning “engine” to consistently simulate the problem correctly. If a model doesn’t understand the Right-Hand Rule in physics, asking it 16 times won’t fix the underlying misconception.
Why Do Models Fail? The Error Analysis
To understand the root cause of these failures, the researchers analyzed the errors made by OpenAI’s o1 model (one of the strongest reasoners).

Figure 5 shows the breakdown:
- Perceptual Error (30%): The model simply didn’t see the image correctly (e.g., missed a line, misread a number).
- Textual Reasoning Error (9%): Calculation mistakes or logic errors.
- Visual Reasoning Error (53%): This is the big one. The model saw the image correctly and knew the text theory, but failed to process the visual logic.
This confirms the central thesis of the paper: The bottleneck isn’t perception (seeing pixels) or knowledge (reading textbooks). The bottleneck is processing visual relationships dynamically.
A classic example of this is the Right-Hand Rule failure in physics.

In Figure 6, the model (o1) correctly identifies that it needs to use the Right-Hand Rule to determine magnetic force. It even describes the rule in text! But because it lacks “spatial simulation skills”—it cannot literally visualize a hand curling around a wire in 3D space—it guesses the direction of the force incorrectly (predicting \(+y\) instead of \(+x\)).
Conclusion: The Road Ahead
EMMA serves as a reality check for the AI community. It demonstrates that while MLLMs have become excellent at describing images and solving text problems, they have not yet mastered organic multimodal reasoning.
The ability to look at a diagram, simulate a change, and derive a conclusion is a pillar of human intelligence. It is what allows us to be engineers, architects, and scientists. For MLLMs to truly assist in these fields, they need to move beyond simple pattern recognition.
The paper suggests that we need new architectures. Simply scaling up existing models or adding more text-based training data might not be enough. We may need training paradigms that specifically target visual imagination and spatial simulation—teaching models not just to see, but to envision.
Until then, if you need someone to figure out if your couch fits in the corner or which way the magnetic force points, you are still better off asking a human.
](https://deep-paper.org/en/paper/2325_can_mllms_reason_in_multi-1879/images/cover.png)