Imagine you are looking at a photo of an elderly man sitting in a wheelchair next to a window. A child asks you, “I need to reach something high. Can you move this chair for me to use?”

As a human, your brain instantly processes a complex web of causal relationships. You see the chair, you see the man, and you understand the relationship: “The chair supports the man.” Moving the chair would cause the man to fall or be displaced. Therefore, the answer is an obvious “No.”

However, if you ask this same question to a state-of-the-art AI, the answer might surprise you. It might simply recognize the object “chair,” associate chairs with “climbing to reach things,” and cheerfully answer, “Yes, I can help!”

An example of causal reasoning in the vision-language context. LVLMs (e.g., GPT-4o) might generate inappropriate responses due to a limited understanding of causal relationships.

This discrepancy highlights a critical gap in modern Artificial Intelligence. While Large Vision-Language Models (LVLMs) like GPT-4o are incredibly good at describing what is in an image, they often struggle to understand why things are the way they are—the causal logic that governs the physical and social world.

In this deep dive, we will explore a recent research paper titled “CELLO: Causal Evaluation of Large Vision-Language Models.” We will unpack how the researchers defined visual causality, how they built a massive dataset to test it, and the new prompting strategy they developed to help AI “think” more like a causal reasoner.

The Problem: Seeing Without Understanding

Causal reasoning is a cornerstone of human intelligence. It allows us to predict the future (“If I drop this glass, it will break”) and explain the past (“The floor is wet because it rained”). For AI agents, such as robots in a home or autonomous vehicles on the street, this ability is non-negotiable. A robot cannot just “see” a vase; it must understand that pushing the vase will cause it to fall.

Previous attempts to test AI on this front have focused on “commonsense causality”—basic associations between events. However, these tests often lack formal causal graphs. They rely on loose associations rather than strict logic. This makes it hard to pinpoint exactly where an AI’s reasoning breaks down.

The researchers behind CELLO argue that to truly evaluate AI, we need a fine-grained definition of causality involving interactions between humans and objects, mapped out on the “Ladder of Causation.”

Background: The Ladder of Causation

To understand the CELLO benchmark, we must first understand the theoretical framework it uses: Judea Pearl’s Ladder of Causation. The researchers extended this ladder to include four distinct levels (or “rungs”):

  1. Rung 0: Discovery. Identifying that a relationship exists. (e.g., “Is there a connection between the wind and the tree moving?”)
  2. Rung 1: Association. Recognizing dependencies and correlations. (e.g., “When the wind blows, how likely is the tree to move?”)
  3. Rung 2: Intervention. Understanding the effect of actions. This involves the “do-operator.” (e.g., “If I cut the tree down, will it still move?”)
  4. Rung 3: Counterfactuals. Imagining alternative realities. (e.g., “If the wind had not blown, would the tree have moved?”)

Most existing datasets only scratch the surface of Rung 1. CELLO aims to test models across all four.

The Core Method: Defining Visual Causality

The heart of this paper is how the authors translate abstract causal theory into concrete visual problems. They propose a unified definition inspired by “causal dispositions.” Simply put: A causal relationship exists when one entity influences the state of another.

This is best understood through counterfactual reasoning: If the “cause” object were removed, would the “effect” object stay the same?

The researchers categorized these interactions into three types:

  • Object-Object: A stick holding a balloon. (Without the stick, the balloon flies away).
  • Human-Object: A woman holding a stick. (Without the woman, the stick falls).
  • Human-Human: A woman holding a child. (Without the woman, the child is not held).

Three different causal relationships considered in the vision-language context: object-object, human-object, and human-human causal relationships.

As shown in Figure 2, these relationships can be mapped into a Causal Graph—a diagram where nodes represent entities (Woman, Child, Stick, Balloon) and arrows represent the direction of influence. This graph provides the “ground truth” logic that the AI must understand to answer questions correctly.

Constructing the CELLO Dataset

Creating a few causal questions is easy; creating 14,094 of them to train and test AI is a massive engineering challenge. The authors developed an automated pipeline to generate the CELLO dataset.

The process, illustrated in Figure 3, follows three main steps:

  1. Causal Graph Extraction: They utilize the Visual Genome dataset, which already contains “scene graphs” (data structures describing objects and their relationships like “on,” “holding,” “fixed to”). They map these scene graphs to formal causal structures (Direct, Confounding, Collision, and Chain).
  2. Causal Task Selection: Based on the extracted graph, they assign specific tasks from the Ladder of Causation. For example, if there is a “confounding” structure (where one object influences two others), they might assign a “Confounder Identification” task.
  3. Question Construction: Using Large Language Models (LLMs) and strict templates, they generate multiple-choice questions.

Dataset construction pipeline of CELLO (using confounder identification task as an example).

Take the example in Figure 3. The system identifies a Confounder structure: The “Wall” supports the “Shelf,” and the “Wall” also supports the “Books” (indirectly).

  • The Question: “Why are the books placed steadily?”
  • The Logic: A correct answer must acknowledge the confounder. “Because the shelf attached to the wall keeps the books organized.”
  • Distractors: The system also generates incorrect answers based on the image (e.g., mentioning the window), the graph (only mentioning the shelf but ignoring the wall), or pure text hallucinations.

This systematic approach ensures that the questions aren’t just testing whether the AI recognizes a “book,” but whether it understands the structural support system keeping the book upright.

Quality Assurance

How do we know these computer-generated questions are any good? The researchers analyzed the linguistic quality of CELLO compared to existing datasets like VQA and VisualCOMET.

Question quality of CELLO compared to other vision-language datasets in terms of lexical diversity and fluency.

As seen in Figure 4, CELLO (the point at the far right) demonstrates significantly higher Lexical Diversity while maintaining good perplexity (fluency). This means the questions are less repetitive and more complex than standard datasets, posing a tougher challenge for models.

The Solution: CELLO-CoT

After building the dataset, the researchers tested standard LVLMs (like LLaVA and raw GPT-4o) and found them lacking. The models often guessed or hallucinated.

To bridge this gap, the authors introduced CELLO-CoT, a “Chain-of-Thought” prompting strategy designed specifically for causal reasoning. Instead of asking the model to jump straight to the answer, CELLO-CoT forces it to follow the cognitive steps a human would take.

The strategy breaks the reasoning process into four explicit steps:

  1. Extract Core Entities: Look at the text and image. Who are the main actors? (e.g., “Shelf”, “Wall”, “Books”).
  2. Identify Causal Graph: Analyze the image to determine the structure. Does X cause Y? Does Z cause both?
  3. Determine Task Type: What kind of causal question is this? (e.g., “Confounder Identification”).
  4. Compile Knowledge: Retrieve specific causal rules relevant to that task.

Illustration of our CELLO-CoT strategy.

By forcing the model to output these intermediate steps (as shown in Figure 5), the final answer becomes grounded in logic rather than statistical probability.

Experiments and Results

The researchers evaluated ten leading LVLMs, including proprietary models like GPT-4o and Claude-3, and open-source models like LLaVA and Qwen-VL. The results were revealing.

1. Models Struggle with Causality

The overall performance of standard models was poor. Some models, like BLIP-2 and Claude-3-Sonnet, performed worse than random guessing on binary (Yes/No) questions. This confirms the hypothesis: current vision-language models are great at recognition but terrible at reasoning.

2. CELLO-CoT Works

The proposed prompting strategy made a massive difference. When applied to GPT-4o, accuracy jumped significantly.

Ablation study on our proposed CELLO-CoT.

The ablation study in Figure 6(a) shows the impact of adding each step of the Chain-of-Thought.

  • Step 1 (Entity Extraction) provided the biggest boost for lower-level tasks like Discovery.
  • Steps 2-4 (Graph and Knowledge) were essential for complex tasks like Counterfactual reasoning (Rung 3).

This proves that giving models a “structural hint” about causality allows them to access reasoning capabilities that are otherwise dormant.

3. The “Helpfulness” Trap (Robustness Testing)

Perhaps the most fascinating result came from the Robustness Testing. The researchers created “trick” questions where the request was polite but physically impossible or dangerous (like the wheelchair example in the introduction).

In these scenarios, models often prioritized being “helpful” over being “causally correct.”

Robustness testing across various LVLMs. It can be observed significant performance decline.

Figure 7 illustrates a dramatic collapse in performance for models like BakLlava and Qwen-VL. When asked, “Can you move this shelf?” (where the shelf is bolted to a wall and holding objects), the models ignored the physical constraints and answered “Yes.”

  • BakLlava dropped from 57% accuracy to 3% accuracy.
  • GPT-4o was the only model to maintain stability, likely due to its heavy “safety” training (RLHF), where it learns to refuse unreasonable requests—though often with a generic “I am an AI” response rather than a physics-based explanation.

4. Where Do Models Fail?

The error analysis (Figure 8) showed that the vast majority of errors (nearly 90%) were “Mischosen Answers.”

Error Analysis of LVLMs.

This suggests the models aren’t crashing or failing to understand the format; they are confidently choosing the wrong causal explanation. They are being distracted by irrelevant objects in the image (Visual Distractors) or by language biases.

A Concrete Example: Counterfactual Reasoning

To visualize these failures, let’s look at a specific case study from the paper regarding Counterfactual Reasoning.

The question asks: “If the person holding the banana steps aside, would the shadow still exist?”

  • Common Sense: No. The person is blocking the light to create the shadow.
  • Model Failure: Many models answer “Yes.” They fail to link the existence of the “shadow” (Effect) to the presence of the “person/banana” (Cause). They treat the shadow as a permanent fixture of the scene rather than a dependent state.

Case study of counterfactual reasoning.

As shown in Figure 20, while some advanced models get this right, many open-source models fail to trace this simple causal link, proving the difficulty of Rung 3 (Counterfactual) tasks.

Conclusion and Implications

The CELLO paper serves as a reality check for the AI community. It demonstrates that while we have built models that can write poetry and identify dog breeds, we have not yet built models that truly understand the physics of a “shelf supporting a book” or the social implication of “moving a wheelchair.”

The key takeaways are:

  1. Causality is distinct from Recognition: Identifying objects is not the same as understanding their interactions.
  2. Explicit Structure Helps: The CELLO-CoT strategy proves that if we force models to think in terms of graphs and entities, their reasoning improves drastically.
  3. Robustness is a Safety Issue: The tendency of models to say “Yes” to physically impossible requests poses a real danger for future embodied agents (robots).

By providing a rigorous dataset and a unified definition of visual causality, CELLO paves the way for the next generation of AI—systems that don’t just look at the world, but actually understand how it works.