Can AI Really Clean Your Kitchen? Benchmarking VLM Planning with ActPlan-1K

Introduction

Imagine asking a robot to “assemble gift baskets” in your living room. A standard Large Language Model (LLM) might give you a perfect textual list of instructions: find the basket, put in the cookies, add the cheese. But what if the robot looks at the table and sees that the cookies are burnt? What if the water meant for the plants is shut off?

This is the frontier of Embodied AI—moving beyond generating text to generating actionable plans based on what an agent actually sees. While LLMs have demonstrated incredible reasoning abilities, we are still figuring out how well Vision Language Models (VLMs) handle complex, multi-modal procedural planning. Can they integrate visual cues with textual goals? Can they handle “counterfactual” scenarios where things go wrong?

In a recent paper, researchers introduced ActPlan-1K, a rigorous benchmark designed to answer these questions. By testing models like GPT-4V, Gemini, and Claude on over 1,000 household activity instances, they discovered that while current AI is impressive, it still struggles to plan like a human when the pressure is on.

The Problem: From Text to Reality

Current research often treats planning as a text-only problem or focuses on very short, simple tasks. Previous benchmarks might ask an AI to “pick up the apple,” but they rarely ask it to “clean the entire kitchen, keeping in mind that the sink is clogged.”

The authors of ActPlan-1K identified two major gaps in the field:

Multi-modality: There is a lack of study on how VLMs behave when they must process both text instructions and visual environments simultaneously to create a long-term plan.
Counterfactual Planning: Most benchmarks assume a “happy path” where everything works perfectly. Real life is full of constrained situations (e.g., broken tools, missing ingredients), and we need to know if AI can adapt.

Generating procedural plan for household activities with VLMs via prompting with task description and environment images.

As shown in Figure 1, the goal is to take a visual scene (a room with specific objects) and a task description, and output a valid sequence of actions (a “Gold Plan”).

The Solution: ActPlan-1K

To bridge this gap, the researchers constructed ActPlan-1K, a dataset built using the iGibson2 household simulator and ChatGPT. The benchmark is substantial, featuring 153 different activities and 1,187 specific instances.

How the Benchmark is Built

The construction of ActPlan-1K is a multi-step pipeline designed to mimic real-world complexity.

BDDL Definitions: The team started with symbolic activity definitions (BDDL) from the Behavior-100 dataset. These define the logic of a task (e.g., “gift basket exists” AND “contains cookie”).
Simulation & Capture: Using iGibson2, they loaded these definitions to generate 3D household environments. They then captured 2 to 5 images per instance to provide the VLM with “eyes.”
Natural Language Translation: The symbolic logic was converted into natural language task descriptions.
Gold Plans: Human annotators wrote the “correct” sequences of actions (Gold Plans) required to solve the task given the specific visual constraints.

Overview of ActPlan-1K dataset collction and evaluation.

Figure 2 illustrates this workflow. The VLM takes the text description (\(\tau\)) and the images from the simulator to generate a predicted plan (\(\mathcal{P}^*\)), which is then compared against the human-annotated Gold Plan (\(\mathcal{P}\)).

The mathematical formulation for the VLM’s task is:

Equation for VLM input and output

Here, the model must synthesize the text \(T\) and a sequence of images \(I\) to produce the plan.

The Twist: Counterfactual Activities

The most innovative aspect of ActPlan-1K is the inclusion of counterfactual activities. These are scenarios where the standard procedure is disrupted by a specific constraint.

To create these, the researchers prompted ChatGPT to brainstorm “unexpected situations” for standard household tasks. Human annotators then selected the most plausible ones that could be visually represented in the simulator.

Example of normal activity and counterfactual activity.

Figure 3 shows a clear example.

Normal Activity: Assemble gift baskets. The plan involves grabbing cookies and putting them in baskets.
Counterfactual Activity: Two of the cookies are burnt. The plan changes entirely—the agent must recognize the visual property (burnt) and decide not to put those cookies in the baskets.

Types of Counterfactuals

The researchers categorized these “curveballs” into three types:

Object Property: The physical state of an object changes the plan (e.g., the burnt cookies above).
Object Function: An object must be used differently than usual.
Event Causality: An unexpected event requires extra steps (e.g., needing to clean a spill before you can set the table).

Example of normal activity and counterfactual activity regarding watering plants.

Another example is shown in Figure 8. In the normal scenario, you water plants using the sink. In the counterfactual scenario, the water supply is off, so the agent must realize it needs to use bottled water found on the countertop.

The distribution of these tricky scenarios is fairly balanced, ensuring the AI is tested on various types of reasoning.

Distribution of counterfactual activities.

Evaluation Methodology

How do we know if the AI’s plan is good? The researchers used a combination of human and automatic metrics.

Human Evaluation

Because these plans are complex, human judgment is the gold standard. Annotators looked for:

Correctness: Does the plan actually achieve the final goal?
Commonsense Satisfaction: Is every step logical? (e.g., you can’t put the milk in the fridge before you open the fridge door).

Automatic Evaluation

To scale evaluation, the team also used automated metrics:

LCS (Longest Common Subsequence): Measures how many steps in the AI’s plan match the order of the Gold Plan.
Learned Metrics (BLEURT): They finetuned a BLEURT model to predict the correctness of a plan.

The BLEURT model processes sentence pairs (the generated step and the gold step) to determine semantic equivalence. The process involves generating a classification vector:

BLEURT vector equation

And predicting a label (correct/incorrect):

Prediction equation

Optimized via a standard loss function:

Loss function equation

This allowed the researchers to create a robust automatic scorer that correlated well with human judgment.

Experiments & Key Results

The researchers tested three state-of-the-art VLMs: Claude-3, Gemini-Pro-1.5, and GPT-4V. The results revealed that household planning is far from solved.

1. Performance is Low Across the Board

Even the best models struggled. Gemini-Pro-1.5 achieved the highest scores, but “high” is relative. It reached about 41.7% correctness on counterfactual activities. This means that more than half the time, the AI failed to generate a plan that successfully completed the task.

2. Counterfactuals are Harder

Unsurprisingly, the models performed significantly better on “normal” activities than on the counterfactual ones. This highlights a critical weakness in current AI: it relies heavily on memorized patterns (scripts) and struggles when it has to reason about exceptions.

3. The “Length Cliff”

One of the most telling findings is how performance degrades as the task gets longer.

Correctness (%) with plan sequence length.

As Figure 5 demonstrates, performance falls off a cliff as the number of plan steps increases. For short plans (0-10 steps), models like GPT-4V and Gemini perform decently (around 60%). But for long plans (over 40 steps), correctness drops to nearly zero. The models simply lose the thread of the narrative or hallucinate steps when the sequence gets too long.

4. Images Matter

The researchers performed an ablation study, removing the images and giving the models only text. Performance dropped significantly. Without visual context, models couldn’t verify which objects were present or their states (e.g., open vs. closed), leading to hallucinations.

Error Analysis: Why Do They Fail?

To understand why the models failed, the researchers categorized the errors into six types.

Error distribution (%) of VLMs on six types of errors.

Figure 6 breaks this down:

Missing Actions: This was the most common error. The AI would skip necessary logical steps (e.g., trying to wash dishes without turning on the faucet).
Mistake of Event Cause/Result: Misunderstanding the flow of cause and effect.
Hallucination: GPT-4V, in particular, had a tendency to invent tools that weren’t in the image (e.g., using a vacuum cleaner that doesn’t exist).
Incorrect Image Understanding: Gemini-Pro struggled the most with correctly interpreting visual details, such as the number of plates or the distance between objects.

Table 8 provides concrete examples of these errors. For instance, in a “packing food” task, a model might fail to distinguish between different spices (chives vs. chili) based on the visual input, leading to a failed plan.

Error examples table.

Conclusion and Future Implications

ActPlan-1K serves as a reality check for the Embodied AI community. While VLMs are powerful, they are not yet ready to be autonomous household managers. The benchmark highlights two critical areas for improvement:

Long-Horizon Consistency: Models need better memory or reasoning structures to handle tasks that require 30, 40, or 50+ steps without losing track of the goal.
Robust Visual Grounding: Models must get better at “verifying” their plans against the image. They need to see that the cookies are burnt and adjust their plan immediately, rather than following a pre-trained script.

By providing a standardized way to test these capabilities—specifically including the difficult “counterfactual” scenarios—ActPlan-1K paves the way for the next generation of more reliable, adaptable, and helpful AI agents.

Introduction#

The Problem: From Text to Reality#

The Solution: ActPlan-1K#

How the Benchmark is Built#

The Twist: Counterfactual Activities#

Types of Counterfactuals#

Evaluation Methodology#

Human Evaluation#

Automatic Evaluation#

Experiments & Key Results#

1. Performance is Low Across the Board#

2. Counterfactuals are Harder#

3. The “Length Cliff”#

4. Images Matter#

Error Analysis: Why Do They Fail?#

Conclusion and Future Implications#