Introduction

Imagine unboxing a new, high-tech rice cooker. It has a dozen buttons, a digital display, and a distinct lack of intuitive design. You, a human, likely grab the user manual, flip to the “Cooking Brown Rice” section, and figure out the sequence of buttons to press.

Now, imagine you want your home assistant robot to do this. For a robot, this is a nightmare scenario. Unlike a hammer or a cup, an appliance is a “state machine”—it has hidden internal modes, logic constraints (you can’t start cooking if the lid is open), and complex inputs. Historically, roboticists have had to hard-code these interactions for specific devices. But what if a robot could simply read the manual and figure it out, just like we do?

In a fascinating new paper, researchers introduce ApBot, a framework that enables robots to operate novel, complex home appliances in a zero-shot manner. By leveraging Large Vision-Language Models (LVLMs) to interpret manuals and constructing a structured symbolic model of the device, ApBot bridges the gap between unstructured text and precise physical action.

ApBot enables robots to operate diverse, novel, complex home appliances in a zero-shot manner using manuals.

The Challenge: Why Not Just Ask GPT-4?

With the rise of powerful multimodal models like GPT-4o or Claude, one might assume we can simply show the robot a picture of the microwave, feed it the manual, and say, “Heat this up.”

However, existing Vision-Language Models (VLMs) struggle with this for several reasons:

Unstructured Information: Manuals are messy. They contain diagrams, warnings, and long paragraphs that mix varying types of information.
Complex State Transitions: Appliances are unforgiving. If you press “Menu” three times instead of four, you might end up in “Porridge” mode instead of “Rice” mode. A generic VLM often lacks the long-horizon reasoning capabilities to track these internal states accurately over time.
Visual Grounding: Knowing you need to press “Start” is different from finding the exact pixel coordinates of the “Start” button on a cluttered panel, especially if the lighting is poor or the text is small.

The researchers found that relying solely on an LVLM as a direct policy (inputting image + text and asking for the next action) leads to high failure rates, particularly as appliances get more complex.

The ApBot Solution: Structure is Key

ApBot solves this by not treating the manual as just context text, but as a blueprint for building a Structured Appliance Model. Instead of guessing the next token, ApBot essentially writes a small computer program that represents the appliance’s logic.

As illustrated in the system overview below, the process moves from the raw manual to a symbolic representation, and finally to physical execution with closed-loop feedback.

Overview of ApBot. The structured model built from manuals can generate actions to operate novel appliances.

The framework operates in four distinct phases:

1. Constructing the Symbolic Model

First, ApBot uses an LVLM to read the manual and extract a formal definition of the appliance, denoted as \(\overline{\mathcal{M}}\). This isn’t just a summary; it’s a mathematical structure tuple \(\langle \overline{S}, \overline{A}, \overline{\mathcal{T}}, \overline{S}_g \rangle\) containing:

State Space (\(\overline{S}\)): A list of variables the machine possesses (e.g., power: [on, off], temperature: [150, 160, ... 200]).
Action Space (\(\overline{A}\)): What can physically be done (e.g., press_button_menu, turn_dial_clockwise).
Transition Rules (\(\overline{\mathcal{T}}\)): The logic of cause and effect. For example, “If mode is ‘Settings’, pressing ‘Up’ increases brightness.”
Macro Actions: The system identifies high-level tasks described in the manual (e.g., “Defrost”) and breaks them down into sequences of symbolic actions.

This step converts ambiguous natural language into a deterministic state machine that a planner can use reliably.

2. Visual Action Grounding

Once the robot knows what to press (symbolic actions), it needs to find where to press (grounding). This is often the point of failure for general-purpose models.

ApBot employs a robust pipeline for this. It combines three different vision systems:

SAM (Segment Anything Model): To find physical object boundaries.
OWL-ViT: To detect semantic objects like “buttons” or “knobs.”
OCR (Optical Character Recognition): To read labels on the buttons.

By taking the union of these detections and filtering them through an LVLM verification step, ApBot maps every symbolic action (e.g., press_start) to a specific bounding box on the image.

Overview of action grounding with visual observations.

3. Macro Actions and Planning

Planning individual button presses for a complex task is inefficient and error-prone. Instead, ApBot utilizes Macro Actions.

User manuals often organize instructions into “features” or tasks. ApBot captures these as macro actions—parameterized sequences like Cook(LongGrain, 1 hour).

When a user gives a command like “Cook long grain rice for 1 hour,” ApBot doesn’t try to hallucinate the steps. Instead, it looks at its generated model, finds the macro action for Cook, and instantiates it with the target parameters. This reduces the reasoning burden significantly.

4. Closed-Loop Execution and Self-Correction

This is perhaps the most critical contribution of the paper. Even the best manual reading can result in an imperfect model. Maybe the manual says “Press + to increase time,” but doesn’t specify that the time increases by 10 minutes rather than 1 minute.

ApBot implements Closed-Loop Model Updates:

Execute: The robot performs a macro action.
Observe: It looks at the appliance’s screen (visual feedback) to see the result.
Compare: If the observed state matches the predicted state, great.
Correct: If there is a mismatch (e.g., the timer says “30” but the robot expected “3”), ApBot enters a diagnostic mode.

In diagnostic mode, the robot might repeatedly press a button to observe how the variable changes (e.g., 0 -> 30 -> 60 -> 90). It uses this empirical data to rewrite the transition rule in its model on the fly. This allows the robot to “learn” the specific quirks of an appliance that weren’t clear in the text.

Experiments and Results

To rigorously test this, the researchers built a benchmark of 30 different appliances across 6 categories: dehumidifiers, bottle washers, rice cookers, microwaves, bread makers, and washing machines.

Appliances in our benchmark. (a) Appliance Types. (b) All Instances of Bread Maker.

They compared ApBot against state-of-the-art LVLMs acting directly as policies (with and without visual grounding). The metrics focused on Success Rate (SR) and SPL (Success weighted by Path Length—essentially, how efficient was the robot?).

ApBot vs. The Baselines

The results were stark. As shown in the charts below, ApBot (red bars) consistently outperformed the baselines (blue and purple).

Performance of home appliance operation by appliance type, including average task success rate (SR), average number of execution steps (Average Steps), and SPL on baseline methods.

Notice the trend in the “Success Rate” graphs. For simple appliances (like the Dehumidifier with 1 variable), standard LVLMs perform decently. But look at the Washing Machine (Variable #6). The complexity crushes the baseline models, dropping their success rates below 50%. ApBot, however, maintains a success rate near 90%.

This demonstrates that structured modeling is essential for complexity. Pure end-to-end learning struggles to maintain coherence when a task involves manipulating multiple variables (temperature, mode, time, spin speed) sequentially.

The Importance of Being Grounded

The ablation studies highlighted two major takeaways:

Visual Grounding matters: The baseline “LLM as policy w/ grounded actions” performed significantly better than “LLM w/ image.” If the robot can’t reliably find the buttons, no amount of reasoning will help.
Closed-Loop is non-negotiable: Removing the model update mechanism (ApBot w/o close-loop update) caused a massive performance drop on complex appliances. The ability to verify and correct one’s own internal model is what makes the system robust.

Real-World Deployment

Simulation is one thing, but does it work on physical hardware? The team deployed ApBot on a Kinova Gen3 robot arm to operate real appliances, including an induction cooker and a water dispenser.

The setup required a specialized “pressing” policy where the robot estimates the surface normal of the button to apply force correctly.

The real-world framework

The system successfully translated open-ended instructions (e.g., “Select the HotPot mode and set power to 2000W”) into long-horizon physical manipulations.

Snapshots of our system operating an induction cooker and a water dispenser.

In these real-world tests, ApBot showed it could handle the noise of real visual sensors and the physical constraints of button pressing, further validating the robustness of the structured model approach.

Conclusion

The ApBot paper makes a compelling argument for the future of assistive robotics. It suggests that we shouldn’t rely solely on the “black box” intuition of large neural networks for complex, sequential tasks. Instead, using LLMs to generate structured, symbolic code—which can then be verified, executed, and corrected—offers a path toward much higher reliability.

By treating the user manual not just as text, but as a source code for the appliance’s logic, robots can finally start using our tools the way they were designed to be used. As household robots become more common, capabilities like this will be the difference between a robot that stares blankly at your new coffee maker and one that brews you a perfect cup.

Introduction#

The Challenge: Why Not Just Ask GPT-4?#

The ApBot Solution: Structure is Key#

1. Constructing the Symbolic Model#

2. Visual Action Grounding#

3. Macro Actions and Planning#

4. Closed-Loop Execution and Self-Correction#

Experiments and Results#

ApBot vs. The Baselines#

The Importance of Being Grounded#

Real-World Deployment#

Conclusion#