Introduction: The “Pasta Jar” Problem
Imagine you are sitting in a wheelchair, using a joystick to control a robotic arm attached to your chair. You are in the kitchen, and your goal is to make dinner. You navigate the robot toward a shelf, pick up a jar of pasta, and move it toward the counter where a cooking pot and a laptop are sitting side-by-side.
To a human observer, your intent is obvious: you are going to pour the pasta into the pot. However, to a traditional robotic system, this is a baffling geometric puzzle. If the pasta jar happens to pass slightly closer to the laptop than the pot, a standard robot might infer that you want to pour the pasta onto the keyboard.
This scenario highlights the fundamental challenge in Assistive Teleoperation. The goal is to share control between a human and a robot—the human provides high-level guidance, and the robot handles the low-level motor skills. But for this partnership to work, the robot needs to understand what the human is trying to do.
Traditional methods rely on geometric cues, like the distance between the robot’s gripper and an object. But real life requires more than geometry; it requires commonsense. We know pasta goes in pots, not computers.
In this post, we will dive deep into CASPER (Commonsense Analysis for Shared Perception and Execution in Robotics), a new system presented by researchers at UT Austin, UCLA, and CMU. CASPER leverages the power of pre-trained Vision Language Models (VLMs) to bring semantic understanding to assistive robotics, allowing robots to infer complex human intents and execute long-horizon tasks with high reliability.
The Spectrum of Control: Why We Need Assistance
To understand why CASPER is necessary, we must first look at the extremes of robot control:
- Full Teleoperation: The user manually controls every joint or movement of the robot. This offers perfect agency—the robot does exactly what you say. However, it is cognitively exhausting. Controlling a 7-degree-of-freedom arm with a joystick requires intense focus and fine motor skills.
- Full Autonomy: The robot acts independently. While low-effort for the user, current autonomous systems often struggle to understand nuanced human needs or operate in unstructured environments (like a cluttered home).
Assistive Teleoperation sits in the middle. The user initiates a motion, and the robot attempts to predict the goal (Intent Inference). Once the intent is clear, the robot takes over and autonomously finishes the task (Skill Execution).
The “Commonsense” Gap
The critical weak point in existing assistive systems is Intent Inference.
Prior methods have relied on Motion-based Inference. If you move the joystick to the right, the robot scans for objects on the right. If you move the gripper toward a cup, it assumes you want the cup. These systems use mathematical models (like Bayesian inference) to update probabilities based on motion.
The problem is that human motion is noisy, and the environment is complex. If you have to navigate around a vase to get to a book, your initial motion might point at the vase. A geometric system will wrongly predict “Pick up Vase.” Furthermore, these systems are “closed-set,” meaning they can only recognize a small, pre-programmed list of objects or actions.
CASPER proposes a paradigm shift: Instead of just measuring motion vectors, let’s use Vision Language Models (VLMs) to look at the scene and reason about it like a human would.
The CASPER Architecture
CASPER is designed to act as a “shadow” to the human operator. As the user controls the robot in the foreground, CASPER runs a sophisticated reasoning loop in the background. It observes, predicts, and waits until it is confident before offering help.

As shown in Figure 1, the workflow is seamless:
- Human Teleoperation: The user starts the task manually.
- Intent Inference: The system analyzes the scene and the user’s input.
- Offer Help: When the system is sure (e.g., “Do you want to pick up the screwdriver?”), it prompts the user.
- Skill Execution: If the user confirms, the robot takes over and executes the skill.
Let’s break down the technical architecture that makes this possible.
1. Open-World Perception
Traditional robots need to be trained on specific datasets to recognize objects (e.g., a “coke can” detector). CASPER needs to work in the real world, where any object might exist.
To achieve this, the authors utilize an Open-World Perception Module. They combine state-of-the-art vision models (like GroundingDINO and GSAM) to detect and segment objects in the scene based on open-vocabulary text descriptions. This means the robot can identify a “blue screwdriver,” a “sweetener packet,” or “the third door on the left” without needing specific training data for those objects.
2. Generating Task Candidates
Before the robot can guess what you are doing, it needs to know what is possible.
CASPER uses a VLM (specifically GPT-4o in this paper) to analyze the current image and generate a list of plausible Task Candidates. The VLM combines the visual detections with a Skill Library (a list of things the robot can physically do, like PickUp, Place, OpenDoor, NavigateTo).
Crucially, the VLM applies commonsense filtering.
- If the robot is holding a mug,
PickUpis not a valid candidate;PlaceorPourare. - If the robot is far away from a door,
PushDooris not valid;NavigateTois.
This step generates a dynamic set of multiple-choice options, such as:
- A) Pick up the apple.
- B) Pick up the pink bowl.
- C) Navigate to the fridge.
3. VLM-Powered Intent Selection
This is the core innovation. Once the candidates are generated, CASPER needs to figure out which one the user actually wants.
The system feeds the VLM a history of the robot’s observations (images) and the user’s recent actions. The VLM acts as a detective, looking for cues.

Visual Prompting: To help the VLM understand the robot’s motion, the researchers don’t just send raw images. As seen in Figure 2, they overlay visual aids onto the images:
- Gripper Masks: Highlighting the robot’s hand so the VLM knows where “self” is.
- Motion Arrows: Drawing 2D arrows indicating the recent trajectory of the gripper or base.
This allows the VLM to reason: “The gripper is empty, and it is moving toward the right, directly at the apple. The user likely intends to PickUp[Apple].”
4. The Confidence Mechanism (Self-Consistency)
A robot that interrupts you with wrong guesses is more annoying than a robot that does nothing. To prevent “premature assistance,” CASPER employs a confidence mechanism inspired by Self-Consistency.
The system doesn’t just ask the VLM once. It asks the VLM multiple times (e.g., K times) in parallel to predict the intent.
- If the VLM returns “Pick up Apple” 9 out of 10 times, the confidence is high. The system interrupts the user and offers help.
- If the VLM returns a mix of “Pick up Apple,” “Pick up Bowl,” and “Navigate to Door,” the confidence is low. The system stays silent and lets the user continue teleoperating.

Figure 7 illustrates this beautifully.
- Top row (T=0 to T=40): The user is moving generally toward a wall. The intent is ambiguous. The system waits.
- Top row (T=100): The user has clearly approached the door. The system becomes confident (“Go to wooden door”) and takes over.
This dynamic thresholding is what makes the system feel “smart” rather than intrusive.
5. Skill Execution
Once the intent is confirmed, CASPER triggers its Parameterized Skill Library. These are modular, pre-programmed behaviors.
If the intent is Pour[Pasta, Pot], the system:
- Calls a specialized VLM to estimate the parameters (Where exactly is the pot? How high is it?).
- Executes the motion using low-level planners (Inverse Kinematics or navigation stacks).
Experimental Validation
To prove that commonsense reasoning beats geometric calculation, the researchers tested CASPER on a TIAGo mobile manipulator.
The Setup
They designed three challenging real-world tasks (shown in Figure 3):
- Shelf Task: Pick a specific jar from a shelf, navigate to a table, and pour it into a specific container.
- Toy Task: A long-horizon task involving picking a toy, navigating rooms, opening doors (using a card reader), and placing the toy.
- Door Task: Navigating to and opening various types of doors (push, button, card swipe).

They compared CASPER against three baselines:
- Full Teleop: Manual control (no assistance).
- HAT: Uses gripper-to-goal distance to guess intent.
- RBII: Uses Bayesian inference on user joystick inputs.
Quantitative Results: Success Rates
The results were stark. As shown in the table below, CASPER dominated the baselines.

- Task Success: CASPER achieved an average 88.9% success rate, compared to just 40-45% for the assistive baselines (HAT and RBII). The baselines often failed because they couldn’t distinguish between objects that were close together or required context to understand.
- Efficiency: Despite the time taken for VLM inference, CASPER had the fastest completion times (218s average vs. 256s for manual teleop).
The “Sweetener vs. Pan” Test
Why did the baselines fail? Figure 6 provides a perfect example of the “Commonsense Gap.”

In this scenario, the user wants to pour food into the Pan (black object). However, the Sweetener box (white object) is physically closer to the gripper’s path.
- Geometric Baselines (HAT/RBII): They see the gripper getting closer to the sweetener. They predict “Pour into Sweetener.” This is a catastrophic failure (and a messy one).
- CASPER: The VLM recognizes the objects. It uses commonsense: You pour food into a pan, not a cardboard box. It correctly identifies the Pan as the target, despite the geometric data.
User Experience: Workload and Satisfaction
Robots are tools for humans, so the user’s subjective experience is just as important as the success rate. The researchers used the NASA-TLX (Task Load Index) to measure cognitive strain.

Figure 4 reveals significant improvements:
- Lower Workload: CASPER (orange bars) scored significantly lower on Mental Demand, Physical Demand, and Frustration compared to Full Teleop (gray).
- Higher Satisfaction: Users trusted CASPER more. They felt safer and more confident. The baselines (HAT/RBII) scored poorly on trust because they kept offering the wrong help, which users described as “annoying” or “alarming.”
Analysis and Ablations
The researchers didn’t just stop at “it works.” They dug into why it works through ablation studies.

1. Does Visual Prompting Matter? Look at the Left chart in Figure 5. The bar “Casper - No VP” (No Visual Prompting) is lower than the full CASPER model. This proves that drawing those green arrows and gripper masks on the image helps the VLM understand the scene better, boosting success rates by roughly 6%.
2. The Importance of Patience The Right chart in Figure 5 shows the “False Prediction Rate.” The blue line represents CASPER without the confidence module (it guesses immediately). The orange line is the full system. Without the confidence check, the robot makes many more false predictions. By waiting for self-consistency, the error rate drops significantly.
Conclusion
CASPER represents a significant step forward in human-robot interaction. By integrating Vision Language Models, the system moves beyond simple geometry and begins to understand the semantics of a task.
It solves the “Pasta Jar” problem not by measuring millimeters, but by understanding the relationship between “pasta,” “pot,” and “pouring.”
Key Takeaways:
- Commonsense is King: Purely geometric intent inference fails in cluttered, real-world environments.
- Shadowing works: Allowing the robot to “think” in the background while the user acts in the foreground creates a fluid user experience.
- Confidence prevents frustration: A robot that knows when not to help is just as important as one that knows how to help.
As VLMs continue to get faster and more accurate, systems like CASPER will likely become the standard for assistive robotics, empowering users with motor impairments to interact with their environments more independently and with far less cognitive effort.
](https://deep-paper.org/en/paper/2506.14727/images/cover.png)