Imagine you are sitting in a chair, and a robotic arm holding a razor blade begins to move toward your face. You know the robot is designed to assist with shaving, but you don’t know its exact path. Is it aiming for your cheek? Is it going to pause? Is it moving too fast?

In scenarios involving Physical Human-Robot Interaction (pHRI), such as robotic feeding, bathing, or shaving, the difference between a helpful interaction and a terrifying one often comes down to transparency. If the robot could just tell you, “I am going to move slowly towards your left cheek to trim your sideburns,” the anxiety would vanish, and the collaboration would be seamless.

However, giving robots the ability to “speak” their intent is surprisingly difficult. Traditionally, this required hard-coding specific phrases for specific tasks. But what if we could build a system that allows any robot to explain any physical task, just by looking at the environment and its own motion plan?

This is the problem tackled by CoRI (Communication of Robot Intent), a new research framework that combines robotic motion planning with the reasoning power of modern Vision-Language Models (VLMs).

Figure 1: Our proposed pipeline, CoRI, generating intent communication during an assisted bathing task.

The Core Problem: The Silent Robot

As robots move from industrial cages into our homes and care facilities, they need to interact physically with humans. In assistive robotics, a robot might wipe a person’s face, feed them, or help them dress.

The standard way robots communicate their intent today is through visual cues: LED lights blinking, projecting arrows onto the floor, or displaying a path on a screen. While these methods work for navigation (e.g., a robot vacuum signaling it’s turning left), they fail in complex manipulation tasks. A blinking light doesn’t tell you why the robot is applying pressure to your arm or that it needs you to lean forward.

Natural language is the most intuitive interface for humans. The challenge lies in generation. A robot “knows” its plan as a series of mathematical waypoints (x, y, z coordinates, velocities, and forces). A human understands concepts like “wiping the arm” or “approaching the mouth.” CoRI is designed to translate the former into the latter.

The CoRI Solution: A Task-Agnostic Pipeline

The researchers introduce CoRI as a pipeline that generates natural language explanations without being trained on specific tasks. It doesn’t know what “shaving” is beforehand. Instead, it figures it out on the fly.

The pipeline answers three critical questions for the user:

Intention: What is the robot trying to achieve overall?
Motion: How will it move? (Speed, direction, force).
Cooperation: What does the human need to do?

The Architecture

The beauty of CoRI lies in how it processes data. As shown in the overview below, the system takes two inputs that are usually incompatible: a 2D image from the robot’s camera and a list of 3D data points representing the robot’s future motion.

Figure 2: Overview of our CoRI pipeline. The pipeline takes as input an image observation of the environment and a planned 3D trajectory.

The pipeline consists of three main stages:

Interaction-Aware Trajectory Encoding: Visualizing the math.
Visual Reasoning (VLM): interpreting the scene.
Verbal Communication (LLM): Generating the speech.

Let’s break down the technical innovations in each stage.

1. Translating Math to Graphics (Trajectory Encoding)

Vision-Language Models (like GPT-4o) are incredibly good at understanding images, but they struggle to interpret raw lists of numerical coordinates. To solve this, CoRI converts the robot’s motion plan into a visual overlay on top of the camera feed.

First, the system detects the human in the scene using pose estimation (identifying wrists, elbows, shoulders). Then, it processes the Trajectory \(\tau\). A trajectory is a list of waypoints where each point contains position, velocity, and force data.

However, explaining a long, complex movement all at once is confusing. CoRI automatically segments the trajectory into “chunks” based on interaction events. The researchers defined a specific logic for when to slice a trajectory into a new segment:

Equation determining segmentation indices based on gripper change, force change, or pause.

This equation essentially says a new segment begins if:

Gripper Change (\(g_i \neq g_{i+1}\)): The robot opens or closes its hand (e.g., grabbing a towel).
Force Change: The robot transitions from moving through free space (\(f=0\)) to making contact (\(f \neq 0\)), or vice versa.
Pause: The robot stops moving for more than 2 seconds (likely waiting for the human).

Once segmented, the pipeline draws the trajectory onto the image. This isn’t just a simple line; it’s a data-rich visualization designed for an AI to read:

Start/End: Marked with blue and red squares.
Velocity: Represented by color brightness (dark green = slow, bright green = fast).
Force: Represented by line color (cyan = no force, gradient to magenta = high force).

This allows the VLM to “see” dynamics. If the AI sees a magenta line, it knows the robot is pushing against something. If it sees a bright green line, it knows the robot is moving quickly.

Figure 8: Example overlay visualization of trajectory 2 in the shaving task, for participant 2.

In the image above (Figure 8), you can see this visualization in action during a shaving task. The skeletal tracking identifies the human arm, and the colored lines show the robot’s intended path along the arm. This visual context is what allows the AI to ground the robot’s numbers in the real world.

2. The Visual Reasoning Engine

Once the trajectory is visually encoded, CoRI queries a Vision-Language Model. It performs a two-step reasoning process:

Step A: Environment Comprehension The model looks at the scene (with faces blurred for privacy) and identifies the context. It sees a person, perhaps a bed, and the tool in the robot’s gripper.

Query: “What is the robot holding?”
VLM Output: “The robot is holding a white cloth. It is likely used for cleaning or wiping.”

Step B: Trajectory Comprehension The system then feeds the VLM the images with the trajectory overlays. It asks structured questions: “Where is the blue square? Is there force involved? What body part is it near?”

By combining the environment context (“holding a razor”) with the trajectory visuals (“moving along the forearm with light force”), the VLM deduces the intent: “The robot is shaving the arm.” This is achieved without the robot ever being explicitly programmed to know what shaving is.

3. Generating the Statement

Finally, a Reasoning LLM (like o3-mini) takes the structured summary from the VLM and turns it into natural, user-directed speech. The researchers prioritize concise, friendly, and instructive language.

Instead of saying, Thinking… “Trajectory segment 2 moves from coordinates X to Y with 2 Newtons of force,” the robot says:

“I’m now moving from your left wrist to your left elbow… gradually increasing my touch with the towel for a gentle sweep.”

Experimental Setup: Putting CoRI to the Test

To prove that this pipeline works across different scenarios, the researchers implemented three distinct assistive tasks using two different robot platforms (Stretch RE1 and xArm 7).

Figure 3: The three tasks implemented and used in the user study: bathing, shaving, and feeding.

Simulated Bathing: The robot holds a washcloth and wipes a user’s arm. This tests the communication of force and velocity changes.
Simulated Shaving: The robot holds a clipper (with a fake blade) and moves along the arm. This tests the communication of complex trajectory shapes and precision.
Feeding: The robot brings a spoon to the user’s mouth. This tests user cooperation (telling the user when to open their mouth).

In the user study, 16 participants interacted with the robots. They experienced the tasks under three conditions:

No Communication: The robot moved silently.
Baseline (Scripted): The robot used a standard template like “I am moving towards your [Left Wrist].”
CoRI (Ours): The robot used the generated natural language explanations.

Results: Did Communication Improve?

The results of the user study were compelling. The researchers measured performance using Likert-scale questionnaires focusing on motion comprehension and communication clarity.

Figure 4: Box plots showing distribution of Likert-item responses for each participant.

As shown in Figure 4, CoRI (the dark blue bars) significantly outperformed the baseline and no-communication strategies.

Motion Comprehension (Left Plot): Users felt significantly more confident predicting the robot’s next action (L1) and understanding what the robot was going to do (L2) when CoRI was active.
Communication Quality (Right Plot): The differences were even more stark here.
L4 (Intention): CoRI was far better at communicating why the robot was moving.
L6 (Cooperation): This was a critical win. CoRI successfully told users what they needed to do (e.g., “keep your arm still” or “lean forward”). The scripted baseline often failed here because it couldn’t infer the context of the interaction.

Entailment: Is the Robot Telling the Truth?

One major risk with using LLMs is “hallucination”—the AI making things up. To verify accuracy, the researchers compared the CoRI-generated statements against “Ground Truth” descriptions of the trajectories. They used a metric called Entailment Probability to check if the generated text was logically consistent with the actual plan.

Table 1: Mean entailment probability.

Table 1 shows that CoRI achieved entailment scores of roughly 0.95 across all tasks. This is comparable to the “Oracle” (human-written summaries) and significantly higher than the baseline. This confirms that the pipeline doesn’t just sound natural; it is technically accurate regarding the robot’s position, speed, and force.

Why This Matters

The significance of CoRI extends beyond just shaving or feeding. It represents a shift in how we program robots.

Generalization: We don’t need to write new communication scripts for every new task. If a robot learns to comb hair or paint a wall, CoRI can automatically generate the explanation for it just by analyzing the motion plan.
Trust: By explaining “invisible” factors like force and velocity before they happen, robots become less intimidating.
Accessibility: Natural language lowers the barrier to entry. A user doesn’t need to understand robotics or read a complex display; they just need to listen.

Conclusion

The CoRI pipeline demonstrates that the gap between low-level robot control (numbers and forces) and high-level human understanding (language and intent) can be bridged using visual reasoning. By turning motion plans into images and letting advanced AI models interpret them, robots can finally explain themselves.

As assistive robots become common fixtures in elderly care and physical therapy, systems like CoRI will be essential. They transform the robot from a silent, unpredictable machine into a communicative, transparent partner.

The visuals and data discussed in this post are based on the research paper “CoRI: Communication of Robot Intent for Physical Human-Robot Interaction.”

The Core Problem: The Silent Robot#

The CoRI Solution: A Task-Agnostic Pipeline#

The Architecture#

1. Translating Math to Graphics (Trajectory Encoding)#

2. The Visual Reasoning Engine#

3. Generating the Statement#

Experimental Setup: Putting CoRI to the Test#

Results: Did Communication Improve?#

Entailment: Is the Robot Telling the Truth?#

Why This Matters#

Conclusion#