Papers

[CoRI: Communication of Robot Intent for Physical Human-Robot Interaction 🔗](https://arxiv.org/abs/2505.20537)

Decoding the Black Box: How Robots Can Explain Their Moves Before They Touch You

Imagine you are sitting in a chair, and a robotic arm holding a razor blade begins to move toward your face. You know the robot is designed to assist with shaving, but you don’t know its exact path. Is it aiming for your cheek? Is it going to pause? Is it moving too fast? In scenarios involving Physical Human-Robot Interaction (pHRI), such as robotic feeding, bathing, or shaving, the difference between a helpful interaction and a terrifying one often comes down to transparency. If the robot could just tell you, “I am going to move slowly towards your left cheek to trim your sideburns,” the anxiety would vanish, and the collaboration would be seamless. ...

[MirrorDuo: Reflection-Consistent Visuomotor Learning from Mirrored Demonstration Pairs 🔗](https://openreview.net/pdf?id=cUeY476ohd)

Double Your Robot's Skills: How MirrorDuo Uses Reflection for Efficient Learning

Double Your Robot’s Skills: How MirrorDuo Uses Reflection for Efficient Learning Imagine you are teaching a child to catch a ball. You demonstrate the motion with your right hand. Intuitively, the child understands that they can perform a similar motion with their left hand, just mirrored. They don’t need to relearn the physics of the ball or the concept of catching from scratch; they simply apply a reflection symmetry to what they already know. ...

[AimBot: A Simple Auxiliary Visual Cue to Enhance Spatial Awareness of Visuomotor Policies 🔗](https://arxiv.org/abs/2508.08113)

Aim Like a Robot: Enhancing Spatial Awareness with AimBot

In the world of first-person shooter (FPS) video games, players rely heavily on a simple UI element: the crosshair. Without it, estimating the exact center of the screen—and consequently, where your character is aiming—is incredibly difficult. The crosshair provides an explicit, visual anchor that connects the player’s intent with the game world’s geometry. Now, imagine playing that game without a crosshair. You would likely struggle to align your aim with objects, missing targets that should be easy to hit. Surprisingly, this is exactly how we have been training state-of-the-art robots to manipulate the world. ...

[Latent Adaptive Planner for Dynamic Manipulation 🔗](https://arxiv.org/abs/2505.03077)

How Robots Learn to Catch: Inside the Latent Adaptive Planner

Imagine someone tossing a cardboard box at you. You don’t freeze, calculate the wind resistance, solve a differential equation, and then move. You react fluidly. You extend your arms, anticipate the contact, and as the box hits your hands, you pull back slightly to absorb the impact. This is dynamic manipulation—interacting with objects through rapid contact changes and physical forces. For humans, this is instinct. For robots, it is an algorithmic nightmare. ...

[Improving Efficiency of Sampling-based Motion Planning via Message-Passing Monte Carlo 🔗](https://arxiv.org/abs/2410.03909)

Better Than Random: Using Graph Neural Networks to Optimize Robot Motion Planning

Introduction: The Problem with “Random” Imagine you are a robot arm tasked with reaching into a crowded box to pick up a specific object without touching the sides. To figure out how to move, you need a plan. In the world of robotics, one of the most popular ways to find this plan is to play a game of “connect the dots.” You scatter thousands of random points across the possibilities of your movement, check which ones are safe, and connect them to form a roadmap. ...

[CASPER: Inferring Diverse Intents for Assistive Teleoperation with Vision Language Models 🔗](https://arxiv.org/abs/2506.14727)

How VLMs Are Giving Robots Commonsense for Shared Control

Introduction: The “Pasta Jar” Problem Imagine you are sitting in a wheelchair, using a joystick to control a robotic arm attached to your chair. You are in the kitchen, and your goal is to make dinner. You navigate the robot toward a shelf, pick up a jar of pasta, and move it toward the counter where a cooking pot and a laptop are sitting side-by-side. To a human observer, your intent is obvious: you are going to pour the pasta into the pot. However, to a traditional robotic system, this is a baffling geometric puzzle. If the pasta jar happens to pass slightly closer to the laptop than the pot, a standard robot might infer that you want to pour the pasta onto the keyboard. ...

[VT-Refine: Learning Bimanual Assembly with Visuo-Tactile Feedback via Simulation Fine-Tuning 🔗](https://arxiv.org/abs/2510.14930)

Giving Robots the Human Touch: Mastering Bimanual Assembly with VT-Refine

Introduction Imagine you are trying to plug a charging cable into a port behind your desk. You can’t really see the port, or perhaps your hand is blocking the view. How do you do it? You rely on touch. You feel around for the edges, align the connector, and gently wiggle it until you feel it slide into place. This interplay between vision (locating the general area) and touch (performing the precise insertion) is second nature to humans. However, for robots, reproducing this “bimanual assembly” capability is an immense challenge. While computer vision has advanced rapidly, giving robots the ability to “feel” and react to physical contact—especially with two hands simultaneously—remains a frontier in robotics research. ...

[Few-Shot Neuro-Symbolic Imitation Learning for Long-Horizon Planning and Acting 🔗](https://arxiv.org/abs/2508.21501)

From Pixels to Plans: How Robots Learn to Reason and Act from Just Five Demonstrations

Introduction Imagine trying to teach a robot how to cook a meal. You could grab its arm and guide it through the motions: opening the fridge, grabbing an egg, cracking it, and frying it. This is the essence of Imitation Learning (IL). It works wonderfully for short, specific skills. But what happens when you ask that same robot to prepare a three-course dinner? Suddenly, the robot isn’t just copying a motion; it needs to plan. It needs to understand that the “egg” must be “cracked” before it can be “fried,” and that the “pan” must be “hot” before the egg goes in. If the robot only knows how to mimic motions, it will inevitably fail over a long sequence. It might try to fry the egg shell, or put the egg in a cold pan, because it doesn’t understand the logic behind the task. ...

[From Tabula Rasa to Emergent Abilities: Discovering Robot Skills via Real-World Unsupervised Quality-Diversity 🔗](https://arxiv.org/abs/2508.19172)

URSA: How Robots Teach Themselves to Survive in the Real World

Introduction: The Blank Slate Problem In the natural world, animals are born with a capacity to learn that far outstrips our current robotic systems. A foal learns to stand, walk, and eventually run, not by following a hard-coded script, but by interacting with its environment, failing, adjusting, and discovering what works. When an animal gets injured, it instinctively adapts its gait to favor the injured limb. It doesn’t need a software update; it relies on a diverse repertoire of movement skills it has acquired over time. ...

[exUMI: Extensible Robot Teaching System with Action-aware Task-agnostic Tactile Representation 🔗](https://arxiv.org/abs/2509.14688)

Giving Robots the Magic Touch: Inside exUMI and Tactile Predictive Pretraining

Introduction Imagine trying to tie your shoelaces with numb fingers. You can see the laces perfectly, but without the subtle feedback of tension and texture, the task becomes clumsy and frustrating. This is the current state of most robotic manipulation. While computer vision has seen explosive growth, allowing robots to “see” the world with high fidelity, the sense of touch—tactile perception—remains a significant bottleneck. Robots generally struggle to manipulate objects where force and contact are critical, such as pulling a stuck drawer, inserting a key into a lock, or handling soft fruit. The challenges are twofold: Hardware (collecting reliable tactile data is hard) and Algorithms (teaching a robot to “understand” what it feels is even harder). ...

[IRIS: An Immersive Robot Interaction System 🔗](https://arxiv.org/abs/2502.03297)

Step Into the Robot's World: How IRIS is Revolutionizing Data Collection with XR

Introduction In the rapidly evolving field of robotics, data is the new oil. For a robot to learn how to fold laundry, cook a meal, or assemble a car, it usually needs to observe thousands of demonstrations of that task being performed. This is the foundation of Imitation Learning and Robot Learning. However, teaching a robot isn’t as simple as showing it a video. The robot needs rich, 3D data about joint angles, spatial relationships, and physics. Traditionally, researchers have used keyboards, 3D mice, or expensive physical “puppet” robots to generate this data. These methods are often clunky, non-intuitive, or prohibitively expensive. ...

[GraspQP: Differentiable Optimization of Force Closure for Diverse and Robust Dexterous Grasping 🔗](https://arxiv.org/abs/2508.15002)

Beyond the Power Grasp: How GraspQP Brings Physics and Diversity to Robotic Hands

If you look at your own hand, you’ll realize it is a marvel of engineering. You can wrap your fingers around a heavy hammer to swing it (a power grasp), but you can also delicately hold a key between your thumb and index finger to unlock a door (a pinch grasp), or use three fingers to manipulate a pen (precision grasp). For robots, replicating this versatility is a massive challenge. While we have built sophisticated multi-fingered robotic hands (like the Shadow Hand or the Allegro Hand), teaching them how to use that dexterity remains difficult. The bottleneck is often data. To train a robot to grasp anything, we need massive datasets containing millions of examples of stable, physically realistic grasps. ...

[ATK: Automatic Task-driven Keypoint Selection for Robust Policy Learning 🔗](https://arxiv.org/abs/2506.13867)

How Robots Learn to Ignore the Noise: Automatic Task-driven Keypoint Selection

Introduction Imagine you are teaching a robot to pick up a transparent glass pot. In a sterile lab environment or a computer simulation, this is relatively easy. The lighting is perfect, the background is a solid color, and the object’s position is known precisely. Now, move that robot into a real kitchen. Sunlight is streaming through the window, casting moving shadows. There is a patterned tablecloth. A bright blue coffee mug is sitting next to the pot. Suddenly, the robot fails. It gets confused by the mug, it can’t “see” the clear glass correctly because its depth sensors struggle with transparency, or the new lighting changes the pixel values so much that the robot’s neural network thinks it’s looking at a completely different scene. ...

[ARCH: Hierarchical Hybrid Learning for Long-Horizon Contact-Rich Robotic Assembly 🔗](https://arxiv.org/abs/2409.16451)

Mastering Long-Horizon Assembly: How ARCH Combines Planning and Learning

Introduction In the world of industrial automation, robots are incredibly proficient at repeating identical tasks in structured environments. A robot arm in an automotive factory can weld the same spot on a chassis millions of times with sub-millimeter precision. However, the moment you ask that same robot to assemble a piece of furniture—where parts might be scattered randomly, fit tightly, or require complex sequencing—the system often fails. This creates a significant gap in robotics: the inability to handle long-horizon, contact-rich assembly tasks. “Long-horizon” means the robot must perform a sequence of many dependent steps (e.g., grasp leg A, reorient leg A, insert leg A, repeat for legs B, C, and D). “Contact-rich” implies that the parts interact physically with friction and force, requiring a finesse that pure position control cannot achieve. ...

[Generalist Robot Manipulation beyond Action Labeled Data 🔗](https://arxiv.org/abs/2509.19958)

How to Train Your Robot with YouTube: Inside MotoVLA

Introduction Imagine trying to learn how to cook a complex dish. You have two ways to learn. One is to have a master chef stand behind you, physically guiding your hands, correcting every chop of the onion, and adjusting the heat for you. The other way is to simply watch a few YouTube videos of someone else cooking the dish, and then try to replicate the motion yourself. In the world of robotics, the first method—explicit supervision with “action labels”—is the standard. We collect data by teleoperating robots (puppeteering them), recording exactly which motors moved and by how much. This produces high-quality data, but it is incredibly slow and expensive to collect. ...

[Subteaming and Adaptive Formation Control for Coordinated Multi-Robot Navigation 🔗](https://arxiv.org/abs/2509.16412)

Divide and Conquer: How Hierarchical Learning Enables Robot Swarms to Navigate Narrow Corridors

Introduction Imagine a team of robots deployed for a search and rescue mission in a collapsed building. To ensure safety and maximum sensor coverage, they need to move in a specific formation—perhaps a circular perimeter protecting a human responder in the center. Everything works perfectly in the open atrium. But then, the team encounters a narrow hallway or a partially blocked bridge. This presents a fundamental conflict in multi-robot systems: the need for coordinated formation versus the need for environmental adaptability. If the robots rigidly stick to their circle, they cannot fit through the door. If they break formation completely, they lose their protective coordination. ...

[Unsupervised Skill Discovery as Exploration for Learning Agile Locomotion 🔗](https://arxiv.org/abs/2508.08982)

Beyond Reward Engineering: How SDAX Teaches Robots Parkour via Unsupervised Skill Discovery

Introduction Deep Reinforcement Learning (RL) has revolutionized robotics, enabling legged machines to walk, run, and recover from falls. Yet, when we look at the most impressive demonstrations—robots doing backflips, leaping across wide gaps, or climbing vertically—there is often a hidden cost. That cost is human engineering effort. To achieve agile locomotion, researchers often have to manually design complex reward functions, gather expensive expert demonstrations, or carefully architect curriculum learning phases that guide the robot step-by-step. If you simply ask a robot to “move forward” across a gap, it will likely fall into the gap repeatedly and fail to learn that it needs to jump. It gets stuck in a local optimum. ...

[Tool-as-Interface: Learning Robot Policies from Observing Human Tool Use 🔗](https://arxiv.org/abs/2504.04612)

Robots that Watch and Learn: Mastering Tool Use via Human Observation

Humans are master tool users. From using a hammer to drive a nail to flipping a pancake with a spatula, our ability to extend our physical capabilities through objects is a defining characteristic of our species. For robots, however, this remains a significant hurdle. While robots have become proficient at simple pick-and-place operations, dynamic tool use—which requires understanding the tool, the object, and the interaction between them—is far more complex. ...

[GraphEQA: Using 3D Semantic Scene Graphs for Real-time Embodied Question Answering 🔗](https://arxiv.org/abs/2412.14480)

Solving the 'Where are my Keys?' Problem: How GraphEQA Grounds VLMs in 3D Space

Imagine you are visiting a friend’s house for the first time. You ask, “Where can I find a glass of water?” Even though you’ve never been in this specific house before, you know exactly what to do: look for the kitchen. Once in the kitchen, you look for cupboards or a drying rack. You don’t wander into the bedroom or inspect the bookshelf. You are using semantic reasoning combined with spatial exploration. For robots, replicating this intuitive process is incredibly difficult. This is the domain of Embodied Question Answering (EQA). A robot must explore an unseen environment, understand what it sees, remember where things are, and answer a natural language question. ...

[Mechanistic Interpretability for Steering Vision-Language-Action Models 🔗](https://arxiv.org/abs/2509.00328)

Opening the Black Box: How to Steer Robot Minds Using Their Internal Concepts

Imagine a robot in your kitchen. You ask it to “pick up the apple.” It moves correctly. Now, you ask it to “pick up the apple carefully.” Does the robot actually understand the concept of “carefulness,” or is it just statistically mapping pixels to motor torques? In the rapidly evolving world of embodied AI, we are seeing the rise of Vision-Language-Action (VLA) models. These are massive neural networks—built on top of Large Language Models (LLMs)—that can see, read, and control robot bodies. Models like OpenVLA and \(\pi_0\) are promising to be generalist agents that can adapt to new tasks “out of the box.” ...