Papers

[Morphologically Symmetric Reinforcement Learning for Ambidextrous Bimanual Manipulation 🔗](https://arxiv.org/abs/2505.05287)

Ambidextrous Robots - Unlocking Bimanual Dexterity with Symmetry

Introduction If you have ever tried to write with your non-dominant hand, you know the struggle. Despite your left and right hands being nearly identical mirror images of each other—structurally and mechanically—your brain has specialized to favor one side for fine motor skills. This phenomenon, known as handedness, is efficient for humans but represents a significant limitation for robots. Bimanual robots (robots with two arms) are typically built with perfect bilateral symmetry. The left arm is a precise reflection of the right. Yet, when we teach these robots to perform tasks using Reinforcement Learning (RL), we often treat them like humans with a strong dominant hand. We might train the right arm to use a screwdriver while the left arm just holds the object. If the workspace is flipped, the robot fails or awkwardly reaches across its body, unable to transfer the skill to the other arm. ...

[JaxRobotarim: Training and Deploying Multi-Robot Policies in 10 Minutes 🔗](https://arxiv.org/abs/2505.06771)

Sim2Real in Minutes: Accelerating Multi-Robot Learning with JaxRobotarium

Introduction Imagine trying to teach a swarm of drones to fly in formation or a fleet of warehouse robots to sort packages without colliding. In the world of robotics, we are moving away from hard-coding these behaviors toward Multi-Agent Reinforcement Learning (MARL). The promise of MARL is enticing: let the robots learn to coordinate by themselves through trial and error. However, researchers in this field face a frustrating dilemma. On one hand, you have high-speed simulators (like those used for video games such as StarCraft) that allow for rapid training but ignore the laws of physics. On the other hand, you have realistic robotics testbeds that respect physics and safety but are agonizingly slow to train on because they cannot be easily parallelized. ...

[AutoEval: Autonomous Evaluation of Generalist Robot Manipulation Policies in the Real World 🔗](https://arxiv.org/abs/2503.24278)

AutoEval: Let Your Robots Grade Their Own Homework

Introduction Imagine you have just trained a new, cutting-edge “generalist” robot policy—a brain capable of controlling a robot arm to do everything from folding laundry to sorting groceries. You are excited to see how well it works. But here is the problem: to statistically prove your model is good, you need to run it thousands of times across different scenarios. Who is going to sit there for 100 hours, putting the laundry back in the basket every time the robot successfully folds it? Who is going to reset the scene when the robot knocks a can of soup off the table? ...

[Long-VLA: Unleashing Long-Horizon Capability of Vision Language Action Model for Robot Manipulation 🔗](https://arxiv.org/abs/2508.19958)

Breaking the Chain: How Long-VLA Masters Long-Horizon Robot Tasks

Introduction Imagine asking a robot to clean a kitchen. The instruction seems simple: “Clean up the kitchen.” However, for a robot, this isn’t a single action. It is a complex sequence: navigate to the counter, locate the sponge, pick it up, move to the sink, turn on the water, scrub the plate, and place it in the rack. In recent years, Vision-Language-Action (VLA) models have revolutionized robotics. By training on massive datasets of robot behavior and language, these models have become excellent at understanding instructions and performing short, discrete tasks. But there is a catch: while they are great at “picking up the sponge,” they often fail miserably when asked to string ten such actions together without failing. ...

[Reflective Planning: Vision-Language Models for Multi-Stage Long-Horizon Robotic Manipulation 🔗](https://arxiv.org/abs/2502.16707)

How "Reflective Planning" Teaches Robots to Think Before They Act

Imagine you are trying to move a large sofa through a narrow doorway. Before you even lift a finger, you likely simulate the process in your head. You visualize tilting it at a 45-degree angle, realize that the legs will get stuck, and then revise your plan to take the legs off first. This process—mental simulation followed by critique and revision—is second nature to humans. In cognitive science, this is often linked to “System 2” thinking: slow, deliberative, and logical. ...

[Learn from What We HAVE: History-Aware VErifier that Reasons about Past Interactions Online 🔗](https://arxiv.org/abs/2509.00271)

Stop Guessing, Start Verifying: How Robots Can Solve Ambiguity Using History-Aware Verification (HAVE)

Introduction: The Problem of the “Norman Door” We have all been there. You walk up to a door, grab the handle, and pull. It doesn’t budge. You realize, slightly embarrassed, that you were supposed to push. This scenario, often called the “Norman Door” problem, highlights a fundamental challenge in robotics: visual ambiguity. To a robot’s camera, a push-door and a pull-door often look identical. Two boxes might look the same, but one is empty while the other is filled with lead bricks on the left side. ...

[Search-TTA: A Multimodal Test-Time Adaptation Framework for Visual Search in the Wild 🔗](https://openreview.net/pdf?id=iVbCWUDyBF)

Finding the Invisible: How Test-Time Adaptation Revolutionizes Autonomous Visual Search

Imagine a drone deployed over a dense forest in Yosemite Valley. Its mission: locate black bears. The drone has a satellite map, but the map resolution is too low to see a bear directly. Instead, the drone must rely on visual priors—intuition derived from the map about where bears likely are (e.g., “bears like dense vegetation, not parking lots”). But what happens if that intuition is wrong? What if the map is outdated, or the Vision-Language Model (VLM) guiding the drone hallucinates, predicting bears in an open field where none exist? In traditional systems, the drone would waste valuable battery life searching empty areas based on a bad initial guess. ...

[MEREQ: Max-Ent Residual-Q Inverse RL for Sample-Efficient Alignment from Intervention 🔗](https://arxiv.org/abs/2406.16258)

Teaching Robots New Tricks: How MEREQ Aligns Policies with Less Human Effort

Imagine you have trained a robot to push a bottle across a table. It does the job perfectly—the bottle gets from point A to point B. However, there is a catch: the robot pushes the bottle from the very top, making it wobble dangerously. As a human observer, you want to teach the robot a preference: “Push the bottle, but please push it from the bottom so it’s stable.” This is a classic problem in Interactive Imitation Learning (IIL). You don’t want to reprogram the robot from scratch; you just want to tweak its behavior through interventions. You watch the robot, and whenever it reaches for the top of the bottle, you take control and move its hand to the bottom. ...

[TWIST: Teleoperated Whole-Body Imitation System 🔗](https://arxiv.org/abs/2505.02833)

Mastering the Humanoid: How TWIST Achieves Real-Time Whole-Body Teleoperation

Introduction Imagine you are carrying two heavy grocery bags into your house. You approach the door, balance on one leg, and use your other foot to push the door open. To you, this action is trivial. To a humanoid robot, this is a nightmare of physics, balance, and coordination. Humans possess an innate ability to couple locomotion (moving around) with manipulation (using hands) in a synchronized, “whole-body” manner. We crouch to reach under beds, we lunge to catch objects, and we adjust our stance to throw a ball. For humanoid robots to become truly useful in our homes, they need this same level of versatility. ...

[KoopMotion: Learning Almost Divergence-Free Koopman Flow Fields for Motion Planning 🔗](https://arxiv.org/abs/2509.09074)

KoopMotion: Teaching Robots 'Muscle Memory' with Fluid Dynamics and Spectral Theory

Introduction Imagine teaching a robot to write the letter “A”. You grab its robotic arm, guide it through the strokes, and then expect it to repeat the motion on its own. This process, known as Learning from Demonstrations (LfD), is a holy grail in robotics. It bypasses the need for tedious code and allows anyone to program a robot simply by showing it what to do. But here is the catch: the real world is messy. If you bump the robot while it’s writing, or if a drone faces a sudden gust of wind, standard imitation learning methods often fail. The robot might struggle to recover its path or fail to stop exactly where it’s supposed to. It knows the path, but it doesn’t necessarily understand the flow of motion required to stabilize itself and converge to a stop. ...

[Pointing3D: A Benchmark for 3D Object Referral via Pointing Gestures 🔗](https://openreview.net/pdf?id=h2K52fhsDU)

Point and Click (in Real Life): Teaching Robots to Understand 3D Pointing Gestures

Imagine you are working with a colleague to assemble a piece of furniture. You realize you need the screwdriver on the table behind them. You don’t say, “Please retrieve the object located at coordinates X: 1.2, Y: -0.5, Z: 0.8.” You simply point and say, “Pass me that.” Your colleague immediately understands the gesture, estimates the direction, identifies the specific object you mean, and hands it over. This interaction is seamless for humans, but it is incredibly difficult for robots. ...

[DEXPLORE: Scalable Neural Control for Dexterous Manipulation from Reference-Scoped Exploration 🔗](https://arxiv.org/abs/2509.09671)

Beyond Retargeting: How DEXPLORE Teaches Robots to Move Like Humans by Learning Intent

The human hand is an engineering marvel. With over 20 degrees of freedom, dense tactile sensors, and complex muscle synergies, it allows us to perform tasks ranging from threading a needle to crushing a soda can with effortless grace. Replicating this level of dexterity in robots has been a “holy grail” challenge in robotics for decades. While we have made massive strides in computer vision and navigation, robotic manipulation—specifically using multi-fingered hands—remains surprisingly difficult. One of the most promising avenues to solve this is learning from human demonstrations. We have vast repositories of motion capture (MoCap) data showing humans interacting with objects. Theoretically, we should be able to feed this data to a robot and have it mimic the behavior. ...

[Enabling Long(er) Horizon Imitation for Manipulation Tasks by Modeling Subgoal Transitions 🔗](https://openreview.net/pdf?id=fBRqCMqVyS)

Beyond Heuristics: How Soft Attention Solves Long-Horizon Robot Tasks

Imagine you are teaching a robot to make a cup of coffee. This isn’t just a single motion; it is a long sequence of distinct steps: grab the mug, place it under the machine, insert the pod, press the button, and finally, pick up the hot coffee. In robotics, we call this a Long-Horizon Task. While modern imitation learning has become quite good at short, atomic movements (like “pick up the mug”), stringing these skills together into a coherent, robust sequence remains a massive hurdle. The longer the sequence, the more likely the robot is to drift off course. A small error in step one becomes a catastrophe by step four. ...

[Cost-aware Discovery of Contextual Failures using Bayesian Active Learning 🔗](https://openreview.net/pdf?id=f2Y549UzM5)

Breaking Robots Better: How Bayesian Active Learning Finds Contextual Failures in Autonomous Systems

Introduction Imagine you are testing a new autonomous vehicle. You run it through a simulator on a sunny day, and it stops perfectly for pedestrians. You try it in the rain, and it works fine. But then, you deploy it in the real world, and it suddenly fails to brake for a cyclist at dusk, specifically when a parked car casts a shadow on the road. This is a contextual failure. The system didn’t fail because it was broken in a general sense; it failed because a specific combination of environmental context (lighting), task constraints (braking distance), and system state (sensor noise) conspired to trick it. ...

[ManiFlow: A General Robot Manipulation Policy via Consistency Flow Training 🔗](https://arxiv.org/abs/2509.01819)

Mastering Robot Dexterity: Inside ManiFlow's Consistency Flow Training

The dream of general-purpose robotics often conjures images of humanoid machines fluidly pouring water, handing over tools, or tidying up a cluttered room. While we have made massive strides in robotic control, achieving this level of “dexterity”—precise, coordinated movement, often using two hands—remains a formidable challenge. Robots need to process complex inputs (vision, language, proprioception) and output high-dimensional actions instantly. Recent years have seen the rise of Diffusion Policies, which treat robot action generation like image generation: starting with noise and refining it into a trajectory. While effective, diffusion models can be slow, requiring many steps of iterative refinement (denoising) to produce a usable action. ...

[Sim2Val: Leveraging Correlation Across Test Platforms for Variance-Reduced Metric Estimation 🔗](https://arxiv.org/abs/2506.20553)

Sim2Val: How to Validate Robots Without Breaking the Bank (or the Robot)

Introduction: The “Expensive Robot” Problem Imagine you have developed a new planning algorithm for a self-driving car. You are confident it works, but before you can deploy it to a fleet of vehicles, you need to answer a critical question: How safe is it, really? To statistically guarantee safety, you might need to test the vehicle over millions (or even billions) of miles. Doing this entirely in the real world is feasible for almost no one—it is prohibitively expensive, time-consuming, and potentially dangerous if the system fails. ...

[Self-supervised perception for tactile skin covered dexterous hands 🔗](https://arxiv.org/abs/2505.11420)

Sparsh-skin - Giving Robots a "Sense of Touch" with Magnetic Skin and Self-Supervision

Introduction Imagine trying to plug a charging cable into a socket behind a nightstand in pitch darkness. You can’t see the socket, yet you successfully maneuver the plug, feel the edges of the port, align it, and push it in. You rely entirely on your sense of touch. For humans, touch is a seamless, high-bandwidth modality that covers our entire hand. For robots, however, replicating this capability has been a monumental challenge. While computer vision has seen explosive growth, robotic tactile sensing has lagged behind. Most robots today are essentially “numb,” relying heavily on vision to infer contact. ...

[Learning Impact-Rich Rotational Maneuvers via Centroidal Velocity Rewards and Sim-to-Real Techniques: A One-Leg Hopper Flip Case Study 🔗](https://arxiv.org/abs/2505.12222)

Teaching a Pogo Stick to Flip: How Centroidal Physics and Motor Limits Enable Acrobatic Robots

Introduction We have entered an era where legged robots are doing far more than just walking. From the viral videos of Boston Dynamics’ Atlas doing parkour to quadrupedal robots performing agile jumps, the field is pushing toward high-dynamic, acrobatic behaviors. But there is a massive gap between watching a robot execute a backflip in a perfectly controlled simulation and achieving that same feat on physical hardware without the robot destroying itself. ...

[3DS-VLA: A 3D Spatial-Aware Vision Language Action Model for Robust Multi-Task Manipulation 🔗](https://openreview.net/pdf?id=dT45OMevL5)

Bridging the Gap: How 3DS-VLA Brings 3D Awareness to 2D Robot Brains

Introduction Imagine trying to pour water into a cup while closing one eye and looking through a paper towel tube. You lose depth perception, and your field of view is restricted. This is essentially how many modern robots operate when powered by standard Vision-Language Models (VLMs). In recent years, the field of robotics has been revolutionized by Vision-Language-Action (VLA) models. These models leverage the massive knowledge base of internet-scale 2D data to help robots understand commands and recognize objects. However, there is a fundamental mismatch: these models are trained on flat, 2D images, but robots live and work in a complex, geometric 3D world. When a robot tries to grasp a bottle or open a drawer based solely on 2D inputs, it often struggles to reason about spatial depth and the precise geometry required for manipulation. ...

[CoRI: Communication of Robot Intent for Physical Human-Robot Interaction 🔗](https://arxiv.org/abs/2505.20537)

Decoding the Black Box: How Robots Can Explain Their Moves Before They Touch You

Imagine you are sitting in a chair, and a robotic arm holding a razor blade begins to move toward your face. You know the robot is designed to assist with shaving, but you don’t know its exact path. Is it aiming for your cheek? Is it going to pause? Is it moving too fast? In scenarios involving Physical Human-Robot Interaction (pHRI), such as robotic feeding, bathing, or shaving, the difference between a helpful interaction and a terrifying one often comes down to transparency. If the robot could just tell you, “I am going to move slowly towards your left cheek to trim your sideburns,” the anxiety would vanish, and the collaboration would be seamless. ...