Papers

[NeuralSVCD for Efficient Swept Volume Collision Detection 🔗](https://arxiv.org/abs/2509.00499)

Solving the Tunneling Problem: How NeuralSVCD Makes Robot Motion Planning Safer and Faster

Introduction Imagine a robotic arm moving rapidly in a crowded factory floor. It needs to pick up a part from a bin and place it on a conveyor belt without hitting the bin walls, the conveyor, or itself. To plan this motion, the robot relies on a collision checker. Traditionally, motion planners work by sampling the robot’s trajectory at discrete points in time—like a flipbook. They check: “Is the robot hitting anything at time \(t=0\)? How about \(t=1\)?” If both are clear, the planner assumes the path is safe. But what happens at \(t=0.5\)? ...

[First Order Model-Based RL through Decoupled Backpropagation 🔗](https://arxiv.org/abs/2509.00215)

Breaking the Cycle of Error - How Decoupled Gradients are Revolutionizing Model-Based RL

Breaking the Cycle of Error: How Decoupled Gradients are Revolutionizing Model-Based RL In the world of robotics, we are constantly chasing a specific dream: a robot that can learn complex agile behaviors—like parkour or bipedal walking—in minutes rather than days. Reinforcement Learning (RL) has given us some incredible results, but it comes with a heavy price tag: sample inefficiency. Standard “model-free” algorithms like PPO (Proximal Policy Optimization) act like trial-and-error machines. They try an action, see the result, and nudge their behavior slightly. This works, but it requires millions, sometimes billions, of interactions to converge. ...

[Rapid Mismatch Estimation via Neural Network Informed Variational Inference 🔗](https://arxiv.org/abs/2508.21007)

How Robots Learn to Feel Weight: A Deep Dive into Rapid Mismatch Estimation

Imagine walking up to a cardboard box. You don’t know if it’s empty, filled with Styrofoam peanuts, or packed with heavy books. You reach out, lift it, and in a split second, your brain processes the proprioceptive feedback from your muscles. If it’s heavier than expected, you instantly recruit more motor units to stabilize your posture; if it’s lighter, you ease off to prevent throwing it over your shoulder. You do this naturally, safely, and—most importantly—rapidly. ...

[Mastering Multi-Drone Volleyball through Hierarchical Co-Self-Play Reinforcement Learning 🔗](https://arxiv.org/abs/2505.04317)

How AI Learned to Play 3v3 Drone Volleyball From Scratch

Competitive sports have long been the proving ground for Artificial Intelligence. We’ve seen AI conquer Chess, Go, Poker, and even complex video games like StarCraft II. But moving from the virtual world of pixels to the physical world of robotics introduces a massive spike in difficulty. In “embodied” sports, agents don’t just need to outsmart an opponent; they have to grapple with physics, gravity, aerodynamics, and the chaos of the real world. ...

[CaRL: Learning Scalable Planning Policies with Simple Rewards 🔗](https://arxiv.org/abs/2504.17838)

Less is More: Solving Autonomous Driving with Simple Rewards and Massive Scale

Introduction In the quest for fully autonomous vehicles, the “planning” module is the brain of the operation. It decides where the car should go, how fast it should drive, and how to handle a sudden pedestrian crossing the street. Historically, this has been the domain of rule-based systems—massive decision trees written by engineers (e.g., “if light is red, stop”). While these work well for standard traffic, they are brittle. They fail to scale to the “long tail” of weird, unpredictable scenarios that happen in the real world. ...

[Meta-Optimization and Program Search using Language Models for Task and Motion Planning 🔗](https://arxiv.org/abs/2505.03725)

Bridging the Gap: How MOPS Uses LLMs as Meta-Optimizers for Robotic Control

Introduction Imagine asking a robot to “draw a perfect star on a tilted whiteboard” or “push these scattered blocks into a neat line.” To a human, these requests are simple. To a robot, they represent a complex interplay of high-level semantic understanding and low-level geometric precision. For years, roboticists have struggled with the Task and Motion Planning (TAMP) problem. The challenge lies in the divide between discrete decisions (which object to pick up, which tool to use) and continuous control (how to move the joints smoothly without hitting obstacles). ...

[From Real World to Logic and Back: Learning Generalizable Relational Concepts For Long Horizon Robot Planning 🔗](https://arxiv.org/abs/2402.11871)

Robots That Invent Their Own Logic: How LAMP Bridges the Gap Between Raw Data and Long-Horizon Planning

Introduction Imagine you are teaching a robot to clean a table. You show it how to pick up a single cup and place it in a bin. Now, you scatter twenty cups across a long dining table and tell the robot to clean it up. For a human, this is trivial. We intuitively understand the concept of “picking up” and “placing,” and we can apply that concept repeatedly, regardless of where the cups are or how many there are. For a robot, however, this is a nightmare. ...

[Adapting by Analogy: OOD Generalization of Visuomotor Policies via Functional Correspondence 🔗](https://arxiv.org/abs/2506.12678)

Can Robots Learn by Analogy? Teaching Policies to Handle the Unexpected

Introduction Imagine you are teaching a robot to clean a table. You spend hours showing it how to pick up a pen and place it in a cup. You train it until it executes the motion perfectly. Then, you hand the robot a pencil. To you, the task is identical: the pencil is long, thin, and rigid, just like the pen. You intuitively understand that the pencil is “functionally analogous” to the pen. ...

[Vision in Action: Learning Active Perception from Human Demonstrations 🔗](https://arxiv.org/abs/2506.15666)

Why Robots Need Necks: Learning Active Perception for Better Manipulation

Introduction Imagine you are trying to find a specific item—say, a yellow banana—buried at the bottom of a messy grocery bag. How do you do it? You don’t just stick your hand in blindly. You lean forward, tilt your head, maybe pull the bag open with one hand while peering inside with your eyes, and constantly adjust your gaze until you spot the target. Only then do you reach in to grab it. ...

[Learning from 10 Demos: Generalisable and Sample-Efficient Policy Learning with Oriented Affordance Frames 🔗](https://arxiv.org/abs/2410.12124)

How Robots Can Learn Complex Tasks from Just 10 Demos: Introducing Oriented Affordance Frames

If you have ever tried to teach a robotic arm to perform a task using machine learning, you know the struggle: data hunger. To teach a robot to simply pick up a mug and place it on a coaster often requires hundreds, if not thousands, of human demonstrations. If you move the mug slightly to the left, or swap it for a taller mug, the robot frequently fails. This lack of “sample efficiency” (needing too much data) and “generalisation” (failing when things change slightly) is a massive bottleneck in robotics. It is the primary reason we don’t yet have Rosie the Robot tidying our kitchens. ...

[Force-Modulated Visual Policy for Robot-Assisted Dressing with Arm Motions 🔗](https://arxiv.org/abs/2509.12741)

Robots That Feel: How Force Feedback Solves the Dressing Challenge

Imagine trying to put a jacket on a toddler. Now, imagine that toddler is actively moving—reaching for a toy, scratching their head, or waving. It is a task that requires patience, visual coordination, and, crucially, a sense of touch. If the sleeve gets stuck, you feel the resistance and adjust your angle. You don’t just keep pushing. For assistive robots, dressing a human is one of the “holy grail” challenges. It promises to restore independence to millions of individuals with mobility impairments. However, it is also a nightmare of physics and safety. Clothes are deformable objects with infinite ways to fold and snag. Humans are dynamic; they move, tremble, and shift posture. ...

[LaVA-Man: Learning Visual Action Representations for Robot Manipulation 🔗](https://arxiv.org/abs/2508.19391)

Can Robots Learn to Act by Predicting the Future? A Deep Dive into LaVA-Man

Introduction Imagine you are teaching a child to “close the door.” You don’t necessarily describe the muscle movements required to rotate the handle and push. Instead, the child learns by understanding the causality: there is an open door, a command is given, and the desired result is a closed door. If the child can mentally visualize the “closed door” state based on the current scene and your instruction, they implicitly understand the action required to get there. ...

[FOMO-3D: Using Vision Foundation Models for Long-Tailed 3D Object Detection 🔗](https://openreview.net/pdf?id=19LSN4QnV4)

FOMO-3D: Teaching Self-Driving Cars to See the Unexpected with Foundation Models

In the world of autonomous driving, recognizing a car or a pedestrian is a solved problem. Modern perception systems can spot a sedan from a hundred meters away with high accuracy. But what happens when the vehicle encounters something rare? A construction worker carrying a sheet of glass, a child in a stroller, or debris scattered on a highway? These “long-tailed” objects—classes that appear infrequently in training data—pose a massive safety risk. Standard AI models struggle to learn them because they simply don’t see enough examples during training. ...

[Disentangled Multi-Context Meta-Learning: Unlocking Robust and Generalized Task Learning 🔗](https://arxiv.org/abs/2509.01297)

Divide and Conquer: How Disentangled Meta-Learning helps Robots Adapt to the Unknown

Introduction Imagine you are hiking with a heavy backpack. You step onto a patch of ice. Instantly, your brain adapts. It realizes two distinct things: “I am heavier than usual” and “The ground is slippery.” You adjust your stride accordingly. Now, imagine a robot in the same scenario. Traditional learning methods often struggle to make this distinction. The robot simply realizes “movement is harder,” creating a messy, entangled mental model that mixes the concept of “heavy backpack” with “slippery ground.” If you take that backpack off but keep the robot on the ice, it might fail because its adaptation was tied to the specific combination of both factors. ...

[Fail2Progress: Learning from Real-World Robot Failures with Stein Variational Inference 🔗](https://arxiv.org/abs/2509.01746)

Failing Forward: How Robots Can Learn to Fix Their Own Mistakes with Fail2Progress

Introduction Imagine you are learning to play a new sport—say, tennis. You swing the racket, expecting to hit a perfect cross-court shot, but the ball sails wildly out of bounds. What do you do? You don’t just ignore it. You analyze the discrepancy between what you thought would happen and what actually happened. You adjust your mental model of the swing and try again. This process of learning from failure is fundamental to human intelligence. ...

[Cross-Sensor Touch Generation 🔗](https://arxiv.org/abs/2510.09817)

Can Robots Feel What They See? Translating Touch Across Sensors

Introduction In the world of computer vision, a camera is generally just a camera. Whether you swap a Logitech webcam for a high-end DSLR, the fundamental data structure—an array of pixels representing light—remains consistent. You might need to resize an image, but a neural network trained on JPEGs can usually handle the switch with minimal fuss. In robotics, however, the sense of touch is far more chaotic. Tactile sensors come in every conceivable shape and form factor. Some are soft, air-filled bubbles; others are rigid, gel-based pads; some use internal cameras to track deformation, while others measure resistance. This hardware diversity creates a massive bottleneck: an algorithm trained to manipulate a cup using a Soft Bubble sensor will likely fail completely if you switch to a GelSlim sensor. The data distributions are simply too different. ...

[Data Retrieval with Importance Weights for Few-Shot Imitation Learning 🔗](https://arxiv.org/abs/2509.01657)

Quality Over Quantity: Improving Robot Learning with Importance Weighted Retrieval

Quality Over Quantity: Improving Robot Learning with Importance Weighted Retrieval The field of robotics is currently facing a “data hunger” crisis. While we have seen massive leaps in capability thanks to Deep Learning, these models require enormous amounts of data. In Computer Vision or NLP, scraping the internet provides billions of examples. In robotics, however, data must be physically collected—a slow, expensive, and labor-intensive process. To solve this, researchers often turn to Few-Shot Imitation Learning. The goal is simple but ambitious: teach a robot a new task using only a handful of demonstrations (the “target” data) by supplementing them with relevant clips from massive, pre-existing datasets (the “prior” data). This process is known as Retrieval. ...

[π0.5: a Vision-Language-Action Model with Open-World Generalization 🔗](https://openreview.net/pdf?id=vlhoswksBO)

Breaking the Lab Barrier: How π0.5 Brings Robots Into the Open World via Heterogeneous Co-Training

Introduction: The Generalization Gap For decades, the “holy grail” of robotics has been a machine capable of walking into a messy, unfamiliar home and making itself useful—cleaning the kitchen, folding laundry, or tidying up a bedroom. While we have seen impressive videos of robots performing backflips or assembling cars, these feats usually occur in highly controlled environments or “labs” where the robot knows exactly where everything is. This is the generalization gap. A robot trained to pick up a red mug in a bright lab often fails to pick up a blue mug in a dimly lit kitchen. Scaling up data collection helps, but we simply cannot physically collect robot data in every possible home configuration on Earth. ...

[Geometric Red-Teaming for Robotic Manipulation 🔗](https://arxiv.org/abs/2509.12379)

Breaking Robots with Geometry: How to Red-Team Manipulation Policies

Breaking Robots with Geometry: How to Red-Team Manipulation Policies Imagine you have trained a robot to pick up a screwdriver. You’ve trained it on thousands of simulations, and it achieves a 95% success rate. You are ready to deploy. But then, in the real world, you hand the robot a screwdriver that is slightly bent, or perhaps the handle is a bit thicker than the one in the training set. Suddenly, the robot fails catastrophically—it slips, it drops the object, or it can’t find a grip. ...

[Belief-Conditioned One-Step Diffusion: Real-Time Trajectory Planning with Just-Enough Sensing 🔗](https://arxiv.org/abs/2508.12166)

Navigating with Eyes Wide Shut: How Diffusion Models Enable Just-Enough Sensing in Robots

Introduction Imagine you are hiking through a dense forest at night. To navigate safely, you have a flashlight, a GPS device, and a map. Keeping all of them on continuously guarantees you won’t get lost, but it drains your batteries rapidly. If your batteries die, you’re stranded. On the other hand, turning everything off to save power is dangerous—you might fall off a cliff. The smartest strategy is to switch devices on only when necessary: use the flashlight for rocky terrain, check the GPS only when the trail forks, and walk in the dark when the path is straight and clear. ...