Papers

[KDPE: A Kernel Density Estimation Strategy for Diffusion Policy Trajectory Selection 🔗](https://arxiv.org/abs/2508.10511)

Taming the Stochastic Robot: enhancing Diffusion Policies with Kernel Density Estimation

The integration of Generative AI into robotics has sparked a revolution. Specifically, Diffusion Policy (DP) has emerged as a state-of-the-art approach for “behavior cloning”—teaching robots to perform tasks by mimicking human demonstrations. Unlike older methods that try to average out human movements into a single mean trajectory, Diffusion Policy embraces the fact that humans solve tasks in many different ways. It models the distribution of possible actions, allowing a robot to handle multimodal behavior (e.g., grasping a cup from the left or the right). ...

[ActLoc: Learning to Localize on the Move via Active Viewpoint Selection 🔗](https://arxiv.org/abs/2508.20981)

Don't Just Look, See: How ActLoc Teaches Robots Where to Look for Better Navigation

Introduction Imagine navigating a familiar room in the dark with a flashlight. To figure out where you are, you wouldn’t point your light at a blank patch of white wall; that wouldn’t tell you anything. Instead, you would instinctively shine the light toward distinct features—a door frame, a bookshelf, or a unique piece of furniture. By actively selecting where to look, you maximize your ability to localize yourself in the space. ...

[BEHAVIOR ROBOT SUITE: Streamlining Real-World Whole-Body Manipulation for Everyday Household Activities 🔗](https://arxiv.org/abs/2503.05652)

Mastering the Home: How BRS Solves Whole-Body Control for Household Robots

Introduction We often dream of the “Rosie the Robot” future—a general-purpose helper that can tidy the living room, clean the bathroom, and organize the pantry. While we have seen incredible advances in robotic manipulation in lab settings, bringing these capabilities into a real-world home remains a formidable challenge. Why is this so hard? It turns out that a messy home requires more than just a good gripper. It requires a robot that can coordinate its entire body. To open a heavy door, a robot can’t just use its arm; it needs to lean in with its torso and drive its base simultaneously. To put a box on a high shelf, it needs to stretch up; to scrub a toilet, it needs to crouch down. ...

[TopoCut: Learning Multi-Step Cutting with Spectral Rewards and Discrete Diffusion Policies 🔗](https://arxiv.org/abs/2509.19712)

TopoCut: How Robots Learn to Slice and Dice using Spectral Rewards and Discrete Diffusion

Introduction Imagine teaching a robot to prepare a meal. Asking it to pick up an apple is a solved problem in many labs. Asking it to slice that apple into perfectly even wedges, however, is a nightmare of physics and control. Why? Because cutting fundamentally changes the topology of an object. One object becomes two; a solid mesh splits into independent clusters. The physics are complex—materials deform, squish, and fracture. Furthermore, evaluating success is incredibly difficult. If a robot slices a potato and the piece rolls off the table, did it fail? Geometrically, the slice might be perfect, but a standard sensor looking for the slice in a specific xyz-coordinate would register a zero score. ...

[SPIN: distilling Skill-RRT for long-horizon prehensile and non-prehensile manipulation 🔗](https://arxiv.org/abs/2502.18015)

From Slow Planning to Fast Reflexes: How SPIN Masters Complex Robot Manipulation

Imagine you are trying to pick up a playing card that is lying flat on a table. You can’t just grab it directly because your fingers—or a robot’s gripper—can’t get underneath it. What do you do? You instinctively slide the card to the edge of the table using one finger (non-prehensile manipulation), and once it overhangs the edge, you pinch it (prehensile manipulation). This sequence of actions feels trivial to humans, but for robots, it is an immense challenge. It requires Long-Horizon Prehensile and Non-Prehensile (PNP) Manipulation. The robot must reason about physics, contacts, and geometry over a long sequence of steps. It has to decide how to slide the object, where to stop, and how to reposition its hand to transition from a slide to a grasp without knocking the card off the table. ...

[FFHFlow: Diverse and Uncertainty-Aware Dexterous Grasp Generation via Flow Variational Inference 🔗](https://arxiv.org/abs/2407.15161)

Mastering the Unknown: How FFHFlow Gives Robots Human-Like Dexterity and Self-Awareness

Imagine you are reaching into a dark cupboard to grab a coffee mug. You can’t see the handle, but your brain makes a reasonable guess about where it is. If you touch something unexpected, you adjust. You don’t just grab blindly; you have an internal sense of uncertainty. For robots, specifically those with multi-fingered (dexterous) hands, this scenario is a nightmare. Most robotic systems struggle to grasp objects when they can only see part of them (partial observation). They either stick to very safe, repetitive grasping motions that might fail in cluttered spaces, or they try to generate complex grasps but lack the speed to do so in real-time. ...

[CogniPlan: Uncertainty-Guided Path Planning with Conditional Generative Layout Prediction 🔗](https://arxiv.org/abs/2508.03027)

Giving Robots an Imagination: How CogniPlan Uses Generative AI for Path Planning

Introduction Imagine walking into a pitch-black building with a flashlight. Your goal is to find a specific exit or map out the entire floor. As you walk down a corridor, you don’t just see the illuminated patch in front of you; your brain instinctively constructs a mental model—a “cognitive map”—of what lies in the darkness. You might hypothesize, “This looks like a hallway, so it probably extends straight,” or “This looks like a lobby, so there might be doors on the sides.” ...

[ObjectReact: Learning Object-Relative Control for Visual Navigation 🔗](https://arxiv.org/abs/2509.09594)

Why Robots Should Navigate by Objects, Not Just Images: Introducing ObjectReact

Visual navigation is one of the holy grails of robotics. We want robots to enter a new environment, look around, and navigate to a goal just like a human would. However, the current dominant paradigm—using raw images to estimate control—has a significant flaw: it is obsessed with the robot’s specific perspective. If you train a robot to navigate a hallway using a camera mounted at eye level, and then you lower that camera to knee height (simulating a smaller robot), the navigation system often breaks completely. The images look different, even though the geometry of the hallway hasn’t changed. ...

[CHD: Coupled Hierarchical Diffusion for Long-Horizon Tasks 🔗](https://arxiv.org/abs/2505.07261)

When the Boss Listens to the Worker: Solving Long-Horizon Planning with Coupled Hierarchical Diffusion

Planning a sequence of actions to achieve a distant goal is one of the fundamental challenges in robotics. Imagine asking a robot to “cook a chicken dinner.” This isn’t a single action; it’s a complex hierarchy of tasks. The robot must plan high-level subgoals (open fridge, get chicken, place in pot, turn on stove) and execute low-level movements (joint angles, gripper velocity) to achieve them. Diffusion models have recently revolutionized this field, treating planning as a generative modeling problem. However, as the “horizon” (the length of the task) grows, these models often struggle. They either hallucinate physically impossible trajectories or get stuck in local optima. ...

[Residual Neural Terminal Constraint for MPC-based Collision Avoidance in Dynamic Environments 🔗](https://arxiv.org/abs/2508.03428)

Safer Robots Through Residual Learning: Improving MPC in Dynamic Environments

Introduction Imagine a delivery robot navigating a busy warehouse or a cleaning robot moving through a crowded train station. These environments are unstructured and, more importantly, dynamic. Humans, forklifts, and other robots are constantly moving. For a robot to operate safely, it can’t just look at a static map; it must predict the future. One of the most popular ways to control such robots is Model Predictive Control (MPC). MPC is great because it looks a few seconds into the future, optimizes a trajectory to avoid collisions, executes the first step, and then repeats the process. However, MPC has a blind spot: the end of its horizon. If the robot plans 5 seconds ahead, what ensures it isn’t leading itself into a trap that closes at 5.1 seconds? ...

[Mult-i-critic Learning for Whole-body End-effector Twist Tracking 🔗](https://arxiv.org/abs/2507.08656)

Walking and Working: How Multi-Critic RL Solves the Loco-Manipulation Paradox

Imagine a waiter carrying a tray of drinks through a crowded restaurant. To succeed, they must perform two distinct tasks simultaneously: they must navigate the room with their legs (locomotion) while keeping their hand perfectly level to avoid spilling the drinks (manipulation). For humans, this coordination is second nature. For robots, specifically quadrupedal robots equipped with robotic arms, this is a profound engineering challenge. This domain is known as Loco-Manipulation. ...

[Adapt3R: Adaptive 3D Scene Representation for Domain Transfer in Imitation Learning 🔗](https://arxiv.org/abs/2503.04877)

Breaking the Brittleness: How Adapt3R Enables Generalizable Robot Learning

Introduction Imagine teaching a robot to pick up a coffee cup. You show it thousands of examples, and eventually, it masters the task perfectly. But then, you move the camera two inches to the left, or you swap the robot arm for a slightly different model. Suddenly, the robot fails completely. This “brittleness” is one of the most persistent challenges in robotic Imitation Learning (IL). While we have become very good at training specialists—agents that excel in a fixed environment—we struggle to train generalists that can adapt to changes in viewpoint or embodiment (the physical structure of the robot). ...

[PrioriTouch: Adapting to User Contact Preferences for Whole-Arm Physical Human-Robot Interaction 🔗](https://arxiv.org/abs/2509.18447)

Robots That Feel: How PrioriTouch Adapts to Personal Comfort in Caregiving

Imagine a future where a robot helps a bedridden patient with a sponge bath. To do this effectively, the robot cannot just use its “hand” (end-effector); it must use its entire arm to reach around the patient, perhaps leaning its forearm gently against the patient’s shoulder to wipe their back. Now, imagine that patient has a bruised shoulder. The robot’s task is to wipe the back, but the patient’s priority is “don’t hurt my shoulder.” If the robot prioritizes the wiping motion over the contact force, the patient gets hurt. If it prioritizes the shoulder too much, it might not reach the back at all. ...

[Sequence Modeling for Time-Optimal Quadrotor Trajectory Optimization with Sampling-based Robustness Analysis 🔗](https://arxiv.org/abs/2506.13915)

Flying at the Limit: How Deep Learning Accelerates Time-Optimal Quadrotor Planning

Introduction In the world of autonomous systems, speed often competes with safety. Nowhere is this clearer than in the domain of agile micro aerial vehicles (MAVs), or quadrotors. Whether it is for high-stakes search and rescue missions, disaster response, or competitive drone racing, we want robots that can move from point A to point B in the absolute minimum amount of time. However, flying at the limit of physics is not just about pushing the throttle to the max. It requires solving a complex “Time-Optimal Path Parameterization” (TOPP) problem. The drone must calculate exactly how fast it can travel along a curve without violating its motor limits or drifting off course due to momentum. Traditionally, this involves solving non-convex optimization problems—heavy mathematical lifting that taxes the onboard computers and takes precious time. ...

[O3Afford: One-Shot 3D Object-to-Object Affordance Grounding for Generalizable Robotic Manipulation 🔗](https://arxiv.org/abs/2509.06233)

How Robots Learn to Interact: Inside O3Afford's One-Shot 3D Affordance System

Imagine you are teaching a robot to pour tea. To you, this action is intuitive. You pick up the teapot by the handle (not the spout) and tilt it over the opening of the cup (not the bottom). This intuition is based on “affordance”—the properties of an object that define how it can be used. For years, computer vision research has focused heavily on single-object affordance—identifying that a handle is for holding. But the real world is rarely about isolated objects. Most meaningful tasks involve object-to-object (O2O) interactions: a knife cutting an apple, a hammer hitting a button, or a plug inserting into a socket. ...

[Multimodal Fused Learning for Solving the Generalized Traveling Salesman Problem in Robotic Task Planning 🔗](https://openreview.net/pdf?id=r29CIl3ePP)

Robots That 'See' the Map: Solving the Generalized TSP with Multimodal Learning

Introduction Imagine you are a robot in a massive fulfillment center. Your job is to collect five specific items to fulfill an order. It sounds simple, right? But here is the catch: “Item A” isn’t just in one place. It is stocked on Shelf 1, Shelf 5, and Shelf 12. “Item B” is available in two other locations. To be efficient, you don’t need to visit every location; you just need to pick one location for Item A, one for Item B, and so on, while keeping your total travel distance as short as possible. ...

[CASH: Capability-Aware Shared Hypernetworks for Flexible Heterogeneous Multi-Robot Coordination 🔗](https://arxiv.org/abs/2501.06058)

Breaking the Trade-off: How Hypernetworks Enable Flexible Multi-Robot Teams

Introduction Imagine a team of robots deployed to fight a wildfire. This isn’t a uniform squad of identical drones; it is a heterogeneous team. Some are fast aerial scouts with limited payload, others are heavy ground rovers carrying massive water tanks, and a few are agile quadrupeds designed to navigate debris. To succeed, these robots must coordinate flawlessly. The scouts need to identify hotspots for the rovers, and the rovers need to position themselves where they can be most effective given their slow speed. ...

[ReCoDe: Reinforcement Learning-based Dynamic Constraint Design for Multi-Agent Coordination 🔗](https://arxiv.org/abs/2507.19151)

The Best of Both Worlds: Merging Control Theory and RL for Multi-Robot Coordination

Introduction Imagine a narrow corridor in a busy warehouse. Two autonomous robots, moving in opposite directions, meet in the middle. Neither has enough room to pass the other. A human would instinctively know what to do: one person backs up into a doorway or hugs the wall to let the other pass. But for robots, this simple interaction is a complex mathematical standoff. In the world of robotics, this is a classic coordination problem. Traditionally, engineers solve this using optimization-based controllers. These are rigid, handcrafted mathematical rules that guarantee safety—ensuring the robot doesn’t hit a wall or another agent. However, these systems are notoriously bad at “social” negotiation. They often result in deadlocks where both robots just freeze, waiting for the other to move. ...

[Predictive Red Teaming: Breaking Policies Without Breaking Robots 🔗](https://arxiv.org/abs/2502.06575)

Breaking Robots with Generative AI: A Guide to Predictive Red Teaming

Introduction: The “It Works in the Lab” Problem Imagine you have spent weeks training a robotic arm to perform a manipulation task, like picking up objects and sorting them into bins. You use Imitation Learning, showing the robot thousands of demonstrations. In your lab, under bright fluorescent lights with a standard pink table mat, the robot is a star. It achieves a 90% success rate. Then, you move the table two centimeters closer to the window. Or maybe someone walks by wearing a bright red shirt. Or you swap the table mat for a blue one. Suddenly, the robot’s performance plummets. It flails, misses the object, or freezes entirely. ...

[Efficient Evaluation of Multi-Task Robot Policies With Active Experiment Selection 🔗](https://arxiv.org/abs/2502.09829)

Smarter, Not Harder—Cutting the Cost of Robot Evaluation with Active Testing

In the world of modern robotics, training a policy is only half the battle. The other half—and often the more expensive half—is figuring out if it actually works. Imagine you have trained a robot to perform household chores. It can pick up a cup, open a drawer, and wipe a table. But can it pick up a red cup? Can it open a stuck drawer? To be sure, you need to test it. Now, imagine you have five different versions of this robot software (policies) and fifty different tasks. That is 250 unique combinations. If you run each combination 10 times to get statistically significant results, you are looking at 2,500 physical experiments. ...