Papers

[COMBO-Grasp: Learning Constraint-Based Manipulation for Bimanual Occluded Grasping 🔗](https://arxiv.org/abs/2502.08054)

Two Hands Are Better Than One: Mastering Occluded Grasping with COMBO-Grasp

Imagine a flat, thin computer keyboard lying on a desk. You want to pick it up. If you just try to grab it from the top, your fingers will likely hit the desk before you can get a solid grip. The desk “occludes” (blocks) the grasp. So, what do you do naturally? You probably use your non-dominant hand to tilt the keyboard up or brace it, while your dominant hand secures the grip. ...

[SocialNav-SUB: Benchmarking VLMs for Scene Understanding in Social Robot Navigation 🔗](https://openreview.net/pdf?id=xVDj9uq6K3)

Can Vision-Language Models Navigate a Crowd? Inside the SocialNav-SUB Benchmark

Introduction Imagine you are walking down a busy university hallway between classes. You see a group of students chatting on your left, a professor hurrying towards you on your right, and a janitor mopping the floor ahead. Without consciously thinking, you adjust your path. You weave slightly right to give the group space, you slow down to let the professor pass, and you avoid the wet floor. This dance of “social navigation” is second nature to humans. We effortlessly interpret intentions, social norms, and spatial dynamics. ...

[AT-Drone: Benchmarking Adaptive Teaming in Multi-Drone Pursuit 🔗](https://arxiv.org/abs/2502.09762)

Can Robots Collaborate with Strangers? Inside the AT-Drone Benchmark

Introduction Imagine a high-stakes search-and-rescue mission in a dense forest. A fleet of drones is scanning the ground below. Suddenly, one drone suffers a battery failure and must return to base. A backup drone is immediately deployed to take its place. In an ideal world, this swap is seamless. The new drone joins the formation, understands the current strategy, and collaborates perfectly with the existing team. But in reality, this is an immense challenge for robotics. Most multi-agent systems are trained to work with specific, pre-defined partners. They rely on “over-training” with a fixed team, developing a secret language of movements and reactions. When a stranger—an “unseen” teammate—enters the mix, the coordination often falls apart. ...

[Robot Operating Home Appliances by Reading User Manuals 🔗](https://openreview.net/pdf?id=wZUQq0JaL6)

Can Robots Learn to Cook by Reading the Manual? Meet ApBot

Introduction Imagine unboxing a new, high-tech rice cooker. It has a dozen buttons, a digital display, and a distinct lack of intuitive design. You, a human, likely grab the user manual, flip to the “Cooking Brown Rice” section, and figure out the sequence of buttons to press. Now, imagine you want your home assistant robot to do this. For a robot, this is a nightmare scenario. Unlike a hammer or a cup, an appliance is a “state machine”—it has hidden internal modes, logic constraints (you can’t start cooking if the lid is open), and complex inputs. Historically, roboticists have had to hard-code these interactions for specific devices. But what if a robot could simply read the manual and figure it out, just like we do? ...

[KDPE: A Kernel Density Estimation Strategy for Diffusion Policy Trajectory Selection 🔗](https://arxiv.org/abs/2508.10511)

Taming the Stochastic Robot: enhancing Diffusion Policies with Kernel Density Estimation

The integration of Generative AI into robotics has sparked a revolution. Specifically, Diffusion Policy (DP) has emerged as a state-of-the-art approach for “behavior cloning”—teaching robots to perform tasks by mimicking human demonstrations. Unlike older methods that try to average out human movements into a single mean trajectory, Diffusion Policy embraces the fact that humans solve tasks in many different ways. It models the distribution of possible actions, allowing a robot to handle multimodal behavior (e.g., grasping a cup from the left or the right). ...

[ActLoc: Learning to Localize on the Move via Active Viewpoint Selection 🔗](https://arxiv.org/abs/2508.20981)

Don't Just Look, See: How ActLoc Teaches Robots Where to Look for Better Navigation

Introduction Imagine navigating a familiar room in the dark with a flashlight. To figure out where you are, you wouldn’t point your light at a blank patch of white wall; that wouldn’t tell you anything. Instead, you would instinctively shine the light toward distinct features—a door frame, a bookshelf, or a unique piece of furniture. By actively selecting where to look, you maximize your ability to localize yourself in the space. ...

[BEHAVIOR ROBOT SUITE: Streamlining Real-World Whole-Body Manipulation for Everyday Household Activities 🔗](https://arxiv.org/abs/2503.05652)

Mastering the Home: How BRS Solves Whole-Body Control for Household Robots

Introduction We often dream of the “Rosie the Robot” future—a general-purpose helper that can tidy the living room, clean the bathroom, and organize the pantry. While we have seen incredible advances in robotic manipulation in lab settings, bringing these capabilities into a real-world home remains a formidable challenge. Why is this so hard? It turns out that a messy home requires more than just a good gripper. It requires a robot that can coordinate its entire body. To open a heavy door, a robot can’t just use its arm; it needs to lean in with its torso and drive its base simultaneously. To put a box on a high shelf, it needs to stretch up; to scrub a toilet, it needs to crouch down. ...

[TopoCut: Learning Multi-Step Cutting with Spectral Rewards and Discrete Diffusion Policies 🔗](https://arxiv.org/abs/2509.19712)

TopoCut: How Robots Learn to Slice and Dice using Spectral Rewards and Discrete Diffusion

Introduction Imagine teaching a robot to prepare a meal. Asking it to pick up an apple is a solved problem in many labs. Asking it to slice that apple into perfectly even wedges, however, is a nightmare of physics and control. Why? Because cutting fundamentally changes the topology of an object. One object becomes two; a solid mesh splits into independent clusters. The physics are complex—materials deform, squish, and fracture. Furthermore, evaluating success is incredibly difficult. If a robot slices a potato and the piece rolls off the table, did it fail? Geometrically, the slice might be perfect, but a standard sensor looking for the slice in a specific xyz-coordinate would register a zero score. ...

[SPIN: distilling Skill-RRT for long-horizon prehensile and non-prehensile manipulation 🔗](https://arxiv.org/abs/2502.18015)

From Slow Planning to Fast Reflexes: How SPIN Masters Complex Robot Manipulation

Imagine you are trying to pick up a playing card that is lying flat on a table. You can’t just grab it directly because your fingers—or a robot’s gripper—can’t get underneath it. What do you do? You instinctively slide the card to the edge of the table using one finger (non-prehensile manipulation), and once it overhangs the edge, you pinch it (prehensile manipulation). This sequence of actions feels trivial to humans, but for robots, it is an immense challenge. It requires Long-Horizon Prehensile and Non-Prehensile (PNP) Manipulation. The robot must reason about physics, contacts, and geometry over a long sequence of steps. It has to decide how to slide the object, where to stop, and how to reposition its hand to transition from a slide to a grasp without knocking the card off the table. ...

[FFHFlow: Diverse and Uncertainty-Aware Dexterous Grasp Generation via Flow Variational Inference 🔗](https://arxiv.org/abs/2407.15161)

Mastering the Unknown: How FFHFlow Gives Robots Human-Like Dexterity and Self-Awareness

Imagine you are reaching into a dark cupboard to grab a coffee mug. You can’t see the handle, but your brain makes a reasonable guess about where it is. If you touch something unexpected, you adjust. You don’t just grab blindly; you have an internal sense of uncertainty. For robots, specifically those with multi-fingered (dexterous) hands, this scenario is a nightmare. Most robotic systems struggle to grasp objects when they can only see part of them (partial observation). They either stick to very safe, repetitive grasping motions that might fail in cluttered spaces, or they try to generate complex grasps but lack the speed to do so in real-time. ...

[CogniPlan: Uncertainty-Guided Path Planning with Conditional Generative Layout Prediction 🔗](https://arxiv.org/abs/2508.03027)

Giving Robots an Imagination: How CogniPlan Uses Generative AI for Path Planning

Introduction Imagine walking into a pitch-black building with a flashlight. Your goal is to find a specific exit or map out the entire floor. As you walk down a corridor, you don’t just see the illuminated patch in front of you; your brain instinctively constructs a mental model—a “cognitive map”—of what lies in the darkness. You might hypothesize, “This looks like a hallway, so it probably extends straight,” or “This looks like a lobby, so there might be doors on the sides.” ...

[ObjectReact: Learning Object-Relative Control for Visual Navigation 🔗](https://arxiv.org/abs/2509.09594)

Why Robots Should Navigate by Objects, Not Just Images: Introducing ObjectReact

Visual navigation is one of the holy grails of robotics. We want robots to enter a new environment, look around, and navigate to a goal just like a human would. However, the current dominant paradigm—using raw images to estimate control—has a significant flaw: it is obsessed with the robot’s specific perspective. If you train a robot to navigate a hallway using a camera mounted at eye level, and then you lower that camera to knee height (simulating a smaller robot), the navigation system often breaks completely. The images look different, even though the geometry of the hallway hasn’t changed. ...

[CHD: Coupled Hierarchical Diffusion for Long-Horizon Tasks 🔗](https://arxiv.org/abs/2505.07261)

When the Boss Listens to the Worker: Solving Long-Horizon Planning with Coupled Hierarchical Diffusion

Planning a sequence of actions to achieve a distant goal is one of the fundamental challenges in robotics. Imagine asking a robot to “cook a chicken dinner.” This isn’t a single action; it’s a complex hierarchy of tasks. The robot must plan high-level subgoals (open fridge, get chicken, place in pot, turn on stove) and execute low-level movements (joint angles, gripper velocity) to achieve them. Diffusion models have recently revolutionized this field, treating planning as a generative modeling problem. However, as the “horizon” (the length of the task) grows, these models often struggle. They either hallucinate physically impossible trajectories or get stuck in local optima. ...

[Residual Neural Terminal Constraint for MPC-based Collision Avoidance in Dynamic Environments 🔗](https://arxiv.org/abs/2508.03428)

Safer Robots Through Residual Learning: Improving MPC in Dynamic Environments

Introduction Imagine a delivery robot navigating a busy warehouse or a cleaning robot moving through a crowded train station. These environments are unstructured and, more importantly, dynamic. Humans, forklifts, and other robots are constantly moving. For a robot to operate safely, it can’t just look at a static map; it must predict the future. One of the most popular ways to control such robots is Model Predictive Control (MPC). MPC is great because it looks a few seconds into the future, optimizes a trajectory to avoid collisions, executes the first step, and then repeats the process. However, MPC has a blind spot: the end of its horizon. If the robot plans 5 seconds ahead, what ensures it isn’t leading itself into a trap that closes at 5.1 seconds? ...

[Mult-i-critic Learning for Whole-body End-effector Twist Tracking 🔗](https://arxiv.org/abs/2507.08656)

Walking and Working: How Multi-Critic RL Solves the Loco-Manipulation Paradox

Imagine a waiter carrying a tray of drinks through a crowded restaurant. To succeed, they must perform two distinct tasks simultaneously: they must navigate the room with their legs (locomotion) while keeping their hand perfectly level to avoid spilling the drinks (manipulation). For humans, this coordination is second nature. For robots, specifically quadrupedal robots equipped with robotic arms, this is a profound engineering challenge. This domain is known as Loco-Manipulation. ...

[Adapt3R: Adaptive 3D Scene Representation for Domain Transfer in Imitation Learning 🔗](https://arxiv.org/abs/2503.04877)

Breaking the Brittleness: How Adapt3R Enables Generalizable Robot Learning

Introduction Imagine teaching a robot to pick up a coffee cup. You show it thousands of examples, and eventually, it masters the task perfectly. But then, you move the camera two inches to the left, or you swap the robot arm for a slightly different model. Suddenly, the robot fails completely. This “brittleness” is one of the most persistent challenges in robotic Imitation Learning (IL). While we have become very good at training specialists—agents that excel in a fixed environment—we struggle to train generalists that can adapt to changes in viewpoint or embodiment (the physical structure of the robot). ...

[PrioriTouch: Adapting to User Contact Preferences for Whole-Arm Physical Human-Robot Interaction 🔗](https://arxiv.org/abs/2509.18447)

Robots That Feel: How PrioriTouch Adapts to Personal Comfort in Caregiving

Imagine a future where a robot helps a bedridden patient with a sponge bath. To do this effectively, the robot cannot just use its “hand” (end-effector); it must use its entire arm to reach around the patient, perhaps leaning its forearm gently against the patient’s shoulder to wipe their back. Now, imagine that patient has a bruised shoulder. The robot’s task is to wipe the back, but the patient’s priority is “don’t hurt my shoulder.” If the robot prioritizes the wiping motion over the contact force, the patient gets hurt. If it prioritizes the shoulder too much, it might not reach the back at all. ...

[Sequence Modeling for Time-Optimal Quadrotor Trajectory Optimization with Sampling-based Robustness Analysis 🔗](https://arxiv.org/abs/2506.13915)

Flying at the Limit: How Deep Learning Accelerates Time-Optimal Quadrotor Planning

Introduction In the world of autonomous systems, speed often competes with safety. Nowhere is this clearer than in the domain of agile micro aerial vehicles (MAVs), or quadrotors. Whether it is for high-stakes search and rescue missions, disaster response, or competitive drone racing, we want robots that can move from point A to point B in the absolute minimum amount of time. However, flying at the limit of physics is not just about pushing the throttle to the max. It requires solving a complex “Time-Optimal Path Parameterization” (TOPP) problem. The drone must calculate exactly how fast it can travel along a curve without violating its motor limits or drifting off course due to momentum. Traditionally, this involves solving non-convex optimization problems—heavy mathematical lifting that taxes the onboard computers and takes precious time. ...

[O3Afford: One-Shot 3D Object-to-Object Affordance Grounding for Generalizable Robotic Manipulation 🔗](https://arxiv.org/abs/2509.06233)

How Robots Learn to Interact: Inside O3Afford's One-Shot 3D Affordance System

Imagine you are teaching a robot to pour tea. To you, this action is intuitive. You pick up the teapot by the handle (not the spout) and tilt it over the opening of the cup (not the bottom). This intuition is based on “affordance”—the properties of an object that define how it can be used. For years, computer vision research has focused heavily on single-object affordance—identifying that a handle is for holding. But the real world is rarely about isolated objects. Most meaningful tasks involve object-to-object (O2O) interactions: a knife cutting an apple, a hammer hitting a button, or a plug inserting into a socket. ...

[Multimodal Fused Learning for Solving the Generalized Traveling Salesman Problem in Robotic Task Planning 🔗](https://openreview.net/pdf?id=r29CIl3ePP)

Robots That 'See' the Map: Solving the Generalized TSP with Multimodal Learning

Introduction Imagine you are a robot in a massive fulfillment center. Your job is to collect five specific items to fulfill an order. It sounds simple, right? But here is the catch: “Item A” isn’t just in one place. It is stocked on Shelf 1, Shelf 5, and Shelf 12. “Item B” is available in two other locations. To be efficient, you don’t need to visit every location; you just need to pick one location for Item A, one for Item B, and so on, while keeping your total travel distance as short as possible. ...