Papers

[SafeBimanual: Diffusion-based Trajectory Optimization for Safe Bimanual Manipulation 🔗](https://arxiv.org/abs/2508.18268)

How to Stop Robots from Breaking Things: Inside SafeBimanual

Introduction Imagine you are asking a robot to help you prepare breakfast. It needs to pick up a bottle of milk with one hand, a bowl with the other, and pour the milk without spilling it, crushing the bowl, or banging its two arms together. For a human, this bimanual (two-handed) coordination is intuitive. For a robot, it is a geometric and kinematic nightmare. In the world of robotic learning, Diffusion Policies have emerged as the reigning champions. By learning from human demonstrations, these models are incredibly good at cloning behaviors and handling complex, multi-modal tasks. However, they have a significant blind spot: Physical Safety. ...

[Shortcut Learning in Generalist Robot Policies: The Role of Dataset Diversity and Fragmentation 🔗](https://arxiv.org/abs/2508.06426)

Why Your Robot is Cheating: The Hidden Trap of Shortcut Learning in Generalist Policies

Introduction In recent years, the recipe for success in Artificial Intelligence has seemed deceptively simple: scale up. In Computer Vision and Natural Language Processing (NLP), feeding massive Transformers with internet-scale data produced emergent capabilities that stunned the world. Naturally, roboticists asked: Can we do the same for physical robots? The answer appeared to be “yes.” By aggregating robotic data from labs around the world into massive collections like the Open X-Embodiment (OXE) dataset, researchers trained Generalist Robot Policies (GRPs) like \(\pi_0\), Octo, and RT-X. These models can perform a wide range of tasks, from opening drawers to picking up specific objects. ...

[HALO: Human Preference Aligned Offline Reward Learning for Robot Navigation 🔗](https://arxiv.org/abs/2508.01539)

Teaching Robots to Navigate Like Humans: A Deep Dive into HALO

Introduction Imagine you are teaching a teenager to drive. You typically don’t give them a differential equation describing the distance to the curb or a physics formula for the friction coefficient of the road. Instead, you offer intuitive feedback: “That was a bit too close to the parked car,” or “Good job slowing down for that pedestrian.” This human intuition is incredibly powerful, yet translating it into the mathematical language of robotics is notoriously difficult. ...

[Contrastive Forward Prediction Reinforcement Learning for Adaptive Fault-Tolerant Legged Robots 🔗](https://openreview.net/pdf?id=P0uqo7CpL8)

When Robots Learn to Limp: Adaptive Fault Tolerance via Forward Prediction

Introduction Imagine you are hiking up a steep, rocky trail. Suddenly, you twist your ankle. It hurts, and your range of motion is limited. What do you do? You don’t stop functioning; you adapt. You shift your weight, change your gait, and favor the uninjured leg. You consciously predict that putting weight on the bad ankle will result in failure, so you adjust your control signals accordingly. This ability to adapt to physical impairment is natural for biological beings, but it is an immense challenge for robots. In the field of legged robotics, reliability is the holy grail. We want robots to navigate disaster zones, explore planetary surfaces, and inspect industrial sites. However, hardware breaks. Motors wear out, gearboxes jam, and legs sustain damage. For a standard robot, a single locked joint often leads to immediate failure. ...

[Self-supervised Learning Of Visual Pose Estimation Without Pose Labels By Classifying LED States 🔗](https://arxiv.org/abs/2509.10405)

Learning to See Without Teachers: How Robots Can Learn Pose Estimation Just by Blinking

Introduction In the world of robotics, answering the question “Where am I relative to you?” is surprisingly difficult. This problem, known as visual relative pose estimation, is fundamental for multi-robot systems. Whether it’s a swarm of drones coordinating a light show or warehouse robots avoiding collisions, robots need to know the position and orientation (pose) of their peers. Traditionally, teaching a robot to estimate pose from a camera image requires a heavy dose of supervision. You usually have two expensive options: ...

[ImLPR: Image-based LiDAR Place Recognition using Vision Foundation Models 🔗](https://arxiv.org/abs/2505.18364)

Bridging the Gap: How ImLPR Adapts Vision Foundation Models for 3D LiDAR Place Recognition

Introduction Imagine a robot navigating a bustling city or a complex underground tunnel. To operate autonomously, it doesn’t just need to know where obstacles are; it needs to know where it is on a global map. GPS is often unreliable or unavailable in these environments (think urban canyons or indoor spaces). This is where LiDAR Place Recognition (LPR) comes in. The robot scans its surroundings with a laser scanner and asks, “Have I seen this pattern of geometry before?” ...

[Door(s): Junction State Estimation for Efficient Exploration in Reinforcement Learning 🔗](https://openreview.net/pdf?id=NtnPVwUCAH)

Finding the Keys: How 'Junction States' Unlock Efficient Exploration in RL

Introduction Reinforcement Learning (RL) has achieved remarkable feats, from mastering complex strategy games to controlling robotic limbs. However, one bottleneck persistently stifles progress: efficient exploration. In environments where rewards are “sparse”—meaning the agent receives feedback only rarely, perhaps only upon completing a complex task—an agent can spend eons flailing randomly, never stumbling upon the specific sequence of actions required to earn a reward. Imagine you are dropped into a massive, dark maze with a single treasure chest hidden somewhere deep inside. If you wander randomly, you might eventually find it, but it could take a lifetime. However, if you realized that passing through a doorway (a junction) opens up a whole new section of the maze, you would prioritize finding those doors. This structural knowledge is crucial. ...

[DiWA: Diffusion Policy Adaptation with World Models 🔗](https://arxiv.org/abs/2508.03645)

Dreaming of Success: How Robots Can Fine-Tune Skills Entirely Offline

Dreaming of Success: How Robots Can Fine-Tune Skills Entirely Offline In the world of robotics, there is a massive gap between “watching” a task and “mastering” it. Imagine you are learning to play tennis. You can watch a professional player (Imitation Learning), and you might pick up the form. But to actually get good, you need to step onto the court and hit the ball thousands of times, adjusting your swing based on where the ball lands (Reinforcement Learning). ...

[SimShear: Sim-to-Real Shear-based Tactile Servoing 🔗](https://arxiv.org/abs/2508.20561)

Feeling the Slide: How SimShear Bridges the Gap in Robotic Tactile Sensing

Introduction In the world of robotics, vision has long been king. We have taught robots to see obstacles, classify objects, and navigate rooms with impressive accuracy. But when it comes to manipulation—actually grabbing, holding, and moving things—sight isn’t enough. Try tying your shoelaces with numb fingers; even if you watch your hands closely, it’s incredibly difficult. You need the sense of touch. Specifically, you need to feel shear. Shear is the lateral force that stretches your skin when an object slides across your fingertips or when gravity pulls on a heavy object you’re holding. It’s the sensation that tells you a glass is about to slip from your hand before it actually falls. ...

[ManipBench: Benchmarking Vision-Language Models for Low-Level Robot Manipulation 🔗](https://arxiv.org/abs/2505.09698)

Can LLMs Control Robots? Exploring ManipBench and Low-Level Reasoning

Introduction The dream of the “generalist robot”—a machine capable of folding laundry, cooking dinner, and tidying a workshop without explicit reprogramming—has long captivated roboticists. Recently, Vision-Language Models (VLMs) like GPT-4 and Gemini have brought us closer to this reality. These models possess immense “common sense” knowledge about the world. They can look at a messy room and tell you what needs to be cleaned up. However, knowing what to do is very different from knowing how to do it. ...

[Distributed Upload and Active Labeling for Resource-Constrained Fleet Learning 🔗](https://openreview.net/pdf?id=M1e2PEMLp2)

Taming the Data Deluge with DUAL—Optimizing Robotic Fleet Learning

Introduction: The Data Bottleneck in Robotics Imagine a fleet of hundreds of autonomous vehicles or delivery drones operating in the real world. Every second, these machines capture high-resolution video, LiDAR scans, and telemetry. Collectively, they generate terabytes of data every single day. Ideally, we would upload all this data to the cloud, label it, and use it to train smarter, safer AI models. But in reality, this is impossible. We face two crushing bottlenecks. First, bandwidth is limited. A car operating in a busy city or a remote rural area cannot upload terabytes of raw sensor data over a 4G or 5G connection; the network simply can’t handle it. Second, annotation is expensive. Even if we could upload everything, human annotators (or expensive foundation models) cannot label millions of images a day. We have a limited budget for ground-truth labeling. ...

[Towards Embodiment Scaling Laws in Robot Locomotion 🔗](https://arxiv.org/abs/2505.05753)

The Body Problem: Can Scaling Robot Morphologies Unlock General Intelligence?

Introduction Two millennia ago, the philosopher Heraclitus famously wrote that “no man steps in the same river twice.” In the field of robotics, we face a similar, strictly physical reality: no agent acts with exactly the same body twice. Consider a robot deployed in the real world. Over time, its motors degrade, its joints loosen, and it might even suffer damage to a limb. Even newly manufactured robots of the same model have subtle manufacturing variations. If we want to build truly generalist robots—agents that can operate not just one specific machine, but adapt to any physical form—we face a massive hurdle. Current deep learning success stories, like Large Language Models (LLMs), have thrived by scaling up data and model size. But in robotics, we have a third, largely unexplored dimension: embodiment. ...

[RoboChemist: Long-Horizon and Safety-Compliant Robotic Chemical Experimentation 🔗](https://arxiv.org/abs/2509.08820)

RoboChemist: How Dual-Loop AI is Mastering the Chemistry Lab

Imagine a chemistry laboratory. It is a place of precise measurements, volatile reactions, and fragile equipment. Now, imagine a robot trying to navigate this space. Unlike organizing blocks on a table—a classic robotics benchmark—chemistry involves transparent glassware that depth sensors can’t see, liquids that slosh and spill, and safety protocols where a few millimeters of error can lead to a hazardous situation. For years, the dream of an autonomous “Robotic Chemist” has been stifled by these physical realities. Robots are good at repetitive motions, but they struggle with the dynamic, safety-critical reasoning required for experimental science. ...

[D-CODA: Diffusion for Coordinated Dual-Arm Data Augmentation 🔗](https://arxiv.org/abs/2505.04860)

Teaching Robots to Coordinate: How Diffusion Models Are Solving Bimanual Manipulation

If you’ve ever tried to carry a heavy moving box or fold a large bedsheet with just one hand, you know the struggle. We humans rely heavily on bimanual manipulation—using two hands in coordination—to interact with the world. For robots to be truly useful in homes and warehouses, they need to master this same skill. However, training robots to coordinate two arms is exponentially harder than training one. You have to manage higher degrees of freedom, ensure the arms don’t crash into each other, and maintain precise coordination to keep objects from falling. Traditional Imitation Learning (IL), where robots learn by mimicking human demonstrations, works well but is data-hungry. Collecting thousands of coordinated, two-arm demonstrations is costly and labor-intensive. ...

[UnPose: Uncertainty-Guided Diffusion Priors for Zero-Shot Pose Estimation 🔗](https://arxiv.org/abs/2508.15972)

Solving the Hallucination Problem: How UnPose Brings Uncertainty to Robotic Perception

Introduction Imagine a robot entering a kitchen for the first time. It sees a mug on the counter. To pick it up, the robot needs to know exactly where that mug is in 3D space and how it is oriented—a problem known as 6D pose estimation. Historically, this has been a rigid process. Robots relied on exact CAD models (digital blueprints) of every object they might encounter. If you didn’t have the CAD file for that specific “Master Chef” mug, the robot was blind. Recent advances in deep learning attempted to fix this with “category-level” training (teaching a robot what a generic mug looks like) or “model-free” approaches that reconstruct objects on the fly. ...

[WoMAP: World Models For Embodied Open-Vocabulary Object Localization 🔗](https://arxiv.org/abs/2506.01600)

How Robots Learn to Search: A Deep Dive into WoMAP and World Models

Introduction Imagine you are looking for your keys. You don’t scan every millimeter of the ceiling or the blank wall; you look on the table, check the couch cushions, or look near the door. Your search is active, exploratory, and guided by a “mental model” of where things usually are. Now, imagine asking a robot to “find the banana.” For a robot, this is an incredibly complex task known as Open-Vocabulary Object Localization. The robot must understand what a “banana” is (semantics), navigate a potentially cluttered and unseen environment (exploration), and understand how its physical movements change what it sees (dynamics). ...

[Constraint-Preserving Data Generation for Visuomotor Policy Learning 🔗](https://openreview.net/pdf?id=KSKzA1mwKs)

One Demo Is All You Need? Scaling Robot Learning with CP-Gen

Introduction We are living in the golden age of imitation learning. From robots that can cook shrimp to those that can repair themselves, we’ve seen incredible breakthroughs driven by large-scale demonstration data. However, there is a massive bottleneck hiding behind these viral videos: the cost of data. Projects like ALOHA Unleashed or DROID required months of labor, dozens of operators, and tens of thousands of collected trajectories. If we want robots to generalize to every cup, hammer, or door handle in the world, we cannot possibly teleoperate them through every variation. We need a way to multiply our data automatically. ...

[Eye, Robot: Learning to Look to Act with a BC-RL Perception-Action Loop 🔗](https://arxiv.org/abs/2506.10968)

Learning to Look: How EyeRobot Uses Active Vision to Master Manipulation

Introduction Imagine you are thirsty. You decide to reach for a cup of coffee sitting on your desk. What happens first? Before your arm muscles even engage, your eyes move. You scan the table, saccade toward the cup to lock in its position, and then guide your hand toward it. Once you grasp the cup, your eyes might immediately dart to the coaster where you plan to place it. This sequence feels instantaneous, but it reveals a fundamental truth about biological intelligence: we do not passively absorb the visual world like a video camera; we actively look in order to act. ...

[Learning Deployable Locomotion Control via Differentiable Simulation 🔗](https://arxiv.org/abs/2404.02887)

How to Teach Robots to Walk Using Analytic Gradients and Differentiable Physics

Introduction In the world of robotics, we are constantly chasing the dream of efficient learning. If you have ever trained a neural network for image recognition, you know the power of backpropagation. You calculate the error, compute the gradient (the direction to adjust parameters to reduce that error), and update the network. It’s elegant, mathematical, and efficient. However, when we try to apply this same logic to robots interacting with the physical world—specifically legged robots that need to walk—we hit a massive wall: Contact. ...

[Agreement Volatility: A Second-Order Metric for Uncertainty Quantification in Surgical Robot Learning 🔗](https://openreview.net/pdf?id=K7KLc4FexO)

When Robots Get Nervous: Making Autonomous Surgery Safer with Agreement Volatility

Introduction Imagine a future where a surgical robot operates autonomously on a patient. It’s stitching soft tissue with precision, relieving an overworked surgeon who oversees the process. Suddenly, the robot encounters a piece of tissue that is slightly more slippery or deformed than what it was trained on. In a standard automation scenario, the robot might plow ahead, confident in a wrong decision, potentially causing harm. Ideally, however, the robot would “realize” it is confused, pause, and ask the human surgeon to take over for a moment. Once the tricky part is navigated, the robot resumes control. ...