Papers

[CASH: Capability-Aware Shared Hypernetworks for Flexible Heterogeneous Multi-Robot Coordination 🔗](https://arxiv.org/abs/2501.06058)

Breaking the Trade-off: How Hypernetworks Enable Flexible Multi-Robot Teams

Introduction Imagine a team of robots deployed to fight a wildfire. This isn’t a uniform squad of identical drones; it is a heterogeneous team. Some are fast aerial scouts with limited payload, others are heavy ground rovers carrying massive water tanks, and a few are agile quadrupeds designed to navigate debris. To succeed, these robots must coordinate flawlessly. The scouts need to identify hotspots for the rovers, and the rovers need to position themselves where they can be most effective given their slow speed. ...

[ReCoDe: Reinforcement Learning-based Dynamic Constraint Design for Multi-Agent Coordination 🔗](https://arxiv.org/abs/2507.19151)

The Best of Both Worlds: Merging Control Theory and RL for Multi-Robot Coordination

Introduction Imagine a narrow corridor in a busy warehouse. Two autonomous robots, moving in opposite directions, meet in the middle. Neither has enough room to pass the other. A human would instinctively know what to do: one person backs up into a doorway or hugs the wall to let the other pass. But for robots, this simple interaction is a complex mathematical standoff. In the world of robotics, this is a classic coordination problem. Traditionally, engineers solve this using optimization-based controllers. These are rigid, handcrafted mathematical rules that guarantee safety—ensuring the robot doesn’t hit a wall or another agent. However, these systems are notoriously bad at “social” negotiation. They often result in deadlocks where both robots just freeze, waiting for the other to move. ...

[Predictive Red Teaming: Breaking Policies Without Breaking Robots 🔗](https://arxiv.org/abs/2502.06575)

Breaking Robots with Generative AI: A Guide to Predictive Red Teaming

Introduction: The “It Works in the Lab” Problem Imagine you have spent weeks training a robotic arm to perform a manipulation task, like picking up objects and sorting them into bins. You use Imitation Learning, showing the robot thousands of demonstrations. In your lab, under bright fluorescent lights with a standard pink table mat, the robot is a star. It achieves a 90% success rate. Then, you move the table two centimeters closer to the window. Or maybe someone walks by wearing a bright red shirt. Or you swap the table mat for a blue one. Suddenly, the robot’s performance plummets. It flails, misses the object, or freezes entirely. ...

[Efficient Evaluation of Multi-Task Robot Policies With Active Experiment Selection 🔗](https://arxiv.org/abs/2502.09829)

Smarter, Not Harder—Cutting the Cost of Robot Evaluation with Active Testing

In the world of modern robotics, training a policy is only half the battle. The other half—and often the more expensive half—is figuring out if it actually works. Imagine you have trained a robot to perform household chores. It can pick up a cup, open a drawer, and wipe a table. But can it pick up a red cup? Can it open a stuck drawer? To be sure, you need to test it. Now, imagine you have five different versions of this robot software (policies) and fifty different tasks. That is 250 unique combinations. If you run each combination 10 times to get statistically significant results, you are looking at 2,500 physical experiments. ...

[Learning Smooth State-Dependent Traversability from Dense Point Clouds 🔗](https://arxiv.org/abs/2506.04362)

SPARTA - Why Your Robot Needs to Care About Approach Angles

Introduction Imagine you are driving an off-road vehicle through a rocky field. You see a sharp, jagged rock ahead. If you hit it head-on, you’ll likely pop a tire or break an axle. However, if you approach that same rock at a slight angle, your tire might ride up the side smoothly, allowing you to pass without damage. This scenario highlights a fundamental truth in off-road navigation: risk is state-dependent. Whether a piece of terrain is traversable often depends not just on the terrain’s geometry, but on the robot’s angle of approach. ...

[Robot Trains Robot: Automatic Real-World Policy Adaptation and Learning for Humanoids 🔗](https://arxiv.org/abs/2508.12252)

When Robots Become Teachers: Accelerating Humanoid Learning in the Real World

Introduction Imagine a toddler learning to walk. They stumble, they teeter, and inevitably, they fall. But usually, a parent is there—holding their hands, guiding their weight, catching them before they hit the ground, and picking them back up to try again. This biological “teacher-student” loop is fundamental to how humans master motor skills. In the world of robotics, specifically humanoid locomotion, we often skip this step. We typically train robots in a digital “matrix”—a physics simulation—where they can fall millions of times without breaking. Then, we copy-paste that brain into a physical robot and hope for the best. This is known as Sim-to-Real transfer. While effective, it suffers from the “reality gap”: friction, sensor noise, and complex physics in the real world never perfectly match the simulation. ...

[Steerable Scene Generation with Post Training and Inference-Time Search 🔗](https://openreview.net/pdf?id=oOCa85Z1Ho)

Beyond Static Worlds - Steering Generative AI to Create Complex 3D Robot Environments

Introduction In the rapidly evolving world of robotics, data is the new gold. We are witnessing a shift where robots, much like Large Language Models (LLMs), are increasingly trained on massive datasets. However, unlike chatbots that feed on text scraped from the internet, robots need to understand physical space. They need 3D environments to practice navigation, manipulation, and interaction. This creates a bottleneck: Where do we get millions of diverse, physically realistic 3D scenes? ...

[Diffusion Dynamics Models with Generative State Estimation for Cloth Manipulation 🔗](https://arxiv.org/abs/2503.11999)

Solving the Laundry Problem: How Generative AI Teaches Robots to Fold Clothes

Introduction If you ask a robot to pick up a coffee mug, it will likely succeed. The mug is rigid; its shape doesn’t change when you touch it. If you know where the handle is, you can calculate exactly how to grab it. Now, ask that same robot to fold a crumpled t-shirt. The robot will likely fail miserably. Cloth manipulation is one of the “Holy Grails” of robotics. Unlike rigid objects, cloth has near-infinite degrees of freedom. It crumples, folds over itself, and occludes its own shape. When a shirt is in a pile, you can only see a fraction of its surface. To a robot’s camera, a crumpled shirt looks like a chaotic, unidentifiable blob. Furthermore, the physics of cloth are complex and non-linear; pulling a sleeve might drag the whole shirt, or it might just stretch the fabric. ...

[LaDi-WM: A Latent Diffusion-based World Model for Predictive Manipulation 🔗](https://arxiv.org/abs/2505.11528)

Can Robots Imagine? Inside LaDi-WM, the Latent Diffusion World Model Revolutionizing Robotic Manipulation

Imagine you are about to stack a heavy ceramic bowl onto a fragile glass cup. Before you even move your hand, your brain runs a split-second simulation. You visualize the bowl slipping, the glass shattering, and the mess that follows. Consequently, you adjust your grip and approach before you even make contact. This ability to “predict the future” based on our actions is fundamental to human dexterity. In robotics, this concept is implemented through World Models—internal simulators that allow robots to forecast the consequences of their actions. However, giving robots this foresight is notoriously difficult. Predicting every pixel of a future video frame is computationally expensive and often results in blurry, physically impossible hallucinations. ...

[Learning Long-Context Diffusion Policies via Past-Token Prediction 🔗](https://openreview.net/pdf?id=o0LBjJxUeS)

Robots That Remember: Improving Diffusion Policies with Past-Token Prediction

Robots That Remember: Improving Diffusion Policies with Past-Token Prediction In the world of robotics, memory is everything. Imagine trying to make a cup of coffee without remembering if you’ve already added the sugar, or trying to unlock a door without recalling which key you just tried. For humans, these temporal dependencies—the relationship between what happened five seconds ago and what should happen next—are intuitive. For robots, they are a massive computational and algorithmic headache. ...

[Constraint-Aware Diffusion Guidance for Robotics: Real-Time Obstacle Avoidance for Autonomous Racing 🔗](https://arxiv.org/abs/2505.13131)

Safety First at High Speed: How CoDiG Brings Diffusion Models to Autonomous Racing

Generative AI has recently transformed how we create images, videos, and text. Models like DALL-E and Sora rely on diffusion models—systems capable of learning complex, high-dimensional distributions and generating diverse, high-quality outputs. Naturally, roboticists have begun asking: Can we use this same technology to control robots? Ideally, a robot could “dream” up a path through a complex environment just as easily as an image generator dreams up a sunset. But there is a catch. If an image generator puts a sixth finger on a hand, it looks weird. If a robot planner puts a trajectory through a wall, the robot crashes. ...

[QuaDreamer: Controllable Panoramic Video Generation for Quadruped Robots 🔗](https://arxiv.org/abs/2508.02512)

Teaching Robot Dogs to Dream: Inside QuaDreamer, A World Model for Quadruped Perception

Introduction In the rapidly evolving field of embodied AI, quadruped robots—often called “robot dogs”—are becoming essential tools for inspection, search and rescue, and industrial security. To navigate complex environments autonomously, these robots rely heavily on visual perception. Panoramic cameras, which capture a comprehensive 360-degree view, are particularly well-suited for this task, offering a field of view that standard cameras cannot match. However, training perception models for these robots faces a significant bottleneck: data scarcity. Collecting high-quality panoramic video data is labor-intensive, costly, and technically difficult. Unlike wheeled robots or drones, quadruped robots have a unique gait that introduces high-frequency vertical vibrations—essentially, they bounce as they walk. This “jitter” creates motion patterns that are difficult to simulate and often result in blurry or unusable real-world data. ...

[Human-like Navigation in a World Built for Humans 🔗](https://arxiv.org/abs/2509.21189)

ReasonNav: Teaching Robots to Ask for Directions and Read Signs using VLMs

Introduction: The “Lost in the Office” Problem Imagine you are visiting a massive corporate headquarters for the first time. You need to deliver a document to “Jane Doe” in Room 4205. You walk into the lobby. What do you do? You probably don’t start walking down every single corridor, opening every door, and checking every room sequentially until you find Jane. That would take hours. Instead, you look for a directory. You check overhead signs pointing toward “Rooms 4100-4300.” If you get confused, you stop a passing employee and ask, “Excuse me, do you know where room 4205 is?” ...

[GC-VLN: Instruction as Graph Constraints for Training-free Vision-and-Language Navigation 🔗](https://arxiv.org/abs/2509.10454)

Unlocking Zero-Shot Robot Navigation: How Graph Constraints Turn Language into Motion

Unlocking Zero-Shot Robot Navigation: How Graph Constraints Turn Language into Motion Imagine telling a robot, “Go through the living room, pass the sofa, and stop at the white table near the window.” To a human, this is a trivial task. We visualize the path, identify landmarks (sofa, table), and understand spatial relationships (pass, near). To a robot, however, this is a massive computational headache involving language processing, visual recognition, and path planning. ...

[Point Policy: Unifying Observations and Actions with Key Points for Robot Manipulation 🔗](https://arxiv.org/abs/2502.20391)

Learning Robot Skills Entirely from Human Videos: The Point Policy Approach

The dream of general-purpose robotics is a machine that can watch a human perform a chore—like folding a towel or putting dishes away—and immediately replicate that behavior. In the fields of Computer Vision (CV) and Natural Language Processing (NLP), we have seen massive leaps in capability driven by internet-scale data. Models like GPT-4 or Sora are trained on vast oceans of text and video scraped from the web. Robotics, however, faces a stubborn bottleneck. Unlike text or images, robotic data usually requires physical interaction. Collecting data means driving a robot, often via tedious teleoperation, which takes time, money, and human labor. If we could teach robots using the millions of “how-to” videos already existing on YouTube, we could unlock a revolution in robotic capability. ...

[Train-Once Plan-Anywhere: Kinodynamic Motion Planning via Diffusion Trees 🔗](https://arxiv.org/abs/2508.21001)

DiTree: When Diffusion Models Meet Search Trees for Robot Motion Planning

Imagine you are trying to drive a car through a dense, unfamiliar warehouse. You can’t just draw a straight line to the exit—you have to steer, accelerate, brake, and avoid pillars, all while respecting the car’s turning radius. This is the essence of Kinodynamic Motion Planning (KMP). It’s not just about geometry; it’s about physics. For decades, roboticists have struggled to solve KMP efficiently. We essentially had two choices: algorithms that are mathematically guaranteed to work but are painfully slow (Search), or modern AI models that are incredibly fast but often crash or hallucinate (Learning). ...

[SDS – See it, Do it, Sorted: Quadruped Skill Synthesis from Single Video Demonstration 🔗](https://arxiv.org/abs/2410.11571)

From YouTube to Robot: How SDS Teaches Quadrupeds to Walk Just by Watching

Imagine trying to learn a new dance move. You don’t read a physics textbook about torque and angular velocity; you simply watch someone else do it, and you try to mimic them. You look at their feet, the rhythm of their steps, and you adjust your body until you match what you see. For a long time, teaching robots to move has been the exact opposite. Engineers have had to painstakingly define “rewards” using complex mathematical functions. If you want a robot dog to trot, you have to mathematically describe what a trot looks like—defining foot heights, joint velocities, and contact timings—and then use Reinforcement Learning (RL) to “train” the robot to maximize that score. It is a brittle, time-consuming process that often requires expensive Motion Capture (MoCap) labs. ...

[ControlVLA: Few-shot Object-centric Adaptation for Pre-trained Vision-Language-Action Models 🔗](https://arxiv.org/abs/2506.16211)

How ControlVLA Teaches Robots New Tricks with Just 10 Demonstrations

If you have ever tried to teach a robotic arm a new skill, you know the struggle: robots are data-hungry. To get a robot to reliably pour water or fold a shirt, you typically need hundreds, if not thousands, of expert demonstrations. This “data barrier” is one of the main reasons we don’t yet have general-purpose robots in our homes. Recent advances in Vision-Language-Action (VLA) models—essentially “Large Language Models for robots”—have shown promise. These models are pre-trained on massive datasets, giving them a form of robotic “common sense.” However, adapting these giant models to specific, local tasks usually requires expensive fine-tuning that washes away their pre-trained knowledge or requires data we simply don’t have. ...

[Sample-Efficient Online Control Policy Learning with Real-Time Recursive Model Updates 🔗](https://arxiv.org/abs/2509.08241)

Learning on the Fly: How Recursive Koopman Learning Solves Robot Control in Real-Time

Introduction One of the most persistent challenges in robotics is controlling systems that are difficult to model mathematically. Soft robots, for example, have infinite degrees of freedom and complex, nonlinear dynamics that defy standard “first principles” physics modeling. Historically, engineers have faced a dilemma. They could use Model-Based Control, which is rigorous and efficient but fails when the mathematical model doesn’t perfectly match reality. Alternatively, they could use Data-Driven approaches, like Reinforcement Learning (RL). RL is powerful because it learns from experience, but it is notoriously “sample inefficient”—often requiring millions of trial-and-error interactions to learn a simple task. This makes RL impractical for real hardware, where gathering data is slow and expensive. ...

[BranchOut: Capturing Realistic Multimodality in Autonomous Driving Decisions 🔗](https://openreview.net/pdf?id=jedBaI1fgU)

Beyond the One True Path: How BranchOut Revolutionizes Multimodal Autonomous Driving

Beyond the One True Path: How BranchOut Revolutionizes Multimodal Autonomous Driving Imagine you are driving down a busy street and you see a delivery truck stopped on the shoulder. As a human driver, what do you do? Do you nudge slightly into the next lane? Do you slow down and wait? Do you perform a full lane change to overtake? The answer is likely “it depends,” and crucially, all of those options might be valid and safe. ...