Papers

[Hand-Eye Autonomous Delivery: Learning Humanoid Navigation, Locomotion and Reaching 🔗](https://arxiv.org/abs/2508.03068)

HEAD: Teaching Humanoids to Move and Reach Like Humans

Imagine you are in a kitchen and you spot a coffee mug on the counter. You don’t just “drive” your body to the counter like a tank and then extend your arm like a crane. You walk towards it, likely adjusting your stride, leaning your torso forward, and extending your hand—all in one fluid, coordinated motion. Your eyes lock onto the target, and your body follows. For humanoid robots, however, this seamless integration of navigation (walking to a place) and manipulation (reaching for things) is incredibly difficult. Historically, roboticists have treated these as separate problems: mobile bases handle the 2D navigation, and once the robot stops, a manipulator arm handles the reaching. ...

[RobotxR1: Enabling Embodied Robotic Intelligence on Large Language Models through Closed-Loop Reinforcement Learning 🔗](https://arxiv.org/abs/2505.03238)

Beyond Static Data: How Closed-Loop RL Teaches Small LLMs to Drive Better than GPT-4o

Introduction We are currently witnessing a massive shift in the capabilities of Large Language Models (LLMs). With the release of models like DeepSeek R1, we’ve seen that LLMs can “learn to reason” by verifying their own answers against mathematical truths. But there is a frontier where this reasoning capability hits a wall: Embodied AI. In the digital world, a math problem is static. In the physical world, environments are chaotic, dynamic, and unpredictable. A robot cannot simply “think” its way out of a problem; it must act, observe the consequences, and adapt. Furthermore, robots often operate on the “edge”—onboard computers with limited battery and memory—making massive cloud-based models like GPT-4 impractical and insecure for real-time control. ...

[Off Policy Lyapunov Stability in Reinforcement Learning 🔗](https://arxiv.org/abs/2509.09863)

Bridging the Gap Between Safety and Efficiency in Reinforcement Learning

Introduction Imagine training a robot to carry a tray of drinks. In a simulation, if the robot trips and shatters the glass, you simply hit “reset.” In the real world, however, that failure is costly, dangerous, and messy. This is the fundamental tension in Deep Reinforcement Learning (DRL). While DRL has achieved incredible feats—from defeating grandmasters in Go to controlling complex robotic hands—it essentially learns through trial and error. It explores the world, makes mistakes, and gradually optimizes a policy to maximize rewards. ...

[From Space to Time: Enabling Adaptive Safety with Learned Value Functions via Disturbance Recasting 🔗](https://arxiv.org/abs/2509.19597)

Flying Through the Unknown: How SPACE2TIME Turns Spatial Chaos into Temporal Order for Drone Safety

Imagine a drone delivering a package in a dense urban environment. It takes off from a calm rooftop, but as it descends into the “urban canyon” between two skyscrapers, it encounters a sudden, fierce wind tunnel. The wind dynamics here are completely different from the rooftop. For autonomous systems, this is a nightmare scenario. To guarantee safety, the drone needs to know the limits of the environment—specifically, the maximum disturbance (wind) it might encounter. If the drone assumes the worst-case storm at all times, it will be too conservative to fly efficiently (or at all). If it assumes calm weather, it might crash when it hits the wind tunnel. ...

[KineDex: Learning Tactile-Informed Visuomotor Policies via Kinesthetic Teaching for Dexterous Manipulation 🔗](https://arxiv.org/abs/2505.01974)

Can Robots Feel? Teaching Dexterity Through Touch with KineDex

Can Robots Feel? Teaching Dexterity Through Touch with KineDex Robotic manipulation has come a long way. We have robots that can lift heavy payloads, weld cars with sub-millimeter precision, and even dance. But when it comes to the subtle art of the human hand—buttoning a shirt, cracking an egg without crushing it, or squeezing just the right amount of toothpaste—robots often fall short. The missing link is tactile sensing. While computer vision gives robots “sight,” it doesn’t tell them how hard they are squeezing or if an object is slipping. To bridge this gap, a team of researchers has introduced KineDex, a new framework that teaches robots not just to move, but to feel. ...

[TrackVLA: Embodied Visual Tracking in the Wild 🔗](https://arxiv.org/abs/2505.23189)

How TrackVLA Teaches Robots to Follow You: Unifying Vision, Language, and Action

Imagine walking through a crowded office or hiking down a forest trail with a robot assistant carrying your gear. For the robot to be useful, it needs to do one thing flawlessly: follow you. This task, known as Embodied Visual Tracking (EVT), sounds simple to us humans. We effortlessly track a friend in a crowd, predict where they will step next, and navigate around obstacles without losing sight of them. But for robots, this is a nightmare. It requires two distinct skills: recognition (identifying who to follow) and trajectory planning (deciding how to move). ...

[CDP: Towards Robust Autoregressive Visuomotor Policy Learning via Causal Diffusion 🔗](https://arxiv.org/abs/2506.14769)

Why History Matters: Making Robots Robust with Causal Diffusion Policy

Introduction Imagine trying to pour a cup of coffee. You pick up the kettle, align the spout, and tilt. Now, imagine your eyes momentarily lose focus or the lights flicker out for a split second. Do you drop the kettle? Probably not. You rely on your muscle memory and the context of what you were doing just a moment ago—you know you were in the middle of a pouring motion, so you continue smoothly. ...

[Towards Generalizable Safety in Crowd Navigation via Conformal Uncertainty Handling 🔗](https://arxiv.org/abs/2508.05634)

Why Uncertainty is Key to Safe Robot Navigation in Crowds

Imagine walking through a busy train station during rush hour. You don’t just calculate the exact future trajectory of every person around you. Instead, you instinctively identify who is walking steadily and who is rushing unpredictably. You give the erratic rushers more space—effectively placing a “safety bubble” around them based on how unsure you are of their movement. For mobile robots, replicating this intuition is incredibly difficult. While Deep Reinforcement Learning (RL) has enabled robots to navigate crowds in simulation, these robots often suffer from a “reality gap.” They perform beautifully in the environments they were trained in but fail dangerously when faced with Out-of-Distribution (OOD) scenarios—such as sudden changes in walking speeds, group behaviors, or aggressive pedestrian dynamics. ...

[Robot Learning from Any Images 🔗](https://arxiv.org/abs/2509.22970)

From Static Photos to Active Robots: How RoLA Turns Any Image into a Physics Simulator

Introduction In the field of Artificial Intelligence, language models like GPT-4 have achieved remarkable capabilities largely because they were trained on the entire textual internet. Robotics, however, faces a distinct “data starvation” problem. While text and images are abundant, robotic data—specifically, data that links visual perception to physical action—is incredibly scarce. Collecting data on real robots is slow, expensive, and potentially dangerous. The alternative has traditionally been simulation (Sim-to-Real), where we build digital twins of the real world. But creating these digital twins usually requires complex setups: multi-view camera rigs, 3D scanners, and manual asset creation. You can’t just take a photo of your messy kitchen and expect a robot to learn how to clean it… until now. ...

[ComposableNav: Instruction-Following Navigation in Dynamic Environments via Composable Diffusion 🔗](https://arxiv.org/abs/2509.17941)

Building Complex Robot Behaviors from Simple Blocks: An Deep Dive into ComposableNav

Imagine you are walking down a crowded hallway. A friend calls out to you: “Hey, catch up to Alice, but make sure you pass Bob on his left, and try to stay on the right side of the carpet.” For a human, this instruction is complex but manageable. We instinctively break it down: Locate Alice (Goal). Locate Bob and plan a left-side pass (Constraint A). Identify the carpet and stay right (Constraint B). We execute these behaviors simultaneously. However, for a robot, this is a nightmare of combinatorial complexity. Standard robotic learning approaches often try to learn a single policy for every possible scenario. But as the number of potential constraints grows—yielding, following, avoiding, passing left/right—the number of combinations explodes exponentially. Training a robot for every possible permutation of instructions is computationally impossible. ...

[UniSkill: Imitating Human Videos via Cross-Embodiment Skill Representations 🔗](https://arxiv.org/abs/2505.08787)

How Robots Can Learn from Human Videos: An Introduction to UniSkill

Introduction One of the “holy grails” of robotics and artificial intelligence is the ability to teach a robot new skills simply by showing it a video of a human performing a task. Imagine if, instead of programming a robot step-by-step or teleoperating it for hours to collect data, you could simply show it a YouTube clip of a person folding a shirt, and the robot would immediately understand how to do it. ...

[PicoPose: Progressive Pixel-to-Pixel Correspondence Learning for Novel Object Pose Estimation 🔗](https://arxiv.org/abs/2504.02617)

PicoPose: Mastering Zero-Shot Object Pose Estimation with Progressive Learning

Introduction Imagine you are asking a robot to pick up a specific brand of drill from a cluttered table. If the robot has seen that exact drill a thousand times during its training, this is a trivial task. But what if it’s a brand new tool it has never encountered before? This scenario, known as novel object pose estimation, is one of the “holy grail” challenges in robotic vision. To interact with the physical world, a robot needs to know an object’s 6D pose—its position (3D coordinates) and its orientation (3D rotation) relative to the camera. Traditionally, accurate pose estimation required expensive depth sensors (RGB-D) to understand the geometry of the scene. While effective, depth sensors increase the cost and complexity of robotic systems. ...

[Co-Design of Soft Gripper with Neural Physics 🔗](https://arxiv.org/abs/2505.20404)

Finding the Sweet Spot: How Neural Physics Optimizes Soft Robot Design

Introduction In the world of robotic manipulation, engineers face a persistent dilemma known as the “compliance trade-off.” Traditional rigid grippers—like the metal claws found on assembly lines—are precise and strong, but they struggle with irregular shapes and can easily crush delicate objects. On the other end of the spectrum, soft robotic grippers made of silicon or rubber offer excellent adaptability and safety; they can wrap around a strawberry without bruising it. However, soft grippers often lack the strength to lift heavy tools or the precision to handle specific orientations. ...

[Joint Model-based Model-free Diffusion for Planning with Constraints 🔗](https://arxiv.org/abs/2509.08775)

The Best of Both Worlds: Teaching Diffusion Models to Think Like Control Theorists

Introduction In the world of robotics, there is a constant tug-of-war between creativity and safety. On one side, we have data-driven methods, particularly diffusion models. These are the “artists.” They have watched thousands of demonstrations and learned to generate complex, human-like motions. They can navigate cluttered rooms or manipulate objects with dexterity. However, like many artists, they don’t always like following strict rules. If you present a diffusion model with a safety constraint it hasn’t seen before, it might hallucinate a path right through a wall. ...

[Action-Free Reasoning for Policy Generalization 🔗](https://arxiv.org/abs/2502.03729)

RAD: Teaching Robots to Reason by Watching Humans

Introduction One of the most persistent bottlenecks in robotics is data. To train a robot to perform useful tasks—like tidying a kitchen or sorting objects—we typically need thousands of demonstrations where a human manually guides the robot through the motions. This process, known as imitation learning, is slow, expensive, and difficult to scale. Conversely, the internet is overflowing with “human data.” There are millions of videos of people cooking, cleaning, and manipulating objects. If robots could learn from this data, we could solve the scalability problem overnight. However, there is a catch: the embodiment gap. A human hand does not look or move like a robotic gripper. Furthermore, human videos are “action-free”—they contain visual information but lack the precise motor commands (joint angles, torques) that a robot needs to execute a task. ...

[UniTac2Pose: A Unified Approach Learned in Simulation for Category-level Visuotactile In-hand Pose Estimation 🔗](https://arxiv.org/abs/2509.15934)

How Robots Learn to Feel: A Unified Energy-Based Approach for Tactile Pose Estimation

Introduction In the world of robotics, vision often gets all the glory. We marvel at robots that can “see” and navigate complex environments. However, when it comes to the delicate art of manipulation—positioning a workpiece, assembling a component, or inserting a USB drive—vision has its limits. Cameras suffer from occlusions (the robot’s own hand often blocks the view) and lighting variations. This is where tactile sensing becomes indispensable. For a robot to manipulate an object precisely, it must know the object’s exact 6D pose (position and orientation) within its hand. This is known as in-hand pose estimation. While humans do this instinctively, it is a massive computational challenge for robots, particularly when handling objects they haven’t seen before or objects with symmetrical shapes that look identical from different angles. ...

[OPAL: Visibility-aware LiDAR-to-OpenStreetMap Place Recognition via Adaptive Radial Fusion 🔗](https://arxiv.org/abs/2504.19258)

OPAL: How to Localize Self-Driving Cars Using Free Maps and Deep Learning

Introduction Imagine you are driving an autonomous vehicle through a dense urban center—perhaps downtown Manhattan or a narrow European street. Suddenly, the skyscrapers block your GPS signal. The blue dot on your navigation screen freezes or drifts aimlessly. For a human driver, this is an annoyance; for a self-driving car, it is a critical failure. To navigate safely without GNSS (Global Navigation Satellite Systems), a robot must answer the question: “Where am I?” based solely on what it sees. This is known as Place Recognition. Typically, this involves matching the car’s current sensor view (like a LiDAR scan) against a pre-built database. ...

[Crossing the Human-Robot Embodiment Gap with Sim-to-Real RL using One Human Demonstration 🔗](https://arxiv.org/abs/2504.12609)

One Video is All You Need: Teaching Dexterous Robots with Human2Sim2Robot

Introduction Imagine you want to teach a robot how to pour a glass of water or place a dish in a rack. In an ideal world, you would simply show the robot how to do it once—perhaps by performing the task yourself—and the robot would immediately understand and replicate the skill. In reality, teaching robots “dexterous manipulation” (using multi-fingered hands to handle objects) is notoriously difficult. Traditional methods like Imitation Learning (IL) often require hundreds of demonstrations to learn a robust policy. Furthermore, capturing high-quality data of human hand motion typically involves expensive wearable sensors or complex teleoperation rigs. ...

[Uncertainty-aware Latent Safety Filters for Avoiding Out-of-Distribution Failures 🔗](https://arxiv.org/abs/2505.00779)

When Robots Hallucinate: Making AI Safe in an Uncertain World

When Robots Hallucinate: Making AI Safe in an Uncertain World Imagine you are playing a high-stakes game of Jenga. You carefully tap a block, analyzing how the tower wobbles. You predict that if you pull it slightly to the left, the tower remains stable. If you pull it to the right, it crashes. Your brain is running a “world model”—simulating the physics of the tower to keep you safe from losing. ...

[MimicFunc: Imitating Tool Manipulation from a Single Human Video via Functional Correspondence 🔗](https://arxiv.org/abs/2508.13534)

Teach Once, Do Anywhere: How MimicFunc Enables Robots to Master Tools from One Video

Imagine you are teaching a friend how to scoop beans. You pick up a silver spoon, scoop the beans, and dump them into a bowl. Now, you hand your friend a large plastic ladle. Without hesitation, your friend adjusts their grip, accounts for the ladle’s larger size, and performs the exact same scooping action. They understood the function of the action, not just the specific geometry of the spoon. For robots, this simple act of transfer is incredibly difficult. Traditional robotic learning often relies on rote memorization of specific objects. If you teach a robot to pour with a red mug, it is likely to fail when handed a glass measuring cup. The shapes, sizes, and grasp points are mathematically different, even if the “function” (pouring) is identical. ...