Papers

[ZipMPC: Compressed Context-Dependent MPC Cost via Imitation Learning 🔗](https://arxiv.org/abs/2507.13088)

ZipMPC: Teaching Short-Sighted Robots to Drive with Long-Term Foresight

In the world of robotics and autonomous systems, there is a constant tug-of-war between foresight and reaction speed. Imagine driving a race car at high speed. To drive optimally, you need to look far ahead (foresight), anticipating curves that are hundreds of meters away. However, you also need to make decisions instantly (reaction speed). If you spend too much time calculating the perfect line for the next ten curves, you’ll crash into the first wall before you’ve even turned the wheel. ...

[LLM-Guided Probabilistic Program Induction for POMDP Model Estimation 🔗](https://arxiv.org/abs/2505.02216)

Can LLMs Code Their Own World Models? A Deep Dive into POMDP Coder

Imagine a robot searching for an apple in a cluttered kitchen. It scans the room but doesn’t see the fruit. A human would instinctively check the table or the counter, knowing that apples don’t hover in mid-air or hide inside the toaster. The robot, however, faces a massive challenge: decision-making under uncertainty. It doesn’t know where the apple is (partial observability), and it needs a model of how the world works to search efficiently. ...

[Long Range Navigator (LRN): Extending robot planning horizons beyond metric maps 🔗](https://arxiv.org/abs/2504.13149)

Beyond the Map: How Long Range Navigator (LRN) Gives Robots 'Farsight'

Imagine you are hiking through a dense, unfamiliar forest. Your goal is a campsite several kilometers away. You don’t have a detailed topographic map of every tree and rock between you and the destination. Instead, you look into the distance. You see a break in the tree line to your left, a steep cliff to your right, and a dense thicket straight ahead. Even though the campsite is technically straight ahead, you instinctively head toward the clearing on the left. ...

[RoboMonkey: Scaling Test-Time Sampling and Verification for Vision-Language-Action Models 🔗](https://arxiv.org/abs/2506.17811)

RoboMonkey: Bringing Test-Time Scaling Laws to Robotics

Imagine you are trying to solve a complex math problem. Do you simply blurt out the first number that comes to your head? Probably not. You likely scribble down a few potential approaches, double-check your logic, and verify your answer before committing to it. Humans naturally allocate more “compute” (thinking time) to harder problems. In the world of Large Language Models (LLMs), we have seen this principle formalized as “inference-time scaling.” Techniques like Chain-of-Thought reasoning or generating multiple code snippets and verifying them have revolutionized how AI solves complex logical tasks. ...

[SafeBimanual: Diffusion-based Trajectory Optimization for Safe Bimanual Manipulation 🔗](https://arxiv.org/abs/2508.18268)

How to Stop Robots from Breaking Things: Inside SafeBimanual

Introduction Imagine you are asking a robot to help you prepare breakfast. It needs to pick up a bottle of milk with one hand, a bowl with the other, and pour the milk without spilling it, crushing the bowl, or banging its two arms together. For a human, this bimanual (two-handed) coordination is intuitive. For a robot, it is a geometric and kinematic nightmare. In the world of robotic learning, Diffusion Policies have emerged as the reigning champions. By learning from human demonstrations, these models are incredibly good at cloning behaviors and handling complex, multi-modal tasks. However, they have a significant blind spot: Physical Safety. ...

[Shortcut Learning in Generalist Robot Policies: The Role of Dataset Diversity and Fragmentation 🔗](https://arxiv.org/abs/2508.06426)

Why Your Robot is Cheating: The Hidden Trap of Shortcut Learning in Generalist Policies

Introduction In recent years, the recipe for success in Artificial Intelligence has seemed deceptively simple: scale up. In Computer Vision and Natural Language Processing (NLP), feeding massive Transformers with internet-scale data produced emergent capabilities that stunned the world. Naturally, roboticists asked: Can we do the same for physical robots? The answer appeared to be “yes.” By aggregating robotic data from labs around the world into massive collections like the Open X-Embodiment (OXE) dataset, researchers trained Generalist Robot Policies (GRPs) like \(\pi_0\), Octo, and RT-X. These models can perform a wide range of tasks, from opening drawers to picking up specific objects. ...

[HALO: Human Preference Aligned Offline Reward Learning for Robot Navigation 🔗](https://arxiv.org/abs/2508.01539)

Teaching Robots to Navigate Like Humans: A Deep Dive into HALO

Introduction Imagine you are teaching a teenager to drive. You typically don’t give them a differential equation describing the distance to the curb or a physics formula for the friction coefficient of the road. Instead, you offer intuitive feedback: “That was a bit too close to the parked car,” or “Good job slowing down for that pedestrian.” This human intuition is incredibly powerful, yet translating it into the mathematical language of robotics is notoriously difficult. ...

[Contrastive Forward Prediction Reinforcement Learning for Adaptive Fault-Tolerant Legged Robots 🔗](https://openreview.net/pdf?id=P0uqo7CpL8)

When Robots Learn to Limp: Adaptive Fault Tolerance via Forward Prediction

Introduction Imagine you are hiking up a steep, rocky trail. Suddenly, you twist your ankle. It hurts, and your range of motion is limited. What do you do? You don’t stop functioning; you adapt. You shift your weight, change your gait, and favor the uninjured leg. You consciously predict that putting weight on the bad ankle will result in failure, so you adjust your control signals accordingly. This ability to adapt to physical impairment is natural for biological beings, but it is an immense challenge for robots. In the field of legged robotics, reliability is the holy grail. We want robots to navigate disaster zones, explore planetary surfaces, and inspect industrial sites. However, hardware breaks. Motors wear out, gearboxes jam, and legs sustain damage. For a standard robot, a single locked joint often leads to immediate failure. ...

[Self-supervised Learning Of Visual Pose Estimation Without Pose Labels By Classifying LED States 🔗](https://arxiv.org/abs/2509.10405)

Learning to See Without Teachers: How Robots Can Learn Pose Estimation Just by Blinking

Introduction In the world of robotics, answering the question “Where am I relative to you?” is surprisingly difficult. This problem, known as visual relative pose estimation, is fundamental for multi-robot systems. Whether it’s a swarm of drones coordinating a light show or warehouse robots avoiding collisions, robots need to know the position and orientation (pose) of their peers. Traditionally, teaching a robot to estimate pose from a camera image requires a heavy dose of supervision. You usually have two expensive options: ...

[ImLPR: Image-based LiDAR Place Recognition using Vision Foundation Models 🔗](https://arxiv.org/abs/2505.18364)

Bridging the Gap: How ImLPR Adapts Vision Foundation Models for 3D LiDAR Place Recognition

Introduction Imagine a robot navigating a bustling city or a complex underground tunnel. To operate autonomously, it doesn’t just need to know where obstacles are; it needs to know where it is on a global map. GPS is often unreliable or unavailable in these environments (think urban canyons or indoor spaces). This is where LiDAR Place Recognition (LPR) comes in. The robot scans its surroundings with a laser scanner and asks, “Have I seen this pattern of geometry before?” ...

[Door(s): Junction State Estimation for Efficient Exploration in Reinforcement Learning 🔗](https://openreview.net/pdf?id=NtnPVwUCAH)

Finding the Keys: How 'Junction States' Unlock Efficient Exploration in RL

Introduction Reinforcement Learning (RL) has achieved remarkable feats, from mastering complex strategy games to controlling robotic limbs. However, one bottleneck persistently stifles progress: efficient exploration. In environments where rewards are “sparse”—meaning the agent receives feedback only rarely, perhaps only upon completing a complex task—an agent can spend eons flailing randomly, never stumbling upon the specific sequence of actions required to earn a reward. Imagine you are dropped into a massive, dark maze with a single treasure chest hidden somewhere deep inside. If you wander randomly, you might eventually find it, but it could take a lifetime. However, if you realized that passing through a doorway (a junction) opens up a whole new section of the maze, you would prioritize finding those doors. This structural knowledge is crucial. ...

[DiWA: Diffusion Policy Adaptation with World Models 🔗](https://arxiv.org/abs/2508.03645)

Dreaming of Success: How Robots Can Fine-Tune Skills Entirely Offline

Dreaming of Success: How Robots Can Fine-Tune Skills Entirely Offline In the world of robotics, there is a massive gap between “watching” a task and “mastering” it. Imagine you are learning to play tennis. You can watch a professional player (Imitation Learning), and you might pick up the form. But to actually get good, you need to step onto the court and hit the ball thousands of times, adjusting your swing based on where the ball lands (Reinforcement Learning). ...

[SimShear: Sim-to-Real Shear-based Tactile Servoing 🔗](https://arxiv.org/abs/2508.20561)

Feeling the Slide: How SimShear Bridges the Gap in Robotic Tactile Sensing

Introduction In the world of robotics, vision has long been king. We have taught robots to see obstacles, classify objects, and navigate rooms with impressive accuracy. But when it comes to manipulation—actually grabbing, holding, and moving things—sight isn’t enough. Try tying your shoelaces with numb fingers; even if you watch your hands closely, it’s incredibly difficult. You need the sense of touch. Specifically, you need to feel shear. Shear is the lateral force that stretches your skin when an object slides across your fingertips or when gravity pulls on a heavy object you’re holding. It’s the sensation that tells you a glass is about to slip from your hand before it actually falls. ...

[ManipBench: Benchmarking Vision-Language Models for Low-Level Robot Manipulation 🔗](https://arxiv.org/abs/2505.09698)

Can LLMs Control Robots? Exploring ManipBench and Low-Level Reasoning

Introduction The dream of the “generalist robot”—a machine capable of folding laundry, cooking dinner, and tidying a workshop without explicit reprogramming—has long captivated roboticists. Recently, Vision-Language Models (VLMs) like GPT-4 and Gemini have brought us closer to this reality. These models possess immense “common sense” knowledge about the world. They can look at a messy room and tell you what needs to be cleaned up. However, knowing what to do is very different from knowing how to do it. ...

[Distributed Upload and Active Labeling for Resource-Constrained Fleet Learning 🔗](https://openreview.net/pdf?id=M1e2PEMLp2)

Taming the Data Deluge with DUAL—Optimizing Robotic Fleet Learning

Introduction: The Data Bottleneck in Robotics Imagine a fleet of hundreds of autonomous vehicles or delivery drones operating in the real world. Every second, these machines capture high-resolution video, LiDAR scans, and telemetry. Collectively, they generate terabytes of data every single day. Ideally, we would upload all this data to the cloud, label it, and use it to train smarter, safer AI models. But in reality, this is impossible. We face two crushing bottlenecks. First, bandwidth is limited. A car operating in a busy city or a remote rural area cannot upload terabytes of raw sensor data over a 4G or 5G connection; the network simply can’t handle it. Second, annotation is expensive. Even if we could upload everything, human annotators (or expensive foundation models) cannot label millions of images a day. We have a limited budget for ground-truth labeling. ...

[Towards Embodiment Scaling Laws in Robot Locomotion 🔗](https://arxiv.org/abs/2505.05753)

The Body Problem: Can Scaling Robot Morphologies Unlock General Intelligence?

Introduction Two millennia ago, the philosopher Heraclitus famously wrote that “no man steps in the same river twice.” In the field of robotics, we face a similar, strictly physical reality: no agent acts with exactly the same body twice. Consider a robot deployed in the real world. Over time, its motors degrade, its joints loosen, and it might even suffer damage to a limb. Even newly manufactured robots of the same model have subtle manufacturing variations. If we want to build truly generalist robots—agents that can operate not just one specific machine, but adapt to any physical form—we face a massive hurdle. Current deep learning success stories, like Large Language Models (LLMs), have thrived by scaling up data and model size. But in robotics, we have a third, largely unexplored dimension: embodiment. ...

[RoboChemist: Long-Horizon and Safety-Compliant Robotic Chemical Experimentation 🔗](https://arxiv.org/abs/2509.08820)

RoboChemist: How Dual-Loop AI is Mastering the Chemistry Lab

Imagine a chemistry laboratory. It is a place of precise measurements, volatile reactions, and fragile equipment. Now, imagine a robot trying to navigate this space. Unlike organizing blocks on a table—a classic robotics benchmark—chemistry involves transparent glassware that depth sensors can’t see, liquids that slosh and spill, and safety protocols where a few millimeters of error can lead to a hazardous situation. For years, the dream of an autonomous “Robotic Chemist” has been stifled by these physical realities. Robots are good at repetitive motions, but they struggle with the dynamic, safety-critical reasoning required for experimental science. ...

[D-CODA: Diffusion for Coordinated Dual-Arm Data Augmentation 🔗](https://arxiv.org/abs/2505.04860)

Teaching Robots to Coordinate: How Diffusion Models Are Solving Bimanual Manipulation

If you’ve ever tried to carry a heavy moving box or fold a large bedsheet with just one hand, you know the struggle. We humans rely heavily on bimanual manipulation—using two hands in coordination—to interact with the world. For robots to be truly useful in homes and warehouses, they need to master this same skill. However, training robots to coordinate two arms is exponentially harder than training one. You have to manage higher degrees of freedom, ensure the arms don’t crash into each other, and maintain precise coordination to keep objects from falling. Traditional Imitation Learning (IL), where robots learn by mimicking human demonstrations, works well but is data-hungry. Collecting thousands of coordinated, two-arm demonstrations is costly and labor-intensive. ...

[UnPose: Uncertainty-Guided Diffusion Priors for Zero-Shot Pose Estimation 🔗](https://arxiv.org/abs/2508.15972)

Solving the Hallucination Problem: How UnPose Brings Uncertainty to Robotic Perception

Introduction Imagine a robot entering a kitchen for the first time. It sees a mug on the counter. To pick it up, the robot needs to know exactly where that mug is in 3D space and how it is oriented—a problem known as 6D pose estimation. Historically, this has been a rigid process. Robots relied on exact CAD models (digital blueprints) of every object they might encounter. If you didn’t have the CAD file for that specific “Master Chef” mug, the robot was blind. Recent advances in deep learning attempted to fix this with “category-level” training (teaching a robot what a generic mug looks like) or “model-free” approaches that reconstruct objects on the fly. ...

[WoMAP: World Models For Embodied Open-Vocabulary Object Localization 🔗](https://arxiv.org/abs/2506.01600)

How Robots Learn to Search: A Deep Dive into WoMAP and World Models

Introduction Imagine you are looking for your keys. You don’t scan every millimeter of the ceiling or the blank wall; you look on the table, check the couch cushions, or look near the door. Your search is active, exploratory, and guided by a “mental model” of where things usually are. Now, imagine asking a robot to “find the banana.” For a robot, this is an incredibly complex task known as Open-Vocabulary Object Localization. The robot must understand what a “banana” is (semantics), navigate a potentially cluttered and unseen environment (exploration), and understand how its physical movements change what it sees (dynamics). ...