Papers

[AgentWorld: An Interactive Simulation Platform for Scene Construction and Mobile Robotic Manipulation 🔗](https://arxiv.org/abs/2508.07770)

Building AgentWorld: How Procedural Generation and Mobile Teleoperation Are Solving the Data Bottleneck in Robotics

Introduction: The Quest for the Generalist Household Robot Imagine a robot that can walk into any kitchen, identify the ingredients for a meal, find the necessary cookware, and start cooking—all without having ever seen that specific room before. This is the “Holy Grail” of Embodied AI: a generalist robot capable of performing complex, multi-stage tasks in diverse, unstructured environments. However, there is a massive roadblock standing between us and that future: Data. ...

[LocoTouch: Learning Dynamic Quadrupedal Transport with Tactile Sensing 🔗](https://arxiv.org/abs/2505.23175)

Giving Robots Skin: How Tactile Sensing Enables Quadrupedal Waiters

Imagine you are a waiter carrying a tray of drinks through a crowded restaurant. As you walk, you don’t just look at the floor to avoid tripping; you feel the weight of the tray shifting in your hands. If a glass starts to slide, your skin senses the friction change, and you instinctively adjust your gait—perhaps slowing down or stiffening your arms—to prevent a spill. For humans, this integration of locomotion (walking) and tactile feedback (feeling) is second nature. For robots, it is an immense challenge. ...

[Uncertainty-aware Accurate Elevation Modeling for Off-road Navigation via Neural Processes 🔗](https://arxiv.org/abs/2508.03890)

Don't Fall into the Ditch: How Neural Processes are Solving Off-Road Terrain Modeling

If you have ever driven a vehicle off-road, you know that the terrain is rarely forgiving. For a human driver, spotting a sudden dip, a ditch, or a cliff edge requires constant vigilance. For an autonomous robot, this challenge is magnified ten-fold. In the world of autonomous driving, “negative obstacles”—like ditches or craters—are notoriously difficult to detect. From a distance, a narrow ditch often looks like a continuous flat surface to sensors. If a robot underestimates a slope or misses a ditch, the consequences range from getting stuck to a catastrophic rollover. ...

[Generative Visual Foresight Meets Task-Agnostic Pose Estimation in Robotic Table-Top Manipulation 🔗](https://arxiv.org/abs/2509.00361)

Dreaming Actions: How GVF-TAPE Enables Robots to Plan Visually Without Action Labels

Introduction When you reach for a coffee mug, you don’t explicitly calculate the inverse kinematics of your elbow and shoulder joints. Instead, you likely visualize the outcome—your hand grasping the handle—and your body intuitively understands how to align your arm to match that mental image. There is a strong sensory coupling between our vision and our body awareness. In robotics, however, this process is usually far more rigid. Traditional robotic manipulation relies heavily on action-labeled data. This means humans must painstakingly teleoperate robots to demonstrate tasks, recording every joint angle and velocity. While effective, this process is expensive, slow, and hard to scale. If we want generalist robots, we cannot manually teach them every possible movement. ...

[Text2Touch: Tactile In-Hand Manipulation with LLM-Designed Reward Functions 🔗](https://arxiv.org/abs/2509.07445)

Can LLMs Feel? How Text2Touch Automates Reward Design for Tactile Dexterity

Introduction Imagine you are holding a smooth, spherical object like an orange in your hand. Now, close your eyes and rotate it. Despite not seeing the object, you can manipulate it with ease. You rely entirely on the sensation of touch—the friction against your skin, the pressure on your fingertips, and the subtle shifts in weight. For humans, this is intuitive. For robots, it is an engineering nightmare. Achieving “dexterous in-hand manipulation”—like rotating an object without dropping it—requires complex control policies. Historically, the most difficult part of training a robot to do this isn’t building the hand; it’s defining the reward function. In Reinforcement Learning (RL), the reward function is the math that tells the robot “Good job” or “That was bad.” ...

[Humanoid Policy ~ Human Policy 🔗](https://arxiv.org/abs/2503.13441)

Why Train Robots When You Can Train Humans? Scaling Humanoid Learning with Egocentric Video

The dream of general-purpose humanoid robots is inching closer to reality. We see the hardware improving rapidly—robots that can walk, carry boxes, and withstand shoves. But the “brain” of the robot, the policy that tells it how to dexterously manipulate objects like a cup or a screwdriver, remains a bottleneck. The standard approach to teaching robots involves Imitation Learning (IL). You teleoperate a robot (control it remotely), record the data, and train a neural network to mimic those movements. It works, but it is painfully slow, expensive, and difficult to scale. You need a physical robot, a skilled operator, and endless hours of tedious repetition. ...

[CUPID: Curating Data your Robot Loves with Influence Functions 🔗](https://arxiv.org/abs/2506.19121)

Why Your Robot Fails: Fixing Imitation Learning with Causal Data Curation

If you have been following the explosion of Large Language Models (LLMs), you are likely familiar with the “scaling laws” hypothesis: more data generally leads to better performance. However, as models grow, a nuanced corollary has emerged—data quality matters just as much, if not more, than data quantity. In the world of robotics, this lesson is proving to be even more critical, and significantly harder to implement. In robot Imitation Learning (IL), we train policies to copy human demonstrations. But not all demonstrations are created equal. Some are messy, some rely on strategies that don’t generalize, and some contain “spurious correlations” (like a robot learning to grab an object only when the table is white). ...

[Moto: A Zero-shot Plug-in Interaction-aware Navigation for General Mobile Manipulation 🔗](https://arxiv.org/abs/2509.01658)

MoTo: Bridging the Gap Between Navigation and Manipulation with Zero-Shot Learning

Imagine you are a robot butler. Your human asks you to “get a bottle of water from the fridge.” You have a map of the house, and you know how to open a door. You successfully navigate to the kitchen and park in front of the fridge. But there is a problem: you parked six inches too far to the left. Your robotic arm, despite being highly sophisticated, cannot reach the handle at the correct angle to pull the door open. You are stuck. To fix this, you have to move your entire base, but standard navigation systems don’t understand how to position the base to make the arm’s job easier. ...

[Constrained Style Learning from Imperfect Demonstrations under Task Optimality 🔗](https://arxiv.org/abs/2507.09371)

Robots with Swag: Learning Style Without Failing the Mission

Reinforcement Learning (RL) has revolutionized how robots move. We can now train quadrupedal robots to run over rough terrain and robotic arms to reach targets with impressive reliability. However, there is often a stark difference between a robot that successfully completes a task and one that looks natural doing it. Pure RL policies tend to result in jittery, mechanical, or “weird” behaviors because the reward function strictly prioritizes efficiency—minimizing energy or maximizing speed—ignoring the nuances of biological motion. To fix this, researchers often turn to Imitation Learning, feeding the robot motion capture data from humans or animals. ...

[SLAC: Simulation-Pretrained Latent Action Space for Whole-Body Real-World RL 🔗](https://arxiv.org/abs/2506.04147)

Bridging the Reality Gap: How SLAC Enables Safe, One-Hour Real-World Robot Learning

Imagine trying to teach a robot to wipe a whiteboard. To a human, this is trivial. To a robot, it is a nightmare of physics and control. The robot must navigate to the board without crashing, lift its arm, apply just enough pressure to erase the marker without punching through the wall, and coordinate its wheels and joints simultaneously. This is a “high-degree-of-freedom” (high-DoF) problem involving contact-rich manipulation. Traditionally, roboticists have had two main ways to solve this: ...

[GraspMolmo: Generalizable Task-Oriented Grasping via Large-Scale Synthetic Data Generation 🔗](https://arxiv.org/abs/2505.13441)

How Robots Learn Context: Introducing GraspMolmo and the PRISM Dataset

Introduction Imagine handing a kitchen knife to a friend. You instinctively grasp the blade or the spine carefully, offering the handle to them. Now, imagine you are about to chop carrots. You firmly grip the handle. Finally, imagine you are washing the knife; you might hold it by the very tip of the handle to reach the soapy sponge to the blade. It is the same object—a knife—but the way you hold it changes drastically based on your intent. ...

[Poke and Strike: Learning Task-Informed Exploration Policies 🔗](https://arxiv.org/abs/2509.00178)

Poke and Strike: How Robots Learn to Explore Before They Act

Imagine you are playing air hockey. You step up to a table you’ve never used before. Is the puck heavy or light? Is the table slick or sticky? Before you take your winning shot, you instinctively tap the puck a few times—a gentle “poke”—to get a feel for how it slides. Only then do you commit to the high-speed “strike.” Humans perform this kind of active exploration naturally. We interact with objects to uncover their hidden physical properties before attempting a difficult task. For robots, however, this is an immense challenge. Traditional robotic control often assumes we know the mass, friction, and center of mass of an object beforehand. If those parameters are wrong, the robot fails. ...

[Robust Dexterous Grasping of General Objects 🔗](https://openreview.net/pdf?id=SNvUSjVm6C)

Mastering Dexterity: How Robots Learn Robust Grasping from Scratch

Imagine you are trying to pick up a wet bar of soap. As your fingers close around it, the soap slips slightly. Instantly, without looking or thinking consciously, your fingers adjust their pressure and position to secure the grip. This micro-adjustment is a hallmark of human dexterity. Now, imagine a robot trying to do the same. Most robotic systems plan a grasp based on a static snapshot, close their “eyes” (sensors), and execute the motion blindly. If the object moves, or if the robot bumps into something, the grasp fails. ...

[ReasonPlan: Unified Scene Prediction and Decision Reasoning for Closed-loop Autonomous Driving 🔗](https://arxiv.org/abs/2505.20024)

Beyond Imitation: How ReasonPlan Gives Autonomous Vehicles a 'Chain of Thought'

Imagine you are driving down a narrow street. You see a delivery truck parked on the right and a ball rolling into the street from behind it. You don’t just “detect a ball”; you immediately simulate a future where a child might be chasing that ball, and you instinctively prepare to brake. This ability to reason about the scene and anticipate the future is second nature to humans. However, for autonomous vehicles (AVs), this is incredibly difficult. Most modern AV systems rely on End-to-End (E2E) Imitation Learning. They look at millions of hours of human driving and try to copy the steering and pedal inputs given similar visual inputs. While this works for driving straight down a highway, it often fails in “closed-loop” scenarios—complex, interactive environments where the car’s actions change the state of the world, and where one small mistake can compound into a crash. ...

[FastUMI: A Scalable and Hardware-Independent Universal Manipulation Interface with Dataset 🔗](https://arxiv.org/abs/2409.19499)

Breaking the Hardware Barrier: How FastUMI Democratizes Robot Data Collection

If you have ever tried to train a robot to perform a simple household task, like folding a towel or opening a jar, you have likely run into the “Data Problem.” Humans can perform these tasks effortlessly, but teaching a robot requires thousands of examples. This is where Imitation Learning (IL) comes in—showing the robot what to do so it can copy you. However, collecting high-quality demonstration data is notoriously difficult. Teleoperation (controlling a robot with a joystick or VR rig) is slow, expensive, and often unintuitive. Recent innovations like the Universal Manipulation Interface (UMI) attempted to solve this by allowing humans to collect data with a handheld gripper. But even UMI had a flaw: it was “picky.” It required specific grippers, rigid hardware setups, and a finicky software pipeline that often broke when the camera view was blocked. ...

[DexVLA: Vision-Language Model with Plug-In Diffusion Expert for General Robot Control 🔗](https://arxiv.org/abs/2502.05855)

DexVLA: How a 1-Billion Parameter Diffusion Expert is Revolutionizing Robot Control

Introduction We are currently living through a golden age of Artificial Intelligence, largely driven by Large Language Models (LLMs) and Vision-Language Models (VLMs). These models can write poetry, debug code, and analyze complex images with startling accuracy. However, when we try to transfer this intelligence into a physical robot, we hit a wall. The “brain” is brilliant, but the “body” is clumsy. The dream of robotics is a generalist agent—a robot that can tidy your kitchen, fold your laundry, and sort trash, regardless of the environment or the specific robot hardware being used. The current state-of-the-art approach is the Vision-Language-Action (VLA) model. These models attempt to ground the vast knowledge of the internet (via VLMs) into robotic actions. ...

[ZipMPC: Compressed Context-Dependent MPC Cost via Imitation Learning 🔗](https://arxiv.org/abs/2507.13088)

ZipMPC: Teaching Short-Sighted Robots to Drive with Long-Term Foresight

In the world of robotics and autonomous systems, there is a constant tug-of-war between foresight and reaction speed. Imagine driving a race car at high speed. To drive optimally, you need to look far ahead (foresight), anticipating curves that are hundreds of meters away. However, you also need to make decisions instantly (reaction speed). If you spend too much time calculating the perfect line for the next ten curves, you’ll crash into the first wall before you’ve even turned the wheel. ...

[LLM-Guided Probabilistic Program Induction for POMDP Model Estimation 🔗](https://arxiv.org/abs/2505.02216)

Can LLMs Code Their Own World Models? A Deep Dive into POMDP Coder

Imagine a robot searching for an apple in a cluttered kitchen. It scans the room but doesn’t see the fruit. A human would instinctively check the table or the counter, knowing that apples don’t hover in mid-air or hide inside the toaster. The robot, however, faces a massive challenge: decision-making under uncertainty. It doesn’t know where the apple is (partial observability), and it needs a model of how the world works to search efficiently. ...

[Long Range Navigator (LRN): Extending robot planning horizons beyond metric maps 🔗](https://arxiv.org/abs/2504.13149)

Beyond the Map: How Long Range Navigator (LRN) Gives Robots 'Farsight'

Imagine you are hiking through a dense, unfamiliar forest. Your goal is a campsite several kilometers away. You don’t have a detailed topographic map of every tree and rock between you and the destination. Instead, you look into the distance. You see a break in the tree line to your left, a steep cliff to your right, and a dense thicket straight ahead. Even though the campsite is technically straight ahead, you instinctively head toward the clearing on the left. ...

[RoboMonkey: Scaling Test-Time Sampling and Verification for Vision-Language-Action Models 🔗](https://arxiv.org/abs/2506.17811)

RoboMonkey: Bringing Test-Time Scaling Laws to Robotics

Imagine you are trying to solve a complex math problem. Do you simply blurt out the first number that comes to your head? Probably not. You likely scribble down a few potential approaches, double-check your logic, and verify your answer before committing to it. Humans naturally allocate more “compute” (thinking time) to harder problems. In the world of Large Language Models (LLMs), we have seen this principle formalized as “inference-time scaling.” Techniques like Chain-of-Thought reasoning or generating multiple code snippets and verifying them have revolutionized how AI solves complex logical tasks. ...