Papers

[BEVCALIB: LiDAR-Camera Calibration via Geometry-Guided Bird's-Eye View Representations 🔗](https://arxiv.org/abs/2506.02587)

Bridging the Gap: How BEVCALIB Uses Bird's-Eye View for Precise Sensor Calibration

Introduction Imagine you are driving a car. Your eyes (cameras) see the red stop sign ahead, and your brain estimates the distance. Now, imagine a sophisticated autonomous vehicle. It doesn’t just rely on cameras; it likely uses LiDAR (Light Detection and Ranging) to measure precise depth. Ideally, the camera and the LiDAR should agree perfectly on where that stop sign is located in 3D space. But what happens if they don’t? ...

[Imagine, Verify, Execute: Memory-Guided Agentic Exploration with Vision-Language Models 🔗](https://arxiv.org/abs/2505.07815)

How Robots Can "Imagine" to Explore: Breaking Free from Random Actions

Introduction How does a human learn to interact with a new environment? If you place a toddler in front of a table with blocks and cups, they don’t just randomly twitch their muscles until something interesting happens. They look at the objects, form a mini-goal (e.g., “I want to put the blue block inside the cup”), and then try to execute it. If it doesn’t fit, they learn. If it works, they remember the result and try something new. ...

[Distilling On-device Language Models for Robot Planning with Minimal Human Intervention 🔗](https://arxiv.org/abs/2506.17486)

Cutting the Cord: How PRISM Brings GPT-4 Level Planning to On-Device Robots

Imagine a robot navigating a disaster zone. It needs to find survivors, assess structural damage, and report back. To do this effectively, it needs to understand complex natural language instructions and reason about its environment in real-time. For the last few years, the standard solution has been to hook the robot up to a Large Language Model (LLM) like GPT-4. The robot sends a picture or a map to the cloud, the LLM processes it, and sends back a plan. In a perfect world with high-speed internet, this works beautifully. ...

[Beyond Constant Parameters: Hyper Prediction Models and HyperMPC 🔗](https://arxiv.org/abs/2508.06181)

Dynamic Parameters for Dynamic Robots: How HyperPM Revolutionizes Model Predictive Control

Introduction In the world of robotics, there is a constant tug-of-war between speed and accuracy. Nowhere is this more apparent than in Model Predictive Control (MPC). MPC is the gold standard for controlling complex robots—from agile drones to autonomous race cars—because it doesn’t just react to the present; it looks into the future, plans a sequence of moves, and executes the best one. But to look into the future, MPC needs a crystal ball: a dynamics model. ...

[Dynamics-Compliant Trajectory Diffusion for Super-Nominal Payload Manipulation 🔗](https://arxiv.org/abs/2508.21375)

Breaking the Limit: How Diffusion Models Enable Robots to Lift Super-Nominal Payloads

In the world of industrial robotics, specifications are often treated as gospel. If a robot manufacturer states that a robotic arm has a maximum payload capacity of 3 kilograms, engineers typically treat 3.01 kilograms as a hard “do not cross” line. But here is a secret: those numbers are conservative. Extremely conservative. Manufacturer ratings are typically derived from “worst-case” scenarios—configurations where the robot arm is fully extended, exerting maximum leverage on its joints. However, in vast regions of the robot’s workspace, the mechanical structure is capable of handling significantly more weight. The hardware is over-provisioned to ensure safety, but this leads to inefficiency. If you need to move a 35kg object, you might be forced to buy a massive, expensive 50kg-rated robot, even though a smaller, cheaper 30kg-rated robot could physically handle the task if it moved intelligently. ...

[Sim-to-Real Reinforcement Learning for Vision-Based Dexterous Manipulation on Humanoids 🔗](https://arxiv.org/abs/2502.20396)

Cracking the Code: A Practical Recipe for Sim-to-Real Bimanual Dexterous Manipulation

Introduction We are currently witnessing a golden age of humanoid robotics. We see robots running, jumping, and performing backflips with impressive agility. Yet, there remains a glaring gap in their capabilities: manipulation. While a robot might be able to navigate a warehouse, asking it to perform a contact-rich task—like picking up a delicate object, reorienting it in its hand, or handing it over to another hand—remains incredibly difficult. The complexity stems from the hardware itself. Humanoid hands are sophisticated, multi-fingered mechanisms with high degrees of freedom (DoF). Controlling them requires precise coordination. Traditional approaches often rely on Imitation Learning (IL), where robots mimic human demonstrations. While effective, IL is data-hungry, expensive, and labor-intensive. You need thousands of hours of teleoperation data to cover every possible edge case. ...

[ParticleFormer: A 3D Point Cloud World Model for Multi-Object, Multi-Material Robotic Manipulation 🔗](https://arxiv.org/abs/2506.23126)

ParticleFormer: Teaching Robots Physics with Transformers and Point Clouds

Imagine you are trying to sweep a pile of sand onto a dustpan using a brush. As you move the brush, you intuitively predict how the sand particles will flow, cascade, and settle. You don’t need to calculate the trajectory of every single grain consciously; you have a “world model”—an internal physics engine—that helps you plan your actions to achieve the goal. For robots, developing this kind of intuition is incredibly difficult, especially when dealing with mixed materials. Pushing a rigid box is one thing; manipulating a soft rope that is sweeping granular material (like sand) is entirely different. The robot needs to understand how the rope deforms and how that deformation transfers force to the sand. ...

[EndoVLA: Dual-Phase Vision-Language-Action for Precise Autonomous Tracking in Endoscopy 🔗](https://openreview.net/pdf?id=7XyO9Y1hI1)

Can AI Surgeons See and Think? A Deep Dive into EndoVLA

Introduction Imagine trying to navigate a slippery, twisting tunnel while looking through a tiny camera. Now, imagine you have to locate a specific lesion, track its movement as the tunnel breathes and deforms, and precisely manipulate a tool to treat it. This is the daily reality of endoscopic surgery. It is a procedure that demands immense cognitive load, steady hands, and years of training. For years, roboticists have tried to automate parts of this process to relieve the burden on surgeons. However, the environment inside the human body is notoriously difficult for robots. It is unstructured, dynamic, and wet. Traditional automation methods are often “brittle”—they rely on complex mathematical models that break down the moment the tissue deforms unexpectedly or a reflection obscures the camera. ...

[TypeTele: Releasing Dexterity in Teleoperation by Dexterous Manipulation Types 🔗](https://arxiv.org/abs/2507.01857)

Beyond Mimicry: How TypeTele Unlocks True Robotic Dexterity via Manipulation Types

The dream of general-purpose robotics often centers on the hands. If we can build a robot hand with the same dexterity as a human hand, surely we can teleoperate it to do anything a human can do, right? This logic has driven the field of dexterous teleoperation for years. The standard approach is straightforward: capture the motion of a human hand and map it, joint-for-joint, to a robot hand. This process is known as retargeting. ...

[LodeStar: Long-horizon Dexterity via Synthetic Data Augmentation from Human Demonstrations 🔗](https://arxiv.org/abs/2508.17547)

LODESTAR: Teaching Robots Long-Horizon Dexterity with Digital Twins and Residual RL

Introduction: The Challenge of the “Simple” Task Imagine a task as simple as watering a plant. For a human, this is trivial: you pick up the spray bottle, aim it, squeeze the trigger, and put it back. But for a robot, this is a nightmare of complexity. To achieve this, a robot must possess dexterity—the ability to manipulate objects with fingers, not just a simple gripper—and long-horizon planning, the ability to string together a sequence of actions where a small mistake in step one causes a catastrophic failure in step five. ...

[RICL: Adding In-Context Adaptability to Pre-Trained Vision-Language-Action Models 🔗](https://arxiv.org/abs/2508.02062)

Teaching Robots to Learn Like LLMs: An Deep Dive into RICL

Teaching Robots to Learn Like LLMs: An In-Depth Look at RICL Imagine you are using a Large Language Model (LLM) like GPT-4, and you want it to write a poem in a very specific, made-up dialect. You wouldn’t need to retrain the entire neural network to do this. Instead, you would simply provide a few examples of this dialect in the prompt—the “context”—and the model would adapt instantly. This capability is known as In-Context Learning (ICL). ...

[D-Cubed: Latent Diffusion Trajectory Optimisation for Dexterous Deformable Manipulation 🔗](https://arxiv.org/abs/2403.12861)

Mastering the Squishy: How D-Cubed Uses Diffusion to Teach Robots Dexterity

Introduction Robotics has achieved remarkable feats in industrial settings, particularly with rigid objects. We have robots that can weld car chassis with sub-millimeter precision or assemble electronics at lightning speeds. However, move that robot into a kitchen and ask it to fold a dumpling or wrap a piece of dough, and it will likely struggle. The challenge lies in deformable object manipulation. Unlike a rigid box, a piece of dough or cloth has infinite degrees of freedom. Its shape changes based on contact, gravity, and material properties. When you combine this with a dexterous robot hand (which has high dimensionality itself), the search space for finding a successful movement trajectory becomes computationally explosive. ...

[Deep Reactive Policy: Learning Reactive Manipulator Motion Planning for Dynamic Environments 🔗](https://arxiv.org/abs/2509.06953)

How Deep Reactive Policy Enables Robots to Move Safely in Chaos

Imagine a robot in a kitchen. It’s not a pre-programmed factory arm welding the same car door every 30 seconds; it’s a household helper. You ask it to grab a mug from the drying rack. As it moves, you suddenly reach across its path to grab the salt shaker. For a traditional robot, this is a nightmare scenario. Most motion planners are too slow to re-calculate a path in the split second before your hand intersects the robot’s arm. The result? A frozen robot, or worse, a collision. ...

[Enter the Mind Palace: Reasoning and Planning for Long-term Active Embodied Question Answering 🔗](https://arxiv.org/abs/2507.12846)

Building a Mind Palace: How Robots Can Use Long-Term Memory to Answer Complex Questions

Imagine you are at the grocery store, and you suddenly wonder, “Do I need to buy milk?” To answer this, you don’t just look at the shelves in front of you. You mentally simulate a walk through your kitchen, recalling the last time you opened the fridge, perhaps remembering that you finished the carton during breakfast yesterday. You combine your current perception (being at the store) with long-term episodic memory (yesterday’s breakfast) to make a decision. ...

[Motion-Blender Gaussian Splatting for Dynamic Scene Reconstruction 🔗](https://openreview.net/pdf?id=4Po2mqLjrQ)

Unlocking Controllable 3D Video with Motion-Blender Gaussian Splatting

Introduction Imagine watching a video of a cat stretching or a person waving their hand. Now, imagine you could reach into that video, grab the cat’s paw, and move it to a different position, or re-animate the hand to wave in a completely new pattern. This is the dream of dynamic scene reconstruction: transforming a flat video into a fully interactive, 3D digital twin. In recent years, a technique called 3D Gaussian Splatting (3DGS) has revolutionized how we render static 3D scenes. It’s fast, high-quality, and photorealistic. However, extending this to dynamic scenes (scenes that move) has hit a roadblock. Most current methods treat motion as a “black box.” They use neural networks to predict how pixels move, which looks great on playback but is impossible to control. You can replay the video in 3D, but you can’t change the movement. ...

[TReF-6: Inferring Task-Relevant Frames from a Single Demonstration for One-Shot Skill Generalization 🔗](https://arxiv.org/abs/2509.00310)

One-Shot Wonder: How TReF-6 Teaches Robots to Generalize from a Single Demo

Imagine teaching a robot to open a kitchen cabinet. You grab the robot’s arm, guide it to the handle, and pull the door open in a distinct arc. The robot records this motion. Great. But what happens if you ask the robot to open a different cabinet, one that is slightly larger, or perhaps positioned at an angle? For humans, this is trivial. We understand the underlying mechanics: “I need to rotate the door around its hinge.” For robots, however, this is a notorious stumbling block. Most robotic learning algorithms memorize the specific coordinates of your demonstration. If the environment changes, the robot tries to execute the exact same path in absolute space—often leading to it grasping thin air or colliding with the door. ...

[Lucid-XR: An Extended-Reality Data Engine for Robotic Manipulation 🔗](https://openreview.net/pdf?id=3p7rTnLJM8)

How Lucid-XR Turns VR Headsets into Data Factories for Robots

Introduction: The Data Bottleneck in Robotics If you look at the history of filmmaking, there is a clear trajectory from practical effects—building physical sets and animatronics—to digital effects (CGI). Filmmakers made this switch because the digital world offers infinite control and scalability. Robotics is currently facing a similar transition, but the stakes are higher than just box office numbers. To train a general-purpose robot, we need data—massive amounts of it. Specifically, we need data showing robots successfully manipulating objects in the real world. The traditional way to get this is through teleoperation, where a human controls a robot to perform a task while recording the data. However, this is slow, expensive, and hard to scale. You need a physical robot, a physical set, and a human operator in the same room. ...

[DREAMGEN: Unlocking Generalization in Robot Learning through Video World Models 🔗](https://arxiv.org/abs/2505.12705)

Robots That Dream: How Synthetic Video Unlocks Generalization in Physical AI

Introduction In the field of robotics, data is the scarcest resource. While Large Language Models (LLMs) have consumed nearly the entire internet to learn how to write code and poetry, robots are stuck in a much slower loop: human teleoperation. To teach a robot to fold a shirt or pour coffee, a human operator usually has to manually guide the robot through the motion hundreds or thousands of times. This reliance on manual data collection creates a massive bottleneck. If you train a robot to pick up a red apple in a lab, it often fails to pick up a green apple in a kitchen. To fix this, you traditionally need to collect more data in the kitchen. This lack of generalization is the “Sim2Real” and “Real2Real” gap that has plagued the field for decades. ...

[GENNAV: Polygon Mask Generation for Generalized Referring Navigable Regions 🔗](https://arxiv.org/abs/2508.21102)

Driving with Language: How GENNAV Solves the "Park Over There" Problem

Imagine you are in a taxi. You tell the driver, “Please park to the left of that red car.” The driver looks around, sees a blue truck and a white sedan, but no red car. The driver turns to you and says, “There is no red car.” This interaction seems trivial for humans. We possess an intuitive grasp of language, spatial relationships, and object permanence. However, for an autonomous vehicle (AV), this is a monumental challenge. Most current AI systems operate under the assumption that if you give a command, the target object must be there. If you tell a standard vision-language model to “find the red car,” and there isn’t one, it will often hallucinate—desperately selecting the closest match (like the red fire hydrant or the maroon truck) just to satisfy the request. ...

[NeuralSVCD for Efficient Swept Volume Collision Detection 🔗](https://arxiv.org/abs/2509.00499)

Solving the Tunneling Problem: How NeuralSVCD Makes Robot Motion Planning Safer and Faster

Introduction Imagine a robotic arm moving rapidly in a crowded factory floor. It needs to pick up a part from a bin and place it on a conveyor belt without hitting the bin walls, the conveyor, or itself. To plan this motion, the robot relies on a collision checker. Traditionally, motion planners work by sampling the robot’s trajectory at discrete points in time—like a flipbook. They check: “Is the robot hitting anything at time \(t=0\)? How about \(t=1\)?” If both are clear, the planner assumes the path is safe. But what happens at \(t=0.5\)? ...