Papers

[OPAL: Visibility-aware LiDAR-to-OpenStreetMap Place Recognition via Adaptive Radial Fusion 🔗](https://arxiv.org/abs/2504.19258)

OPAL: How to Localize Self-Driving Cars Using Free Maps and Deep Learning

Introduction Imagine you are driving an autonomous vehicle through a dense urban center—perhaps downtown Manhattan or a narrow European street. Suddenly, the skyscrapers block your GPS signal. The blue dot on your navigation screen freezes or drifts aimlessly. For a human driver, this is an annoyance; for a self-driving car, it is a critical failure. To navigate safely without GNSS (Global Navigation Satellite Systems), a robot must answer the question: “Where am I?” based solely on what it sees. This is known as Place Recognition. Typically, this involves matching the car’s current sensor view (like a LiDAR scan) against a pre-built database. ...

[Crossing the Human-Robot Embodiment Gap with Sim-to-Real RL using One Human Demonstration 🔗](https://arxiv.org/abs/2504.12609)

One Video is All You Need: Teaching Dexterous Robots with Human2Sim2Robot

Introduction Imagine you want to teach a robot how to pour a glass of water or place a dish in a rack. In an ideal world, you would simply show the robot how to do it once—perhaps by performing the task yourself—and the robot would immediately understand and replicate the skill. In reality, teaching robots “dexterous manipulation” (using multi-fingered hands to handle objects) is notoriously difficult. Traditional methods like Imitation Learning (IL) often require hundreds of demonstrations to learn a robust policy. Furthermore, capturing high-quality data of human hand motion typically involves expensive wearable sensors or complex teleoperation rigs. ...

[Uncertainty-aware Latent Safety Filters for Avoiding Out-of-Distribution Failures 🔗](https://arxiv.org/abs/2505.00779)

When Robots Hallucinate: Making AI Safe in an Uncertain World

When Robots Hallucinate: Making AI Safe in an Uncertain World Imagine you are playing a high-stakes game of Jenga. You carefully tap a block, analyzing how the tower wobbles. You predict that if you pull it slightly to the left, the tower remains stable. If you pull it to the right, it crashes. Your brain is running a “world model”—simulating the physics of the tower to keep you safe from losing. ...

[MimicFunc: Imitating Tool Manipulation from a Single Human Video via Functional Correspondence 🔗](https://arxiv.org/abs/2508.13534)

Teach Once, Do Anywhere: How MimicFunc Enables Robots to Master Tools from One Video

Imagine you are teaching a friend how to scoop beans. You pick up a silver spoon, scoop the beans, and dump them into a bowl. Now, you hand your friend a large plastic ladle. Without hesitation, your friend adjusts their grip, accounts for the ladle’s larger size, and performs the exact same scooping action. They understood the function of the action, not just the specific geometry of the spoon. For robots, this simple act of transfer is incredibly difficult. Traditional robotic learning often relies on rote memorization of specific objects. If you teach a robot to pour with a red mug, it is likely to fail when handed a glass measuring cup. The shapes, sizes, and grasp points are mathematically different, even if the “function” (pouring) is identical. ...

[CLONE: Closed-Loop Whole-Body Humanoid Teleoperation for Long-Horizon Tasks 🔗](https://arxiv.org/abs/2506.08931)

Becoming the Robot: How CLONE Solves the Drift Problem in Humanoid Teleoperation

The dream of telepresence—operating a humanoid robot as if it were your own body—is a staple of science fiction. We imagine wearing a VR headset and having a robot miles away perfectly mimic our movements, walking through a room to pick up a package or perform a repair. However, the reality of robotics is often clumsier than the dream. While we have made massive strides in robotic control, two persistent “villains” plague humanoid teleoperation: unnatural movement and positional drift. ...

[Hold My Beer: Learning Gentle Humanoid Locomotion and End-Effector Stabilization Control 🔗](https://arxiv.org/abs/2505.24198)

Don't Spill the Tea: How SoFTA Teaches Humanoids to Walk Gently

Imagine a waiter walking through a crowded restaurant carrying a tray full of drinks. They have to dodge obstacles, maintain their balance, and navigate uneven flooring. At the same time, their hands must remain perfectly steady to prevent the drinks from spilling. This is a feat of coordination that humans perform almost unconsciously, yet it remains one of the most difficult challenges in humanoid robotics. We have seen robots do backflips, run parkour, and dance to pop music. But ask those same robots to walk across a room holding a full cup of coffee without spilling a drop, and you will likely end up with a mess. ...

[HyperTASR: Hypernetwork-Driven Task-Aware Scene Representations for Robust Manipulation 🔗](https://arxiv.org/abs/2508.18802)

Robots That See What Matters: Introducing HyperTASR

Imagine you are making a cup of tea. When you reach for the kettle, your eyes lock onto the handle. When you pour the water, your focus shifts entirely to the spout and the water level in the cup. You essentially ignore the toaster, the fridge, and the pattern on the tablecloth. Your visual perception is task-aware and evolves as the task progresses. Now, consider how robots typically “see.” In standard robotic manipulation pipelines, the robot takes an image of the scene and compresses it into a representation (a set of features). Crucially, this process is usually task-agnostic. The robot processes the toaster, the fridge, and the kettle with equal importance, regardless of whether it’s trying to boil water or make toast. ...

[Phantom: Training Robots Without Robots Using Only Human Videos 🔗](https://arxiv.org/abs/2503.00779)

Phantom: How to Train Robots Using Only Human Videos and Zero Robot Data

Introduction: The Data Bottleneck If you look at the recent explosions in Natural Language Processing (like GPT-4) or Computer Vision, there is a common denominator: massive datasets. These models are trained on billions of data points scraped effortlessly from the internet. Robotics, however, is stuck in a data bottleneck. To train a robot to do the dishes, you typically need to “teleoperate” it—manually controlling the robot arm to perform the task hundreds of times to collect training data. This is slow, expensive, and hardware-dependent. ...

[Granular Loco-Manipulation: Repositioning Rocks Through Strategic Sand Avalanche 🔗](https://arxiv.org/abs/2505.12934)

Mastering the Dunes: How Robots Can Use Sand Avalanches to Move Rocks

Imagine a robot deployed on a search-and-rescue mission in a desert environment, or perhaps an explorer rover on the steep, sandy slopes of Mars. The terrain is treacherous—loose sand shifts underfoot, and large rocks block the path. Traditionally, a robot’s navigation strategy is strictly avoidance: see a rock, plan a path around it. But what if the path is blocked? Or what if avoiding the rock puts the robot on a slope so steep it might slide? ...

[Articulate AnyMesh: Open-Vocabulary 3D Articulated Objects Modeling 🔗](https://arxiv.org/abs/2502.02590)

Turning Static Meshes into Moving Machines: A Deep Dive into Articulate AnyMesh

Introduction In the rapidly evolving world of Embodied AI and robotics, data is oxygen. To teach a robot how to navigate a kitchen or tidy up a workshop, we rely heavily on simulation. It is safer, faster, and cheaper to crash a virtual drone a thousand times than a real one. However, there is a significant bottleneck in current simulation environments: the lack of diverse, interactive objects. While we have witnessed a revolution in generative AI that can produce stunning static 3D meshes from a simple text prompt, these objects are frozen statues. A generated microwave looks real, but you cannot open the door. A generated car has wheels, but they don’t spin. For a robot learning manipulation skills, a static object is useless. ...

[COLLAGE: Adaptive Fusion-based Retrieval for Augmented Policy Learning 🔗](https://arxiv.org/abs/2508.01131)

Beyond Visual Similarity: How COLLAGE Teaches Robots to Learn from Heterogeneous Data

Introduction Imagine you are trying to teach a robot how to make a cup of coffee. You show it a few examples—maybe five or ten demonstrations of you grinding beans and pouring water. For a modern machine learning model, this small handful of examples is nowhere near enough to learn a robust policy. The robot might learn to move its arm, but it won’t understand how to handle slight variations in the mug’s position or changes in lighting. ...

[Imitation Learning Based on Disentangled Representation Learning of Behavioral Characteristics 🔗](https://arxiv.org/abs/2509.04737)

Teaching Robots Nuance: How to Control Speed and Force in Real-Time

Introduction Imagine you are teaching a robot to wipe a whiteboard. You show it the motion, and it learns to mimic the trajectory perfectly. But then you notice a stubborn stain. You tell the robot, “Wipe harder,” or perhaps “Wipe faster.” In a typical robotic system, this is where things break down. Most imitation learning models treat a task as a static sequence: they learn what to do, but they struggle to adapt how they do it based on qualitative feedback during execution. ...

[Bipedal Balance Control with Whole-body Musculoskeletal Standing and Falling Simulations 🔗](https://arxiv.org/abs/2506.09383)

Decoding Human Balance - How Muscle Simulations Reveal the Secrets of Standing and Falling

Decoding Human Balance: How Muscle Simulations Reveal the Secrets of Standing and Falling If you are reading this while standing up, you are performing a miracle of mechanical engineering. You are, effectively, an inverted pendulum—a heavy weight balanced precariously on top of two narrow supports. To stay upright, your brain is processing a torrent of sensory data and firing precise electrical signals to hundreds of individual muscles, making micro-adjustments every millisecond. ...

[Do LLM Modules Generalize? A Study on Motion Generation for Autonomous Driving 🔗](https://arxiv.org/abs/2509.02754)

Driving Like an LLM: Can We Copy-Paste Language Model Architectures to Autonomous Vehicles?

Introduction In the last few years, the landscape of Artificial Intelligence has been dominated by one specific architecture: the Transformer. Large Language Models (LLMs) like GPT-4 and Llama have revolutionized how machines process sequence data, demonstrating reasoning capabilities that seem almost emergent. This naturally leads to a provocative question for researchers in other fields: If a model is excellent at predicting the next word in a sentence, can it be equally good at predicting the next move of a car in traffic? ...

[Diffusion-Guided Multi-Arm Motion Planning 🔗](https://arxiv.org/abs/2509.08160)

How to Choreograph Robots: Scaling Multi-Arm Motion Planning with Diffusion Models

Imagine a busy factory floor. A single robotic arm picks up a part and places it on a conveyor belt. It’s a solved problem; the robot knows where it is, where the part is, and how to move without hitting itself. Now, imagine eight robotic arms arranged in a circle, all reaching into the same central bin to sort objects simultaneously. Suddenly, the problem isn’t just about moving; it’s about coordination. If Robot A moves left, Robot B might need to wait, and Robot C might need to take a completely different path to avoid a collision. ...

[See, Point, Fly: A Learning-Free VLM Framework for Universal Unmanned Aerial Navigation 🔗](https://arxiv.org/abs/2509.22653)

How to Teach Drones to Fly Using Vision-Language Models (Without Training)

Imagine standing in a park and pointing to a distant tree, telling a friend, “Go over there.” Your friend sees your finger, estimates the distance, and walks toward it, adjusting their path to avoid a bench along the way. This interaction is intuitive, relying on visual understanding and common sense. Now, imagine trying to get a drone to do the same thing. Traditionally, this has been a nightmare. You either need a joystick to manually pilot it, or you need to train a complex neural network on thousands of hours of flight data just to recognize a “tree.” ...

[Pseudo-Simulation for Autonomous Driving 🔗](https://arxiv.org/abs/2506.04218)

Bridging the Gap—How Pseudo-Simulation Solves the AV Evaluation Crisis

Introduction Imagine you are teaching a student driver. To test their skills, you have two options. Option A is to sit in the passenger seat while they drive through rush-hour traffic. It’s realistic and gives you immediate feedback, but it’s dangerous and stressful. Option B is to show them a video of a drive and ask, “What would you do here?” It’s safe and easy to grade, but it doesn’t tell you if they can actually handle the car when things go wrong. ...

[Toward Real-World Cooperative and Competitive Soccer with Quadrupedal Robot Teams 🔗](https://arxiv.org/abs/2505.13834)

From Dribbling to Tactics: How Hierarchical RL Teaches Quadruped Robots to Play Soccer

Robotic soccer has long been viewed as a “Grand Challenge” for artificial intelligence and robotics. Since the inception of RoboCup in the 90s, the dream has been to field a team of robots capable of beating the human World Cup champions. While we aren’t quite there yet, the complexity of soccer makes it the perfect testbed for modern robotics. It combines everything that is difficult about robots: the need for agile, split-second motor control (balancing, kicking) and the need for high-level cognitive planning (strategy, teamwork, anticipation). ...

[CLASS: Contrastive Learning via Action Sequence Supervision for Robot Manipulation 🔗](https://arxiv.org/abs/2508.01600)

Beyond Behavior Cloning: How CLASS Uses Action Sequences to Fix Robot Generalization

Introduction In the quest to build general-purpose robots, Behavior Cloning (BC) has been a dominant strategy. The premise is simple: collect a massive amount of data showing a human performing a task, and train a neural network to copy those actions given the current visual observation. With the rise of expressive models like Diffusion Policies and Transformers, robots have become remarkably good at mimicking complex movements. However, there is a catch. Behavior Cloning tends to be a “pixel-perfect” copycat. If you train a robot to pick up a mug with the camera fixed at a 45-degree angle, and then you bump the camera slightly to the left, the policy often fails catastrophically. The robot hasn’t learned how to pick up the mug; it has learned how to react to a specific arrangement of pixels. ...

[Generating Robot Constitutions & Benchmarks for Semantic Safety 🔗](https://arxiv.org/abs/2503.08663)

Nightmares and Nuance: How DeepMind is Teaching Robots to Write Their Own Safety Laws

Introduction In 1942, Isaac Asimov introduced the “Three Laws of Robotics” in his short story Runaround. They were elegant, hierarchical, and seemingly comprehensive. The First Law stated that a robot may not injure a human being or, through inaction, allow a human to come to harm. For decades, these laws served as the philosophical bedrock of sci-fi robotics. But when roboticists were asked in 2009 why they hadn’t implemented these laws, the answer was pragmatic and blunt: “They are in English – how the heck do you program that?” ...