Papers

[CLONE: Closed-Loop Whole-Body Humanoid Teleoperation for Long-Horizon Tasks 🔗](https://arxiv.org/abs/2506.08931)

Becoming the Robot: How CLONE Solves the Drift Problem in Humanoid Teleoperation

The dream of telepresence—operating a humanoid robot as if it were your own body—is a staple of science fiction. We imagine wearing a VR headset and having a robot miles away perfectly mimic our movements, walking through a room to pick up a package or perform a repair. However, the reality of robotics is often clumsier than the dream. While we have made massive strides in robotic control, two persistent “villains” plague humanoid teleoperation: unnatural movement and positional drift. ...

[Hold My Beer: Learning Gentle Humanoid Locomotion and End-Effector Stabilization Control 🔗](https://arxiv.org/abs/2505.24198)

Don't Spill the Tea: How SoFTA Teaches Humanoids to Walk Gently

Imagine a waiter walking through a crowded restaurant carrying a tray full of drinks. They have to dodge obstacles, maintain their balance, and navigate uneven flooring. At the same time, their hands must remain perfectly steady to prevent the drinks from spilling. This is a feat of coordination that humans perform almost unconsciously, yet it remains one of the most difficult challenges in humanoid robotics. We have seen robots do backflips, run parkour, and dance to pop music. But ask those same robots to walk across a room holding a full cup of coffee without spilling a drop, and you will likely end up with a mess. ...

[HyperTASR: Hypernetwork-Driven Task-Aware Scene Representations for Robust Manipulation 🔗](https://arxiv.org/abs/2508.18802)

Robots That See What Matters: Introducing HyperTASR

Imagine you are making a cup of tea. When you reach for the kettle, your eyes lock onto the handle. When you pour the water, your focus shifts entirely to the spout and the water level in the cup. You essentially ignore the toaster, the fridge, and the pattern on the tablecloth. Your visual perception is task-aware and evolves as the task progresses. Now, consider how robots typically “see.” In standard robotic manipulation pipelines, the robot takes an image of the scene and compresses it into a representation (a set of features). Crucially, this process is usually task-agnostic. The robot processes the toaster, the fridge, and the kettle with equal importance, regardless of whether it’s trying to boil water or make toast. ...

[Phantom: Training Robots Without Robots Using Only Human Videos 🔗](https://arxiv.org/abs/2503.00779)

Phantom: How to Train Robots Using Only Human Videos and Zero Robot Data

Introduction: The Data Bottleneck If you look at the recent explosions in Natural Language Processing (like GPT-4) or Computer Vision, there is a common denominator: massive datasets. These models are trained on billions of data points scraped effortlessly from the internet. Robotics, however, is stuck in a data bottleneck. To train a robot to do the dishes, you typically need to “teleoperate” it—manually controlling the robot arm to perform the task hundreds of times to collect training data. This is slow, expensive, and hardware-dependent. ...

[Granular Loco-Manipulation: Repositioning Rocks Through Strategic Sand Avalanche 🔗](https://arxiv.org/abs/2505.12934)

Mastering the Dunes: How Robots Can Use Sand Avalanches to Move Rocks

Imagine a robot deployed on a search-and-rescue mission in a desert environment, or perhaps an explorer rover on the steep, sandy slopes of Mars. The terrain is treacherous—loose sand shifts underfoot, and large rocks block the path. Traditionally, a robot’s navigation strategy is strictly avoidance: see a rock, plan a path around it. But what if the path is blocked? Or what if avoiding the rock puts the robot on a slope so steep it might slide? ...

[Articulate AnyMesh: Open-Vocabulary 3D Articulated Objects Modeling 🔗](https://arxiv.org/abs/2502.02590)

Turning Static Meshes into Moving Machines: A Deep Dive into Articulate AnyMesh

Introduction In the rapidly evolving world of Embodied AI and robotics, data is oxygen. To teach a robot how to navigate a kitchen or tidy up a workshop, we rely heavily on simulation. It is safer, faster, and cheaper to crash a virtual drone a thousand times than a real one. However, there is a significant bottleneck in current simulation environments: the lack of diverse, interactive objects. While we have witnessed a revolution in generative AI that can produce stunning static 3D meshes from a simple text prompt, these objects are frozen statues. A generated microwave looks real, but you cannot open the door. A generated car has wheels, but they don’t spin. For a robot learning manipulation skills, a static object is useless. ...

[COLLAGE: Adaptive Fusion-based Retrieval for Augmented Policy Learning 🔗](https://arxiv.org/abs/2508.01131)

Beyond Visual Similarity: How COLLAGE Teaches Robots to Learn from Heterogeneous Data

Introduction Imagine you are trying to teach a robot how to make a cup of coffee. You show it a few examples—maybe five or ten demonstrations of you grinding beans and pouring water. For a modern machine learning model, this small handful of examples is nowhere near enough to learn a robust policy. The robot might learn to move its arm, but it won’t understand how to handle slight variations in the mug’s position or changes in lighting. ...

[Imitation Learning Based on Disentangled Representation Learning of Behavioral Characteristics 🔗](https://arxiv.org/abs/2509.04737)

Teaching Robots Nuance: How to Control Speed and Force in Real-Time

Introduction Imagine you are teaching a robot to wipe a whiteboard. You show it the motion, and it learns to mimic the trajectory perfectly. But then you notice a stubborn stain. You tell the robot, “Wipe harder,” or perhaps “Wipe faster.” In a typical robotic system, this is where things break down. Most imitation learning models treat a task as a static sequence: they learn what to do, but they struggle to adapt how they do it based on qualitative feedback during execution. ...

[Bipedal Balance Control with Whole-body Musculoskeletal Standing and Falling Simulations 🔗](https://arxiv.org/abs/2506.09383)

Decoding Human Balance - How Muscle Simulations Reveal the Secrets of Standing and Falling

Decoding Human Balance: How Muscle Simulations Reveal the Secrets of Standing and Falling If you are reading this while standing up, you are performing a miracle of mechanical engineering. You are, effectively, an inverted pendulum—a heavy weight balanced precariously on top of two narrow supports. To stay upright, your brain is processing a torrent of sensory data and firing precise electrical signals to hundreds of individual muscles, making micro-adjustments every millisecond. ...

[Do LLM Modules Generalize? A Study on Motion Generation for Autonomous Driving 🔗](https://arxiv.org/abs/2509.02754)

Driving Like an LLM: Can We Copy-Paste Language Model Architectures to Autonomous Vehicles?

Introduction In the last few years, the landscape of Artificial Intelligence has been dominated by one specific architecture: the Transformer. Large Language Models (LLMs) like GPT-4 and Llama have revolutionized how machines process sequence data, demonstrating reasoning capabilities that seem almost emergent. This naturally leads to a provocative question for researchers in other fields: If a model is excellent at predicting the next word in a sentence, can it be equally good at predicting the next move of a car in traffic? ...

[Diffusion-Guided Multi-Arm Motion Planning 🔗](https://arxiv.org/abs/2509.08160)

How to Choreograph Robots: Scaling Multi-Arm Motion Planning with Diffusion Models

Imagine a busy factory floor. A single robotic arm picks up a part and places it on a conveyor belt. It’s a solved problem; the robot knows where it is, where the part is, and how to move without hitting itself. Now, imagine eight robotic arms arranged in a circle, all reaching into the same central bin to sort objects simultaneously. Suddenly, the problem isn’t just about moving; it’s about coordination. If Robot A moves left, Robot B might need to wait, and Robot C might need to take a completely different path to avoid a collision. ...

[See, Point, Fly: A Learning-Free VLM Framework for Universal Unmanned Aerial Navigation 🔗](https://arxiv.org/abs/2509.22653)

How to Teach Drones to Fly Using Vision-Language Models (Without Training)

Imagine standing in a park and pointing to a distant tree, telling a friend, “Go over there.” Your friend sees your finger, estimates the distance, and walks toward it, adjusting their path to avoid a bench along the way. This interaction is intuitive, relying on visual understanding and common sense. Now, imagine trying to get a drone to do the same thing. Traditionally, this has been a nightmare. You either need a joystick to manually pilot it, or you need to train a complex neural network on thousands of hours of flight data just to recognize a “tree.” ...

[Pseudo-Simulation for Autonomous Driving 🔗](https://arxiv.org/abs/2506.04218)

Bridging the Gap—How Pseudo-Simulation Solves the AV Evaluation Crisis

Introduction Imagine you are teaching a student driver. To test their skills, you have two options. Option A is to sit in the passenger seat while they drive through rush-hour traffic. It’s realistic and gives you immediate feedback, but it’s dangerous and stressful. Option B is to show them a video of a drive and ask, “What would you do here?” It’s safe and easy to grade, but it doesn’t tell you if they can actually handle the car when things go wrong. ...

[Toward Real-World Cooperative and Competitive Soccer with Quadrupedal Robot Teams 🔗](https://arxiv.org/abs/2505.13834)

From Dribbling to Tactics: How Hierarchical RL Teaches Quadruped Robots to Play Soccer

Robotic soccer has long been viewed as a “Grand Challenge” for artificial intelligence and robotics. Since the inception of RoboCup in the 90s, the dream has been to field a team of robots capable of beating the human World Cup champions. While we aren’t quite there yet, the complexity of soccer makes it the perfect testbed for modern robotics. It combines everything that is difficult about robots: the need for agile, split-second motor control (balancing, kicking) and the need for high-level cognitive planning (strategy, teamwork, anticipation). ...

[CLASS: Contrastive Learning via Action Sequence Supervision for Robot Manipulation 🔗](https://arxiv.org/abs/2508.01600)

Beyond Behavior Cloning: How CLASS Uses Action Sequences to Fix Robot Generalization

Introduction In the quest to build general-purpose robots, Behavior Cloning (BC) has been a dominant strategy. The premise is simple: collect a massive amount of data showing a human performing a task, and train a neural network to copy those actions given the current visual observation. With the rise of expressive models like Diffusion Policies and Transformers, robots have become remarkably good at mimicking complex movements. However, there is a catch. Behavior Cloning tends to be a “pixel-perfect” copycat. If you train a robot to pick up a mug with the camera fixed at a 45-degree angle, and then you bump the camera slightly to the left, the policy often fails catastrophically. The robot hasn’t learned how to pick up the mug; it has learned how to react to a specific arrangement of pixels. ...

[Generating Robot Constitutions & Benchmarks for Semantic Safety 🔗](https://arxiv.org/abs/2503.08663)

Nightmares and Nuance: How DeepMind is Teaching Robots to Write Their Own Safety Laws

Introduction In 1942, Isaac Asimov introduced the “Three Laws of Robotics” in his short story Runaround. They were elegant, hierarchical, and seemingly comprehensive. The First Law stated that a robot may not injure a human being or, through inaction, allow a human to come to harm. For decades, these laws served as the philosophical bedrock of sci-fi robotics. But when roboticists were asked in 2009 why they hadn’t implemented these laws, the answer was pragmatic and blunt: “They are in English – how the heck do you program that?” ...

[BEVCALIB: LiDAR-Camera Calibration via Geometry-Guided Bird's-Eye View Representations 🔗](https://arxiv.org/abs/2506.02587)

Bridging the Gap: How BEVCALIB Uses Bird's-Eye View for Precise Sensor Calibration

Introduction Imagine you are driving a car. Your eyes (cameras) see the red stop sign ahead, and your brain estimates the distance. Now, imagine a sophisticated autonomous vehicle. It doesn’t just rely on cameras; it likely uses LiDAR (Light Detection and Ranging) to measure precise depth. Ideally, the camera and the LiDAR should agree perfectly on where that stop sign is located in 3D space. But what happens if they don’t? ...

[Imagine, Verify, Execute: Memory-Guided Agentic Exploration with Vision-Language Models 🔗](https://arxiv.org/abs/2505.07815)

How Robots Can "Imagine" to Explore: Breaking Free from Random Actions

Introduction How does a human learn to interact with a new environment? If you place a toddler in front of a table with blocks and cups, they don’t just randomly twitch their muscles until something interesting happens. They look at the objects, form a mini-goal (e.g., “I want to put the blue block inside the cup”), and then try to execute it. If it doesn’t fit, they learn. If it works, they remember the result and try something new. ...

[Distilling On-device Language Models for Robot Planning with Minimal Human Intervention 🔗](https://arxiv.org/abs/2506.17486)

Cutting the Cord: How PRISM Brings GPT-4 Level Planning to On-Device Robots

Imagine a robot navigating a disaster zone. It needs to find survivors, assess structural damage, and report back. To do this effectively, it needs to understand complex natural language instructions and reason about its environment in real-time. For the last few years, the standard solution has been to hook the robot up to a Large Language Model (LLM) like GPT-4. The robot sends a picture or a map to the cloud, the LLM processes it, and sends back a plan. In a perfect world with high-speed internet, this works beautifully. ...

[Beyond Constant Parameters: Hyper Prediction Models and HyperMPC 🔗](https://arxiv.org/abs/2508.06181)

Dynamic Parameters for Dynamic Robots: How HyperPM Revolutionizes Model Predictive Control

Introduction In the world of robotics, there is a constant tug-of-war between speed and accuracy. Nowhere is this more apparent than in Model Predictive Control (MPC). MPC is the gold standard for controlling complex robots—from agile drones to autonomous race cars—because it doesn’t just react to the present; it looks into the future, plans a sequence of moves, and executes the best one. But to look into the future, MPC needs a crystal ball: a dynamics model. ...