[Constraint-Preserving Data Generation for Visuomotor Policy Learning 🔗](https://openreview.net/pdf?id=KSKzA1mwKs)

One Demo Is All You Need? Scaling Robot Learning with CP-Gen

Introduction We are living in the golden age of imitation learning. From robots that can cook shrimp to those that can repair themselves, we’ve seen incredible breakthroughs driven by large-scale demonstration data. However, there is a massive bottleneck hiding behind these viral videos: the cost of data. Projects like ALOHA Unleashed or DROID required months of labor, dozens of operators, and tens of thousands of collected trajectories. If we want robots to generalize to every cup, hammer, or door handle in the world, we cannot possibly teleoperate them through every variation. We need a way to multiply our data automatically. ...

7 min · 1433 words
[Eye, Robot: Learning to Look to Act with a BC-RL Perception-Action Loop 🔗](https://arxiv.org/abs/2506.10968)

Learning to Look: How EyeRobot Uses Active Vision to Master Manipulation

Introduction Imagine you are thirsty. You decide to reach for a cup of coffee sitting on your desk. What happens first? Before your arm muscles even engage, your eyes move. You scan the table, saccade toward the cup to lock in its position, and then guide your hand toward it. Once you grasp the cup, your eyes might immediately dart to the coaster where you plan to place it. This sequence feels instantaneous, but it reveals a fundamental truth about biological intelligence: we do not passively absorb the visual world like a video camera; we actively look in order to act. ...

2025-06 · 9 min · 1757 words
[Learning Deployable Locomotion Control via Differentiable Simulation 🔗](https://arxiv.org/abs/2404.02887)

How to Teach Robots to Walk Using Analytic Gradients and Differentiable Physics

Introduction In the world of robotics, we are constantly chasing the dream of efficient learning. If you have ever trained a neural network for image recognition, you know the power of backpropagation. You calculate the error, compute the gradient (the direction to adjust parameters to reduce that error), and update the network. It’s elegant, mathematical, and efficient. However, when we try to apply this same logic to robots interacting with the physical world—specifically legged robots that need to walk—we hit a massive wall: Contact. ...

2024-04 · 9 min · 1837 words
[Agreement Volatility: A Second-Order Metric for Uncertainty Quantification in Surgical Robot Learning 🔗](https://openreview.net/pdf?id=K7KLc4FexO)

When Robots Get Nervous: Making Autonomous Surgery Safer with Agreement Volatility

Introduction Imagine a future where a surgical robot operates autonomously on a patient. It’s stitching soft tissue with precision, relieving an overworked surgeon who oversees the process. Suddenly, the robot encounters a piece of tissue that is slightly more slippery or deformed than what it was trained on. In a standard automation scenario, the robot might plow ahead, confident in a wrong decision, potentially causing harm. Ideally, however, the robot would “realize” it is confused, pause, and ask the human surgeon to take over for a moment. Once the tricky part is navigated, the robot resumes control. ...

9 min · 1764 words
[CARE: Enhancing Safety of Visual Navigation through Collision Avoidance via Repulsive Estimation 🔗](https://arxiv.org/abs/2506.03834)

Making Robots Safe: How CARE Adds Collision Avoidance to Visual Navigation Models

Introduction Imagine unboxing a new robot, turning it on, and telling it to “go to the kitchen.” Thanks to recent advancements in foundation models and Vision-Language Models (VLMs), this is becoming a reality. Robots can now understand high-level instructions and navigate through environments they have never seen before. However, there is a catch. While these modern AI models are excellent at understanding where to go based on visual context, they often lack a precise understanding of physical geometry. They might recognize a path to the kitchen but fail to notice the small cardboard box left in the hallway or a chair leg protruding from under a table. The result? Collisions. ...

2025-06 · 8 min · 1689 words
[Embrace Contacts: Humanoid Shadowing with Full Body Ground Contacts 🔗](https://openreview.net/pdf?id=JibqR9sEdW)

Beyond Walking: Teaching Humanoids to Roll, Crawl, and Breakdance

Introduction When we imagine a humanoid robot, we typically picture it doing one of two things: walking on two legs or standing still while using its hands to manipulate an object. This mirrors how traditional robotics control has evolved—treating the robot as a bipedal platform for mobile manipulation. But think about how humans actually move. We don’t just walk. We sit in chairs, we crawl under low obstacles, we trip and roll to break a fall, and we push ourselves up from the ground. We use our elbows, knees, backs, and shoulders to interact with the world. In robotics terms, humans embrace “full-body ground contacts.” ...

9 min · 1865 words
[FLOWER: Democratizing Generalist Robot Policies with Efficient Vision-Language-Action Flow Policies 🔗](https://openreview.net/pdf?id=JeppaebLRD)

Shrinking the Brain to Grow the Skills: How FLOWER Makes Generalist Robots Efficient

Introduction The dream of generalist robotics is a machine that can walk into any room, look around, and perform a task simply by being asked. Whether it’s “clean up that spill” or “make me a sandwich,” the robot needs to understand the visual world, parse the language command, and translate that into precise physical movements. To achieve this, the field has coalesced around Vision-Language-Action (VLA) models. Think of these as the robotic equivalent of Large Language Models (LLMs). But instead of outputting text, they output robot actions. ...

8 min · 1668 words
[VLM-AD: End-to-End Autonomous Driving through Vision-Language Model Supervision 🔗](https://arxiv.org/abs/2412.14446)

Teaching Cars to Reason: How VLM-AD Enhances Autonomous Driving Without the Inference Cost

Imagine you are driving down a busy street. You see a ball roll out from between two parked cars. You don’t just see a “spherical orange object”; you immediately infer that a child might be chasing it, and you instinctively slow down. This is commonsense reasoning. Now, consider an autonomous driving (AD) system. Most modern end-to-end (E2E) models are excellent at pattern matching—they see the road geometry and other cars, and they mimic the trajectories found in their training data. However, they often lack the “why” behind the driving decisions. They might navigate a standard intersection perfectly but struggle in “long-tail” scenarios (rare, complex events) because they lack the underlying reasoning capabilities that humans possess. ...

2024-12 · 8 min · 1603 words
[SIREN: Semantic, Initialization-Free Registration of Multi-Robot Gaussian Splatting Maps 🔗](https://arxiv.org/abs/2502.06519)

Merging Reality: How SIREN Fuses Multi-Robot Maps Without GPS or Poses

Introduction In the rapidly evolving world of robotics, the ability to create high-fidelity 3D maps is crucial. Whether it is a drone inspecting a warehouse or a quadruped exploring a disaster zone, robots rely on these maps to understand their environment. Recently, 3D Gaussian Splatting (GSplat) has emerged as a powerful technique for representing scenes, offering photorealistic quality and real-time rendering speeds that surpass traditional point clouds or voxel maps. ...

2025-02 · 8 min · 1601 words
[Articulated Object Estimation in the Wild 🔗](https://arxiv.org/abs/2509.01708)

How Robots Can Learn to Open Doors by Watching You: Inside ArtiPoint

Imagine a robot entering a new kitchen for the first time. To a human, the function of the room is obvious: the refrigerator handle pulls open, the drawer slides out, and the cabinet door swings on a hinge. We understand these mechanics intuitively, often predicting how an object moves before we even touch it. For a robot, however, this is a geometric nightmare. A cabinet isn’t just a static box; it is an articulated object—a rigid body capable of specific motions relative to another part. If a robot miscalculates the axis of rotation for a heavy fridge door, it could rip the handle off or damage its own arm. ...

2025-09 · 9 min · 1812 words
[Decentralized Aerial Manipulation of a Cable-Suspended Load using Multi-Agent Reinforcement Learning 🔗](https://arxiv.org/abs/2508.01522)

No Comms, No Problem: How Drones Can Cooperate to Lift Heavy Loads Without Talking

Imagine trying to move a heavy sofa up a winding staircase with two friends. It requires constant communication: “Pivot left,” “lift higher,” “wait, it’s slipping.” Now, imagine doing that while flying, buffeted by wind, connected to the object only by loose cables, and—here is the kicker—you aren’t allowed to speak to each other. This is the challenge of cooperative aerial manipulation. While a single drone (Micro Aerial Vehicle or MAV) is often too weak to carry heavy payloads, a team of drones can lift significantly more. By tethering multiple drones to a single load, we can transport construction materials or emergency supplies to remote areas. ...

2025-08 · 8 min · 1617 words
[Focusing on What Matters: Object-Agent-centric Tokenization for Vision Language Action Models 🔗](https://arxiv.org/abs/2509.23655)

Cutting the Visual Tax: How Oat-VLA Streamlines Robotic Learning by Focusing on What Matters

Introduction: The High Cost of Robotic Vision In the rapidly evolving landscape of Artificial Intelligence, we have witnessed a massive shift from text-based Large Language Models (LLMs) to multimodal systems. We no longer just want AI to write poetry; we want it to see the world and, more importantly, act upon it. This ambition has given rise to Vision-Language-Action (VLA) models—systems that ingest visual data and language instructions to output robotic control actions. ...

2025-09 · 10 min · 2019 words
[In-Context Iterative Policy Improvement for Dynamic Manipulation 🔗](https://arxiv.org/abs/2508.15021)

Teaching Robots Physics Without Training: How LLMs Master Dynamic Manipulation via In-Context Learning

Introduction Large Language Models (LLMs) like GPT-4 have transformed our expectations of artificial intelligence. We have grown accustomed to their ability to write code, summarize history, and even reason through logic puzzles. Recently, roboticists have begun connecting these “brains” to robot “bodies,” allowing LLMs to generate high-level plans or write control code. However, a significant gap remains. While LLMs understand language, they don’t natively “understand” the complex, low-level physics required to slide a puck across a table or swing a rope into a specific shape. ...

2025-08 · 9 min · 1907 words
[One Demo is Worth a Thousand Trajectories: Action-View Augmentation for Visuomotor Policies 🔗](https://openreview.net/pdf?id=Hu3NoPMAg4)

How to Turn One Robot Demo into a Thousand: The 1001 DEMOS Framework

Introduction Imagine you are teaching a robot to pick up a coffee mug. You guide the robot’s hand, gripping the mug and placing it on a coaster. The robot records this motion and the video feed from its camera. You run the policy, and it works perfectly. But then, you move the mug three inches to the left, or perhaps you rotate the robot’s base slightly. Suddenly, the robot flails, misses the mug, or crashes into the table. ...

10 min · 2006 words
[Gen2Act: Human Video Generation in Novel Scenarios enables Generalizable Robot Manipulation 🔗](https://arxiv.org/abs/2409.16283)

Gen2Act: Teaching Robots by Hallucinating Human Videos

The dream of general-purpose robotics is a machine that can walk into a messy, unseen kitchen and simply “wash the dishes” or “put away the groceries” without needing weeks of specific training for that exact room. However, the reality of robotic learning is often far more brittle. Most robots are trained via Behavior Cloning (BC), where they meticulously mimic collected datasets of robot actions. The problem? Collecting robot data is expensive, slow, and dangerous. ...

2024-09 · 9 min · 1727 words
[FLARE: Robot Learning with Implicit World Modeling 🔗](https://arxiv.org/abs/2505.15659)

Predicting the Future, Implicitly: How FLARE Revolutionizes Robot Learning

Introduction Imagine you are reaching for a coffee mug on a cluttered desk. You don’t consciously hallucinate a photorealistic video of your hand moving, frame by frame, texture by texture, before you act. Instead, your brain operates on an intuitive, implicit level. It predicts the consequences of your movement—the spatial feeling of the grasp, the weight of the cup, the avoidance of the stapler—without needing to render every pixel of the scene in your mind’s eye. ...

2025-05 · 9 min · 1888 words
[GLOVER++: Unleashing the Potential of Affordance Learning from Human Behaviors for Robotic Manipulation 🔗](https://arxiv.org/abs/2505.11865)

How Robots Learn to 'Click' with the World: A Deep Dive into GLOVER++ and HOVA-500K

Introduction When you look at a coffee mug, you don’t just see a cylindrical object with a curved protrusion; you intuitively see a handle that can be grasped. When you look at a drawer, you see a knob meant to be pulled. In psychology and robotics, this concept is known as affordance—the actionable properties of an object that define how an agent can interact with it. For humans, recognizing affordances is effortless. For robots, it is an immense challenge. While recent advancements in Vision-Language Models (VLMs) have given machines the ability to describe scenes and answer questions, bridging the gap between high-level semantic understanding (“That is a mug”) and low-level robotic control (“Grasp the handle at these coordinates”) remains a bottleneck. ...

2025-05 · 10 min · 1955 words
[Extracting Visual Plans from Unlabeled Videos via Symbolic Guidance 🔗](https://arxiv.org/abs/2505.08444)

Vis2Plan: Bridging the Gap Between Symbolic Reasoning and Visual Planning in Robotics

Introduction: The Long-Horizon Challenge Imagine asking a robot to “make dinner.” To you, this is a single request. To a robot, it is a staggering sequence of complex, physically grounded actions: open the fridge, identify the ingredients, grasp the onion, place it on the cutting board, pick up the knife, and so on. In the field of robot learning, we call these long-horizon manipulation tasks. They are notoriously difficult because errors compound. If the robot fumbles opening the fridge, the rest of the plan is irrelevant. ...

2025-05 · 9 min · 1838 words
[TA-VLA: Elucidating the Design Space of Torque-aware Vision-Language-Action Models 🔗](https://arxiv.org/abs/2509.07962)

Feeling the Force: How Torque-Aware VLA Models Master Contact-Rich Manipulation

Imagine trying to plug a USB charger into a wall socket in a pitch-black room. You can’t see the socket, but you can feel around, sensing the resistance when you hit the plastic faceplate and the satisfying “click” when the plug slides in. Now, imagine a robot trying to do the same thing relying only on a camera. If its hand blocks the view, or if the lighting is bad, the robot is effectively blind and numb. It pushes, fails, and doesn’t know why. ...

2025-09 · 8 min · 1580 words
[AnyPlace: Learning Generalizable Object Placement for Robot Manipulation 🔗](https://openreview.net/pdf?id=H0zFqW6QM0)

AnyPlace: Solving Generalizable Robot Placement with VLMs and Diffusion

Introduction In the context of human motor skills, placing an object is deceptively simple. Whether you are hanging a mug on a rack, sliding a book onto a shelf, or inserting a battery into a remote, your brain seamlessly processes the geometry of the objects, identifies valid target locations, and coordinates your hand to execute the move. You don’t need to be retrained from scratch every time you encounter a slightly different mug or a new type of shelf. ...

10 min · 1967 words