[Wheeled Lab: Modern Sim2Real for Low-cost, Open-source Wheeled Robotics 🔗](https://arxiv.org/abs/2502.07380)

Democratizing Robot Learning: How Wheeled Lab Brings Modern Sim2Real to Low-Cost Robots

Introduction Imagine watching a $50,000 quadruped robot hike up a mountain trail or a specialized drone race through a complex circuit at champion-level speeds. These feats are awe-inspiring, representing the bleeding edge of robotics. They share a common secret sauce: Sim2Real—training a policy in a high-fidelity simulation and then deploying it into the real world. But here lies the problem: these innovations are often locked behind a paywall of expensive hardware and proprietary software. For the average undergraduate student, researcher on a budget, or robotics hobbyist, accessing the tools required to learn these modern techniques is nearly impossible. You are often left with outdated simulators and basic line-following robots, while the state-of-the-art races ahead. ...

2025-02 · 7 min · 1486 words
[FlashBack: Consistency Model-Accelerated Shared Autonomy 🔗](https://arxiv.org/abs/2505.16892)

The Fast Lane to Shared Autonomy: How Consistency Models Turn Robots into Real-Time Copilots

Imagine trying to land a spacecraft on the moon or insert a delicate plug into a socket using a robotic arm. These tasks require extreme precision. Now, imagine doing it with a joystick that has a slight delay or feels “mushy.” This is the challenge of teleoperation. Shared autonomy is the solution: a collaborative approach where a human pilot provides high-level intent (via a joystick or VR controller), and an AI “copilot” handles the low-level precision, stabilizing the motion and avoiding collisions. ...

2025-05 · 7 min · 1424 words
[CLAMP: Crowdsourcing a LArge-scale in-the-wild haptic dataset with an open-source device for Multimodal robot Perception 🔗](https://arxiv.org/abs/2505.21495)

Giving Robots the Sense of Touch: How CLAMP Crowdsourced Haptic Perception

Introduction Imagine you are reaching into a cluttered bag to find your house keys. You can’t really see them, but your fingers brush against cold metal, and you instantly know you’ve found them. Or consider checking if a banana is ripe; looking at the color helps, but a gentle squeeze tells you if it’s mushy or firm. For humans, integrating vision with haptics (the sense of touch) is seamless. For robots, it is an immense challenge. While computer vision has seen an explosion of capabilities due to massive datasets like ImageNet, robotic touch has lagged behind. Robots struggle to differentiate between a plastic apple and a real one, or a metal cup and a ceramic one, simply by looking. They need to feel. ...

2025-05 · 8 min · 1685 words
[DEQ-MPC: Deep Equilibrium Model Predictive Control 🔗](https://openreview.net/pdf?id=zQXurgHUVX)

Closing the Loop: Merging Neural Networks and Control Solvers with Deep Equilibrium Models

Introduction In the world of robotics, there is a constant tension between flexibility and safety. On one hand, we want robots to use Neural Networks (NNs) to learn complex behaviors, adapt to new environments, and process high-dimensional sensor data. On the other hand, neural networks are often “black boxes”—we can’t easily guarantee they won’t command a drone to fly into a wall. To solve this, roboticists rely on Model Predictive Control (MPC). MPC is a mathematical framework that plans movements by solving an optimization problem at every moment, strictly adhering to safety constraints (like “do not hit obstacle” or “stay within motor limits”). ...

8 min · 1684 words
[Merging and Disentangling Views in Visual Reinforcement Learning for Robotic Manipulation 🔗](https://arxiv.org/abs/2505.04619)

MAD Skills: How to Teach Robots to See with One Eye or Many

Visual reinforcement learning (RL) has pushed the boundaries of what robots can do, from beating Atari games to performing complex dexterous manipulation. However, a significant gap remains between a robot that performs well in a controlled simulation and one that is robust enough for the real world. A major part of this challenge lies in vision. In the real world, depth perception is crucial. Humans achieve this naturally through binocular vision—using two eyes to triangulate 3D structure. Similarly, robots benefit immensely from multiple camera views. Merging these views creates a richer representation of the world, overcoming occlusions and improving learning speed (sample efficiency). ...

2025-05 · 9 min · 1779 words
[GraspVLA: a Grasping Foundation Model Pre-trained on Billion-scale Synthetic Action Data 🔗](https://arxiv.org/abs/2505.03233)

Can Robots Learn to Grasp Using Only Synthetic Data? A Deep Dive into GraspVLA

Introduction: The Data Bottleneck in Robotics We are currently witnessing a golden age of Artificial Intelligence, driven largely by Foundation Models. In Natural Language Processing (NLP) and Computer Vision (CV), models like GPT-4 and Gemini have achieved staggering capabilities. Their secret weapon? The internet. These models are pre-trained on trillions of tokens of text and billions of images scraped from the web. However, there is one frontier where this formula hasn’t quite worked yet: Robotics. ...

2025-05 · 8 min · 1580 words
[Multi-Loco: Unifying Multi-Embodiment Legged Locomotion via Reinforcement Learning Augmented Diffusion 🔗](https://arxiv.org/abs/2506.11470)

One Brain, Many Bodies: How Multi-Loco Unifies Robot Control with Diffusion and RL

Introduction Imagine if learning to ride a bicycle immediately made you better at walking on stilts or ice skating. In the biological world, this kind of skill transfer happens constantly; animals adapt their motor control strategies to different terrains and physical changes. In robotics, however, this has remained a distant dream. Typically, if you want to train a quadruped (a four-legged robot dog) and a humanoid (a two-legged robot), you need two completely separate training pipelines. Their bodies are different, their motors are different, and the physics governing their movement are distinct. ...

2025-06 · 8 min · 1700 words
[Ensuring Force Safety in Vision-Guided Robotic Manipulation via Implicit Tactile Calibration 🔗](https://arxiv.org/abs/2412.10349)

Why Robots Break Doors (And How "SafeDiff" Fixes It with Tactile Diffusion)

Opening a door seems like the simplest task in the world. For a human, it’s effortless: you reach out, grab the handle, and pull. If the door is heavy or the hinge is stiff, your hand automatically adjusts the force and trajectory to follow the door’s natural arc. You don’t even think about it. For a robot, however, this simple act is a geometric nightmare. If a robot’s planned trajectory deviates even slightly from the door’s physical constraints—say, by pulling a centimeter too far to the left—it fights against the hinge. This creates “harmful forces.” In the best-case scenario, the robot fails the task. In the worst case, it rips the handle off the door or burns out its own motors. ...

2024-12 · 8 min · 1591 words
[Learning Long-Horizon Robot Manipulation Skills via Privileged Action 🔗](https://arxiv.org/abs/2502.15442)

Cheating to Win: How Privileged Actions Teach Robots Complex Skills

Reinforcement Learning (RL) has achieved remarkable things, from beating grandmasters at Go to teaching robots how to run. But if you ask a robot to perform a seemingly simple task—like picking up a credit card lying flat on a table—it often flails. This specific type of problem is known as a “long-horizon, contact-rich” task. To succeed, the robot cannot just close its gripper; it must push the card to the edge of the table, reorient its hand, and then grasp it. This requires a sequence of precise interactions (pushing, sliding, pivoting) where the reward (holding the object) only comes at the very end. Standard RL struggles here because the search space is massive, and randomly stumbling upon this complex sequence is statistically impossible. ...

2025-02 · 7 min · 1399 words
[Estimating Value of Assistance for Online POMDP Robotic Agents 🔗](https://openreview.net/pdf?id=xzR8rBRgPp)

How Robots Decide to Help: Calculating Value of Assistance in Uncertain Worlds

Introduction Imagine a collaborative robotic scenario in a warehouse. One robot, equipped with a robotic arm, is trying to pick up a specific can of soda from a cluttered shelf. However, its sensors are noisy, and a large box is blocking its line of sight. It knows the can is there, but it can’t pinpoint the location with enough certainty to act safely. Nearby, a second robot with a vacuum gripper is idle. This second robot could move the box, revealing the can and making the first robot’s job significantly easier. ...

10 min · 1935 words
[COMBO-Grasp: Learning Constraint-Based Manipulation for Bimanual Occluded Grasping 🔗](https://arxiv.org/abs/2502.08054)

Two Hands Are Better Than One: Mastering Occluded Grasping with COMBO-Grasp

Imagine a flat, thin computer keyboard lying on a desk. You want to pick it up. If you just try to grab it from the top, your fingers will likely hit the desk before you can get a solid grip. The desk “occludes” (blocks) the grasp. So, what do you do naturally? You probably use your non-dominant hand to tilt the keyboard up or brace it, while your dominant hand secures the grip. ...

2025-02 · 8 min · 1703 words
[SocialNav-SUB: Benchmarking VLMs for Scene Understanding in Social Robot Navigation 🔗](https://openreview.net/pdf?id=xVDj9uq6K3)

Can Vision-Language Models Navigate a Crowd? Inside the SocialNav-SUB Benchmark

Introduction Imagine you are walking down a busy university hallway between classes. You see a group of students chatting on your left, a professor hurrying towards you on your right, and a janitor mopping the floor ahead. Without consciously thinking, you adjust your path. You weave slightly right to give the group space, you slow down to let the professor pass, and you avoid the wet floor. This dance of “social navigation” is second nature to humans. We effortlessly interpret intentions, social norms, and spatial dynamics. ...

10 min · 2029 words
[AT-Drone: Benchmarking Adaptive Teaming in Multi-Drone Pursuit 🔗](https://arxiv.org/abs/2502.09762)

Can Robots Collaborate with Strangers? Inside the AT-Drone Benchmark

Introduction Imagine a high-stakes search-and-rescue mission in a dense forest. A fleet of drones is scanning the ground below. Suddenly, one drone suffers a battery failure and must return to base. A backup drone is immediately deployed to take its place. In an ideal world, this swap is seamless. The new drone joins the formation, understands the current strategy, and collaborates perfectly with the existing team. But in reality, this is an immense challenge for robotics. Most multi-agent systems are trained to work with specific, pre-defined partners. They rely on “over-training” with a fixed team, developing a secret language of movements and reactions. When a stranger—an “unseen” teammate—enters the mix, the coordination often falls apart. ...

2025-02 · 10 min · 2025 words
[Robot Operating Home Appliances by Reading User Manuals 🔗](https://openreview.net/pdf?id=wZUQq0JaL6)

Can Robots Learn to Cook by Reading the Manual? Meet ApBot

Introduction Imagine unboxing a new, high-tech rice cooker. It has a dozen buttons, a digital display, and a distinct lack of intuitive design. You, a human, likely grab the user manual, flip to the “Cooking Brown Rice” section, and figure out the sequence of buttons to press. Now, imagine you want your home assistant robot to do this. For a robot, this is a nightmare scenario. Unlike a hammer or a cup, an appliance is a “state machine”—it has hidden internal modes, logic constraints (you can’t start cooking if the lid is open), and complex inputs. Historically, roboticists have had to hard-code these interactions for specific devices. But what if a robot could simply read the manual and figure it out, just like we do? ...

7 min · 1395 words
[KDPE: A Kernel Density Estimation Strategy for Diffusion Policy Trajectory Selection 🔗](https://arxiv.org/abs/2508.10511)

Taming the Stochastic Robot: enhancing Diffusion Policies with Kernel Density Estimation

The integration of Generative AI into robotics has sparked a revolution. Specifically, Diffusion Policy (DP) has emerged as a state-of-the-art approach for “behavior cloning”—teaching robots to perform tasks by mimicking human demonstrations. Unlike older methods that try to average out human movements into a single mean trajectory, Diffusion Policy embraces the fact that humans solve tasks in many different ways. It models the distribution of possible actions, allowing a robot to handle multimodal behavior (e.g., grasping a cup from the left or the right). ...

2025-08 · 9 min · 1797 words
[ActLoc: Learning to Localize on the Move via Active Viewpoint Selection 🔗](https://arxiv.org/abs/2508.20981)

Don't Just Look, See: How ActLoc Teaches Robots Where to Look for Better Navigation

Introduction Imagine navigating a familiar room in the dark with a flashlight. To figure out where you are, you wouldn’t point your light at a blank patch of white wall; that wouldn’t tell you anything. Instead, you would instinctively shine the light toward distinct features—a door frame, a bookshelf, or a unique piece of furniture. By actively selecting where to look, you maximize your ability to localize yourself in the space. ...

2025-08 · 8 min · 1613 words
[BEHAVIOR ROBOT SUITE: Streamlining Real-World Whole-Body Manipulation for Everyday Household Activities 🔗](https://arxiv.org/abs/2503.05652)

Mastering the Home: How BRS Solves Whole-Body Control for Household Robots

Introduction We often dream of the “Rosie the Robot” future—a general-purpose helper that can tidy the living room, clean the bathroom, and organize the pantry. While we have seen incredible advances in robotic manipulation in lab settings, bringing these capabilities into a real-world home remains a formidable challenge. Why is this so hard? It turns out that a messy home requires more than just a good gripper. It requires a robot that can coordinate its entire body. To open a heavy door, a robot can’t just use its arm; it needs to lean in with its torso and drive its base simultaneously. To put a box on a high shelf, it needs to stretch up; to scrub a toilet, it needs to crouch down. ...

2025-03 · 8 min · 1655 words
[TopoCut: Learning Multi-Step Cutting with Spectral Rewards and Discrete Diffusion Policies 🔗](https://arxiv.org/abs/2509.19712)

TopoCut: How Robots Learn to Slice and Dice using Spectral Rewards and Discrete Diffusion

Introduction Imagine teaching a robot to prepare a meal. Asking it to pick up an apple is a solved problem in many labs. Asking it to slice that apple into perfectly even wedges, however, is a nightmare of physics and control. Why? Because cutting fundamentally changes the topology of an object. One object becomes two; a solid mesh splits into independent clusters. The physics are complex—materials deform, squish, and fracture. Furthermore, evaluating success is incredibly difficult. If a robot slices a potato and the piece rolls off the table, did it fail? Geometrically, the slice might be perfect, but a standard sensor looking for the slice in a specific xyz-coordinate would register a zero score. ...

2025-09 · 7 min · 1476 words
[SPIN: distilling Skill-RRT for long-horizon prehensile and non-prehensile manipulation 🔗](https://arxiv.org/abs/2502.18015)

From Slow Planning to Fast Reflexes: How SPIN Masters Complex Robot Manipulation

Imagine you are trying to pick up a playing card that is lying flat on a table. You can’t just grab it directly because your fingers—or a robot’s gripper—can’t get underneath it. What do you do? You instinctively slide the card to the edge of the table using one finger (non-prehensile manipulation), and once it overhangs the edge, you pinch it (prehensile manipulation). This sequence of actions feels trivial to humans, but for robots, it is an immense challenge. It requires Long-Horizon Prehensile and Non-Prehensile (PNP) Manipulation. The robot must reason about physics, contacts, and geometry over a long sequence of steps. It has to decide how to slide the object, where to stop, and how to reposition its hand to transition from a slide to a grasp without knocking the card off the table. ...

2025-02 · 9 min · 1820 words
[FFHFlow: Diverse and Uncertainty-Aware Dexterous Grasp Generation via Flow Variational Inference 🔗](https://arxiv.org/abs/2407.15161)

Mastering the Unknown: How FFHFlow Gives Robots Human-Like Dexterity and Self-Awareness

Imagine you are reaching into a dark cupboard to grab a coffee mug. You can’t see the handle, but your brain makes a reasonable guess about where it is. If you touch something unexpected, you adjust. You don’t just grab blindly; you have an internal sense of uncertainty. For robots, specifically those with multi-fingered (dexterous) hands, this scenario is a nightmare. Most robotic systems struggle to grasp objects when they can only see part of them (partial observation). They either stick to very safe, repetitive grasping motions that might fail in cluttered spaces, or they try to generate complex grasps but lack the speed to do so in real-time. ...

2024-07 · 10 min · 2007 words