Papers

[Reactive In-Air Clothing Manipulation with Confidence-Aware Dense Correspondence and Visuotactile Affordance 🔗](https://arxiv.org/abs/2509.03889)

How Robots Learned to Fold Laundry by admitting Confusion

Introduction If you have ever watched a robot try to fold a t-shirt, you might have noticed a stark difference between its strategy and yours. A typical robotic approach involves painstakingly smoothing the garment flat on a table, ironing out every wrinkle with computer vision algorithms, and then executing a pre-calculated fold. It is slow, rigid, and requires a lot of table space. You, on the other hand, probably pick the shirt up, give it a shake, and fold it in mid-air. If you grab it by the wrong end, you simply rotate it until you find the collar. You rely on the feel of the fabric and a general understanding of where the sleeves should be, even if the shirt is crumpled. ...

[Tactile Beyond Pixels: Multisensory Touch Representations for Robot Manipulation 🔗](https://arxiv.org/abs/2506.14754)

Feeling the World: How Sparsh-X Gives Robots a Multisensory Sense of Touch

Introduction Imagine you are reaching into a dark bag to find a specific set of keys. You can’t see them, but your fingers instantly provide a flood of data. You feel the cold metal (temperature), the jagged edges (texture), the weight as you lift them (proprioception), and perhaps you hear the “clink” as they hit other objects (audio). Human dexterity relies on this symphony of signals. We don’t just “touch” by sensing skin deformation; we integrate vibrations, thermal cues, motion, and pressure into a cohesive understanding of the physical world. ...

["Stack It Up!": 3D Stable Structure Generation from 2D Hand-drawn Sketch 🔗](https://arxiv.org/abs/2508.02093)

From Napkin Sketches to Robot Actions: How 'StackItUp' Turns 2D Drawings into Stable 3D Structures

Introduction: The Gap Between Imagination and Execution Imagine a child drawing a wobbly sketch of the Eiffel Tower and asking a robot, “Build this!” To a human, the request is obvious. We see the drawing, understand the structural intent—a wide base, a tapering tower, a spire on top—and we instinctively know how to arrange blocks to replicate it. To a robot, however, this is a nightmare. Current robotic manipulation systems are incredibly literal. They typically require precise 3D goal specifications: exact coordinates (\(x, y, z\)), orientation quaternions, and CAD models. These specifications usually come from complex design software, not a messy scribble on a piece of paper. There is a massive “modality gap” between the intuitive, noisy 2D way humans communicate ideas and the precise, physically grounded 3D data robots need to function. ...

[One View, Many Worlds: Single-Image to 3D Object Meets Generative Domain Randomization for One-Shot 6D Pose Estimation 🔗](https://arxiv.org/abs/2509.07978)

OnePoseViaGen: Solving One-Shot 6D Pose Estimation with Generative AI

Introduction Imagine a robot operating in a household or a flexible factory line. To interact with the world—to pick up a mug, insert a plug, or organize a shelf—the robot needs to know exactly where objects are. It’s not enough to simply draw a 2D box around an object on a screen; the robot needs the object’s 6D pose: its precise 3D position (\(x, y, z\)) and orientation (pitch, yaw, roll). ...

[AirExo-2: Scaling up Generalizable Robotic Imitation Learning with Low-Cost Exoskeletons 🔗](https://arxiv.org/abs/2503.03081)

Robots Without Robots: Scaling Imitation Learning with AirExo-2 and RISE-2

Introduction In the world of Artificial Intelligence, we have witnessed a massive explosion in capabilities driven by data. Large Language Models (LLMs) like GPT-4 thrive because they ingest trillions of tokens of text from the internet. However, robotics faces a stubborn bottleneck: the physical world. Unlike text or images, high-quality data for robotic manipulation—teaching a robot how to fold laundry, cook a steak, or assemble a toy—is incredibly scarce. The gold standard for collecting this data is teleoperation. This involves a human expert controlling a physical robot arm to perform a task. While this produces perfect “robot-domain” data (exact joint angles and camera views), it is prohibitively expensive. You need the robot hardware, the safety infrastructure, and the time to operate it slowly. On the other end of the spectrum, we have in-the-wild demonstrations—videos of humans doing tasks with their own hands. This data is abundant and cheap, but it suffers from a massive “domain gap.” A human hand does not look or move like a two-finger robotic gripper. ...

[Non-conflicting Energy Minimization in Reinforcement Learning based Robot Control 🔗](https://arxiv.org/abs/2509.01765)

How to Teach Robots to Be Lazy (Efficiently): Introducing PEGrad

Introduction: The Battery Bottleneck Imagine buying a state-of-the-art quadruped robot. It’s agile, intelligent, and capable of traversing rough terrain. You deploy it for a search and rescue mission or a routine inspection, and 60 minutes later, it shuts down. The battery is dead. This is the reality for many untethered robots today. For instance, the Unitree Go2 typically operates for only 1 to 4 hours on a single charge. If we want robots to be truly useful in the real world, we cannot just focus on what they do (task performance); we must focus on how they do it (energy efficiency). ...

[Streaming Flow Policy: Simplifying diffusion/flow-matching policies by treating action trajectories as flow trajectories 🔗](https://arxiv.org/abs/2505.21851)

Streaming Flow Policy: How to Make Robot Reflexes Faster by Streaming Actions

In the rapidly evolving world of robotic imitation learning, we are constantly trying to bridge the gap between how a robot “thinks” (computation) and how it “acts” (execution). For the past few years, the gold standard in this field has been dominated by Diffusion Policies. These models are incredibly powerful; they can learn complex, multi-modal behaviors—like knowing that to get around an obstacle, you can go left or right, but not straight through the middle. However, they come with a significant cost: latency. ...

[Steering Your Diffusion Policy with Latent Space Reinforcement Learning 🔗](https://arxiv.org/abs/2506.15799)

Don't Retrain, Just Steer: How DSRL Adapts Diffusion Robots via Latent Space

Don’t Retrain, Just Steer: How DSRL Adapts Diffusion Robots via Latent Space In the rapidly evolving world of robot learning, Behavioral Cloning (BC) has emerged as the dominant paradigm. By collecting demonstrations from humans (teleoperation) and training neural networks to mimic those actions, we have enabled robots to perform impressive manipulation tasks. Recently, Diffusion Models—the same tech behind DALL-E and Stable Diffusion—have taken over robotics, providing policies that can model complex, multi-modal action distributions with high precision. ...

[Real-Time Out-of-Distribution Failure Prevention via Multi-Modal Reasoning 🔗](https://arxiv.org/abs/2505.10547)

When Robots Panic: Bridging Foundation Models and Real-Time Safety Control

Introduction Imagine a delivery drone navigating a busy city. It has been trained on thousands of hours of flight data—blue skies, clear landing pads, and standard obstacles. But today, the world throws a curveball. As the drone approaches its destination, it encounters a construction site that wasn’t there yesterday. There is a crane moving unpredictably, a person balancing on a ladder, and yellow caution tape fluttering in the wind. To a human, the danger is obvious: “Don’t fly near the person on the ladder.” But to a classical robotic control system, these are just undefined obstacles or, worse, confusing sensor noise. This is the Out-of-Distribution (OOD) problem. The robot is in a situation it wasn’t explicitly trained for, and its standard operating procedures might lead to a catastrophic failure. ...

[ScrewSplat: An End-to-End Method for Articulated Object Recognition 🔗](https://arxiv.org/abs/2508.02146)

ScrewSplat: Teaching Robots How Things Move Using RGB Images and Gaussian Splatting

Introduction Imagine a robot walking into a kitchen. To be useful, it can’t just look at the scene; it needs to interact with it. It needs to know that the fridge door swings open (a revolute joint), the drawer slides out (a prismatic joint), and the laptop on the table flips up. This is the challenge of articulated object recognition. Traditionally, teaching robots to understand these movable parts has been cumbersome. Previous methods often relied on depth cameras (which struggle with transparent or shiny surfaces like glass cabinet doors), required humans to manually specify how many joints an object has, or used complex, multi-stage processing pipelines that could fail at any step. ...

[RoboArena: Distributed Real-World Evaluation of Generalist Robot Policies 🔗](https://arxiv.org/abs/2506.18123)

RoboArena: Solving the Benchmarking Crisis for Generalist Robots

Introduction In the last few years, the field of robotics has witnessed a paradigm shift. We are moving away from specialized bots designed to do one thing perfectly (like welding a car door) toward generalist robot policies—AI brains capable of performing a wide range of tasks across diverse environments. These models, often trained on massive datasets like Open X-Embodiment or DROID, are the physical cousins of Large Language Models (LLMs). They can pick up fruits, fold laundry, or open drawers, often in scenes they have never encountered before. ...

[Training Strategies for Efficient Embodied Reasoning 🔗](https://arxiv.org/abs/2505.08243)

Smarter Robots, Faster Speeds - Decoding Why Chain-of-Thought Reasoning Works in VLAs

Introduction In the quest to build general-purpose robots, we often look to the success of Large Language Models (LLMs). If an AI can plan a vacation or debug code by “thinking through” the problem step-by-step, shouldn’t a robot be able to plan how to tidy a kitchen using the same mechanism? This concept is known as Embodied Chain-of-Thought (CoT) reasoning. By training Vision-Language-Action (VLA) models to predict intermediate reasoning steps—like “identify the apple,” “plan to move the arm,” and “calculate gripper width”—before outputting a final movement command, researchers have achieved impressive gains in generalization. Robots trained this way are smarter; they handle new objects and instructions much better than those that just map pixels directly to actions. ...

[Latent Theory of Mind: A Decentralized Diffusion Architecture for Cooperative Manipulation 🔗](https://arxiv.org/abs/2505.09144)

Robots Reading Minds? How Latent Theory of Mind Enables Decentralized Collaboration

Imagine two people trying to move a large, heavy couch up a winding staircase. One is at the top, pulling; the other is at the bottom, pushing. They can’t see each other’s faces, and the noise of the city drowns out their voices. Yet, they manage to pivot, lift, and tilt the couch in perfect unison. How? They rely on implicit cues—the tension in the couch, the speed of movement, and an internal mental model of what the other person is likely doing. In psychology, the ability to attribute mental states—beliefs, intents, desires—to oneself and others is known as Theory of Mind (ToM). ...

[Fabrica: Dual-Arm Assembly of General Multi-Part Objects via Integrated Planning and Learning 🔗](https://arxiv.org/abs/2506.05168)

Fabrica: How Robots Learn to Assemble Complex Objects from Scratch

Introduction If you have ever struggled to assemble a piece of flat-pack furniture, you know that assembly is more than just putting peg A into hole B. It involves a complex choreography: holding one piece steady with one hand, aligning another piece with the other, applying just the right amount of force, and doing it all in a specific order so the whole thing doesn’t collapse. For humans, this is intuitive. For robots, it is an algorithmic nightmare. ...

[The Sound of Simulation: Learning Multimodal Sim-to-Real Robot Policies with Generative Audio 🔗](https://arxiv.org/abs/2507.02864)

Can AI Hear Simulation? How Generative Audio is Solving the Blind Spots in Robotics

Introduction Imagine you are standing in a kitchen, holding a pitcher of water. You are pouring it into a cup. Now, imagine you close your eyes. Can you still fill the cup without spilling? Most likely, yes. As the water level rises, the pitch of the sound changes—the “glug-glug” becomes higher and faster as the resonant frequency of the remaining air space shifts. This is a prime example of multimodal perception. Humans don’t just see the world; we hear, touch, and feel it. We integrate these senses to perform complex tasks effortlessly. ...

[DexUMI: Using Human Hand as the Universal Manipulation Interface for Dexterous Manipulation 🔗](https://arxiv.org/abs/2505.21864)

Your Hands Are the Best Controllers: How DexUMI Teaches Robots Dexterous Skills

Introduction Imagine trying to teach a robot to pour a cup of tea. For a human, this is trivial; we intuitively understand how much pressure to apply to the handle, how to rotate our wrist, and how to adjust if the pot feels heavy. For a robot, however, this requires complex coordination of vision, tactile sensing, and motor control. The “holy grail” of robotics is dexterous manipulation—giving robots the ability to handle objects with the same versatility as human hands. But there is a major bottleneck: data. To train a robot using Imitation Learning (teaching by demonstration), we need thousands of examples of the task being performed successfully. ...

[ReWiND: Language-Guided Rewards Teach Robot Policies without New Demonstrations 🔗](https://arxiv.org/abs/2505.10911)

How to Train Your Robot: Teaching Unseen Tasks with Language and 'Rewound' Videos

Introduction Imagine trying to learn a new skill, like playing a specific song on the piano. A great teacher doesn’t just wait until you’ve finished the piece to tell you “Pass” or “Fail.” Instead, they provide continuous feedback while you play: “That’s the right chord,” “You’re slowing down too much here,” or “You missed that note, try again.” In the world of robotics, this kind of dense, informative feedback is crucial. Typically, we teach robots using Imitation Learning (showing them exactly what to do thousands of times) or Reinforcement Learning (RL) (giving them a reward signal when they succeed). However, both have a major bottleneck: scaling. ...

[Planning from Point Clouds over Continuous Actions for Multi-object Rearrangement 🔗](https://arxiv.org/abs/2509.04645)

SPOT: How Robots Can Plan Long-Horizon Tasks Directly from Point Clouds

Imagine you have just finished a dinner party. The table is cluttered with stacked bowls, plates with cups resting on them, and scattered cutlery. Your task is to “bus” the table—move everything into a neat stack to be carried to the kitchen. To you, this is trivial. To a robot, it is a nightmare. This is a long-horizon manipulation task. It requires reasoning about physics (you can’t stack a plate on a cup), geometry (do I have space to set this down?), and sequencing (I need to move the cup before I can move the plate beneath it). ...

[SAVOR: Skill Affordance Learning from Visuo-Haptic Perception for Robot-Assisted Bite Acquisition 🔗](https://arxiv.org/abs/2506.02353)

Can Robots Feel the Crunch? How SAVOR Teaches Robots to Eat Like Humans

Introduction Eating is one of the most fundamental human activities, an act so intuitive that we rarely give it a second thought. When you sit down to a meal, you don’t calculate the Newton-meters of force required to pierce a piece of broccoli versus a cherry tomato. You don’t consciously analyze the viscosity of mashed potatoes before deciding whether to scoop or skewer them. Yet, for the millions of people living with mobility limitations, the inability to feed themselves is a significant barrier to independence and dignity. ...

[LocoFormer: Generalist Locomotion via Long-context Adaptation 🔗](https://arxiv.org/abs/2509.23745)

How One Brain Can Control Any Robot: Inside LocoFormer

Introduction In the biological world, adaptation is survival. A newborn calf learns to walk minutes after birth. A dog with an injured leg instinctively shifts its weight to a three-legged gait to keep moving. Humans can walk on sand, ice, or stilts, adjusting their motor control in real-time based on sensory feedback. In the world of robotics, however, this level of flexibility has historically been a pipe dream. Traditional locomotion controllers are brittle “specialists.” A controller tuned for a quadruped (four-legged robot) will fail instantly if deployed on a biped (two-legged robot). Even worse, if a robot’s motor burns out or its limb is damaged, the pre-programmed control policy usually fails catastrophically because the robot’s physical reality no longer matches its internal model. ...