Papers

[Real2Render2Real: Scaling Robot Data Without Dynamics Simulation or Robot Hardware 🔗](https://arxiv.org/abs/2505.09601)

Real2Render2Real: How to Train Robots Without Robots (or Physics Engines)

In the world of Artificial Intelligence, scale is everything. Large Language Models (LLMs) like GPT-4 and Vision-Language Models (VLMs) have achieved “generalist” capabilities primarily because they consumed massive, internet-scale datasets. Robotics, however, has been left behind in this data revolution. This is often referred to as the “Moravec’s paradox” or the data scarcity problem in robotics: while we have billions of text tokens, we do not have billions of examples of robots folding laundry or making coffee. ...

[Sampling-Based System Identification with Active Exploration for Legged Sim2Real Learning 🔗](https://openreview.net/pdf?id=UTPBM4dEUS)

Closing the Sim-to-Real Gap: How Active Exploration Makes Legged Robots Agile and Precise

Closing the Sim-to-Real Gap: How Active Exploration Makes Legged Robots Agile and Precise If you have ever watched a robot fail spectacularly—falling over while trying to walk, or missing a jump by several inches—you have witnessed the “Sim-to-Real gap.” In a physics simulator, robots are perfect. They have infinite battery life, their motors respond instantly, and the ground is perfectly flat. In the real world, however, motors lag, friction varies, and mass distributions are rarely what the spec sheet says they are. ...

[Versatile Loco-Manipulation through Flexible Interlimb Coordination 🔗](https://arxiv.org/abs/2506.07876)

ReLIC: Teaching Quadruped Robots to Use Legs as Hands

Introduction Imagine you are carrying a large, heavy box through a doorway. To get through, you might need to use your hip to nudge the door open while balancing the box with your arms, or perhaps you use a foot to kick a doorstop out of the way. As humans, we perform this kind of “loco-manipulation”—coordinating locomotion and manipulation simultaneously—effortlessly. We treat our limbs as versatile tools; a leg is usually for walking, but it can momentarily become a manipulator if the task demands it. ...

[KineSoft: Learning Proprioceptive Manipulation Policies with Soft Robot Hands 🔗](https://arxiv.org/abs/2503.01078)

Teaching Soft Robots to Feel: How KineSoft Revolutionizes Dexterous Manipulation

Introduction Imagine shaking hands with a robot. If it’s a standard industrial arm, you might be terrified of it crushing your fingers. Its rigid metal skeleton and high-torque motors are designed for precision, not comfort. Now, imagine shaking a hand made of silicone—soft, compliant, and yielding to your touch. This is the promise of soft robotics: machines that are inherently safe and adaptable to the chaotic real world. However, there is a catch. While soft robots are mechanically safer, they are notoriously difficult to control. A rigid robot has specific joints with encoders that tell the computer exactly where the arm is up to a fraction of a millimeter. A soft robot, on the other hand, is a continuum body—it can bend, twist, and deform in infinite ways. It has virtually infinite degrees of freedom. If you can’t measure exactly where the robot is (proprioception), how can you teach it to perform complex tasks like unscrewing a bottle or picking a blackberry? ...

[Learning a Unified Policy for Position and Force Control in Legged Loco-Manipulation 🔗](https://arxiv.org/abs/2505.20829)

How Robots Learn to Feel: Unified Force and Position Control Without Sensors

Robotic manipulation often feels like a magic trick. We see videos of robots backflipping or picking up delicate objects, and we assume the problem is solved. But there is a massive difference between waving a robotic arm in empty space and interacting with the physical world. The former requires position control (moving from A to B), while the latter requires force control (interacting with resistance). Imagine opening a heavy door. You don’t just move your hand along a trajectory; you lean into it, applying force while maintaining your balance. If you treated the door like empty air, you would either fail to open it or fall over. This combination of movement and physical interaction is called loco-manipulation. ...

[Omini-Percption: Omnidirectional Collision Avoidance for Legged Locomotion in Dynamic Environments 🔗](https://openreview.net/pdf?id=KUSYJIlKor)

Can Robots See Like We Do? Mastering Legged Locomotion with Raw LiDAR

Introduction Imagine trying to sprint through a crowded room, dodging furniture and people, while looking through a narrow paper towel tube. This is effectively how many state-of-the-art legged robots operate today. While we have seen incredible videos of robot dogs backflipping or hiking, much of that agility relies on “proprioception”—the robot’s internal sense of its joint positions and balance. They are essentially moving blindly, relying on their ability to recover from stumbles rather than avoiding obstacles in the first place. ...

[HuB: Learning Extreme Humanoid Balance 🔗](https://arxiv.org/abs/2505.07294)

How to Teach a Robot to Kick Like Bruce Lee: A Deep Dive into HuB

The human body is a marvel of mechanical engineering. Think about the act of standing on one leg while extending the other high into the air—a “Bruce Lee kick.” To you, this might feel like a simple (albeit physically demanding) exertion of muscle. To a roboticist, this is a nightmare of physics. It requires precise center-of-mass control, active stabilization against gravity, and the ability to handle the subtle jitter of muscles—or in a robot’s case, motors. ...

[FACET: Force-Adaptive Control via Impedance Reference Tracking for Legged Robots 🔗](https://arxiv.org/abs/2505.06883)

Making Robots Soft: How FACET Teaches Legged Robots to Go with the Flow

Introduction Reinforcement Learning (RL) has revolutionized how legged robots move. We have seen quadrupedal robots traversing rough terrain, climbing stairs, and even performing parkour with superhuman agility. However, there is a lingering problem with these state-of-the-art controllers: they are often incredibly “stiff.” Most RL locomotion policies are trained to track a specific velocity command. If you push a robot running a velocity-tracking controller, it treats your push as a disturbance to be rejected immediately. It fights back to maintain its target speed. While this works for minor bumps, it fails catastrophically under large forces. The robot either stiffens up and slips, or it breaks itself (and potentially its surroundings) by colliding hard with obstacles. ...

[SAIL: Faster-than-Demonstration Execution of Imitation Learning Policies 🔗](https://arxiv.org/abs/2506.11948)

Breaking the Speed Limit: How SAIL Enables Robots to Move Faster Than Their Teachers

Introduction In the world of robotics, Imitation Learning (IL) has become the go-to method for teaching robots complex manipulation skills. By observing a human perform a task—like folding a shirt or stacking cups—a robot can learn to replicate the behavior using algorithms like Behavior Cloning. It is an elegant, data-driven solution that has shown incredible promise. But there is a catch: Speed. Typically, an imitation learning policy is confined to the speed of the demonstration. If a human moves cautiously to ensure safety or precision during data collection, the robot learns to be equally sluggish. In a research lab, this is fine. In industrial automation or logistics, where throughput is king, “slow and steady” is a dealbreaker. ...

[Divide, Discover, Deploy: Factorized Skill Learning with Symmetry and Style Priors 🔗](https://arxiv.org/abs/2508.19953)

Taming the Chaos: How Factorized Learning and Symmetry Make Unsupervised Robot Skills Safe and Deployable

Reinforcement Learning (RL) has gifted robotics with some incredible capabilities, from parkour-performing quadrupeds to drones that can beat human champions. However, there is a significant bottleneck in this pipeline: the reward function. Designing a reward function that explicitly tells a robot how to perform a complex task requires immense engineering effort. As tasks get harder, the math gets messier. Unsupervised Skill Discovery (USD) promises a way out. Ideally, USD allows an agent to play around in its environment and autonomously learn a diverse library of “skills”—like walking, rolling, or jumping—without any task-specific rewards. The problem? Robots trained this way often behave like toddlers on a sugar rush. Their movements, while diverse, are often erratic, unsafe, and impossible to control or deploy on real hardware. ...

[DexSkin: High-Coverage Conformable Robotic Skin for Learning Contact-Rich Manipulation 🔗](https://arxiv.org/abs/2509.18830)

DexSkin: Giving Robots the Sensitivity of Human Skin for Complex Manipulation

Introduction Imagine trying to blindly fish a specific key out of your pocket or wrap a rubber band around a small box while wearing thick, rigid winter gloves. Even if your muscles know the movements, the lack of tactile feedback makes these tasks nearly impossible. You rely on the subtle pressure on the tips, sides, and backs of your fingers to know where the object is and how it is behaving. ...

[Visual Imitation Enables Contextual Humanoid Control 🔗](https://arxiv.org/abs/2505.03729)

From YouTube to Reality: Teaching Robots to Climb Stairs by Watching Videos

Introduction Imagine trying to learn how to parkour just by reading a textbook on physics. You would have to calculate friction coefficients, angular momentum, and trajectory arcs in real-time. It sounds impossible, right? Instead, humans learn by watching. We observe someone climb a set of stairs or sit on a chair, we internalize that motion, and then we try to replicate it, adjusting our balance as we go. For years, roboticists have been trying to teach humanoid robots to navigate complex environments—like climbing stairs or traversing rough terrain—using the “physics textbook” approach. This usually involves hand-tuning complex reward functions or setting up expensive motion capture (MoCap) studios to record data. But what if a robot could learn like we do? What if it could just watch a video of a person walking up stairs and figure it out? ...

[X-SIM: Cross-Embodiment Learning via Real-to-Sim-to-Real 🔗](https://arxiv.org/abs/2505.07096)

Robots Watching YouTube: How X-SIM Turns Human Videos into Robot Skills

Introduction: The Data Bottleneck in Robotics Imagine you want to teach a robot how to make a cup of coffee. The traditional way to do this is through imitation learning. You, the human expert, have to grab a controller or physically guide the robot arm through the motion dozens, perhaps hundreds of times. This process, known as teleoperation, provides the robot with exact pairs of “what the robot sees” (images) and “what the robot did” (motor actions). ...

[ImMimic: Cross-Domain Imitation from Human Videos via Mapping and Interpolation 🔗](https://arxiv.org/abs/2509.10952)

Bridging the Gap: How ImMimic Teaches Robots Using Human Videos and MixUp Interpolation

Introduction One of the most persistent bottlenecks in robotics is the cost of data. To teach a robot a new skill—like cracking an egg or using a hammer—we typically need hundreds, if not thousands, of teleoperated demonstrations. This process is slow, expensive, and scales poorly. On the other hand, we have the internet. Platforms like YouTube are overflowing with millions of videos of humans performing exactly these kinds of manipulation tasks. In theory, this is a goldmine of training data. But in practice, a massive barrier stands in the way: the domain gap. ...

[FetchBot: Learning Generalizable Object Fetching in Cluttered Scenes via Zero-Shot Sim2Real 🔗](https://arxiv.org/abs/2502.17894)

Solving the Cluttered Shelf: How FetchBot Masters Zero-Shot Sim-to-Real Fetching

Introduction: The Bull in the China Shop Problem Imagine asking a robot to grab a specific bottle of soda from a densely packed refrigerator. For a human, this is trivial. We instinctively know how to reach in, avoid knocking over the yogurt, and infer where the back of the bottle is even if we can’t see it. For a robot, however, this “cluttered scene” scenario is a nightmare. Densely packed objects create severe occlusions. Traditional depth sensors struggle with transparent or reflective surfaces (like glass bottles), often returning noisy data or “holes” in the vision. Furthermore, collecting real-world training data for every possible arrangement of objects is prohibitively expensive. Consequently, most robots today struggle to fetch objects safely without disrupting their environment. ...

[ClutterDexGrasp: A Sim-to-Real System for General Dexterous Grasping in Cluttered Scenes 🔗](https://arxiv.org/abs/2506.14317)

Solving the Junk Drawer Problem: How ClutterDexGrasp Brings Dexterous Hands to the Real World

Introduction Imagine reaching into a messy “junk drawer” to find a specific battery buried under tangled cables, loose change, and old receipts. As a human, you do this effortlessly. You don’t just grab; you nudge obstacles aside, slide your fingers into gaps, and carefully extract the target without breaking anything. For robots, however, this is a nightmare. While robotic grasping has seen massive improvements recently, most success stories involve simple, two-fingered grippers picking up isolated objects on clean tables. Dexterous grasping—using multi-fingered hands that mimic human physiology—offers the versatility needed for the real world, but it introduces a massive spike in complexity. When you add a cluttered environment into the mix, with objects blocking the target and the risk of collision everywhere, the difficulty skyrockets. ...

[Zero-Shot Text-to-Speech for Vietnamese 🔗](https://arxiv.org/abs/2506.01322)

PhoAudiobook: Bridging the Gap in Vietnamese Zero-Shot Text-to-Speech

Introduction In the rapidly evolving world of Generative AI, Text-to-Speech (TTS) has moved far beyond the robotic voices of the past. We have entered the era of Zero-Shot TTS. This technology allows a system to clone a speaker’s voice using only a few seconds of reference audio, without ever having been trained on that specific person’s voice before. While models like VALL-E and XTTS have revolutionized this space for English, low-resource languages often get left behind. ...

[WinSpot: GUI Grounding Benchmark with Multimodal Large Language Models 🔗](https://aclanthology.org/2025.acl-short.85.pdf)

Taming the Desktop: How WinSpot Brings AI Agents to Windows

Imagine a digital assistant that doesn’t just chat with you but actually uses your computer. You tell it, “Open the settings and change my default browser to Edge,” and it navigates the menus, finds the right buttons, and clicks them—just like a human would. This is the promise of Graphical User Interface (GUI) automation. While we have seen rapid progress in AI agents that can browse the web or navigate mobile apps, the desktop environment—specifically Windows—remains a massive, largely unconquered frontier. ...

[WiCkeD: A Simple Method to Make Multiple Choice Benchmarks More Challenging 🔗](https://arxiv.org/abs/2502.18316)

Why "None of the Above" is the Ultimate Test for LLMs: Introducing WiCkeD

Introduction In the rapidly evolving world of Large Language Models (LLMs), we have hit a peculiar wall: the students are becoming smarter than the tests. Benchmarks that were once considered difficult—covering everything from high school chemistry to professional law exams—are now being “saturated.” Models are scoring so high that it is becoming increasingly difficult to distinguish a good model from a great one. When a benchmark gets saturated, researchers usually have two options. The first is to build a brand new, harder dataset from scratch. This is expensive, time-consuming, and requires expert human annotation. The second option is to take existing benchmarks and try to make them harder. Recent attempts have involved adding more “distractors” (wrong answers) to questions to lower the odds of guessing correctly. However, generating plausible distractors that don’t accidentally confuse the right answer is a massive challenge in itself. ...

[Using Subtext to Enhance Generative IDRR 🔗](https://aclanthology.org/2025.acl-short.35.pdf)

Reading Between the Lines: How Subtext Enhances Implicit Discourse Relation Recognition in LLMs

When we communicate, we rarely say exactly what we mean. We rely on the listener to fill in the gaps. If someone says, “The new rate is payable Feb. 15,” and follows it with, “A record date hasn’t been set,” a human immediately understands the connection. There is a conflict here: the payment date is set, but the necessary record date isn’t. We infer a relationship of Concession (e.g., “however”). ...