Introduction

Imagine you are trying to plug a charging cable into a port behind your desk. You can’t really see the port, or perhaps your hand is blocking the view. How do you do it? You rely on touch. You feel around for the edges, align the connector, and gently wiggle it until you feel it slide into place.

This interplay between vision (locating the general area) and touch (performing the precise insertion) is second nature to humans. However, for robots, reproducing this “bimanual assembly” capability is an immense challenge. While computer vision has advanced rapidly, giving robots the ability to “feel” and react to physical contact—especially with two hands simultaneously—remains a frontier in robotics research.

The primary hurdle is data. Training a robot to perform precise assembly tasks usually requires Imitation Learning (teaching the robot by showing it examples). But collecting thousands of real-world demonstrations where a human carefully wiggles parts together is expensive and time-consuming. Furthermore, standard robot learning often ignores tactile data because it is notoriously difficult to simulate. If you can’t simulate touch accurately, you can’t train in a simulation (Sim-to-Real transfer), and you are stuck collecting data in the real world forever.

In this post, we are diving deep into VT-Refine, a new framework presented at CoRL 2025. This research proposes a robust “Real-to-Sim-to-Real” pipeline that combines the best of both worlds: the realism of human demonstrations and the scale of simulation-based Reinforcement Learning (RL).

Figure 1: Overview of the VT-Refine framework. The pipeline starts with real-world demos, moves to simulation for RL fine-tuning, and transfers back to reality.

As illustrated in Figure 1, the authors have developed a system where a robot learns to see and feel, refining its skills in a digital twin before deploying those skills to the real world with impressive precision.

The Challenge of Bimanual Assembly

Bimanual manipulation—using two hands to manipulate objects—adds a layer of complexity over standard single-arm tasks. It requires coordination. In assembly tasks (like inserting a plug into a socket or screwing a nut onto a bolt), the margin for error is often less than a millimeter.

Vision alone is rarely enough. When a robot hand approaches an object, the hand itself occludes the camera’s view. This is where tactile feedback becomes non-negotiable.

The researchers identified two main bottlenecks in current robotic learning:

  1. Data Scarcity: Collecting real-world data for contact-rich tasks is costly. Furthermore, human demonstrations are often “suboptimal.” A human might insert a part perfectly on the first try, but that doesn’t teach the robot how to recover if it gets stuck.
  2. The Tactile Sim-to-Real Gap: Simulation is great for scaling up training, but simulating the physics of soft, squishy tactile sensors is computationally heavy and often inaccurate. If the simulated touch doesn’t match the real touch, the policy fails when transferred to the real robot.

VT-Refine addresses these by creating a seamless loop that starts in the real world, masters the task in a high-fidelity tactile simulation, and returns to the real world.

The Hardware: Designing for Simulation

One of the cleverest decisions in this research was the choice of tactile sensor. Many modern researchers use optical tactile sensors (like GelSight), which use internal cameras to capture high-resolution images of the contact surface. While these provide amazing detail, they are incredibly difficult to simulate accurately. The “Sim-to-Real” gap for optical sensors is massive.

The authors of VT-Refine took a different approach. They designed a custom piezoresistive sensor called FlexiTac.

Figure 2: The FlexiTac sensor setup. (a) Real-world hardware with sensors on grippers. (b) The simulation model using spring-dampers.

Why Piezoresistive?

As shown in Figure 2, the FlexiTac sensor consists of a grid of sensing units (taxels) capable of measuring normal force (pressure).

  • Real World: It uses a force-sensitive film sandwiched between flexible circuits. It has a resolution of roughly 2mm, which is coarse compared to a camera but sufficient for detecting contact patterns.
  • Simulation: Because the sensor measures normal force, it can be simulated using a Spring-Damper Model (Kelvin-Voigt model).

This design choice is strategic. Instead of trying to simulate complex light refraction (as needed for optical sensors), the simulator only needs to calculate how much a point on the sensor is being “squished” (penetration depth) and apply a formula to output a force value. This computation is fast, GPU-parallelizable, and, most importantly, provides a very narrow gap between simulation and reality.

The Simulation Engine: TacSL

To make the “Sim” part of “Real-to-Sim-to-Real” work, the environment needs to be a Digital Twin of the real setup. The researchers utilized TacSL, a library built on top of Isaac Gym, which allows for massive parallelization on GPUs.

Figure 9: The Tactile-Simulation Pipeline. It breaks down how a finger with a sensor pad (a) is modeled as a grid of taxels (b) using a spring-damper physics model (c).

The simulation process, detailed in Figure 9, works as follows:

  1. Modeling: The sensor pad is modeled as a grid of “taxels” (tactile pixels).
  2. Collision: When the robot touches an object, the simulator calculates the “interpenetration depth” (\(d\))—essentially, how far the object has pushed into the soft sensor pad.
  3. Physics Calculation: It uses the spring-damper equation to convert that depth into a force signal: \[f_n = -(k_n d + k_d \dot{d})\mathbf{n}\] Where \(k_n\) is stiffness (spring) and \(k_d\) is damping (viscosity).

By tuning these \(k\) values, the researchers can make the simulated sensor behave exactly like the real FlexiTac sensor. This fidelity is what allows the policy trained in the matrix (simulation) to work in the real world.

The VT-Refine Pipeline

The core methodology consists of two distinct stages: Real-World Pre-Training and Simulation Fine-Tuning.

Figure 3: The Two-Stage Training Process. Stage 1 uses real data for pre-training. Stage 2 uses simulation for fine-tuning via Reinforcement Learning.

Stage 1: Real-World Pre-Training

The process begins with a human. An operator uses a teleoperation rig to perform the assembly task about 30 times. This is a very small dataset for deep learning standards, but the goal here isn’t perfection—it’s initialization.

The robot records:

  • Visual Data: Point clouds from an ego-centric camera.
  • Tactile Data: Pressure readings from the fingertips.
  • Proprioception: The position of its own joints.

These inputs are fed into a Diffusion Policy. Diffusion models (the same tech behind image generators like DALL-E) are excellent at modeling multimodal distributions. They help the robot learn the general “flow” of the movement.

However, with only 30 demos, the robot is merely “okay” at the task. It can pick up the parts and bring them close together, but it often fails the precise insertion because it hasn’t seen enough failure cases to know how to correct itself.

Stage 2: Simulation Fine-Tuning

This is where VT-Refine shines. The pre-trained policy is transferred into the simulation. Now, the robot can practice thousands of times faster than real-time.

The researchers use Reinforcement Learning (RL), specifically a method called Diffusion Policy Policy Optimization (DPPO).

In the simulation:

  1. The robot tries to assemble the parts.
  2. If it succeeds, it gets a reward (1). If it fails, it gets nothing (0).
  3. Because it already knows the basics (from Stage 1), it doesn’t flail around randomly. It starts near the goal.
  4. Through RL, it learns to refine its movements. It discovers that specific tactile signals (feeling a collision on the left side of the finger) should lead to specific adjustments (wiggling to the right).

This fine-tuning stage injects the “exploratory” behaviors—the wiggles, adjustments, and force corrections—that were missing from the limited human demonstrations.

Unified Representation: Point Clouds

To ensure the robot doesn’t get confused when switching between Real and Sim, the inputs are converted into a unified format: Visuo-Tactile Point Clouds.

  • Visual Points: Derived from the depth camera.
  • Tactile Points: The 3D positions of the taxels on the fingers.

By treating tactile data as geometric points in 3D space (just like visual data), the neural network learns spatial relationships. For example, it learns that “points on the fingertip (touch) are colliding with points on the object (vision).” This representation is highly robust to visual noise and lighting changes.

Experimental Results

The researchers evaluated VT-Refine on five challenging tasks from the AutoMate dataset, such as inserting distinct plugs into sockets. They compared their method against a Vision-Only baseline and analyzed the impact of the simulation fine-tuning.

Does Fine-Tuning Work?

The results were stark. As shown in Figure 6, the Visuo-Tactile policy (Blue line) significantly outperforms the Vision-Only policy (Orange line).

Figure 6: Fine-tuning curves showing success rates over training epochs. Visuo-Tactile policies (blue) consistently outperform Vision-Only (orange).

Notice the trajectory of the blue line. It starts with a decent success rate (thanks to the real-world pre-training) but then shoots up towards 90-100% success as the RL fine-tuning kicks in. This proves that the robot is effectively learning from its simulated practice sessions.

Sim-to-Real Transfer Performance

The ultimate test is deploying the policy back onto the physical robot.

Figure 7: Comparison of success rates across different stages: Pre-trained vs. Fine-tuned, in both Sim and Real environments.

Figure 7 highlights a critical finding: Fine-tuning in simulation improves real-world performance. Look at the jump between “Pre-Train (Real)” and “Fine-Tuned (Real).” For difficult assets (like Asset 00081 and 00007), the success rate jumps dramatically. The “Sim-Real Gap” (the performance drop when moving from sim to reality) is minimal, validating the high fidelity of the tactile simulation.

In numerical terms, Table 1 below details the specific success rates. For the “Visuo-Tactile Policy,” we see bold numbers in the 0.85 - 0.95 range for most objects after fine-tuning, compared to 0.55 - 0.65 before fine-tuning.

Table 1: Real-World Experiments table showing significant improvement in success rates after RL fine-tuning for visuo-tactile policies.

(Note: While the table description text in the image deck refers to “Table 2”, the data presented corresponds to the Real-World outcomes discussed in the paper’s Table 1).

Qualitative Analysis: The “Wiggle”

What does this improvement look like visually?

Figure 8: Policy Rollout comparison. (a) Successful insertion with wiggling/re-orienting. (b) Failure cases where the robot jams or misaligns.

In Figure 8(a), we see the fine-tuned policy in action. The authors describe this as a “wiggle-and-dock” maneuver. The robot arms continuously coordinate, sensing the forces. When the parts don’t align perfectly, the robot doesn’t just push harder (which causes jamming); it retracts slightly, re-orients, and tries again until the tactile map indicates a smooth slide.

Contrast this with Figure 8(b), the baseline policy. Without the fine-tuned tactile awareness, the robot pushes at bad angles, leading to jams. It lacks the “reactive” capability to fix small errors.

The Importance of Calibration

One technical nugget that shouldn’t be overlooked is sensor calibration. You cannot simply assume the simulator physics match the real world. The authors used a “Real-to-Sim” calibration step.

Figure 4: Sensor Calibration Histograms. The distribution of sensor readings in Sim (Orange) closely matches the Real (Blue) readings.

They poked the real sensor, recorded the data, and then tuned the simulator’s stiffness parameters until the simulated data matched. Figure 4 shows the histogram of sensor readings. The overlap between the Real (blue) and Sim (orange) distributions is excellent. Without this calibration, the RL agent would learn to react to forces that don’t exist in the real world, leading to failure upon deployment.

Conclusion and Implications

VT-Refine represents a significant step forward in robotic manipulation. It successfully bridges the gap between the data-hungry needs of Deep Learning and the data-scarcity of the real world.

Key Takeaways:

  1. Touch is Essential: For precise assembly, vision isn’t enough. Tactile feedback provides the necessary cues to correct alignment errors.
  2. Simulation Scales Skills: By using a “Real-to-Sim” pipeline, we can leverage the speed of simulation to refine policies far beyond what is possible with human demonstrations alone.
  3. Hardware-Software Synergy: The choice of piezoresistive sensors was not just a hardware decision; it was a software decision. It enabled accurate, fast simulation, which was the linchpin of the whole operation.
  4. Point Cloud Representation: Unifying vision and touch into a single geometric representation simplifies the learning process and aids transfer.

This framework suggests a future where robots can learn rough skills from humans and then “dream” in simulation to perfect them, eventually mastering complex, contact-rich tasks that currently require human dexterity. Whether it’s assembling electronics or handling fragile items, the combination of vision, touch, and simulation is the key to the next generation of capable robots.