Introduction
Imagine trying to tie your shoelaces with numb fingers. You can see the laces perfectly, but without the subtle feedback of tension and texture, the task becomes clumsy and frustrating. This is the current state of most robotic manipulation. While computer vision has seen explosive growth, allowing robots to “see” the world with high fidelity, the sense of touch—tactile perception—remains a significant bottleneck.
Robots generally struggle to manipulate objects where force and contact are critical, such as pulling a stuck drawer, inserting a key into a lock, or handling soft fruit. The challenges are twofold: Hardware (collecting reliable tactile data is hard) and Algorithms (teaching a robot to “understand” what it feels is even harder).
Standard methods like teleoperation are expensive and slow. Passive human video data (like YouTube) lacks sensory information entirely. In this post, we are diving deep into a new paper that bridges this gap: exUMI.
The researchers from Shanghai Jiao Tong University have introduced a comprehensive system that includes a low-cost, handheld data collection device (hardware) and a novel learning framework called Tactile Prediction Pretraining (TPP) (software). By treating tactile sensing not just as a static image, but as a dynamic process predicted by action, they have achieved remarkable results in contact-rich manipulation tasks.

Part 1: The Hardware Barrier
To learn from demonstration, we need data—lots of it. The “Universal Manipulation Interface” (UMI) was a breakthrough device that allowed researchers to collect robot training data using a handheld gripper equipped with a camera. However, the original UMI had limitations:
- Proprioception Drift: It relied on visual SLAM (Simultaneous Localization and Mapping) to track where the gripper was in space. In featureless rooms or during fast motion, SLAM often failed.
- No Touch: It was purely visual.
- Gripper Uncertainty: It used visual markers (ArUco) to guess how wide the gripper was open, which is prone to error when the markers are occluded.
Enter exUMI: An Extensible Upgrade
The exUMI system is a “physical twin” of a robot gripper designed for robust, in-the-wild data collection. It addresses the flaws of its predecessor with three clever engineering upgrades.

1. Robust Proprioception via AR
Instead of relying on fragile visual SLAM algorithms, exUMI leverages the mature tracking technology of Virtual Reality. The researchers mounted a Meta Quest 3 controller onto the handheld device. This provides industrial-grade 6D pose tracking (position and rotation) that works even when the camera is blocked or the background is plain white walls—scenarios that usually break visual tracking systems.
As shown in Figure 3 below, environments with clean backgrounds or severe occlusions are nightmares for traditional vision-based tracking. The AR system ignores these visual distractions entirely.

2. Precise Gripper State
To know exactly how wide the gripper is opened (crucial for grasping), the team ditched the visual markers. They installed a low-cost AS5600 magnetic rotary encoder directly into the joint.

This sensor measures the magnetic field of a radial magnet attached to the gripper mechanism. It provides sub-millimeter precision regarding the finger width, regardless of lighting conditions or visual obstructions.
3. Visuo-Tactile Sensing
The “fingertips” of exUMI are not just rubber pads; they are sensors. The team upgraded the 9DTact design—a vision-based tactile sensor.
Here is how it works: A camera looks at the back of a silicone gel pad from the inside. When the gel presses against an object, it deforms. The internal camera captures this deformation as a change in lighting/color, effectively turning “touch” into an “image.”

The researchers improved the durability of these sensors by adding a bevel to lock the gel in place (preventing it from peeling off during shear forces) and creating a custom mold to ensure consistent thickness.

Synchronization: The “Shake” Test
Integrating these disparate sensors—a GoPro camera, a VR controller, and tactile sensors—introduces a major headache: Latency. If the robot feels a bump 50 milliseconds after it sees the collision, the learning algorithm will be confused.
The exUMI system uses a clever calibration trick. The user simply waves the device back and forth in front of a visual marker. The system then aligns the trajectory from the camera (visual) with the trajectory from the AR tracker (motion) to find the exact time offset, achieving synchronization with less than 5 ms error.

Part 2: The Data Advantage
With robust hardware, data collection becomes a breeze. The authors collected a massive dataset of Human Play Data. Instead of strictly following a script, operators interacted with over 300 objects in 10 different environments—picking, pushing, stacking, and squishing things.
Because the system is reliable, they collected over 1 million tactile frames. What makes this dataset unique is the Contact Richness. In typical robot datasets, valid contact (actually touching something) happens less than 10% of the time. In the exUMI dataset, active tactile frames account for over 60% of the data.

As seen in the chart above, exUMI allows for significantly higher data throughput compared to traditional teleoperation, specifically in capturing active tactile frames.
Part 3: Tactile Predictive Pretraining (TPP)
Now we reach the core innovation of the paper. How do we turn these million frames of squishy silicone images into a “brain” that understands physics?
The researchers argue that existing methods fall short because they treat tactile images like standard photos.
- Contrastive Learning (common in computer vision) assumes that if you crop an image, it’s still the same object. But in tactile sensing, cropping an image changes the contact point entirely—it changes the physics.
- Visual-Tactile Alignment tries to force the touch sensor to match the camera view. But often, what you see (a flat surface) and what you feel (a slippery texture) are different, which is the whole point of having touch sensors.
The Hypothesis: Action-Awareness
The core insight of Tactile Predictive Pretraining (TPP) is that touch is a consequence of action. You cannot understand the tactile signal of “friction” without knowing that the finger is “sliding.”
Therefore, the model shouldn’t just classify tactile images; it should predict them.
The Algorithm
TPP functions as a self-supervised proxy task. The model is trained to answer the question: “Given what I felt in the past, and how I am moving my hand now, what will I feel in the future?”

Let’s break down the architecture shown in Figure 6:
- Inputs:
- Tactile History: A sequence of past tactile images.
- Action History: How the robot moved in the past.
- Current Image: What the robot sees right now (context).
- Future Action: How the robot plans to move.
The Engine (Latent Diffusion Model): The system uses a Latent Diffusion Model (LDM). It takes the history and conditions and tries to “denoise” a random signal into a clean prediction of the Future Tactile Frames.
The Goal: By forcing the model to generate future tactile frames based on actions, the network implicitly learns the dynamics of contact. It learns that pushing down creates a spreading pattern (pressure), while moving sideways creates a shear pattern (friction).
This pretraining results in a frozen Tactile Encoder (\(E_T\)). This encoder can then be plugged into a standard Imitation Learning policy, providing the robot with a rich, physics-aware understanding of touch.
The policy learning equation looks like this:

Here, the policy \(\pi\) decides the next action \(a_{t+1}\) based on the robot state \(s\), the tactile embedding \(T\) (from our new TPP model), and the visual input \(V\).
Experiments and Results
Does this theory hold up in the real world? The researchers tested the system on a Flexiv Rizon 4 robot arm across several difficult tasks.
Does the model actually “predict” touch?
Before putting it on a robot, they checked if the TPP model could accurately hallucinate the future.

In Figure 7, look at the timeline. The model receives history (t-1, t). It then predicts frames t+1 through t+4. The “Ground Truth” row shows what actually happened, and the “Prediction” row shows what the model thought would happen. The predictions are incredibly close, accurately forecasting the onset and release of contact.
Crucially, an ablation study (Table 1 below) showed that including Action data significantly reduced prediction error. The model predicts better because it knows what the hand is doing.

Real-World Manipulation
The ultimate test involves tasks that are nearly impossible with vision alone.
- Pull Drawer: The robot must grasp a handle and pull. The drawer might be empty (easy) or weighed down with stones (hard). Vision can’t see the weight. The robot must “feel” the resistance to adjust its force.
- Peg in Hole: Inserting a tight-fitting block requires precise alignment that cameras often miss due to occlusion (the robot hand blocks the view).
- Open Bottle: Unscrewing a cap requires sensing the friction to rotate without slipping.

In the visualizations above (Figure 8), you can see the “thermal” maps representing the pressure on the gel sensors. The yellow arrows indicate the tangent forces inferred by the robot.
The Results: The TPP-enhanced policy significantly outperformed baselines.
- Pull Drawer (Random Weight): Vision-only policies failed almost every time (40% success) because they couldn’t adapt to the heavy drawer. The TPP policy achieved a 95% success rate.
- Peg in Hole (Insert): Vision-only achieved 50%. TPP achieved 80%.
The experiments proved that the TPP encoder provided the necessary “physical intuition” to adjust the robot’s motion based on resistance and contact geometry.
Conclusion and Implications
The exUMI paper represents a significant step toward “sensory completeness” in robotics. By essentially open-sourcing a design for a $698 data collection device, the authors are democratizing access to high-quality, multimodal robot data.
More importantly, the Tactile Predictive Pretraining framework shifts the paradigm of how we teach robots to feel. Instead of static texture recognition, we are moving toward dynamic contact prediction. The robot isn’t just feeling; it’s anticipating.
Key Takeaways
- Hardware Matters: You can’t learn good policies from bad data. Using AR for tracking and magnetic encoders for gripper state ensures 100% data usability.
- Touch is Dynamic: Tactile signals are meaningless without the context of motion (Action).
- Prediction is Learning: If a robot can predict the tactile consequence of an action, it understands the physical interaction.
This work paves the way for robots that can handle delicate tasks—from elderly care to complex assembly—with the sensitivity and grace of a human hand.
](https://deep-paper.org/en/paper/2509.14688/images/cover.png)