Introduction
In the world of robotics, vision often gets all the glory. We marvel at robots that can “see” and navigate complex environments. However, when it comes to the delicate art of manipulation—positioning a workpiece, assembling a component, or inserting a USB drive—vision has its limits. Cameras suffer from occlusions (the robot’s own hand often blocks the view) and lighting variations. This is where tactile sensing becomes indispensable.
For a robot to manipulate an object precisely, it must know the object’s exact 6D pose (position and orientation) within its hand. This is known as in-hand pose estimation. While humans do this instinctively, it is a massive computational challenge for robots, particularly when handling objects they haven’t seen before or objects with symmetrical shapes that look identical from different angles.
In a recent paper titled UniTac2Pose, researchers propose a breakthrough framework that unifies pose estimation, tracking, and uncertainty estimation. By leveraging energy-based diffusion models trained entirely in simulation, this method allows robots to “feel” the pose of an object with high precision, bridging the gap between virtual training and real-world application.

The Challenge of Tactile Perception
Tactile sensors, such as the GelSlim sensor used in this research, provide high-resolution “images” of the contact surface. While these sensors offer rich geometrical data, they introduce specific challenges:
- Local Ambiguity: A tactile sensor only sees a small patch of the object. Touching a flat surface on a cube feels the same regardless of where you touch it. Mapping this local print to a global object shape is difficult.
- Sim-to-Real Gap: Training robots in the real world is slow and expensive. Training them in simulation is fast, but simulated physics and rendering rarely match the real world perfectly.
- Generalization: Most existing methods are trained on specific objects. If you hand the robot a slightly different version of a tool (an “intra-category” object), traditional models fail.
Previous attempts have used regression (direct prediction), point cloud registration (aligning 3D points), or feature matching. However, these methods often struggle with initialization sensitivity, getting stuck in “local minima” (good guesses that aren’t quite right), or failing to handle symmetric objects.
The UniTac2Pose Framework
The core innovation of UniTac2Pose is its shift from direct prediction to an iterative, energy-based approach. Instead of guessing the pose once, the system defines an “energy landscape” where the correct pose has the highest energy (or likelihood). It then uses a diffusion process to guide random guesses toward that peak.

As illustrated in Figure 2, the framework operates in two main phases: Synthetic Data Generation and the Real-world Inference Pipeline.
1. Learning from Simulation
The researchers developed a purely synthetic training pipeline. They used a tactile simulator (XENSIM) to generate thousands of virtual grasps. For a given object mesh (CAD model) and a pose, they simulate what the tactile sensor would see. This creates a massive dataset of (Pose, Object, Tactile Image) triplets without needing a single real-world experiment for training.
2. The Energy-Based Diffusion Model
At the heart of the system is the Energy Net. This neural network takes a proposed pose, the object’s 3D model, and the observed tactile image as input. It outputs a scalar “energy” score indicating how likely that pose is correct.
Crucially, the network is trained using Denoising Score Matching (DSM). The objective function ensures that the gradient of the energy field points toward the true pose.

Here, the model learns to denoise a perturbed pose, effectively learning the “slope” of the energy landscape. The researchers parameterize the energy function as the inner product of a feature vector and the pose itself:

3. The Render-and-Compare Architecture
One of the most clever design choices in UniTac2Pose is the render-and-compare mechanism. To bridge the sim-to-real gap, the network doesn’t just look at the tactile data blindly.
Inside the network, the system takes the candidate pose and renders a synthetic tactile image from the CAD model. It then compares this rendered image with the actual real-world tactile image observed by the robot. By feeding both the “imagined” touch and the “felt” touch into the network, the model focuses on geometric consistency rather than over-fitting to specific visual textures.
The Three-Stage Inference Process
When the robot actually grasps an object, UniTac2Pose determines the pose in three distinct stages.
Stage 1: Pre-filtering
The system starts by sampling a large number of random pose candidates (guesses) from a prior distribution. It passes these candidates through the Energy Net to get a rough score. Low-ranking candidates are immediately discarded, leaving only the most promising guesses.

Stage 2: Iterative Refinement
This is where the diffusion model shines. The remaining candidates are refined iteratively. The system calculates the gradient of the energy function—essentially asking, “In which direction should I nudge this pose to make it match the tactile data better?”
This process is modeled as a probability flow ODE (Ordinary Differential Equation). By following the gradient, the candidates “flow” from noisy guesses toward the true pose.

Stage 3: Post-Ranking
After refinement, the candidates should be clustered tightly around the true pose. The system scores them one last time using the Energy Net and selects the candidate with the highest energy as the final estimate.

Beyond Estimation: Tracking and Uncertainty
Because the framework is probabilistic, it offers more than just a single coordinate output.
Pose Tracking: If the object moves, the system doesn’t start from scratch. It uses the previous frame’s estimate as the “prior” for the next frame. This allows for real-time tracking at roughly 10 Hz.
Uncertainty Estimation: If the refined candidates are spread far apart, it means the model is confused (high variance). If they cluster tightly, the model is confident. This variance (\(S^2\)) is a direct metric for uncertainty.

This capability is vital for manipulation. If the uncertainty is high, the robot can decide to re-grasp the object at a different location to get a better read.
Experimental Results
The researchers validated UniTac2Pose on a Franka Panda robot equipped with GelSlim 3.0 sensors. They tested on 30 distinct objects, including pipes, connectors, and tools.
Visualizing the Difference
The effectiveness of the simulation-based training relies on how well the synthetic data matches reality. Figure 5 shows the comparison between real tactile images (top) and the simulated ones (bottom). While not identical, the geometric features (contact shapes) are consistent enough for the render-and-compare module to work effectively.

Performance vs. Baselines
The method was compared against standard baselines:
- FilterReg: A point-cloud registration method.
- Regression: A standard deep learning model predicting pose directly.
- Matching: A feature-matching approach similar to Tac2Pose.
The metric used was ADD-S, which accounts for object symmetry (crucial for round objects like washers or nuts).

As seen in Figure 3, the estimated poses (yellow) align very closely with the ground truth (red). The quantitative results showed that UniTac2Pose significantly outperforms baselines, particularly on complex shapes where global registration methods often fail.
Generalizing to Unseen Objects
Perhaps the most impressive result is the intra-category generalization. The researchers trained the model on a set of “Pipe” and “Connector” objects and then tested it on different pipes and connectors that the model had never seen before.

Figure 4 shows that the error rates for “Unseen” objects (pink background) are comparable to “Seen” objects (blue background). This suggests the model learns the fundamental geometric properties of “pipes” or “connectors” rather than memorizing specific object instances.
Conclusion and Future Outlook
UniTac2Pose represents a significant step forward in robotic tactile perception. By unifying pose estimation, tracking, and uncertainty into a single energy-based framework, it solves several longstanding problems in the field.
Key Takeaways:
- Simulation is Sufficient: You can train high-precision tactile models entirely in simulation if you use robust domain-randomization and render-and-compare architectures.
- Unified Framework: One model can handle estimation, tracking, and uncertainty, simplifying the robotic control stack.
- Generalization: Robots can potentially handle new tools and parts without needing retraining, provided they belong to a known category.
Limitations: The primary drawback noted by the authors is speed. The full inference process takes 1-2 seconds per pose, which is slow for dynamic tasks (though the tracking mode is much faster at 10 Hz).
Future work may focus on accelerating the diffusion process using techniques like Flow Matching, potentially bringing the full estimation pipeline up to real-time speeds. As robots move from structured factories into unstructured homes, this ability to “feel” and understand objects despite visual limitations will be critical.
](https://deep-paper.org/en/paper/2509.15934/images/cover.png)