Introduction
Imagine you are holding a smooth, spherical object like an orange in your hand. Now, close your eyes and rotate it. Despite not seeing the object, you can manipulate it with ease. You rely entirely on the sensation of touch—the friction against your skin, the pressure on your fingertips, and the subtle shifts in weight.
For humans, this is intuitive. For robots, it is an engineering nightmare.
Achieving “dexterous in-hand manipulation”—like rotating an object without dropping it—requires complex control policies. Historically, the most difficult part of training a robot to do this isn’t building the hand; it’s defining the reward function. In Reinforcement Learning (RL), the reward function is the math that tells the robot “Good job” or “That was bad.”
Designing these functions for tactile tasks is notoriously difficult. It often involves “alchemy”—experts manually tuning dozens of parameters (e.g., “bonus for touching the object,” “penalty for dropping,” “bonus for velocity”) until something works.
Enter Text2Touch, a new framework presented in a recent research paper that asks a bold question: Can Large Language Models (LLMs) replace the human expert in designing these tactile reward functions?

The answer appears to be yes. As shown in Figure 1, the researchers successfully used LLMs to generate reward functions that allow a four-fingered robot hand to rotate objects using vision-based tactile sensors. Most impressively, the LLM-designed rewards outperformed human-engineered baselines in the real world.
In this post, we will tear down the Text2Touch paper. We will explore why tactile sensing breaks traditional automated reward methods, how the authors engineered prompts to fix it, and how they successfully transferred these policies from simulation to the real world.
Background: The “Reward Engineering” Bottleneck
To understand why Text2Touch is significant, we need to understand the friction in modern robot learning.
Reinforcement Learning and Rewards
In Reinforcement Learning, an agent (the robot) explores an environment. It takes actions and receives a reward (a scalar number). The goal is to maximize the cumulative reward.
- If the reward function is too simple (e.g., “1 point for success, 0 otherwise”), the robot learns too slowly because successes are rare.
- If the reward function is too complex (dense rewards), it requires tedious manual tuning of coefficients. If you weight “don’t drop the object” too high, the robot might freeze and never move. If you weight “move fast” too high, it might fling the object across the room.
Enter LLMs: The “Eureka” Moment
Recently, researchers realized that LLMs (like GPT-4) are excellent at writing code. A framework called Eureka demonstrated that LLMs could iteratively write reward function code, train a robot in simulation, read the performance stats, and then rewrite the code to be better.
However, prior work like Eureka focused primarily on proprioception (knowing where your joints are) and vision (seeing the object).
The Tactile Gap
Tactile sensing poses a unique challenge. Vision-based tactile sensors (like the TacTip used in this paper) provide high-dimensional, noisy, and physically complex data. They detect slip, shear, and contact patches. Incorporating this into an LLM-designed workflow is hard because the “state space” (the list of variables the robot knows about) explodes in size.
Text2Touch is the first attempt to bridge this gap, proving that LLMs can handle the messy, high-dimensional reality of tactile manipulation.
Core Method: The Text2Touch Framework
The Text2Touch framework operates as a loop that generates, evaluates, and refines reward functions in simulation, eventually distilling the best one for real-world use.
The architecture is visualized below:

Let’s break down the two main phases shown in Figure 2: the Reward Generation Pipeline (left/middle) and the Model Distillation (right).
1. Iterative Reward Design with LLMs
The core of the method is an iterative loop. The system provides the LLM with a task description (e.g., “rotate the object”) and the environment code. The LLM writes a Python function compute_reward(). This function is used to train a policy in simulation using PPO (Proximal Policy Optimization). The results are fed back to the LLM to prompt an improvement.
Mathematically, this loop looks like this:

Here, \(R\) is the reward function, \(\pi\) is the policy, and \(F\) is the fitness (success rate).
The Problem with “Vanilla” Prompts
When the authors tried using standard prompts (like those from Eureka) for this tactile task, they failed. The environment had over 70 state variables—things like fingertip_pos, contact_force_mags, obj_orientation, tactile_patch_deformation, etc.
The LLMs got confused. They would hallucinate variables that didn’t exist or mismatch data types, leading to code that crashed. To solve this, the authors introduced Modified Prompt Structuring.
Instead of just giving the environment code, they explicitly passed a “Reward Function Signature”—a strict template listing every available variable and its type.

By strictly defining \(S_{detailed}\) (the signature), they drastically reduced syntax errors. The LLM didn’t have to guess variable names; it had a menu to choose from.
The Innovation: Scalable Bonuses
The second major innovation in the method deals with reward scaling.
In previous works, the total reward was usually calculated as the LLM’s output plus a fixed binary bonus for success:

The problem is that the LLM generates the reward logic (\(R_{LLM}\)) from scratch. It might output values in the range of 0.1 to 1.0, or 100 to 1000. If the LLM outputs a reward of 1000, a fixed success bonus (\(B\)) of +1 is negligible—the robot won’t care about succeeding. If the LLM outputs 0.01, a bonus of +1 is huge, and the robot ignores everything else.
To fix this, Text2Touch forces the LLM to take the Success Bonus (\(B\)) and the Failure Penalty (\(P\)) as arguments inside the code.

This allows the LLM to apply math to \(B\) and \(P\). The LLM might decide to multiply the bonus by 10 or divide the penalty by 5 so that they match the scale of the other reward terms (like velocity or contact stability). This small change was critical for convergence.
2. Sim-to-Real Distillation
Once the LLM designs a great reward function, we have a policy that works in simulation. However, this policy usually cheats. The reward function (and the “Teacher” policy trained on it) has access to privileged information—exact object velocity, exact center of mass, and perfect friction coefficients.
You cannot get this data in the real world. A real robot only knows what it feels (tactile) and where its joints are (proprioception).
To bridge this, the authors used a Teacher-Student approach:
- Teacher: Trained in simulation using the LLM-designed reward and privileged data.
- Student: A second neural network that takes only real-world-available data (tactile + proprioception) as input.
- Distillation: The Student tries to mimic the Teacher’s actions. It minimizes the difference between “what the Teacher would do” and “what the Student is doing.”
This Student policy is what gets deployed onto the physical robot.
Experiments and Results
The authors tested their framework using an Allegro Hand (a 4-fingered robotic hand) equipped with TacTip sensors. The task was multi-axis in-hand rotation: keeping an object off the palm and rotating it continuously around the X, Y, or Z axes.
Simulation Performance & Code Quality
The authors compared rewards generated by various LLMs (GPT-4o, Gemini-1.5, Llama-3.1, etc.) against a highly tuned Human-Engineered Baseline. This baseline was crafted by experts and contained complex terms for smoothness, stability, and orientation.
The results in simulation were telling:

Looking at Table 2, we see two major takeaways:
- Performance: The best LLM rewards (e.g., from Gemini-1.5-Flash or GPT-4o) achieved significantly higher “Rotations per Episode” (Rots/Ep) than the human baseline (5.48 vs 4.92).
- Simplicity: The “Code Quality” columns are fascinating. The human code used 66 variables and 111 lines of code. The LLM rewards used an average of only 7 variables and 20-30 lines of code.
The LLMs discovered that you don’t need complex, interconnected terms. You usually just need a few succinct components: touch the object, keep it oriented, and a sparse bonus for success. The LLM code was more interpretable and computationally efficient.
Real-World Deployment
Simulation is useful, but the real world is the true test. The authors deployed the distilled Student policies onto the physical Allegro Hand. They tested with diverse objects—some similar to the training set, and some “Out of Distribution” (OOD) objects with different shapes and weights.

The experiments covered different hand orientations, specifically “Palm Up” and “Palm Down,” which changes how gravity affects the manipulation.

The “Aggression” Advantage
A surprising finding emerged when moving to the real world. In simulation, the human baseline appeared very stable. However, on the physical robot, the LLM policies outperformed the human baseline in both rotation speed and stability.
Why?
The authors hypothesize that the LLM policies were “faster” and more aggressive in simulation. In the rigid physics of a simulator, fast movement can be risky (brittle). But in the real world, the TacTip sensors are soft and compliant. They provide rich, continuous friction data.
The LLM-trained agents leveraged this speed to be reactive. When the object started to slip, the fast LLM policy could quickly re-grasp or adjust. The human-engineered policy, which was tuned to be “smooth” and “cautious,” was too slow to catch the object when dynamic slips occurred.
In the end, the LLM-derived policies achieved 38% more rotations and 25% longer episode durations (before dropping) compared to the human baseline.
Conclusion and Implications
Text2Touch represents a significant step forward for robotic manipulation. It demonstrates that the tedious “alchemy” of manual reward tuning can be offloaded to Large Language Models, even for tasks involving complex tactile sensation.
Here are the key takeaways for students and researchers:
- LLMs Can Handle Touch: By converting the environment into a structured code context, LLMs can reason about tactile variables they have never physically “felt.”
- Prompt Engineering Matters: You cannot simply paste a 70-variable environment into a prompt. Structuring the input (function signatures) and output (scalable bonuses) is required to get valid code.
- Simplicity Wins: LLMs naturally converged on simpler, sparser reward functions that transferred better to reality than complex human-crafted ones.
- Better Sim-to-Real: The resulting policies were not just good in simulation; their dynamic, aggressive nature made them more robust in the real world.
This paper suggests a future where robot learning is more accessible. Instead of spending weeks tuning parameters, engineers might spend their time describing the task in natural language, letting the AI work out the mathematical details of how to feel and manipulate the world.
](https://deep-paper.org/en/paper/2509.07445/images/cover.png)