Imagine trying to plug a USB charger into a wall socket in a pitch-black room. You can’t see the socket, but you can feel around, sensing the resistance when you hit the plastic faceplate and the satisfying “click” when the plug slides in. Now, imagine a robot trying to do the same thing relying only on a camera. If its hand blocks the view, or if the lighting is bad, the robot is effectively blind and numb. It pushes, fails, and doesn’t know why.

This highlights a critical gap in modern robotics. While Vision-Language-Action (VLA) models have become incredibly good at “seeing” and “reading”—interpreting images and instructions to plan movements—they are often bad at “feeling.”

In a recent paper, TA-VLA: Elucidating the Design Space of Torque-aware Vision-Language-Action Models, researchers tackle this problem by giving VLA models a sense of touch through joint torque. By analyzing how much effort the robot’s motors are exerting, the system can detect contacts, collisions, and successful insertions without needing expensive external touch sensors.

In this post, we will deconstruct this paper to understand how torque signals can be integrated into large-scale robotic models. We will explore the “where, when, and how” of torque perception and see how it enables robots to perform precise, contact-rich tasks.

The Blind Spot of VLA Models

State-of-the-art VLA models, such as Octo or RT-2, operate primarily on visual inputs (RGB cameras) and language instructions. They map pixels and text to joint positions. This works well for “non-contact” tasks, like moving an apple from a table to a bowl.

However, contact-rich tasks—like inserting a charger, turning a stiff door handle, or wiping a surface—require feedback that vision cannot provide.

  1. Occlusion: As a robot arm approaches a target, the arm itself often blocks the camera’s view.
  2. Subtlety: The difference between a plug that is misaligned by 1mm and one that is perfectly aligned is often visually imperceptible, but physically obvious due to resistance forces.

This is where torque comes in. As shown below, torque signals provide a rich narrative of what is happening physically during a task.

Figure 1: Torque response during a charger-insertion task. Note the distinct spikes during successful insertion compared to failed attempts.

In Figure 1(a), you can see the “heartbeat” of a manipulation task. The gray areas show movement without contact. The orange area shows a failed insertion—the torque fluctuates slightly but never peaks. The green area shows a success: specific, sharp spikes in torque indicate the mechanical resistance of the socket clips engaging. A vision-only model misses this entire story.

The Physics of “Feeling”

Before diving into the neural network architecture, we need to understand the physical intuition. How does a robot “feel” without skin?

The robot possesses proprioception—it knows the angles of its joints (\(q\)) and the speed at which they are moving (\(\dot{q}\)). It also knows how much current it is sending to its motors. Since motor current is directly proportional to torque, the robot can measure the torque (\(\tau_{measured}\)) at each joint.

The relationship between the robot’s movement and the external forces acting on it is governed by rigid body dynamics:

Dynamic equation of the manipulator.

Here, \(M(q)\ddot{q} + C(q,\dot{q})\dot{q} + G(q)\) represents the torque required just to move the robot’s own weight and inertia. \(\tau_{cmd}\) is the commanded torque, and \(\tau_{ext}\) is the torque caused by the environment pushing back (external contact).

If we know the robot’s model (its mass, gravity, etc.), we can predict the torque required to move in free air (\(\tau_{model}\)). Any difference between the measured torque and this model torque is likely due to external contact:

Torque measurement equation showing the relationship between measured, modeled, and external forces.

This residual torque is the signal we want to feed into our AI. It tells the model, “I am pushing harder than I should be; something is blocking me.”

The Design Space: Integrating Torque into VLAs

The core contribution of this paper is a systematic study of the Design Space. The authors didn’t just slap torque inputs onto a model; they asked:

  1. Where should this signal go? (Encoder vs. Decoder)
  2. How should we represent time? (Single frame vs. History)
  3. What should the objective be? (Observation vs. Prediction)

Let’s break down their findings.

1. Where to Embed? (Encoder vs. Decoder)

A typical VLA model has two main parts:

  • Conditioning Encoder: Processes the “context”—the images and the text instruction (e.g., “Plug in the charger”). It builds a high-level understanding of the scene.
  • Denoising Decoder: The “action expert.” It takes the context and the robot’s current physical state (proprioception) to generate the next movement (action).

The researchers tested three architectures to see where torque fits best:

Figure 2: Architectures for embedding torque signals. (a) Into the Encoder. (b) Pre-concatenated in the Decoder. (c) Post-concatenated in the Decoder.

  • Encoder Embedding (Enc): Treating torque like an image or text. It goes into the high-level context processor.
  • Decoder Embedding (DePre & DePost): Treating torque like a physical state (similar to joint angles). It goes directly into the action generation module.

The Verdict: The Decoder wins. specifically the “DePost” method (Figure 2c), where torque is processed by an adapter and added to the decoder’s input.

Why? The authors performed a statistical analysis (HSIC) to see which data types are correlated. They found that torque is highly correlated with joint angles (Action and Angle) but has very low correlation with Text or Images.

Figure 3: Normalized HSIC heatmap showing correlations. Note the high correlation between Torque, Angle, and Action (red), and the low correlation with Text (blue).

As seen in the heatmap above, Torque, Angles, and Actions are “birds of a feather.” They are all high-frequency, proprioceptive signals. The Encoder is designed to handle static, semantic data (images/text), while the Decoder is sensitive to fine-grained variations in physical state. Therefore, feeding torque to the Decoder aligns the signal with the part of the brain best suited to use it.

2. Sense What Was: Torque History

Knowing the torque right now is useful, but knowing how torque has changed over the last second is critical. A “click” is a temporal event—a rise followed by a drop.

The researchers explored two ways to feed history into the model:

  1. Frame-wise (Multi-Token): Feed 10 separate tokens representing the last 10 timesteps.
  2. Aggregate (Single-Token): Summarize the last 10 timesteps into one dense vector.

Figure 4: Architectures for embedding torque history. (c) Summarizing history into a single token for the decoder proved most effective.

The Verdict: Single-Token History in the Decoder (Figure 4c) is best.

Why? You might think more tokens = more information = better performance. However, VLA decoders are pretrained on specific input patterns. Flooding the decoder with 10 extra tokens disrupts its learned patterns, essentially acting as noise. Compressing the history into a single token preserves the information (the “feeling” of the contact event) without breaking the architecture’s expected structure.

3. Predict What Will Be: Action-Torque Diffusion

Usually, models use torque only as an input (observation). But humans don’t just feel what is happening; we anticipate it. When you push a heavy door, you expect resistance. If you don’t feel it, you stumble.

The researchers proposed a “Joint Action-Torque Diffusion” approach. They trained the model to predict both the next action (movement) and the expected future torque.

Figure 5: Architecture for Action-Torque Diffusion. The model predicts future actions and future torques simultaneously.

This creates an auxiliary loss function. The model isn’t just penalized for moving incorrectly; it’s penalized if it doesn’t understand the physical consequences (torque) of its movement.

The Verdict: This significantly improves performance. By forcing the model to predict forces, it builds an internal “physics engine,” grounding its latent space in reality.

Figure 6: Future torque signal prediction. The red lines (predictions) closely track the blue lines (ground truth), showing the model has learned physical dynamics.

Putting It All Together: The Experiments

The team integrated these findings into a model called \(\pi_0\) (Pi-Zero) and compared it against standard baselines like ACT and RDT. They tested on a dual-arm Aloha robot performing tasks that require “feeling,” such as button pushing, charger plugging, and door opening.

Quantitative Success

The results were stark. On contact-rich tasks, standard vision-based models often failed completely (0% success) because they couldn’t detect when they had made contact or if a plug was misaligned.

The Torque-Aware models (specifically \(\pi_0\) + obs + obj, combining torque observation and torque prediction objectives) achieved high success rates.

  • Charger Plugging:
  • Standard \(\pi_0\): 0/20 success (Vision can’t see the fine alignment).
  • Torque-Aware \(\pi_0\): 17/20 success.
  • Button Pushing:
  • Standard \(\pi_0\): 5/20 success.
  • Torque-Aware \(\pi_0\): 18/20 success.

Even on “regular” tasks where torque seems less critical (like stacking cubes), the torque-aware model performed slightly better or on par, proving that adding this sense doesn’t hurt general capabilities.

Visualizing the “Touch”

The most compelling evidence is watching how the robot behaves. In the figure below, look at the Button Pushing and Door Handle sequences.

Figure 7: Visualization of tasks. In (a) and (b), the robot detects a failed first attempt via torque, readjusts, and retries successfully.

In sequence (a), the robot tries to push a button but misses or slips (misalignment). A standard model would blindly continue its pre-planned trajectory, likely jamming into the table. The Torque-Aware model detects the anomaly in the force feedback. It retreats, realigns, and presses again successfully. This closed-loop correction is the holy grail of robust manipulation.

Cross-Embodiment Generalization

Can this learn on one robot and work on another? The researchers tested this on a completely different industrial arm (ROKAE).

Figure 12: Cross-Embodiment Execution. The model adapts to different charging ports (fast vs. slow) using torque feedback.

The model successfully generalized, inserting different types of connectors (fast vs. slow chargers) by feeling the specific resistance profiles of each port.

Conclusion: The Future is Multisensory

The TA-VLA paper provides a clear blueprint for the next generation of robot brains. It moves us away from the “eyes-only” paradigm and towards “full-body” intelligence.

Key Takeaways:

  1. Decoder is King: Proprioceptive signals like torque belong in the action decoder, not the semantic encoder.
  2. Compress History: Don’t overwhelm the model; summarize the history of forces into a compact token.
  3. Predict Forces: Training a robot to anticipate resistance helps it understand the world better than just training it to move.

By elucidating this design space, the authors have opened the door for robots that can operate in the messy, contact-rich real world—plugging in cables, assembling furniture, and handling delicate objects with the sensitivity of a human hand.


This post explores the research presented in “TA-VLA: Elucidating the Design Space of Torque-aware Vision-Language-Action Models” by Zhang et al.