Introduction
Imagine you are reaching into a cluttered bag to find your house keys. You can’t really see them, but your fingers brush against cold metal, and you instantly know you’ve found them. Or consider checking if a banana is ripe; looking at the color helps, but a gentle squeeze tells you if it’s mushy or firm.
For humans, integrating vision with haptics (the sense of touch) is seamless. For robots, it is an immense challenge. While computer vision has seen an explosion of capabilities due to massive datasets like ImageNet, robotic touch has lagged behind. Robots struggle to differentiate between a plastic apple and a real one, or a metal cup and a ceramic one, simply by looking. They need to feel.
The primary bottleneck has been data. Collecting haptic data requires physical interaction—a robot must touch an object to “feel” it. This makes data collection slow, expensive, and usually confined to sterile lab environments that don’t reflect the real world.
In a recent paper, researchers from Cornell University introduced CLAMP, a project that aims to solve this scalability problem. They designed a low-cost, handheld device that allowed non-experts to crowdsource haptic data in their own homes. The result is the largest open-source multimodal haptic dataset to date, enabling robots to recognize materials and object stiffness (compliance) better than vision alone ever could.

Background: The Data Gap in Haptics
To understand why CLAMP is significant, we must look at the state of robotic perception.
Vision is Action-Independent: You can scrape billions of images from the internet to train a vision model. The data already exists. Haptics is Action-Conditioned: To know what a surface feels like, you have to interact with it. The sensation depends on how hard you press, how fast you slide your finger, and the temperature of the sensor.
Previous attempts to build haptic datasets faced three major limitations:
- Lack of Multimodality: Many datasets only recorded force (how hard is the contact?). But human touch is complex—we sense temperature (thermal), vibration (texture), and limb position (proprioception).
- Lack of Scale: Most datasets were collected by a single robot in a lab, limiting the number of objects to a few hundred.
- Lack of Diversity: Lab objects are often uniform (blocks, sponges). The real world is filled with complex items like packaged snacks, heterogeneous tools, and furniture.
The researchers hypothesized that if they could get haptic data “in the wild”—from real homes using a standardized device—they could train robust models that generalize to different robots.
The CLAMP Device: Hardware for the Masses
To crowdsource physical data, you cannot ship a $50,000 robot arm to participants. You need something cheap, portable, and easy to use. The researchers developed the CLAMP device (Crowdsourcing a LArge-scale Multimodal Perception device).
The device is essentially a “smart” reacher-grabber. It costs less than $200 to build and weighs about 1.3 lbs, making it easy for participants to use with one hand.

The Sensor Suite
The “fingers” of the grabber are equipped with suction cups embedded with sensors designed to capture five distinct modalities of touch:
- Active Thermal: A sensor heated to 55°C. When it touches an object, the rate at which it cools down helps determine the object’s thermal conductivity (e.g., metal feels colder than wood because it draws heat away faster).
- Passive Thermal: Measures the ambient surface temperature of the object.
- Force: A Force-Sensing Resistor (FSR) measures the hardness or “squishiness” of the grasp.
- Vibration: A contact microphone listens to the sound of the interaction (like the scratchy sound of fabric vs. the silence of rubber).
- Proprioception: Inertial Measurement Units (IMUs) track how the gripper moves and rotates during the grasp.
The device includes a Raspberry Pi Zero for computing and a small screen with a GUI to guide users through the data collection process. Users simply take a picture of an object, say what it is, and then grasp it five times.
The CLAMP Dataset
By deploying 16 of these devices to 41 participants, the researchers amassed a massive dataset. The numbers are unprecedented for haptics:
- 5,357 unique household objects.
- 25,100 distinct touches (trials).
- 12.3 million individual data points.

As shown in the statistics above, the dataset covers a wide distribution of materials—from hard plastics and metals to soft fabrics and foams. It also captures a variety of grasping forces and speeds, which is crucial for training a model that isn’t brittle.
To label this massive dataset, the team used an automated pipeline. They transcribed the user’s spoken description (e.g., “This is a steel cup”) and used GPT-4o to analyze the object’s image and description to generate ground-truth labels for material (e.g., Steel) and compliance (e.g., Hard).
The Core Method: The CLAMP Model
Collecting data is only half the battle. The goal is to give robots the ability to identify materials and compliance. The researchers developed the CLAMP Model, a “visuo-haptic” architecture that fuses sight and touch.
Architecture Breakdown
The model consists of two parallel encoders that process different types of sensory input:
- Visual Encoder (GPT-4o): The robot takes an image of the object. This image is processed by GPT-4o to generate a visual estimate of the material. However, vision can be tricked—a metallic-painted plastic spoon looks like metal but isn’t.
- Haptic Encoder (InceptionTime): This is a specialized neural network designed for time-series data. It takes the raw streams from the thermal sensors, force sensors, microphone, and IMUs. It processes these signals to extract features like thermal decay rates or stiffness profiles.
![Figure 3: Model overview: We propose the CLAMP model, a visuo-haptic model that fuses outputs from a GPT-4o [50] visual encoder and a pretrained InceptionTime-based [51] haptic encoder.](/en/paper/2505.21495/images/004.jpg#center)
Impedance and Proprioception
One specific feature the researchers engineered is Impedance. It helps the robot understand how an object resists motion. They calculated this using the relationship between the change in force and the angular velocity of the gripper:

In this equation, \(F'(t)\) is the change in force and \(\omega(t)\) is the angular velocity. If the gripper is squeezing hard but moving little (low velocity), impedance is high (a hard object). If the gripper moves easily while force increases slowly, impedance is low (a soft object).
Visuo-Haptic Fusion
The visual and haptic features are concatenated (joined together) and passed through a Multi-Layer Perceptron (MLP)—a standard neural network classifier. This fusion allows the model to correct visual errors using tactile data. If the vision system says “Metal” but the thermal sensor says “Warm” and the contact mic says “Soft,” the model can correct the prediction to “Plastic” or “Fabric.”
Performance vs. Baselines
The researchers compared their model against state-of-the-art vision models (like CLIP and Open-Vocabulary models) and haptic-only baselines.

The results (Table 2) show a clear hierarchy:
- Vision-only models struggle significantly with material classification (Accuracy near 0 or very low in zero-shot settings due to the difficulty of inferring material from RGB pixels alone).
- Haptic-only models perform better (Accuracy ~59%), proving that touch is informative.
- The CLAMP Model (Vision + Haptics) achieves the highest accuracy (87%), demonstrating that the two senses are complementary.
Crucially, ablation studies (removing one sensor at a time) showed that Force and Active Thermal sensing were the most critical haptic modalities. Without force, the model cannot distinguish rigid vs. soft; without thermal, it struggles with metal vs. plastic.
Experiments & Results: From Stick to Robot
A major question in robotics is “Sim-to-Real” or “Human-to-Robot” transfer. Can a model trained on data collected by humans holding a stick be used by a sophisticated robotic arm?
The researchers tested this on three different robot embodiments:
- Franka Emika Panda equipped with the CLAMP gripper.
- Franka Emika Panda with a standard parallel-jaw gripper (modifying the CLAMP sensors to fit).
- WidowX arm with a parallel-jaw gripper.

They found that the CLAMP model generalized surprisingly well. With just a tiny amount of fine-tuning (using only 15% of robot-specific data), the model outperformed vision-only baselines on the robots. This suggests that the fundamental “physics” of touch learned from the crowdsourced dataset applies broadly, regardless of the specific robot arm moving the sensor.
Real-World Tasks
To prove practical utility, the team deployed the system in three realistic scenarios:
1. Recycling Sorting (Material Recognition)
The robot was tasked with sorting trash into “Recycle” or “Garbage.” This is notoriously hard for vision because materials like dirty paper, crushed cans, and plastic wrappers look chaotic.
- Result: The CLAMP-enabled robot successfully identified materials like aluminum, paper, and plastic, achieving a 90% success rate.
2. Bag Retrieval (Occluded Object Detection)
The robot had to find a metallic object inside a cluttered bag.
- Challenge: The camera cannot see inside the bag or distinguish a metal object from a non-metal one when they are jumbled together.
- Result: The vision-only model failed completely (0 successful retrievals). The CLAMP model successfully identified and retrieved the metallic object in 6 out of 13 trials by “feeling” the objects.
3. Banana Sorting (Compliance Recognition)
The robot needed to separate ripe bananas from overripe ones.
- Challenge: Visually, a spotted banana might be ripe or overripe. The difference is internal stiffness (compliance).
- Result: Using the force and impedance data, the robot could gently squeeze the fruit to determine if it was “Soft” (overripe) or “Hard” (ripe/green), achieving 83% accuracy.

Conclusion & Implications
The CLAMP project represents a pivotal shift in how we approach robotic perception. It moves away from the “lab-centric” view of data collection toward a distributed, crowdsourced model—similar to how the internet enabled large language models.
Key Takeaways:
- Haptics is Essential: For manipulation tasks involving sorting, recycling, or food handling, vision is insufficient. Touch provides the necessary ground truth.
- Crowdsourcing Works: You don’t need expensive robots to collect robot data. A $200 handheld device can generate millions of high-quality training points.
- Cross-Embodiment Transfer: Physics is universal. A model that learns how heat transfers from steel to a handheld sensor can apply that knowledge to a robot arm with minimal adjustment.
By bridging the gap between vision and touch, and by solving the data scarcity problem, CLAMP opens the door for robots that can operate not just by looking at the world, but by physically interacting with it—safely, intelligently, and effectively. Future work may see these sensors integrated into even more dexterous hands, allowing robots to perform complex tasks like buttoning a shirt or doing dishes, where the “feel” of the object is everything.
](https://deep-paper.org/en/paper/2505.21495/images/cover.png)