Introduction
If you have ever watched a robot try to fold a t-shirt, you might have noticed a stark difference between its strategy and yours. A typical robotic approach involves painstakingly smoothing the garment flat on a table, ironing out every wrinkle with computer vision algorithms, and then executing a pre-calculated fold. It is slow, rigid, and requires a lot of table space.
You, on the other hand, probably pick the shirt up, give it a shake, and fold it in mid-air. If you grab it by the wrong end, you simply rotate it until you find the collar. You rely on the feel of the fabric and a general understanding of where the sleeves should be, even if the shirt is crumpled.
Why is this so hard for robots? Clothing is “deformable,” meaning it has an infinite number of possible shapes. A crumpled shirt looks nothing like a flat shirt, and key features (like a sleeve) are often hidden inside folds (self-occlusion).
In a fascinating new paper, “Reactive In-Air Clothing Manipulation with Confidence-Aware Dense Correspondence and Visuotactile Affordance,” researchers from MIT, Prosper AI, and Boston Dynamics present a framework that brings robots closer to human-level adaptability. Instead of demanding a perfect view of a flat shirt, their system learns to manipulate garments in mid-air, uses tactile sensors to “feel” the cloth, and—crucially—knows when it is confused and needs to take a second look.
In this deep dive, we will explore how they achieved this by combining high-fidelity simulation, a new way of calculating visual correspondence, and a reactive state machine that isn’t afraid to say, “I’m not sure yet.”
The Core Problem: The Infinite Shape of Cloth
Rigid object manipulation is a mostly solved problem in robotics. If you see a coffee mug, its handle is always in a fixed position relative to the cup. But a shirt? The “handle” (say, the shoulder seam) could be anywhere. It could be inside out, twisted, or buried under the rest of the fabric.
Previous attempts to solve this relied on flattening. The logic was: if we make the cloth 2D, we can treat it like a rigid object. But flattening is time-consuming and often fails if the robot can’t figure out how to un-crumple the object in the first place.
The researchers propose a different paradigm: In-Air Manipulation. By holding the garment up, gravity helps unfurl it. To make this work, the robot needs three specific capabilities:
- Dense Correspondence: It needs to look at a crumpled mess of pixels and identify “that pixel is the left shoulder.”
- Visuotactile Affordance: It needs to know “can I actually grip this spot without slipping or grabbing too many layers?”
- Confidence Awareness: It needs to know how certain it is. If the robot thinks a blob of fabric might be a sleeve but isn’t sure, it shouldn’t act. It should wait or change its view.

As shown in Figure 1, the system integrates these components into a loop. It perceives the cloth, estimates the probability of different parts being the “shoulder” or “sleeve,” checks if those parts are graspable, and uses tactile feedback to confirm the grip.
Step 1: Building a Digital Closet
To train a neural network to recognize parts of a shirt, you need data—specifically, labeled data. You need to know that this specific pixel on a crumpled image corresponds to that specific pixel on a flat shirt.
Collecting this data in the real world is a nightmare. You would need to take thousands of photos of shirts, manually clicking on the exact same stitch on every single one. Instead, the team turned to simulation using Blender 4.2.

As illustrated in Figure 2, they created a parametric generation pipeline. This isn’t just about rendering a generic t-shirt mesh; they randomized:
- Geometry: Sleeve lengths, collar types (V-neck vs. Crew), and tightness.
- Physics: Stiffness and damping (how the cloth swings).
- Visuals: Crucially, they added seams, hems, and stitching.
Why are seams important? In a featureless blob of colored fabric, a seam is often the only visual cue indicating orientation. By simulating these details, the visual model learns to rely on the same geometric cues humans do. They generated 1,500 scenes, creating a massive dataset of “deformed” vs. “canonical” (flat) pairs, all with perfect ground-truth labels.
Step 2: Dense Correspondence with Distributional Loss
Once the data is ready, the robot needs a brain to interpret it. The goal is Dense Correspondence. This means taking an image of a crumpled shirt (the input) and mapping every pixel to a coordinate on a flat, template shirt (the canonical space).
The Flaw of Contrastive Loss
Traditionally, researchers use Contrastive Loss. This method trains the network by saying, “Pixel A in the crumpled image matches Pixel B in the flat image. Make their feature vectors similar. Make all other pairs dissimilar.”
This works for rigid objects, but cloth has symmetries. A left sleeve looks exactly like a right sleeve. If the shirt is crumpled, it might be visually impossible to tell them apart. A contrastive loss forces the network to make a hard choice, often leading to a “mean” prediction that lands somewhere in the middle of the shirt—which is wrong.
The Solution: Distributional Loss
The researchers adopted a Distributional Loss approach. Instead of forcing a single point match, the network predicts a probability distribution (a heatmap) over the canonical shirt.
If the robot sees a sleeve but can’t tell if it’s left or right, the network can output two high-probability “blobs” on the heatmap—one at the left sleeve and one at the right. This explicitly models uncertainty.
The mathematical formulation for the probability estimator is:

Here, the network computes the probability that a pixel \((x_i, y_j)\) in the canonical image corresponds to the query pixel \((x_a, y_a)\) in the crumpled image. It essentially compares the feature descriptors (\(f(I)\)) of the two points.
During training, the system minimizes the KL-Divergence (a measure of difference between probability distributions) between the predicted heatmap and a target Gaussian distribution centered on the true match.

Figure 3 visualizes this training loop. Notice the Target Distribution (\(q_b\)). It isn’t just a single dot; it’s a Gaussian blob (or multiple blobs for symmetries). This teaches the network that “near the shoulder” is a better guess than “nowhere,” and allows it to express ambiguity.
Why This Matters: Confidence
Because the network outputs a distribution, the peak value of that distribution acts as a confidence score. If the peak is sharp and high, the robot is sure. If the distribution is flat or spread out, the robot knows it is confused. This metric is the cornerstone of the reactive system.
Step 3: Visuotactile Affordance
Knowing where the shoulder is doesn’t mean you can grab it. It might be stretched tight against a table, or bundled up with three other layers of fabric. This is where Affordance comes in.
The team defined a good grasp based on three criteria:
- Reachability: Can the gripper actually get there?
- Collision: Will the gripper hit the table or other body parts?
- Material Thickness: Are there 2 or fewer layers of cloth? (Ideally, we want to grab one edge, not a whole bundle).
They trained a U-Net architecture to predict a “grasp quality” heatmap from depth images.

Closing the Sim-to-Real Gap
A simulation can never perfectly match the physics of real cloth. To fix this, the researchers fine-tuned the model on the real robot using Tactile Self-Supervision.
They equipped the robot fingers with GelSight Wedge sensors. These are tactile sensors that provide high-resolution images of the contact surface. The robot attempted thousands of grasps. If the GelSight sensor saw fabric texture, it was a success. If it saw nothing (empty grasp) or too much bulk, it was a failure.
This data was fed back into the affordance network. The result? A system that not only looks for visual features but “hallucinates” how the grasp will feel before it tries.

As seen in Figure 16, the fine-tuned model (Right) strikes a balance. The pure simulation model (Left) is too scared to grasp anything. The pure real-world model (Middle) is reckless. The combined model understands the geometry but respects the physical reality of the cloth.
Step 4: The Reactive State Machine
Now we have a robot that can see (Correspondence) and predict graspability (Affordance). How do we combine them to fold a shirt?
The researchers built a Confidence-Based State Machine. This is the logic that governs the robot’s behavior. It doesn’t follow a rigid script like “Move to X, Grasp Y.” Instead, it follows a logic of inquiry.
- Observe: Look at the hanging shirt.
- Query: “Where is the left sleeve?”
- Check Confidence:
- High Confidence: “I see it clearly.” -> Grasp.
- Low Confidence: “It’s hidden or I’m unsure.” -> Rotate.
- Action: If the grasp happens, use tactile sensors to confirm. If the grasp is empty, retry.

This workflow, detailed in Figure 7, allows the robot to handle the infinite variability of crumpled clothes. If the shirt is bunched up such that the sleeve is hidden, the robot rotates it. Gravity changes the configuration, new features are revealed, confidence spikes, and the robot strikes.
Experimental Results
So, does it work?
Visual Performance
The researchers compared their Distributional Loss approach against standard Contrastive Loss methods. They evaluated how accurately the system could find a specific point on a crumpled shirt.

Figure 4 shows the results. The red line (Symmetric Distributional) sits highest, indicating the lowest error rate. The heatmaps in part (b) are telling: the Distributional method produces tight, accurate hotspots (red/yellow), whereas the Contrastive methods produce messy, confident-but-wrong predictions.
Real-World Folding
When deployed on a real dual-arm robot setup, the system showed impressive resilience.

In Figure 5, you can see the robot’s internal view. The “Correspondence Heatmap” shows where it thinks the part is, and the “Grasp Affordance” shows where it is safe to grab. The intersection of these two maps gives the target.
The system was able to successfully fold and hang shirts starting from highly occluded states. However, it wasn’t perfect.

Figure 8 highlights the remaining challenges. Sometimes the “Dense Correspondence” gets fooled—it might mistake the inside of the shirt for the outside, or grab the back layer instead of the front. These are “irrecoverable” because the robot thinks it succeeded. However, failures like “empty grasp” (grabbing air) were almost always caught by the tactile sensors, allowing the robot to recover and try again.
The Future: Learning from Humans
One of the most exciting implications of this work is the potential for Learning from Demonstration. Because the Dense Correspondence network maps any shirt to a canonical template, it can effectively “translate” videos of humans into robot instructions.

As shown in Figure 6, the system can watch a human grasp a specific point on a shirt. It tracks the human hand, maps that point to the canonical template, and then maps that template point back to the robot’s current view of its own shirt. This allows the robot to imitate folding strategies without needing to be explicitly programmed for every movement.
Conclusion
The paper “Reactive In-Air Clothing Manipulation” represents a significant step forward in robotic manipulation of deformable objects. The key takeaway isn’t just about better vision or better sensors—it’s about the integration of uncertainty.
By moving from rigid “contrastive” matching to probabilistic “distributional” matching, the robot gains the ability to assess its own confusion. By combining this with tactile feedback and a reactive state machine, it gains the patience to wait for a better opportunity.
While we aren’t quite at the point where Rosie the Robot puts away all our laundry, this research provides a blueprint for how to get there: stop trying to force the world to be flat and rigid, and instead build robots that are comfortable with the crumpled, chaotic nature of reality.
](https://deep-paper.org/en/paper/2509.03889/images/cover.png)