Introduction
Imagine shaking hands with a robot. If it’s a standard industrial arm, you might be terrified of it crushing your fingers. Its rigid metal skeleton and high-torque motors are designed for precision, not comfort. Now, imagine shaking a hand made of silicone—soft, compliant, and yielding to your touch. This is the promise of soft robotics: machines that are inherently safe and adaptable to the chaotic real world.
However, there is a catch. While soft robots are mechanically safer, they are notoriously difficult to control. A rigid robot has specific joints with encoders that tell the computer exactly where the arm is up to a fraction of a millimeter. A soft robot, on the other hand, is a continuum body—it can bend, twist, and deform in infinite ways. It has virtually infinite degrees of freedom. If you can’t measure exactly where the robot is (proprioception), how can you teach it to perform complex tasks like unscrewing a bottle or picking a blackberry?
In this post, we are diving deep into KineSoft, a groundbreaking framework presented by researchers at Carnegie Mellon University and the Bosch Center for AI. This paper tackles the “brain” problem of soft robotics. By developing a way for soft robots to “feel” their own shape and learn from human touch, KineSoft bridges the gap between the inherent safety of soft materials and the precision required for dexterous manipulation.
The Core Challenge: The “Body Schema” Problem
To understand why KineSoft is necessary, we first need to look at why standard robotics techniques fail with soft hands.
In rigid robotics, Imitation Learning is a popular technique. A human demonstrates a task (like moving a cup), and the robot records the trajectory of its joints. Later, it replays that trajectory or learns a policy to adapt it. This works because the mapping between the robot’s motors and its position is generally static and well-understood.
For soft robots, this breaks down for two reasons:
- State Representation: What is the “state” of a piece of rubber? It doesn’t have joints. We need a way to represent its complex, deforming shape mathematically.
- The Demonstration-Execution Gap: This is subtle but crucial. If you grab a soft robotic finger and wiggle it (demonstration), the internal sensors reading the deformation will output specific values. However, when the robot tries to move that finger itself by pulling on internal tendons (execution), the mechanics are different. The finger might achieve the same shape, but the internal stresses—and thus the sensor readings—might be different.
If you simply train a robot to mimic the sensor readings it felt during the human demonstration, it will fail when it tries to move itself. KineSoft solves this by teaching the robot to mimic the shape, not just the raw sensor data.
The KineSoft Framework
KineSoft is a hierarchical framework designed to enable kinesthetic teaching—where a human physically guides the robot to do a task—for soft robots.

As illustrated in Figure 1, the framework consists of three main pillars:
- Proprioceptive Shape Estimation: A deep learning model that translates raw sensor data into a 3D mesh of the robot’s shape.
- Imitation Policy: A diffusion-based AI that observes the world and decides what shape the hand should be in to complete a task.
- Shape-Conditioned Controller: A low-level control loop that actuates the motors to achieve that desired shape.
Let’s break these down, starting with the hardware.
1. The Hardware: The MOE Hand
The researchers utilized a Multifinger Omnidirectional End-effector (MOE). The key innovation here isn’t just the soft silicone body, but the sensors embedded inside it. They integrated conductive elastic rubber sensors directly into the fingers. As the finger bends or twists, these sensors stretch, changing their electrical resistance.
This gives the robot raw data about its internal strain, but raw resistance values are just noisy numbers. The robot needs to translate these numbers into a mental model of its physical body.
2. Learning to “Feel” Shape (Proprioception)
How do you go from a list of electrical resistance values to a full 3D mesh of a finger? The authors propose a neural network architecture based on FoldingNet.

The process, shown in Figure 2A, works as follows:
- Input: The network takes the current resistance readings (\(\mathbf{R}\)) from the sensors.
- Encoding: These readings are passed through a “Signal Encoder” to create a latent feature vector—a compressed numerical summary of the sensor state.
- Decoding: This feature vector is combined with the “rest pose” (undeformed shape) of the finger mesh. The “Deformation Field Decoder” then predicts how much every single vertex on the mesh needs to move (\(\Delta \mathbf{V}\)) to match the current physical reality.
Mathematically, the network learns a function \(f\) that predicts displacement:

The actual calculation for each vertex position relies on adding this predicted displacement to the original position:

This approach is powerful because it outputs a mesh, which is a geometric format that humans, physics simulators, and control algorithms can all understand.
Visualizing the Sense
The result of this network is impressive. It allows the robot to visualize its own deformation in real-time. In Figure 9 below, you can see the correlation between the raw sensor signals (center) and the reconstructed shape (right) as the hand interacts with an object.

3. Bridging Simulation and Reality
Training the shape estimation network requires a massive amount of data—specifically, pairs of sensor readings and true 3D shapes. Collecting this data on a physical robot is a nightmare because you would need external motion capture cameras to track thousands of points on the silicone surface constantly.
The researchers solved this by training in simulation. They created a finite-element model of the finger and simulated thousands of deformations. However, simulations are never perfect. The electrical resistance of a real rubber sensor doesn’t perfectly match the simulated strain.
To fix this Sim-to-Real gap, they developed a domain alignment technique. They formulated an optimization problem to find correction factors (\(\kappa\)) that align the real-world resistance (\(R\)) with the simulated length changes (\(L^S\)).

Because they can’t know the “true” length of the real sensors during calibration, they use an external depth camera to observe the robot’s shape and minimize the Chamfer Distance (a metric for comparing two point clouds) between the observed shape and the predicted shape:

This calibration step is crucial. It allows the robot to learn its “body schema” in the matrix (simulation) and then download it to the physical world with only a brief alignment phase.
4. The Shape-Conditioned Controller
Now the robot knows its current shape. But to perform a task, it needs to move to a target shape.
This is where KineSoft shines compared to traditional methods. Traditional “strain-matching” tries to force the motors to reproduce specific sensor readings. But as noted earlier, pulling a tendon creates different internal strains than pushing the finger with a human hand.
KineSoft’s controller doesn’t look at sensors; it looks at the geometry. It calculates the error between the current estimated mesh and the desired target mesh. It then projects this error onto the available actuation directions (the tendons).

In this equation, \(\mathbf{e}\) represents the error between where a mesh vertex is and where it should be. The controller adjusts the servos (\(\delta u\)) to minimize this geometric error. This effectively bypasses the demonstration-execution gap because geometry is consistent, regardless of whether the deformation is caused by a motor or a human hand.
Learning Skills via Imitation
With the proprioception and control layers built, the researchers could finally teach the robot skills.
They used Diffusion Policies, a state-of-the-art method in imitation learning. The workflow is intuitive:
- A human operator grabs the soft fingers.
- They physically guide the robot to perform a task (e.g., unscrewing a bottle).
- The system records the sequence of shapes (meshes) generated during this motion.
- The policy learns to predict the next desired shape based on the current state.

Figure 3 shows this in action. The top row shows the human guiding the robot to manipulate cones. The bottom row shows the robot executing the same behavior autonomously. Notice how the robot isn’t just replaying a recording; it is actively generating shape goals that its low-level controller tracks.
Experiments and Results
The authors evaluated KineSoft on two fronts: its ability to estimate/track shape, and its ability to perform useful work.
Shape Estimation Fidelity
First, can the robot actually tell what shape it is in? The team compared KineSoft against several baselines, including “DeepSoRo” (a vision-based method) and naive linear models.

As shown in Table 1, KineSoft achieved a shape error of just 1.92 mm. This is a massive improvement over naive methods (~4.9 mm) and even outperforms vision-based methods that suffer from occlusions (when the robot’s hand blocks the camera’s view of its own fingers).
Tracking Performance
Next, they tested if the controller could actually follow a trajectory. This is the critical test of the “Demonstration-Execution Gap.”

Figure 6B visualizes the tracking. The red dots represent the ground truth target. The blue lines represent KineSoft’s performance. The system closely follows the complex, non-linear deformations required.
In contrast, looking at Table 2 below, we see that a standard “Strain-tracking” baseline (trying to match sensor values directly) has nearly double the error (6.20 mm vs 3.29 mm). This confirms that geometric shape is a much more robust transfer medium than raw sensor data.

Real-World Manipulation Tasks
Finally, the ultimate test: Can it do chores? The team designed six tasks ranging from rigid object manipulation to delicate interaction with soft objects.

The tasks, pictured in Figure 4, included:
- Bottle Unscrewing: Requires torque and coordination.
- Berry Picking: Requires extreme gentleness to avoid crushing the fruit.
- Fabric Grasping: Hard because the object itself deforms.
The results were stark.

Table 3 reveals the performance gap.
- Bottle Unscrewing: KineSoft succeeded 17/20 times. The baseline strain policy failed completely (0/20).
- Berry Picking: KineSoft achieved 16/20. The baseline only managed 7/20.
The baseline failed largely because the sensor values recorded during the human demonstration (when the human is squeezing the finger) were physically impossible for the robot to reproduce using its tendons. KineSoft, by focusing on the shape, ignored those impossible sensor values and simply found the best way to actuate its tendons to achieve the geometric goal.
Conclusion and Future Implications
KineSoft represents a significant step forward for soft robotics. By decoupling the “what” (the geometric shape) from the “how” (the specific tension in the cables or readings in the sensors), it allows us to apply powerful imitation learning techniques to soft bodies.
The key takeaways from this research are:
- Compliance is an Asset: Soft robots shouldn’t be treated as “hard robots with bad sensors.” Their softness allows for intuitive kinesthetic teaching that rigid robots can’t easily support.
- Geometry is the Universal Language: Translating sensor data into a 3D mesh acts as a reliable bridge between human demonstration and robot execution.
- Sim-to-Real works for Soft Bodies: With clever domain alignment, we can learn complex deformation models in simulation and transfer them to the real world without needing expensive real-world motion capture setups.
As we look toward a future where robots help in elderly care, handle delicate produce, or work side-by-side with humans, frameworks like KineSoft will be essential. They provide the “body awareness” necessary for soft robots to move from the research lab into the real world, combining safety with the dexterity we expect from intelligent machines.
](https://deep-paper.org/en/paper/2503.01078/images/cover.png)