Imagine you are playing air hockey. You step up to a table you’ve never used before. Is the puck heavy or light? Is the table slick or sticky? Before you take your winning shot, you instinctively tap the puck a few times—a gentle “poke”—to get a feel for how it slides. Only then do you commit to the high-speed “strike.”

Humans perform this kind of active exploration naturally. We interact with objects to uncover their hidden physical properties before attempting a difficult task. For robots, however, this is an immense challenge. Traditional robotic control often assumes we know the mass, friction, and center of mass of an object beforehand. If those parameters are wrong, the robot fails.

In this post, we dive into the research paper “Poke and Strike: Learning Task-Informed Exploration Policies,” which proposes a novel Reinforcement Learning (RL) framework. This method teaches robots to autonomously “poke” objects to learn exactly what they need to know—and nothing more—before executing a high-stakes “strike.”

The Problem: One-Shot Tasks and Hidden Physics

The core problem addressed in this work is the one-shot robotic task involving objects with unknown physical properties.

Consider a robot arm trying to hit a puck into a goal that is outside its reach. This is a dynamic, irreversible action. If the robot guesses the friction wrong and hits the puck too softly, it stops short. If it hits too hard, it bounces off the table. Because the goal is unreachable, the robot cannot correct its mistake mid-action. It has one shot.

To succeed, the robot needs two things:

Exploration: A strategy to manipulate the object (poke it) to estimate properties like mass and friction.
Execution: A strategy to perform the task (strike it) using those estimated properties.

This raises difficult questions: How does the robot know which properties matter? (Maybe friction matters, but mass doesn’t). How long should it explore? And how can it do this quickly without needing a human to tune the system for every new object?

The Solution: Task-Informed Exploration

The researchers propose a framework where the robot learns an exploration policy based on the needs of the task itself. Instead of trying to learn everything about an object with perfect accuracy (which takes too long), the robot learns to identify only the properties that are critical for success.

Figure 1: Task-informed exploration approach enables the robot to autonomously learn how to explore.

As shown in Figure 1, the process is split into two phases:

Train (Left): The robot learns to explore. It produces an exploration policy (\(\pi_{exp}\)) and an uncertainty estimator.
Test (Right): The robot executes the learned behavior. It explores until it is confident, then switches to the task policy (\(\pi_{task}\)) to execute the strike.

Let’s break down the methodology into its core components.

Phase 1: The Privileged Teacher

The first step relies on privileged learning. In a simulation, we have access to “ground truth” data—we know the exact friction, mass, and restitution of every object.

The researchers train a Task Policy (\(\pi_{task}\)) that has access to these hidden values. This policy becomes the “expert” or “teacher.” It knows exactly how to strike the puck into the goal because it knows the physics perfectly. While this policy can’t be used directly in the real world (where we don’t know the physics), it serves as the baseline for what success looks like.

Phase 2: Determining What Matters

Not all physical properties are created equal. For a sliding puck, friction is crucial. For a tumbling box, the center of mass might be more important.

To automate this intuition, the authors perform a sensitivity analysis. They take the trained Privileged Task Policy and deliberately lie to it, feeding it incorrect physical parameters to see how much the performance drops.

Figure 4: Uni-modal functions fitted to the relationship between task success rate and normalized property estimation errors.

Figure 4 illustrates this beautifully. The graphs plot Success Rate against Estimation Error (\(\epsilon\)).

Steep curves (like Dynamic Friction in the left plot) mean the task is highly sensitive. A small error leads to failure.
Flat curves mean the property doesn’t matter much. The robot can be wrong about it and still succeed.

By analyzing these curves, the system automatically generates error thresholds (\(\epsilon_{threshold}\)). These thresholds dictate how accurate the robot needs to be for each specific property to ensure a high probability of success.

Phase 3: Learning to Explore

Now the robot needs to learn how to measure these properties. The researchers train an Exploration Policy (\(\pi_{exp}\)) using Reinforcement Learning.

Crucially, the reward function for this exploration is derived from the sensitivity analysis in Phase 2. The robot is rewarded not just for moving, but for gathering information that brings its estimation error below the required thresholds.

The reward function is defined as:

Equation for exploration reward based on estimation thresholds.

Here, the robot gets a positive reward (\(r_{estimation}\)) only if the error for all properties falls below their specific thresholds. This encourages the robot to perform actions—like poking or sliding the object—that reveal the necessary physical details.

Simultaneously, the robot trains an Online Property Estimator. This is a neural network (specifically an LSTM) that takes the history of observations (how the object moved when poked) and outputs the estimated physical properties (\(\hat{\phi}\)).

The estimation error used to train this network is simple:

Equation for estimation error.

This creates a loop: the policy learns to move the object in ways that make it easier for the estimator to guess the physics, and the estimator gets better at guessing based on those movements.

Phase 4: The Uncertainty Switch

In the real world, the robot doesn’t know the ground truth, so it can’t calculate the estimation error to know when to stop exploring. It needs a surrogate metric.

The authors introduce an Uncertainty-Based Policy Switching mechanism. They use an ensemble of neural networks to estimate the properties. If the networks disagree, uncertainty is high. If they agree, uncertainty is low.

The uncertainty quantification is calculated using the covariance of the ensemble:

Equation for uncertainty estimation using ensemble covariance.

\(\hat{\Sigma}_t\) represents the uncertainty. The system learns an “uncertainty threshold” during training. In the testing phase, the robot continues to explore (poke) until its uncertainty about the critical properties drops below this threshold. Once confident, it immediately switches to the Task Policy to strike.

Figure 5: RMSE and uncertainty of dynamic friction over time.

Figure 5 demonstrates this correlation. As the robot explores (Time on x-axis), both the actual error (RMSE, top graph) and the estimated uncertainty (bottom graph) decrease. This validates that uncertainty is a reliable proxy for accuracy.

Experimental Results

The method was tested in simulation and on real hardware across different tasks.

The Tasks

The primary experiments focused on two distinct manipulation challenges:

Striking: Hitting a puck with unknown friction/mass to a target.
Edge Pushing: Pushing a box with an unknown center of mass (e.g., a box of eggs) to the edge of a table without it falling off.

Figure 2: Manipulation tasks including Striking and Edge Pushing.

Performance vs. Baselines

The results were compared against several baselines, including Domain Randomization (DR) (a common technique where a single policy tries to be robust to all possible physics) and methods that use generic system identification.

Figure 3: Performance of different methods on the Striking task.

Figure 3 shows the success rate on the Striking task.

Privilege (Red line): This is the teacher policy with perfect knowledge (~100% success).
Ours (Cyan triangle): The proposed method achieves 90.1% success, drastically outperforming the others.
Baselines: Standard Domain Randomization (Orange) and others hover below 40%. They simply cannot adapt well enough to the specific object variations.

The “Poke and Strike” method succeeds because it adapts. It doesn’t settle for “average” performance; it figures out the specific friction of the current puck and adjusts the strike accordingly.

Real-World Validation

The true test of robotic learning is transfer to the physical world (Sim-to-Real). The authors deployed their policy on a KUKA iiwa robot arm.

Figure 6: Robot experiments on Striking task showing exploration and task phases.

In Figure 6, we see the robot in action.

Phase 1 (Top left): The robot gently pokes the puck.
Phase 2 (Top right): Having estimated the friction, it winds up and strikes the puck into the green target zone.

The graphs in Figure 6(b) show the real-time estimation. You can see the estimated friction values (colored lines) converging quickly, and the uncertainty (bottom plot) dropping below the red threshold line, triggering the strike.

The system was able to distinguish between pucks made of different materials (Aluminum vs. Nylon vs. Ball bearings) and adjust the force of the strike to ensure the puck landed in the goal every time.

Figure 11: Hardware setup showing different pucks and surface materials.

Figure 11 provides a closer look at the hardware setup, highlighting the variety of materials used to test the robustness of the exploration policy.

Conclusion

The “Poke and Strike” paper presents a compelling argument for Task-Informed Exploration. Rather than separating system identification and control into two disconnected fields, this approach unifies them.

Key takeaways:

Efficiency: Robots shouldn’t try to learn everything about an object; they should learn what matters for the task.
Autonomy: By using uncertainty estimates, the robot decides for itself when it has learned enough to act.
Sim-to-Real: Learning to estimate properties allows the robot to bridge the gap between simulation and reality, adapting to real-world friction and mass on the fly.

This research moves us closer to robots that can walk into a new environment, pick up an unfamiliar tool, give it a quick “heft” or “shake” to understand it, and then get to work.

The Problem: One-Shot Tasks and Hidden Physics#

The Solution: Task-Informed Exploration#

Phase 1: The Privileged Teacher#

Phase 2: Determining What Matters#

Phase 3: Learning to Explore#

Phase 4: The Uncertainty Switch#

Experimental Results#

The Tasks#

Performance vs. Baselines#

Real-World Validation#

Conclusion#