Introduction

Large Language Models (LLMs) like GPT-4 have transformed our expectations of artificial intelligence. We have grown accustomed to their ability to write code, summarize history, and even reason through logic puzzles. Recently, roboticists have begun connecting these “brains” to robot “bodies,” allowing LLMs to generate high-level plans or write control code. However, a significant gap remains. While LLMs understand language, they don’t natively “understand” the complex, low-level physics required to slide a puck across a table or swing a rope into a specific shape.

Typically, solving these dynamic manipulation tasks requires training specialized neural networks on massive datasets or meticulously modeling physical properties like friction and mass. But what if we didn’t need to train a new model? What if the same “pattern matching” ability that allows an LLM to complete a sentence could also allow it to tune a robot’s physical movements?

In the paper “In-Context Iterative Policy Improvement for Dynamic Manipulation,” researchers Mark Van der Merwe and Devesh K. Jha explore this exact possibility. They propose a method called In-Context Policy Improvement (ICPI). Instead of asking the LLM to “know” physics, they treat the robot’s learning process as a sequence completion problem. By feeding the LLM a history of “attempts and corrections,” the model learns to predict the necessary adjustments to achieve a physical goal—without a single gradient update.

Figure 1: We investigate in-context learning for iteratively improving policy parameters for dynamic manipulation tasks.

As illustrated in Figure 1, this approach allows robots to iteratively improve their performance on complex tasks—like sliding objects or swinging ropes—both in simulation and the real world, using very small datasets.

Background: The Challenge of Dynamic Manipulation

To understand why this research is significant, we must first distinguish between quasi-static and dynamic manipulation.

Quasi-static manipulation: Think of a robot slowly picking up a cup. The forces are dominated by gravity and contact; inertia plays a minor role. If the robot stops moving, the cup stays put.
Dynamic manipulation: Think of tossing a ball, cracking a whip, or sliding a beer mug down a bar. These actions rely on velocity, acceleration, and momentum. If the robot stops mid-action, the object keeps moving.

Dynamic manipulation extends a robot’s workspace and efficiency, but it is notoriously difficult. Success depends on hidden physical properties—the mass of the object, the friction of the surface, or the density of a rope. These properties are often invisible to cameras and difficult to measure directly.

Humans solve this through iterative improvement. If you try to slide a coaster to a friend and it stops short, you push harder next time. You don’t solve a physics equation; you look at the error (it stopped short) and adjust your policy (push harder). The researchers replicate this human-like trial-and-error loop using the In-Context Learning (ICL) capabilities of Transformers.

In-Weights vs. In-Context Learning

LLMs exhibit two types of learning:

In-Weights Learning: Knowledge stored in the neural network’s parameters (weights) during its massive pre-training phase. This is how the model knows English grammar or historical facts.
In-Context Learning (ICL): The ability to learn a new pattern “on the fly” from the prompt provided at inference time. If you give an LLM three examples of a made-up word game, it can often solve the fourth example, even though it never saw that game during training.

The researchers hypothesize that LLMs are general pattern machines. If the relationship between a robot’s action and the resulting error is presented as a pattern in the text prompt, the LLM should be able to predict the correct adjustment, effectively learning the “physics” of the task in-context.

Core Method: In-Context Policy Improvement (ICPI)

The goal of the research is to find the optimal policy parameters (\(\theta^*\))—such as the speed and angle of a robot arm—that minimize a task cost (\(C_\tau\)).

Equation 1

Here, \(s_0\) is the starting state, and \(T_\tau\) represents the dynamics of the world (how the physics actually plays out). The researchers formulate this as an iterative problem. They want to learn an improvement operator—a function that looks at what the robot just did and tells it how to fix it.

The Policy Improvement Operator

Let \(\theta^i\) be the robot’s parameters at attempt \(i\), and \(s_{1:T}^i\) be the resulting trajectory (what happened). We want a function \(f\) that outputs the change in parameters, \(\Delta \theta^i\):

Equation 6

Traditionally, this function \(f\) would be a neural network trained via Reinforcement Learning or Supervised Learning. This requires training data and time. The key innovation of ICPI is to replace this trained network with a pre-trained LLM.

Instead of training weights, the researchers construct a text prompt containing a dataset \(\mathcal{D}\) of past experiences. Each experience contains the parameters used, the error observed, and the correction that should have been made.

Equation 7

This turns the physics problem into a sequence completion problem. The prompt looks something like this:

“When parameters were X and error was Y, the correction was Z.”
“When parameters were A and error was B, the correction was C.”
“Current parameters are P and error is Q. The correction is…” -> [LLM Completes Here]

Figure 2: Overview of our proposed In-Context Policy Improvement (ICPI) method.

Figure 2 visualizes this loop.

Execution: The robot tries a task with parameters \(\theta^i\).
Observation: The system records the error \(e^i\) (e.g., missed the target by 10cm).
Retrieval: The system looks up similar past examples from a small dataset.
Prompting: These examples, plus the current error, are sent to the LLM (GPT-4o).
Update: The LLM predicts \(\Delta \theta^i\), which is added to the current parameters to get \(\theta^{i+1}\).

Tokenization: Translating Physics to Text

To make this work, physical data must be converted into tokens the LLM can process.

Parameters (\(\theta\)): Converted directly to text numbers (e.g., velocity “0.5”).
State Trajectory (\(s_{1:T}\)): Feeding raw coordinates of every millisecond is too much data and too noisy. Instead, the researchers compute the relative error to the goal (\(e^i = s_T^i - \tau_g\)).

For example, if the robot is sliding a puck, the “state” fed to the LLM isn’t the puck’s full path, but rather a vector representing how far it stopped from the target. This captures the essential information: “I pushed with force 5, and it stopped 2 meters short.”

Selecting the Right Examples

You cannot fit thousands of past trials into a single LLM prompt. The context window is limited. Therefore, the system uses K-Nearest Neighbors (KNN) to find the most relevant examples.

When the robot generates a query (current parameters + current error), the system searches its dataset for the \(k\) examples that are most similar to this query. These \(k\) examples (set to 20 in experiments) are formatted into the prompt. This ensures the LLM is reasoning based on history that is relevant to the specific situation the robot is currently facing.

Data Collection: Algorithm Distillation

Where does the dataset \(\mathcal{D}\) come from? The researchers use a process called algorithm distillation. They use a brute-force search algorithm (which is slow and expensive) to solve the task for a variety of conditions offline. These successful runs provide the “ground truth” for how parameters should be adjusted. The LLM then learns to mimic this expensive search process using only a few examples, drastically speeding up the process during actual execution.

Experimental Setup

The team tested ICPI on five task variations involving both simulated and real robots.

Slide (Sim): A robot strikes a puck to slide it to a target. The friction and puck size change.
Slide-GC (Sim): Same as above, but the target location changes (Goal-Conditioned).

The cost function is simply the distance between the final puck position and the goal:

Rope-Swing (Sim): A robot swings a flexible rope to hit a target. The rod length and rope length change.

This is harder because the rope is deformable. The cost is the minimum distance the rope tip gets to the target during the swing:

Rope-Swing-GC (Sim): Rope swinging with changing target locations.
Roll-GC-Real (Real Robot): A real physical robot hits a billiard ball to roll it to a target pixel location.

Results and Analysis

The researchers compared ICPI against several baselines:

Random Shooting: Randomly guessing parameter changes.
Bayesian Optimization: A standard mathematical approach for tuning parameters.
In-Weights Reasoning: Asking the LLM to solve the problem directly by describing the physics in the prompt, relying on its pre-trained knowledge rather than example-based pattern matching.
Linear KNN: A simple linear regression model fitted to the retrieved examples.

Performance Comparison

The results, summarized in Table 1, show that ICPI (bottom row) consistently achieves the lowest task cost across almost all environments.

Table 1: Final Best-Policy Mean Performance Comparison

Key Observations:

ICPI outperforms Random Shooting and Bayes Opt: It is much more sample-efficient, converging to good solutions faster.
ICPI vs. In-Weights: Interestingly, the “In-Weights” baseline performed poorly. Simply describing the physics to GPT-4 (“You are pushing a puck…”) is not enough. The model cannot simulate detailed dynamics in its “head.” However, showing it the pattern of errors via ICPI works clearly. This confirms the “General Pattern Machine” hypothesis.
ICPI vs. Linear Models: While a simple linear model (Linear KNN-20) performed surprisingly well, the LLM-based ICPI generally surpassed it, especially in complex tasks like the real-world ball roll. This suggests the LLM captures non-linear relationships that simple regression misses.

Convergence Speed

In dynamic manipulation, you want to learn quickly to minimize wear and tear on the robot. Figure 3 shows the learning curves.

Figure 3: Task Cost convergence plots for the best policy so far at each step

Slide (Graph a): ICPI (red line) drops to near-zero error within 5-8 steps.
Real Robot Roll (Graph c): In the real world, ICPI rapidly outperforms random shooting and linear baselines, finding a successful policy in under 10 trials. This is crucial for real-world deployment where data collection is expensive.

Qualitative Success

Figure 4 visualizes the iterative improvement.

Top Row (Slide): At \(t=1\), the puck stops far short. By \(t=5\), the robot has adjusted its strike angle and force to get closer. By \(t=18\), it hits the target.
Bottom Row (Rope Swing): The robot adjusts the swing velocity and joint angles. You can see the arc of the rope changing until it intersects with the target ‘X’.

Figure 4: Qualitative examples of iterative in-context policy improvement

Ablation Studies: Does Model Choice Matter?

The researchers also tested different LLMs. They found that GPT-4o significantly outperformed GPT-3.5-turbo and GPT-4o-mini. This indicates that the “reasoning” or “pattern matching” quality of the larger models is essential for parsing the complex relationships in dynamic physics data.

Additionally, they found that explicitly tokenizing the error (distance to goal) was more effective than giving the LLM the raw state coordinates and the goal coordinates separately. This suggests that while LLMs are smart, doing some “pre-processing” (calculating the error for them) helps them focus on the correction logic.

Conclusion and Implications

This paper presents a compelling argument for using Large Language Models not just as chatbots, but as numeric reasoning engines for robotics. The In-Context Iterative Policy Improvement (ICPI) method demonstrates that we don’t always need to train massive, task-specific neural networks to handle complex physics.

By treating physical interaction as a sequence of “try, fail, adjust” data points, we can leverage the massive pre-trained pattern-matching capabilities of models like GPT-4o. The key takeaways are:

LLMs are General Pattern Machines: They can generalize to physical dynamics tasks simply by reading input-output examples in the prompt.
In-Context beats In-Weights: For physics, showing the LLM examples of behavior is far more effective than asking it to reason about physical laws theoretically.
Sample Efficiency: This method works with very small datasets (\(\leq 300\) examples) and improves policies within just a few steps.

This work opens the door for more flexible robots that can adapt to new tools and environments “on the fly,” utilizing the same intelligence that powers our search engines and chat applications.

Introduction#

Background: The Challenge of Dynamic Manipulation#

In-Weights vs. In-Context Learning#

Core Method: In-Context Policy Improvement (ICPI)#

The Policy Improvement Operator#

Tokenization: Translating Physics to Text#

Selecting the Right Examples#

Data Collection: Algorithm Distillation#

Experimental Setup#

Results and Analysis#

Performance Comparison#

Convergence Speed#

Qualitative Success#

Ablation Studies: Does Model Choice Matter?#

Conclusion and Implications#