Can LLMs Code Their Own World Models? A Deep Dive into POMDP Coder

Imagine a robot searching for an apple in a cluttered kitchen. It scans the room but doesn’t see the fruit. A human would instinctively check the table or the counter, knowing that apples don’t hover in mid-air or hide inside the toaster. The robot, however, faces a massive challenge: decision-making under uncertainty. It doesn’t know where the apple is (partial observability), and it needs a model of how the world works to search efficiently.

In robotics and AI, we mathematically formalize this problem using Partially Observable Markov Decision Processes (POMDPs). While POMDPs are theoretically powerful, applying them in the real world is notoriously difficult. You usually have to hand-engineer the model of the world (physics, sensor noise, probabilities) or try to learn it from massive amounts of data, which is often infeasible.

But what if we could use the “common sense” reasoning of Large Language Models (LLMs) to bridge this gap?

In this post, we are doing a deep dive into POMDP Coder, a research framework from MIT and Cornell University. This paper proposes a novel method: instead of asking an LLM to control the robot directly, we ask the LLM to write probabilistic code that models the world. This learned model is then plugged into a rigorous mathematical solver to make optimal decisions.

Let’s explore how this hybrid approach combines the creativity of LLMs with the reliability of classical planning.

The Problem: Why is Uncertainty So Hard?

To understand why this paper is significant, we first need to understand the headache of POMDPs.

A POMDP is a framework used to model scenarios where an agent acts in an environment but cannot directly observe the state. It consists of:

States (\(S\)): The true status of the world (e.g., “The apple is in the drawer”).
Observations (\(O\)): What the robot sensors see (e.g., “I see a wooden handle”).
Actions (\(A\)): What the robot can do (e.g., “Open drawer”).
Transition Function (\(T\)): How the state changes after an action.
Observation Function (\(Z\)): The probability of seeing an observation given a state.
Reward Function (\(R\)): What the agent wants to achieve.

The goal is to find a policy \(\pi\) that maximizes the expected discounted reward over time. Mathematically, this looks like:

Equation for maximizing expected discounted reward in a POMDP.

The agent maintains a belief state (\(b_t\)), which is a probability distribution over all possible states. As it takes actions and sees new things, it updates this belief.

The Bottleneck

The math above works beautifully if you have the model components (\(T\), \(Z\), \(R\), etc.).

Hand-coding these functions for complex environments is tedious and error-prone.
Learning these functions from scratch (tabular learning) requires an enormous amount of data because the agent has to explore every possibility to fill in the probability tables.

This is where POMDP Coder steps in.

The Solution: POMDP Coder

The researchers propose that the components of a POMDP (transitions, observations, etc.) can be modeled as short probabilistic programs. These are snippets of code (written in Python/Pyro) that describe logic and uncertainty.

Because LLMs have seen millions of lines of code and vast amounts of text about how the world works, they are excellent at generating these programs.

The Architecture

The core idea is an iterative loop. The LLM acts as a generator of hypotheses (code), and the real world acts as a validator.

Short alt text and caption. Figure 1: The POMDP Coder architecture. Notice the cycle: Experiences feed into Coverage Analysis, which prompts the LLM (GPT-4) to generate code components. These components form the POMDP model used by the Solver to take actions.

The process works in two main phases: Model Learning and Online Planning.

1. Learning the Model (The “Coder”)

Instead of learning a giant table of numbers, the system tries to learn the source code of the simulation. This is called Probabilistic Program Induction.

The agent starts with a set of demonstrations (experiences). It then follows this loop:

Proposal: The LLM is given a description of the task (e.g., “You are in a grid world with lava”) and the API structure. It generates Python code for the Initial State, Transition, Observation, and Reward functions.
Evaluation (Coverage): The system checks the generated code against the real data. It doesn’t just check for syntax errors; it checks for logical contradictions.

The paper introduces a Coverage Metric. It asks: Does the generated code assign a non-zero probability to events that actually happened?

Equation for the coverage metric.

If the robot moved North and hit a wall, but the LLM’s code says moving North always succeeds, the coverage is zero for that data point. This is a critical error.

Refinement: If the code fails the coverage test, the system takes the specific examples where the model failed (e.g., “Model said moving North works, but Data shows position didn’t change”) and feeds them back to the LLM. The LLM then “repairs” the code.

This cycle continues, effectively using the LLM to “debug” its understanding of the world until the code matches reality.

2. Planning with the Model (The “Solver”)

Once the LLM has written a valid POMDP model, we don’t need the LLM to make decisions anymore. We can use a Belief-Space Planner.

The agent uses the learned code to simulate possible futures. Because the state is hidden, the planner maintains a cloud of “particles” (guesses of the current state). It runs simulations to find the action that minimizes a specific cost function:

Equation for the planner cost function involving reward, risk, and entropy.

Here, the agent balances three things:

\(-\hat{r}\): Maximizing expected reward.
\(-\lambda \log \hat{p}\): Minimizing “surprise” (risk sensitivity).
\(\alpha \hat{h}\): Maximizing entropy (information gain/curiosity).

This allows the robot to act deliberately: “I’m 80% sure the apple is in the drawer, so I will open it.”

Why Code? The Power of Abstraction

You might wonder, why generate Python code instead of just predicting the next state with a neural network?

The answer is generalization.

Consider a “Lava” grid world. A tabular method needs to visit every square to learn that lava kills you. A neural network needs many examples to approximate the danger zone. An LLM, however, can write a rule:

1
2
if next_cell == LAVA:
 return DEAD

This single logical statement covers the entire map. Even if the agent encounters a lava tile it has never seen before, the logic holds. The code captures the causal mechanism of the environment, not just statistical correlations.

Experimental Results

The researchers tested POMDP Coder against several baselines:

Tabular: Traditional counting-based learning.
Behavior Cloning (BC): Copying the expert demonstrations directly.
Direct LLM: Asking the LLM “What action should I take next?” (The standard “LLM Agent” approach).
Oracle: Using the true, perfect model (the upper bound).

Simulated Domains: MiniGrid

They utilized MiniGrid, a suite of grid-world tasks requiring memory and exploration.

Visualizations of MiniGrid environments including Empty, Corners, Lava, Unlock, and Rooms. Figure 2: The MiniGrid environments. The green square is the goal, the red triangle is the agent. The blue/grey areas represent the agent’s belief—initially, it doesn’t know the map layout.

The results were striking. In complex environments like “Unlock” (where you must find a key to open a door) or “Lava,” traditional methods failed spectacularly.

Bar chart comparing episode rewards across different methods. Figure 3: Performance comparison. The Blue bars (Ours) consistently reach near-Oracle (Gray) performance. Notice how “Direct LLM” (Pink) struggles in logic-heavy tasks like RockSample or Unlock, often getting stuck in loops.

Key Takeaway: Direct LLM planning is unreliable. It often hallucinates or forgets constraints (like walking through walls). By contrast, POMDP Coder uses the LLM to write the rules, but uses a mathematical solver to execute them, resulting in much higher reliability.

Real-World Robotics: The Spot Robot

The team took the method out of the simulator and onto a Boston Dynamics Spot robot. The task: Find an apple in a room.

The environments were challenging: a small room with cabinets and a large lobby with tables. The robot had to rely on its vision system and move its camera to find the object.

Comparison of uniform belief vs learned belief in real robot experiments. Figure 4: Real-world experiments. The top row shows a small room; the bottom a large lobby. Look at the rightmost column (“Learned Initial Belief”). The model learned that apples are likely on tables/cabinets, drastically narrowing down the search area (blue dots) compared to the “Uniform” baseline where it thought the apple could be anywhere.

The results showed that POMDP Coder was significantly more efficient than baselines.

Table showing success rates and rewards in real-world experiments. Table 1: Real-world results. POMDP Coder (Ours) achieved much higher success rates (10/10) compared to Direct LLM or Behavior Cloning.

The tabular method and behavior cloning failed because the state space was too big—they couldn’t generalize from the few training demonstrations. The Direct LLM failed because it struggled with spatial reasoning over long horizons. POMDP Coder succeeded because it learned a structured rule: “Objects are usually supported by furniture,” and used that rule to plan an efficient search.

Conclusion & Implications

This paper presents a compelling step forward in model-based reinforcement learning. By treating code as the universal representation for world models, the authors leverage the structured reasoning of LLMs without suffering from their hallucinations during execution.

Summary of Advantages:

Data Efficiency: It learns generalizable rules from very few examples (unlike tabular methods).
Interpretability: The output is Python code. If the robot makes a mistake, a human can read the code and see exactly why (e.g., a bug in the transition function).
Reliability: By offloading the planning to a dedicated solver (rather than the LLM), the system avoids the “stochastic parrot” problem during critical decision-making.

Limitations: The approach currently assumes post-hoc observability—meaning the agent gets to see what really happened after the episode ends to train the model. It also currently operates in discrete state spaces.

However, the direction is clear: the future of robotic reasoning might not be end-to-end neural networks, but rather neuro-symbolic systems where AI writes the programs that control the robot. POMDP Coder demonstrates that LLMs are not just chatbost; they are capable architects of the mathematical models robots need to understand our world.

The Problem: Why is Uncertainty So Hard?#

The Bottleneck#

The Solution: POMDP Coder#

The Architecture#

1. Learning the Model (The “Coder”)#

2. Planning with the Model (The “Solver”)#

Why Code? The Power of Abstraction#

Experimental Results#

Simulated Domains: MiniGrid#

Real-World Robotics: The Spot Robot#

Conclusion & Implications#