Teaching LLMs to Code the World: A New Path for Smarter AI Agents

Imagine teaching a robot to play chess. You could show it millions of games and hope it learns the patterns, as many deep learning models do. Or, you could give it the rulebook. Armed with the rules, the robot wouldn’t just mimic past games — it could reason about any possible board position, predict outcomes, and plan its moves strategically.

This rulebook is what AI researchers call a world model — an internal simulation of how the world works.

For an AI agent, having an accurate world model is a superpower. It allows the agent to plan ahead, understand the consequences of actions, and adapt to new tasks with remarkable efficiency. But building these models is challenging. Traditional approaches demand vast amounts of data, and while Large Language Models (LLMs) have vast knowledge and reasoning ability, using them directly as simulators is often slow, expensive, and unreliable.

What if we could get the best of both worlds?

That’s the central idea behind a fascinating new paper: Generating Code World Models with Large Language Models Guided by Monte Carlo Tree Search. Instead of asking an LLM to be the world model, the researchers ask it to write a program that acts as the world model. This program — a Code World Model (CWM) — is precise, interpretable, and lightning-fast to run.

To build CWMs, they introduce a powerful code generation method called GIF-MCTS (Generate, Improve, and Fix with Monte Carlo Tree Search), guiding an LLM through iterative cycles of writing, debugging, and refining code until it precisely simulates the environment.

In this article, we’ll break down how Code World Models work, explore how GIF-MCTS crafts them, and discuss why this approach could reshape the creation of sample-efficient, intelligent agents.

The Problem with World Models

In model-based Reinforcement Learning (RL), an agent learns to predict what will happen next in its environment. Given its current state \(s\) and a chosen action \(a\), the world model outputs:

the next state \(s'\)
the reward \(r\)
whether the task is finished (done flag)

With this predictive power, the agent can mentally simulate thousands of futures to select the best plan — without the cost or risk of real-world trial-and-error.

Researchers have tried various paths to building such models:

Specialized Neural Networks — Fast and trainable, but limited in understanding complex textual instructions or generalizing to new situations. They can’t easily leverage the broad, unstructured knowledge LLMs contain.
Multimodal Models (e.g., text-to-video) — Highly realistic predictions but far too slow for real-time planning. A single prediction might take seconds — unusable for reactive agents.
LLMs as Direct Simulators — Prompt the LLM with the current state and ask it for the next state in text form. This often yields unreliable numbers, is slow to query, and expensive to run repeatedly in planning loops.

The authors saw a better medium: code. Code is exact, incredibly fast to execute, and human-interpretable. But getting an LLM to write complex, correct simulation code is not trivial.

Introducing Code World Models (CWMs)

The key idea is to recast the world modeling task into program synthesis.

Given:

a natural language description of an RL environment (like a rulebook),
a small set of example transitions (trajectories),

…the goal is to generate a Python class that perfectly simulates the environment.

This class must implement a step(state, action) method:

\[ (\hat{s}', \hat{r}, \hat{d}) = \text{code\_environment.step}(s, a) \]

The generated code is validated against the provided trajectories. If its predictions match the real outcomes, we have a successful CWM. If not, the errors are fed back into the writing process for refinement.

An overview of the Code World Models framework. An LLM, environment description, and trajectories are fed into the GIF-MCTS code generation process. This produces a candidate Code World Model, which is validated against the trajectories. If it fails, feedback is sent back to the generator to refine the code. If it succeeds, it can be used for planning.

Figure 1: The iterative CWM workflow: generate candidate code, validate against known trajectories, and refine until the code fully matches the environment.

To rigorously evaluate this idea, the authors created the Code World Models Benchmark (CWMB) — 18 diverse RL environments (from CartPole to complex MuJoCo physics tasks), each paired with a clear textual description and small trajectory datasets.

But how do you get an LLM to write, debug, and refine such complex code? A naive one-shot prompt would never be enough.

Enter GIF-MCTS.

GIF-MCTS: A Smarter Way to Write Code

Building correct code is an iterative dance of writing, testing, and fixing. GIF-MCTS uses Monte Carlo Tree Search (MCTS) to orchestrate this process.

MCTS, made famous by AlphaGo, searches over possible decision paths, balancing exploration of new ideas with exploitation of known good ones, to find optimal strategies.

For GIF-MCTS:

Nodes = versions of a program (partial or complete).
Actions = specific prompts to the LLM to modify the code.
Node Value = accuracy — the fraction of test trajectories predicted correctly.

The search tree grows as the algorithm tries ways to improve accuracy by generating new code snippets, fixing bugs, and refining logic.

An example of the GIF-MCTS tree. The algorithm explores different code generation paths, using feedback (prediction accuracy)

Figure 2: In GIF-MCTS, each node is a program scored by accuracy. Actions (Generate, Improve, Fix) expand the tree, guided by feedback until a high-performing CWM is found.

GIF-MCTS employs three specialized editing actions:

1. Generate new lines

The primary exploration move. The LLM is given an existing snippet and asked to extend it, possibly from scratch. Repeated generates from the same node explore diverse completions.

2. Improve predictions

For code that runs but returns incorrect results. The LLM gets:

the full program,
input where it failed,
the wrong output,
and the correct expected output.

It explains the error and rewrites the code with improved logic — as if debugging unit tests.

3. Fix bugs

Some outputs crash (syntax/runtime errors). Instead of discarding them, Fix prompts feed the error message and code back to the LLM to repair it. This salvages promising logic from small mistakes.

By balancing these three actions within MCTS, GIF-MCTS systematically searches the code space, incrementally improving toward accurate CWMs.

Experiments and Results

The authors tested GIF-MCTS for both general-purpose code generation and the specialized CWM creation task.

APPS: Competitive Programming Challenge

To validate GIF-MCTS as a code generator, they used the APPS benchmark — a set of challenging programming problems — focusing on the hardest “Competition” subset.

Table showing the performance of different methods on the APPS benchmark. GIF-MCTS achieves the highest strict accuracy of 28.3%.

Table 1: On APPS Competition problems, GIF-MCTS achieves state-of-the-art performance using the same LLM backbone.

GIF-MCTS, with Llama 3–70B, solved 28.3% of problems, outperforming both Chain-of-Thought prompting and the similar WorldCoder algorithm. This established it as a leading method for hard code synthesis tasks.

CWMB: Building World Models

The main test: generating CWMs for the 18 environments in the CWMB benchmark.

Two metrics were evaluated:

Accuracy \(A\):
\[ A = \frac{1}{N} \sum_{i=1}^{N} \left( \frac{1}{3} \mathbf{1}[s'_i, \hat{s}'_i] + \frac{1}{3} \mathbf{1}[r_i, \hat{r}_i] + \frac{1}{3} \mathbf{1}[d_i, \hat{d}_i] \right) \]
Normalized Return \(\mathcal{R}\):
\[ \mathcal{R}(\text{CWM}) = \frac{R(\pi_{\text{CWM}}) - R(\pi_{\text{rand}})}{R(\pi_{\text{true}}) - R(\pi_{\text{rand}})} \]
This measures how well a planner performs using the CWM compared to a random policy and a perfect oracle planner.

Table of results on the Code World Models Benchmark (CWMB)

Table 2: Across discrete and continuous environments, GIF-MCTS generates higher-accuracy CWMs and achieves stronger planning returns than WorldCoder.

GIF-MCTS outperformed WorldCoder across the board, generating models that led to better planning results in both discrete and continuous action spaces. GPT-4 Turbo proved especially strong for complex continuous tasks, including successfully modeling the challenging Humanoid environments.

RTFM: Language Meets Logic

The RTFM grid-world requires reading a “manual” describing monsters’ weaknesses and goals, then acting accordingly. It’s a demanding test of translating rules into executable logic.

Table of results on the RTFM environment. GIF-MCTS dramatically outperforms WorldCoder

Table 3: In RTFM, GIF-MCTS guides GPT-4 to perfect models when given more refinement budget.

GIF-MCTS decisively beat WorldCoder in accuracy here. With 50 LLM calls, GPT-4 produced a perfect 100% accurate CWM, enabling a planner to achieve a normalized return of 1.0 — matching oracle performance.

Key Takeaways & Future Directions

This study isn’t just a new algorithm — it’s a shift in how we think about using LLMs for reasoning in RL.

Sample & Inference Efficiency: CWMs require few environment interactions (10 short trajectories in tests) and run orders of magnitude faster than querying LLMs at each step.
For Humanoid-v4, a CWM step took 0.0001s, versus 146.7s for GPT-4 — a million-fold speedup.

Table comparing inference times. A CWM is 4 to 6 orders of magnitude faster than querying GPT-4 directly.

Table 8: CWMs deliver massive speedups by replacing repeated LLM calls with instant code execution.

Power of Search: GIF-MCTS’s guided exploration delivered better results than naive refinement — combining generate, improve, and fix made it far more robust.
Accuracy-Return Gap: High accuracy doesn’t always translate to high return. Missing rare but critical states (e.g., a terminal success) can cripple the planner. Future work should emphasize modeling these edge cases.

Limitations: CWMs currently assume deterministic, fully observable environments and rely on clean textual descriptions. Extending to stochastic, partially observable, or image-defined tasks will be important next steps.

Conclusion

Generating Code World Models presents a compelling vision: transform LLMs from slow simulators into program synthesizers. This lets us create world models that are fast, precise, interpretable, and sample-efficient.

GIF-MCTS shows that a structured, search-based approach can guide LLMs to tackle complex reasoning and coding beyond the reach of simple prompts. As LLM coding capabilities grow, methods like GIF-MCTS could generate sophisticated world models for tasks once thought intractable.

It’s a promising step toward AI agents that truly understand, predict, and plan in our complex world — powered by code they’ve written themselves.

The Problem with World Models#

Introducing Code World Models (CWMs)#

GIF-MCTS: A Smarter Way to Write Code#

1. Generate new lines#

2. Improve predictions#

3. Fix bugs#

Experiments and Results#

APPS: Competitive Programming Challenge#

CWMB: Building World Models#

RTFM: Language Meets Logic#

Key Takeaways & Future Directions#

Conclusion#