Imagine trying to learn a complex skill—like mastering a new video game or operating a robot—using only recorded videos of other people doing it. You can’t interact with the environment yourself; you can only watch and deduce the rules. How would you discover the right strategy to succeed?

This is one of the great challenges of artificial intelligence: teaching agents to act effectively in complex environments using only offline data, without trial-and-error interaction. This approach is vital for real-world applications such as robotics, where each experiment can be slow, costly, or unsafe.

A powerful solution is the world model—a neural network that learns the rules and physics of an environment directly from video data. Once trained, this model becomes a kind of dream engine, enabling agents to practice inside their own imagination. They can simulate experiences and learn strategies entirely offline.

Until recently, however, building world models that were both accurate and fast enough for imagination training proved difficult. Earlier models either misrepresented object interactions or required massive computing resources for each step of simulation.

In the new paper “Training Agents Inside of Scalable World Models” , researchers from Google DeepMind introduce Dreamer 4, an agent that finally overcomes these hurdles. Dreamer 4 trains entirely inside its own imagined world, learning complex long-horizon behaviors with no direct environment interaction. In the process, it has achieved something unprecedented—it became the first AI to obtain diamonds in Minecraft, one of the game’s hardest goals, purely from offline gameplay data.


A collage of Minecraft screenshots showing an agent completing tasks like gathering wood, mining stone, crafting a pickaxe, and finally mining diamonds. The agent learns to perform these long sequences of actions by practicing inside its world model.

Figure 1: Dreamer 4 learns to perform long-horizon tasks in Minecraft by training entirely within its own internal simulation of the game world.


Background: The Building Blocks of Generative Models

Before unpacking how Dreamer 4 works, it helps to understand the foundations it builds upon. Dreamer 4’s world model is a type of diffusion-based generative model, a modern framework for producing complex data like images and videos.

Flow Matching

In a standard diffusion model, noise is gradually added to clean data until the data becomes indistinguishable from random noise. A neural network is trained to reverse that process—to iteratively denoise, step by step.

Flow matching simplifies this procedure. Instead of adding and removing noise over thousands of iterations, the model learns a direct path between pure noise \(x_0\) and clean data \(x_1\). Each intermediate state is defined by the signal level \(\tau \in [0,1]\):

\[ x_{\tau} = (1 - \tau)x_0 + \tau x_1 \]

The model \(f_{\theta}\) predicts a velocity vector \(v = x_1 - x_0\), pointing straight from noise to data. It minimizes the squared difference between its prediction and the true velocity:

\[ \mathcal{L}(\theta) = \| f_{\theta}(x_{\tau}, \tau) - (x_1 - x_0) \|^2 \]

At inference, starting from noise, the model takes iterative steps along this learned direction to generate realistic outputs.

Shortcut Models: Speeding Up Generation

Traditional diffusion models require hundreds or thousands of tiny steps to generate one image—a huge computational burden. Shortcut Models solve this by teaching the model to leap instead of crawl.

During training, shortcut models learn to take large, accurate jumps by conditioning on the desired step size \(d\). The model is guided by a bootstrap loss that compares the result of a large step with the outcome of two smaller intermediate steps.

The bootstrap loss for shortcut models, where the target for a large step is derived from two smaller steps.

Shortcut training allows models to take larger leaps during inference, reducing hundreds of small steps to just a handful of high-quality ones.

This innovation enables the generation of high-quality samples in as few as 2–4 steps, instead of hundreds—critical for achieving real-time simulation speed in Dreamer 4.

Diffusion Forcing: Generating Video Sequences

Video generation adds a temporal dimension to the challenge. Each frame depends on those before it. Diffusion Forcing assigns a distinct noise level to every frame, allowing each to serve both as a denoising target and as context for subsequent frames. This mechanism helps the model learn coherent temporal dynamics—essential for world models that predict how actions change a scene over time.


The Architecture of Dreamer 4

Dreamer 4 is an integrated agent system centered around its scalable world model. It consists of two main components:

  1. Causal Tokenizer – compresses visual frames into compact, continuous representations.
  2. Interactive Dynamics Model – simulates how the world evolves over time in response to actions.

Both components share an efficient transformer architecture designed for high-throughput sequential data.

The overall design of the Dreamer 4 world model, showing the Causal Tokenizer and the Interactive Dynamics model.

Figure 2: Dreamer 4’s architecture combines a causal tokenizer that encodes video frames with an interactive dynamics model that predicts future frames given actions.

Training proceeds in three distinct phases:

  1. World Model Pretraining – Learn the physics of the environment from offline videos.
  2. Agent Finetuning – Add behavior and reward prediction capabilities.
  3. Imagination Training – Reinforce the policy entirely inside the simulated world.

Causal Tokenizer: Seeing the World Efficiently

Each video frame is too large for direct processing in reinforcement learning. The causal tokenizer converts high-dimensional images into smaller sets of latent “tokens.” It uses an encoder-decoder architecture, where the encoder compresses visuals and the decoder reconstructs them when needed.

To ensure robustness, the tokenizer uses Masked Autoencoding (MAE): during training, random patches of each image are hidden, forcing the model to infer missing details. This encourages a deeper semantic understanding of scenes rather than superficial pixel-matching. The tokenizer operates causally in time, meaning only past frames influence current encoding—vital for streaming, interactive applications.


Interactive Dynamics: The Dream Engine

At the core of Dreamer 4 lies its interactive dynamics model, effectively the “physics simulator” of this learned dream world.

The dynamics model ingests the encoded image tokens and action sequences (e.g., mouse movements, keyboard presses), predicting how these actions transform the world in subsequent frames. Its efficiency and accuracy stem from a novel Shortcut Forcing training objective.

  1. X-Prediction (Clean Representation Targeting): Instead of predicting velocity differences between states (v-prediction), the model directly predicts the clean latent representation \(z_1\) (x-prediction). This prevents the accumulation of small high-frequency errors typical in long rollouts and stabilizes video generation over thousands of frames.

  2. Bootstrap Loss in X-Space: The shortcut model’s bootstrap training is adapted for direct x-prediction. It ensures that one large simulation step yields the same result as multiple smaller ones, maintaining accuracy while enabling accelerated inference.

The Shortcut Forcing loss function, showing the flow matching term for the smallest step size and the bootstrap term for larger steps.

Shortcut Forcing combines flow matching and bootstrap objectives in x-space, allowing Dreamer 4 to simulate sequences quickly while remaining stable.

  1. Ramp Loss Weight: Since very noisy frames carry little learning signal, Dreamer 4 scales its loss linearly with the signal level \(\tau\):
\[ w(\tau) = 0.9\tau + 0.1 \]

This focuses model capacity where information is richest.

Together, these innovations make it possible for Dreamer 4’s dynamics model to generate the next frame with just four forward passes, achieving real-time speed on a single GPU.


Imagination Training: Learning to Act in the Dream

After building its internal simulator, Dreamer 4 becomes an agent that learns entirely inside its own imagination.

In the finetuning phase, new heads are added to the transformer: one predicting actions (policy head) and another predicting rewards. These heads are first trained on the offline dataset with behavior cloning, imitating recorded player behavior.

But imitation alone doesn’t create progress. To discover strategies beyond the dataset, Dreamer 4 uses Imagination Training—a reinforcement learning process performed entirely inside its simulated world.

The PMPO loss function used to update the policy during imagination training.

In imagination training, the PMPO objective updates the policy using only simulated trajectories from the world model.

Dreamer 4 rolls out imagined trajectories: actions are sampled from its policy, consequences simulated by the world model, and rewards produced by the learned reward head. A value head predicts cumulative rewards so that the policy can optimize long-term success.

The training objective, called PMPO (Preference Optimization as Probabilistic Inference), updates policies by considering only whether outcomes were better or worse than expected—not by magnitude. This approach prevents instability caused by varying reward scales and ensures balanced learning across tasks.


Experiments: Putting Dreamer 4 to the Test

The researchers evaluated Dreamer 4 extensively in the open-world sandbox of Minecraft, one of the most complex visual control benchmarks available.

The Offline Diamond Challenge

The ultimate test: could Dreamer 4 learn to obtain diamonds? This milestone requires mastering long sequences of actions—chopping wood, crafting tools, mining deep underground—equivalent to roughly 24,000 mouse and keyboard events per episode.

Training used the VPT Contractor Dataset (2,500 hours of recorded human gameplay). Crucially, Dreamer 4 learned purely offline, with no interaction with the live environment.

Bar chart showing the success rates of different agents at obtaining key items in Minecraft. Dreamer 4 (blue) consistently outperforms all other methods.

Figure 3: Dreamer 4 dramatically surpasses prior offline agents, becoming the first to collect diamonds purely from recorded data.

The results speak volumes. Dreamer 4 achieved over 90% success on intermediate tasks (crafting stone tools), 29% success on iron pickaxe creation, and successfully obtained diamonds in 0.7% of episodes—a milestone never reached before without online data.

Further ablation studies revealed that imagination training was key:

Ablation study comparing success rates and completion times for key Minecraft items. Dreamer 4 (blue) is both more successful and faster than behavior cloning agents.

Figure 4: Imagination training improves both success rates and task efficiency compared to pure behavior cloning.

Dreamer 4 significantly reduced the time needed to reach milestones and consistently outperformed all baselines, including OpenAI’s VPT and vision-language models like Gemma 3.


Human Interaction: Playing Inside the Dream

To test the fidelity of the world model’s physics, researchers invited human players to interact inside Dreamer 4’s simulation. Using mouse and keyboard, humans attempted tasks like building a 3×3 wall—comparing Dreamer 4 to prior models such as Oasis and Lucid.

A human player successfully builds a 3x3 wall inside Dreamer 4’s world model, while previous models like Oasis and Lucid fail to simulate the interaction correctly.

Figure 5: Dreamer 4’s world model is the first to precisely simulate complex interactions, letting humans play and build structures within the imagined world.

Dreamer 4 accurately predicted the effect of each action, enabling genuine gameplay-like interaction. Competing models often blurred visuals, swapped held items, or hallucinated nonexistent structures.

Table comparing Minecraft world models on parameters, resolution, context length, speed, and task success. Dreamer 4 excels in context length and task success while maintaining real-time FPS.

Table 1: Comparison of Minecraft world models. Dreamer 4 achieves the longest temporal context and real-time performance.

Dreamer 4 also mastered robotic simulations when trained on real-world data, learning accurate physics for pick-and-place tasks.

Generated video from the world model showing a robot arm performing various pick-and-place tasks based on counterfactual human controls.

Figure 6: Dreamer 4 applied to robotics. Its world model can simulate realistic physical interactions in real time.


Learning from Unlabeled Videos

A crucial benefit of world models is their ability to learn general structure from video alone. The researchers tested how much labeled action data Dreamer 4 actually needed.

Graphs showing that Dreamer 4 learns effective action conditioning with only a small amount of action-labeled data, and that this learning generalizes to unseen environments.

Figure 7: With only 100 hours of action data (≈4% of the total), Dreamer 4 achieves near-perfect action conditioning, even generalizing to unexplored environments like the Nether.

Training on all 2,500 hours of video but only 100 hours of labeled actions, Dreamer 4 reached nearly the same generation quality as full supervised training. Even more remarkably, when trained with actions only from the Minecraft Overworld, it successfully generalized to unseen biomes like the Nether and End—suggesting that action grounding learned from limited data can transfer broadly.


Why It’s Fast and Accurate

The team conducted a thorough breakdown of Dreamer 4’s design, starting from a slow baseline and progressively adding efficiency improvements.

Table showing a cascade of improvements, starting from a baseline diffusion transformer and adding each of Dreamer 4’s components, tracking improvements in speed (FPS) and quality (FVD).

Table 2: Stepwise analysis of Dreamer 4’s architecture. Each design modification improves generation quality (lower FVD) or inference speed (higher FPS).

At first, naive diffusion forcing achieved only 0.8 FPS. Adding shortcut modeling boosted speed to 9.1 FPS while keeping quality steady. Further refinements—x-prediction, x-loss computation, ramp weighting, and architectural optimizations—pushed interactive inference past 20 FPS, matching real-time gameplay.

A graph showing that Shortcut Forcing achieves high quality (low FVD) with far fewer sampling steps than standard Diffusion Forcing.

Figure 8: Shortcut Forcing achieves near-equivalent quality to diffusion forcing with only one-sixteenth the sampling steps.

This combination of architectural and training insights produces a world model that is both high-fidelity and real-time capable, a crucial milestone for scalable imagination training.


Conclusion and Future Directions

Dreamer 4 marks a major advance in world modeling and autonomous agent learning. Its innovations—shortcut forcing, causal tokenization, efficient transformer design—collectively enable real-time, scalable imagination-based training.

Key takeaways:

  1. Imagination training works: Agents can surpass human demonstrations purely by practicing inside learned world models.
  2. Speed and accuracy in one system: Shortcut forcing and efficient architectures permit real-time, precise simulations from raw pixels.
  3. Learning from passive video is viable: Only small fractions of labeled actions are needed; the rest can come from observing the world.

These achievements open remarkable horizons: pretraining on diverse web videos, applying world models to robotics, incorporating language understanding, and scaling to agents that reason over indefinite timelines.

Dreamer 4 doesn’t just imagine—it dreams realistically. And in doing so, it demonstrates how intelligent agents may one day learn the complexities of the world not through trial—and error—but through imagination itself.