For years, large language models (LLMs) have relied on a single fundamental idea: autoregression. Models such as GPT-4, LLaMA, and Qwen generate text one word at a time, moving from left to right—much like how a person might write a sentence. This approach has driven remarkable progress, but it also carries inherent limitations. When a model can only see the past, it struggles with tasks requiring global consistency, long-term planning, or satisfying complex constraints.

Imagine solving a Sudoku puzzle strictly from the top-left corner, filling cells one by one, without ever revisiting earlier steps. You’d almost certainly make irreversible mistakes. That’s the challenge faced by autoregressive (AR) models.

But what if a model could start with a rough draft of the entire sequence and refine it iteratively, seeing the full context from the beginning? This idea defines diffusion models, a paradigm that has transformed image generation—and is now poised to rewrite the way we build language models.

In their 2025 paper, researchers from The University of Hong Kong and Huawei Noah’s Ark Lab introduce Dream 7B, a diffusion-based large language model with 7 billion parameters. It bridges the gap with top-tier autoregressive systems while unlocking new abilities in reasoning, planning, and flexible inference. Dream 7B isn’t just another LLM; it is evidence that the future of text generation might not need to be written one word at a time.

Performance comparison of diffusion and autoregressive models. Dream 7B achieves exceptional results on reasoning and planning benchmarks compared to diffusion (LLaDA 8B) and autoregressive (Qwen2.5 7B, LLaMA3 8B) counterparts.

Figure 1: Dream 7B achieves competitive scores on general benchmarks and exceptional results on planning tasks like Sudoku and Countdown.

As shown in Figure 1, Dream 7B matches autoregressive models on general tasks and clearly surpasses them in planning and reasoning—key areas where holistic context matters. Let’s dive deeper into how it works.


Background: Autoregression vs. Diffusion

Autoregressive (AR) Modeling — The Familiar Path

Autoregressive models rely on a simple principle: the probability of a sequence can be decomposed into the probability of each token given the previous tokens.

\[ p_{\theta}(\mathbf{x}) = p_{\theta}(\mathbf{x}^{1}) \prod_{n=2}^{N} p_{\theta}(\mathbf{x}^{n} \mid \mathbf{x}^{1:n-1}) \]

To predict the next word, the model looks at all words it has generated so far. This left-to-right, token-by-token process works well for fluency but prevents the model from using future tokens when reasoning about the whole sentence.

Diffusion Modeling — A New Dream

Diffusion models flip this approach. Rather than generating text sequentially, they start with a noisy sequence and denoise it over multiple iterations to reveal coherent text.

  1. Forward (Noising) Process: Begin with a clean sequence—say, “Dream is a text diffusion model.” Then progressively corrupt it by replacing words with [MASK] tokens. After enough steps, everything becomes masked.
  2. Reverse (Denoising) Process: Train the model to reconstruct the clean sentence step by step, using all available context to predict each masked token.

This diffusion formulation models the complete context, not just what’s on the left:

\[ p_{\theta}(\mathbf{x}) = \sum_{\mathbf{x}_{1:T} \sim q} p(\mathbf{x}_T) \prod_{t=1}^T p_{\theta}(\mathbf{x}_{t-1} | \mathbf{x}_{t}) \]

The training goal is a weighted cross-entropy loss emphasizing timesteps near the clean data:

\[ L(\theta) = -\mathbb{E}_{\mathbf{x}_0, t, \mathbf{x}_t} w(t) \sum_{n=1}^{N} \mathbf{1}_{[\mathbf{x}_t^n = \mathsf{MASK}]} \log p_\theta(\mathbf{x}_0^n | \mathbf{x}_t) \]

Here, \( w(t) \) assigns greater weight when the model reconstructs less corrupted sequences, improving denoising precision.

This bidirectional setup allows diffusion models to use full-context prediction, which turns out to be crucial for solving tasks with global constraints—like planning or mathematical reasoning.


The Dream 7B Approach: Two Key Innovations

Building a diffusion LLM at scale can be computationally expensive. Dream 7B introduces two innovations that make training faster and more effective.

1. Smart Start — AR-based LLM Initialization

Instead of starting from random weights, Dream 7B begins with weights from a powerful autoregressive model, Qwen2.5 7B. This is akin to training a new skill on top of an already well-educated foundation.

Autoregressive models encode sequence order by using the hidden state at position i to predict the token at i+1. Dream preserves this relationship using a “shift operation” during diffusion training. Rather than disrupting positional knowledge, the model continues to use hidden states in a shifted manner, aligning diffusion behavior with existing pre-trained structure.

Comparison of autoregressive and diffusion modeling in Dream, highlighting causal attention (AR) versus full attention (Diffusion). Dream preserves AR shift semantics during initialization.

Figure 2: Dream aligns diffusion modeling with AR architectures. Causal (yellow) attention in AR becomes full (blue) attention in Dream, maintaining positional prediction alignment.

This AR initialization dramatically accelerates training and allows the diffusion model to inherit rich linguistic understanding from established LLMs. With stronger AR foundations, future diffusion models can improve without expensive retraining from scratch.

2. Fine-Grained Noise — Context-Adaptive Token-Level Rescheduling (CART)

Conventional diffusion models apply a single noise level to the entire sequence, ignoring token-specific difficulty. Some masked tokens are easier to recover than others depending on context.

Dream introduces CARTContext-Adaptive noise Rescheduling at Token-level—which dynamically adjusts the noise level for each masked token based on its contextual informativeness.

Illustration of Dream’s context-adaptive token-level noise rescheduling mechanism. Each token receives a unique noise level reflecting its contextual informativeness.

Figure 3: Tokens with richer context are treated as having lower noise. Dream re-estimates noise per token, improving learning efficiency and stability.

The training objective becomes:

\[ L(\theta) = -\mathbb{E}_{\mathbf{x}_0, t, \mathbf{x}_t} \sum_{n=1}^{N} \mathbf{1}_{[\mathbf{x}_t^n = \mathsf{MASK}]} w(t, \mathbf{x}_t, n) \log p_\theta(\mathbf{x}_0^n | \mathbf{x}_t) \]

where

\[ w(t, \mathbf{x}_t, n) = \frac{1}{2} \sum_{i=1}^{N} \mathbf{1}_{[\mathbf{x}_t^i \neq \mathrm{MASK}]} \operatorname{Geo}(p, |n - i| - 1) \]

This adaptive scheme helps Dream focus learning on informative positions first, leading to faster convergence and better understanding of complex contextual dependencies.


Experiments: Putting Dream 7B to the Test

The authors evaluated Dream 7B on a suite of benchmarks across general language understanding, mathematics, programming, reasoning, and planning. Comparisons included diffusion models (LLaDA 8B) and autoregressive models (Qwen2.5 7B, LLaMA3 8B).

Comparison table showing Dream 7B’s results across benchmarks. Dream delivers robust performance overall and exceptional gains in planning tasks.

Figure 4: Dream 7B versus LLaDA 8B (Diffusion) and leading AR models. Dream shows clear advantages on planning and reasoning benchmarks.

Key observations:

  1. Competitive with AR Models: Dream 7B rivals Qwen2.5 7B in general and reasoning benchmarks, despite training on far fewer tokens (0.6T vs. 18T).
  2. Superior Planning Ability: On tasks demanding global consistency—Countdown, Sudoku, Trip planning—Dream excels, scoring 81.0 on Sudoku compared to Qwen2.5’s 21.0.
  3. Diffusion Advancements: Dream outperforms the prior diffusion state-of-the-art (LLaDA 8B), confirming the benefits of AR initialization and CART.

Instruction-Tuned Dream-Instruct

To explore alignment and instruction following, the team fine-tuned Dream using 1.8 million prompt-response pairs, creating Dream-Instruct.

Performance comparison of instruction-tuned Dream-Instruct against LLaDA 8B, Qwen2.5 7B, and LLaMA3 8B.

Figure 5: Dream-Instruct demonstrates that diffusion models can be effectively fine-tuned for instruction following.

Even without reinforcement learning post-training, Dream-Instruct reaches strong performance across language, math, and code benchmarks—proof that diffusion models extend naturally to dialogue and alignment use cases.


Evaluating the Impact of AR Initialization

To measure how much AR pre-training helps, researchers compared two versions of Dream with 1 billion parameters—one trained from scratch and one initialized from LLaMA3.2-1B weights.

Validation loss comparison between training from scratch and AR-initialized Dream.

Figure 6: AR initialization consistently leads to lower validation loss throughout training, highlighting its efficiency advantage.

The results are striking: training from AR initialization maintains substantially lower loss throughout. This confirms that reusing autoregressive knowledge accelerates diffusion model optimization and reduces compute cost, providing a practical path to scale.


The Unique Superpowers of Diffusion LLMs

Beyond benchmarks, Dream 7B reveals several qualities that set diffusion models apart.

1. Unmatched Planning and Reasoning Ability

Diffusion models can reason about entire sequences instead of a single upcoming token. When solving Sudoku or planning trips, this full-context capability enables consistent problem-solving rather than piecemeal guesses.

Dream 7B’s performance in planning tasks (Countdown and Sudoku) compared to other LLMs. Diffusion models handle increasing difficulty gracefully.

Figure 7: Dream 7B maintains strong planning accuracy as task difficulty increases, outperforming AR models by wide margins.

As tasks grow more complex, Dream 7B’s advantage widens. It degrades gracefully where AR models often fail catastrophically.

Qualitative examples comparing Qwen2.5 7B and Dream 7B on Trip Planning, Countdown, and Sudoku problems. Dream maintains global consistency.

Figure 8: Dream 7B finds valid solutions under complex constraints where AR models fail due to their sequential nature.

2. Flexible Quality–Speed Trade-Offs

AR models generate text at a fixed pace, but diffusion models can trade compute for quality dynamically by adjusting the number of denoising timesteps.

Quality-speed trade-off curve for Dream 7B showing accuracy and inference speed against diffusion timesteps, compared with Qwen2.5 7B.

Figure 9: Dream enables tunable generation. Around 5–20 denoising steps, Dream achieves both higher speed and accuracy than Qwen2.5 7B.

This flexibility lets users decide—faster responses or more refined outputs—without retraining, offering a new dimension to inference-time optimization.

3. Arbitrary Order Generation and Infilling

Autoregressive models are bound by sequential dependence. Diffusion models break that barrier, generating tokens in any order. This unlocks powerful applications:

  • Infilling: Fill missing segments given beginnings and endings.
  • Completion: Extend text freely, akin to AR models.
  • Configurable Decoding: Adjust generation order, from structured left-to-right to fully random synthesis.

This adaptability enables creative writing, document revision, and seamless interactive editing—all without specialized training.


Conclusion: A New Path Forward

Dream 7B marks a turning point in large language model design. By combining practical techniques—AR initialization and context-aware noise scheduling—with the inherently holistic nature of diffusion, Dream achieves performance competitive with leading autoregressive models while surpassing them in reasoning, planning, and flexibility.

Diffusion-based language models are no longer academic curiosities. They are versatile systems capable of generating, infilling, and optimizing text quality and speed dynamically—all while reasoning globally across context.

The next generation of language models may not write strictly from left to right. Instead, they may dream the text into existence—refining meaning iteratively across the full canvas of possibilities.