Introduction

In the world of Generative AI, Autoregressive (AR) models are the heavy lifters. They are the architecture behind the Large Language Models (LLMs) that power ChatGPT and Claude. Their premise is simple but powerful: predict the next piece of data based on everything that came before it. When applied to text, they write one word at a time. When applied to computer vision, they paint an image one “token” (a compressed patch of an image) at a time.

While AR models have shown incredible promise in visual generation—offering unified modeling and great scalability—they suffer from a significant bottleneck: speed.

Imagine trying to load a webpage where every pixel has to load sequentially, one by one, from the top left to the bottom right. That is essentially how standard autoregressive visual models work. They follow a “raster scan” order, predicting hundreds or thousands of tokens in a strict sequence. This makes them significantly slower than their counterparts, like GANs or some diffusion distillation techniques.

But what if we didn’t need to wait? What if we could predict multiple parts of an image simultaneously without breaking the logic that makes AR models so good?

In this post, we dive deep into a new research paper that proposes Parallelized Autoregressive Visual Generation (PAR). The researchers introduce a simple yet effective strategy to achieve speeds up to 9.5x faster than standard methods while maintaining high image quality.

The Problem: The Sequential Bottleneck

To understand the innovation of PAR, we first need to understand the limitation of current methods. State-of-the-art autoregressive visual models usually follow a two-stage process:

  1. Tokenization: An image is compressed into a grid of discrete tokens (e.g., using a VQGAN). A \(256 \times 256\) image might become a \(16 \times 16\) or \(24 \times 24\) grid of tokens.
  2. Sequential Prediction: A Transformer predicts these tokens one by one. To generate a \(24 \times 24\) grid, the model must run 576 separate inference steps.

This sequential nature guarantees consistency. By generating token #100 based on tokens #1 through #99, the model ensures the image makes sense. However, this serialization is computationally expensive and slow.

Attempts to parallelize this (guessing multiple tokens at once) usually result in a drop in quality. If you try to guess two adjacent pixels at the same time effectively “blind” to each other, you might end up with incoherent patterns.

The Core Insight: Dependency and Distance

The researchers behind PAR started with a fundamental question: Which tokens actually depend on each other?

In language, dependencies can be complex and long-range. In images, however, there is a strong spatial correlation.

  • Strong Dependency: Adjacent tokens (neighbors) are highly correlated. If one token represents the stripe of a zebra, the token immediately to its right must continue that stripe.
  • Weak Dependency: Distant tokens are loosely correlated. The texture of the grass in the top-left corner of an image doesn’t dictate the specific texture of the dirt in the bottom-right corner, provided they both fit the global context of “an animal in nature.”

The Failure of Naive Parallelization

If we naively try to speed up generation by predicting adjacent tokens in parallel, the results are disastrous. Because standard sampling (like top-k) introduces randomness, predicting neighbors independently leads to conflicts.

Comparison of different parallel generation strategies showing the failure of naive local parallelization.

As shown in Figure 1 above:

  • Panel (b) Naive Method: When the model tries to generate strongly dependent tokens (neighbors) simultaneously, the local patterns break. Notice the distorted tiger face and the fragmented zebra stripes. The tokens don’t “agree” on the local texture because they are being sampled independently.
  • Panel (a) PAR Method: The researchers’ approach generates weakly dependent tokens (distant regions) in parallel. The result is a coherent, high-quality image.

Visualizing Entropy and Dependency

To scientifically validate this intuition, the authors analyzed the conditional entropy of visual tokens. Entropy here essentially measures uncertainty or how much “new” information a token provides given what we already know. Lower entropy implies a stronger dependency (the token is easier to predict if you know its neighbor).

Visualization of token conditional entropy maps showing strong local dependencies.

Figure 11 illustrates this concept beautifully. The blue square represents a reference token. The red areas show low conditional entropy (high dependency). As you can see, a token is heavily dependent on its immediate neighbors, but that dependency fades rapidly with distance. This heat map provides the mathematical justification for the PAR strategy: Parallelize the distant, serialize the local.

The PAR Method: How It Works

The PAR approach modifies the generation order without changing the fundamental model architecture or the tokenizer. It can be broken down into three logical steps.

1. Cross-Region Grouping

First, the token grid (the compressed image) is partitioned into \(M \times M\) regions. For example, a \(24 \times 24\) grid might be split into four \(12 \times 12\) regions (\(M=2\)).

Instead of processing the image row-by-row across the whole width, the model groups tokens that share the same relative position in their respective regions.

\[ \Big \{ [ v _ { 1 } ^ { ( 1 ) } , \\\cdot \cdot \ , v _ { 1 } ^ { ( M ^ { 2 } ) } ] , [ v _ { 2 } ^ { ( 1 ) } , \cdot \cdot \cdot , v _ { 2 } ^ { ( M ^ { 2 } ) } ] , \cdot \cdot \cdot \ , [ v _ { k } ^ { ( 1 ) } , \cdot \cdot \cdot \ . , v _ { k } ^ { ( M ^ { 2 } ) } ] \Big \} . \]

This equation simply denotes that we are grouping the \(k\)-th token from every region together.

2. Stage 1: Sequential Initialization (The Skeleton)

We can’t just start generating parallel tokens immediately. If we generated the top-left corner and the bottom-right corner simultaneously from scratch, they might disagree on the global subject (e.g., the top might think “Dog” while the bottom thinks “Cat”).

To solve this, PAR generates the initial token of each region sequentially.

Illustration of the Non-local parallel generation process.

Looking at Figure 3 (Stage 1): The model generates tokens 1, 2, 3, and 4 one by one. These “anchor” tokens establish the global context. Because there are very few regions (e.g., 4 or 16), this step is fast but crucial for global coherence.

3. Stage 2: Parallel Cross-Region Generation

Once the anchors are set, the model switches to parallel mode. It identifies the next position in Region 1, Region 2, Region 3, and Region 4, and predicts them all simultaneously.

In Figure 3 (Stage 2), you can see the model generating the group labeled 5a, 5b, 5c, 5d at the same time. These tokens are spatially distant from each other, so their weak dependency allows for independent sampling without breaking the image structure. It then moves to 6a-6d, and so on.

By predicting 4 tokens at once, the model cuts the number of inference steps by a factor of roughly 4.

Model Architecture Implementation

How does a standard Transformer handle this? The researchers implement this using a cleverly designed sequence input and attention masking.

Overview of the PAR model implementation and attention masking.

Figure 4 details the implementation:

  1. Input Sequence: The sequence starts with the sequentially generated initial tokens.
  2. Learnable Transitions: Special “M” tokens (M1, M2…) are inserted to help the model transition its internal state from sequential mode to parallel mode.
  3. Parallel Groups: The rest of the tokens are fed in groups (e.g., [5a, 5b, 5c, 5d]).

The Attention Trick: In standard Autoregressive models, a token can only attend to previous tokens (Causal Attention). In PAR, the researchers use Group-wise Bi-directional Attention.

  • Between Groups: Standard causal attention applies (Group 6 can see Group 5, but Group 5 cannot see Group 6).
  • Within Groups: Tokens in Group 5 can see all tokens in Group 4. This is a subtle but important upgrade from naive implementations where token 5b might only see up to 4b. This ensures that every parallel prediction has the maximum possible context from the previous step.

Experiments and Results

Does this theory hold up in practice? The researchers tested PAR on ImageNet (images) and UCF-101 (video).

Speed vs. Quality

The results show a massive improvement in efficiency with negligible loss in quality.

Visualization comparison of PAR vs LlamaGen showing speedup.

Figure 2 compares the baseline (LlamaGen) against PAR.

  • LlamaGen: Takes 12.41 seconds to generate an image (576 steps).
  • PAR-4x: Takes 3.46 seconds (147 steps). The quality is visually identical.
  • PAR-16x: Takes 1.31 seconds (51 steps). This is nearly a 10x speedup.

The quantitative metrics back this up. In the table below (Figure 5/Table 2), we see the FID (Fréchet Inception Distance) scores. Lower FID is better.

Qualitative comparison and Table 2 showing FID scores.

The PAR-XXL-4x model achieves an FID of 2.35, which is almost identical to the baseline LlamaGen-XXL’s 2.34, but with a quarter of the steps. Even the aggressive PAR-16x keeps the FID at a respectable 3.02.

Visual Consistency

One might worry that generating regions in parallel would create “seams” or disjointed borders in the image. However, because the model maintains autoregressive conditioning on previous steps (and uses sequential initialization), the images remain globally coherent.

Additional image generation results of PAR-16x.

Figure 9 displays samples from the ultra-fast PAR-16x model. From the texture of the wolf’s fur to the structure of the lighthouse, the images are consistent and high-fidelity.

Video Generation

The method isn’t limited to static images. By treating video as a 3D grid of tokens (Time \(\times\) Height \(\times\) Width), PAR can speed up video generation as well.

Video generation results on UCF-101.

As shown in Figure 10, the motion in videos generated by PAR-4x and PAR-16x remains smooth. The parallelization is applied spatially (across the frame), preserving the temporal dependencies necessary for fluid motion.

Why Design Choices Matter: Ablation Studies

The researchers performed ablation studies to prove that their specific design choices—specifically the sequential initialization and the non-local ordering—were necessary.

The Importance of Sequential Initialization

What happens if we skip the “Stage 1” sequential generation and try to generate the first tokens of all regions in parallel?

Ablation study on initial sequential token generation.

As the table shows, removing the sequential initialization worsens the FID score from 2.61 to 3.67. Without that initial “handshake” between regions, the global structure suffers.

Entropy Analysis of Ordering

The researchers also compared their “Non-local” ordering against a “Raster” ordering (predicting adjacent tokens in parallel).

Conditional entropy differences between parallel and sequential generation.

Figure 12 provides a fascinating look at the “difficulty” of prediction (measured by entropy increase).

  • Panel (c) - PAR Ordering: The increase in entropy (redness) when switching to parallel mode is relatively low.
  • Panel (f) - Raster Ordering: The increase in entropy is high. This confirms that predicting neighbors in parallel is mathematically much harder for the model than predicting distant tokens.

Conclusion

The “Parallelized Autoregressive Visual Generation” (PAR) paper presents a compelling solution to the latency issues plaguing autoregressive vision models. By recognizing that spatial distance equals statistical independence, the authors unlocked a way to parallelize generation without retraining complex architectures or sacrificing the unified modeling capabilities of Transformers.

With speedups ranging from 3.6x to 9.5x, PAR makes high-quality autoregressive image and video generation practical for real-world applications, bridging the gap between the flexibility of AR models and the speed of non-autoregressive approaches.

This work highlights a broader lesson in AI research: sometimes the biggest gains come not from larger models or more data, but from a smarter organization of the generation process itself.