Introduction: The Great Divide in AI Art

If you have been following the explosion of AI-generated imagery over the last few years, you likely know the big names: DALL-E, Midjourney, Stable Diffusion. What you might not know is that under the hood, there is a fundamental split in how these models work.

On one side, we have Diffusion Models (like Stable Diffusion and DALL-E 2/3). These work by removing noise from a chaotic image to reveal a clear picture. On the other side, we have Auto-Regressive Models (like the original DALL-E and Google’s Parti). These treat images like language: they break an image into a sequence of “tokens” and predict them one by one, just like ChatGPT predicts the next word in a sentence.

Here is the puzzle: Diffusion models have seen massive performance gains by integrating Pre-trained Language Models (LLMs). When researchers plugged powerful text encoders (like T5) into diffusion models, the models understood prompts better and generated superior images.

Naturally, you would assume the same logic applies to Auto-Regressive models. After all, if an Auto-Regressive image generator works exactly like an LLM—predicting the next token in a sequence—shouldn’t starting with a “smart” pre-trained LLM be better than starting from scratch? A pre-trained LLM already understands the world, grammar, and logic. It seems intuitive that this knowledge would transfer to generating images.

A fascinating research paper, Pre-trained Language Models Do Not Help Auto-regressive Text-to-Image Generation, puts this intuition to the test. The authors explored whether adapting a massive pre-trained language model could boost the performance of text-to-image generation.

The answer was a resounding, and surprising, no.

In this post, we will tear down this paper to understand why pre-trained “brains” don’t help AI “draw,” and what this tells us about the fundamental difference between the language of words and the language of pixels.

Background: The State of Text-to-Image

To understand why this negative result is so significant, we first need to look at the competitive landscape of image generation.

For a long time, the superiority of one architecture over the other—Diffusion vs. Auto-Regressive (AR)—was unclear. The original DALL-E proved AR models could work. Then DALL-E 2 switched to Diffusion and raised the bar. Then Google released Parti (AR) and Imagen (Diffusion) almost simultaneously, showing comparable high-quality results.

A scatter plot comparing zero-shot FID scores on COCO across various AI image generation models over time. Blue dots represent Auto-Regressive models, while gray dots represent Diffusion models.

As shown in Figure 1, both families of models have been neck-and-neck in terms of FID (Fréchet Inception Distance, a metric where lower is better, measuring how similar generated images are to real images).

However, a key distinction emerged in how they achieved these results. Diffusion models aggressively utilized pre-trained text encoders. The smarter the text model, the better the image model. Conversely, AR models like Parti generally trained their image generation components from scratch. While Parti used a text encoder (BERT) to initialize part of the model, it didn’t fully leverage the decoder-only architecture of modern GPT-style LLMs.

This left a gaping hole in the research: Can we take a decoder-only LLM (like a 1-billion parameter GPT), which is already a master of auto-regressive generation, and teach it to speak “image”?

How Auto-Regressive Image Generation Works

Before we look at the experiment, we must understand how an image can be treated like text. You cannot simply feed a grid of pixels into a Transformer; the data is too dense.

To solve this, researchers use Image Tokenizers (such as VQ-VAE or VQ-GAN). These tools act like translators. They take a square image (e.g., 256x256 pixels) and compress it into a smaller grid of discrete numbers (tokens).

Imagine breaking a mosaic into tiles. Each tile is assigned a number based on its visual pattern. Suddenly, an image isn’t a matrix of pixels; it is a sequence of integers, like [113, 154, 1334...].

Once an image is a sequence of numbers, it looks mathematically identical to a sentence of text. This allows us to use the standard Transformer architecture—the same one used for everything from translation to chatbots—to generate images.

The Core Method: Adapting the LLM

The researchers set out to test the “transfer learning” hypothesis. They constructed an experiment to compare two models that were identical in architecture but different in initialization:

  1. The Pre-trained Model: Initialized with weights from a powerful 1-billion parameter Language Model (trained on 1.6 Trillion text tokens).
  2. The Baseline: A model with the exact same architecture, but initialized with random weights (learning from scratch).

The Architecture Setup

The adaptation process is illustrated in Figure 2 below.

Diagram illustrating the process of image-to-text generation. Left: An image is tokenized into a grid. Right: The Language Model architecture showing the embedding and output layers.

Here is the step-by-step workflow shown in the figure:

  1. Input: An image (e.g., a cactus in a desert) is fed into the Image Tokenizer (specifically SBER-MoVQGAN).
  2. Tokenization: The tokenizer converts the image into a sequence of 1,024 discrete tokens.
  3. Language Model: This sequence, combined with the text caption, is fed into the Transformer.
  4. Output: The model predicts the sequence, and a “De-tokenizer” reconstructs the final image.

The Engineering Challenge

You can’t just shove image tokens into an LLM immediately. The LLM’s “dictionary” (vocabulary) only contains words. It doesn’t know what “Image Token #405” is.

To fix this, the authors expanded the Embedding Layer (input) and the Output Layer (prediction).

  • Original LLM: Vocabulary size ~50,000 (text words).
  • New Model: Vocabulary size ~66,000 (text words + 16,384 image tokens).

The weights for the text part were copied from the pre-trained model. The weights for the new image tokens were initialized randomly (or via a contrastive alignment technique, which we will discuss later). The massive “middle” of the model—the Transformer blocks, attention heads, and feed-forward networks—retained all their pre-trained knowledge.

The models were then fine-tuned on the HQITP dataset, a massive collection of 134 million high-quality image-caption pairs.

Experiments & Results: The Great Disappointment

If the intuition about pre-training holds true, the pre-trained model should learn faster, reach a lower loss (better accuracy), and generate better images than the random model.

Let’s look at the training curves.

1. The Loss Curves

The primary metric used here is Perplexity (the exponential of the loss). In simple terms, perplexity measures how “surprised” the model is by the next token. A lower number means the model is better at predicting the image.

Line graph displaying perplexity vs training tokens. The curves for ‘Pretrained Initialized’ and ‘Randomly Initialized’ overlap almost perfectly.

Figure 3 tells the whole story. The blue lines (random initialization) and the red/orange lines (pre-trained initialization) track each other almost perfectly.

  • Does pre-training help convergence? No. The pre-trained model doesn’t learn any faster.
  • Does pre-training lower the final loss? No. After training on 100 Billion tokens, both models end up at the exact same performance level.

This is a shocking result in the world of Deep Learning, where pre-training is usually the “secret sauce” for high performance.

2. Catastrophic Forgetting

It actually gets worse. Not only did the pre-training fails to help image generation, but the fine-tuning process also destroyed the model’s original language capabilities.

The researchers tested the pre-trained model on standard text tasks (like translation or answering questions) after it had been trained on images for a while.

Table showing examples of forgetting. The model fails to complete simple sentences or translate words after training on 5B image tokens.

As Table 1 shows, after training on just 5 billion tokens (a fraction of the total training), the model becomes incoherent.

  • Original: Properly explains the theory of relativity.
  • After Training: “Simplify puts, the theory of relativity states that iles must be able to see the invisible.”
  • Translation Task: It forgets how to translate “cheese” into French, instead outputting “I love cheese.”

This phenomenon is known as Catastrophic Forgetting. The model is overwhelmed by the new image data and overwrites its previous knowledge.

3. Analyzing the Failure: Why?

To understand why the pre-trained weights were useless, the authors broke down the loss into two components: Text Loss (predicting the caption) and Image Loss (predicting the image tokens).

Two line charts comparing perplexity on Image Tokens vs Text Tokens. Image token loss is identical for both models. Text token loss starts lower for pre-trained but converges.

Figure 4 provides the diagnostic:

  1. Image Tokens (Left Graph): There is zero difference between the pre-trained and random models at any point in training. The pre-trained LLM is no better at predicting “pixels” than a random neural network.
  2. Text Tokens (Right Graph): The pre-trained model starts with a huge advantage (lower perplexity), which makes sense—it already knows English. However, this advantage vanishes very quickly (within 10 billion tokens).

Why does the text advantage disappear? Because the text in image-caption datasets is incredibly simple. Captions like “A dog sitting on a bench” are far less complex than the literature, math, and code the LLM was originally trained on. The model “downgrades” its complexity to match the simple captions.

4. Unconditional Image Generation

The critics might argue: “Maybe the text-to-image connection is the problem. What if we just ask the model to generate images without text?”

The researchers tested this by training the model purely on image tokens (Unconditional Generation).

Graph showing perplexity on Image Tokens for unconditional generation. Random initialization performs slightly better than or equal to pre-trained variants.

Figure 6 shows the results of training only on image tokens.

  • The Red line (Random) actually achieves a lower final perplexity than the Blue line (Pre-trained).
  • The other lines show what happens if you “freeze” parts of the pre-trained model (like the layers or the feed-forward networks) to try and preserve the pre-trained knowledge. The performance gets significantly worse.

This confirms a major hypothesis: The optimal weights for modeling language are fundamentally different from the optimal weights for modeling images. By forcing the model to keep its “language brain,” you actually hinder its ability to learn “image logic.”

The Root Cause: Apples and Oranges

The paper offers a profound explanation for these results, centering on the nature of Tokens.

In an LLM, a token (like the word “Apple”) has a rich semantic meaning. It connects to concepts of fruit, technology, gravity, pie, and the color red.

In an Image Tokenizer (VQ-GAN), a token is just a visual patch—perhaps a curve, a texture, or a specific shade of blue. It has no standalone semantic meaning. Token #405 doesn’t mean “eye”; it just means “a small dark circle.”

The Alignment Failure

The authors tried to force the image tokens to act like text tokens using Contrastive Alignment. They tried to map image tokens to text embeddings that were mathematically similar (e.g., trying to make the token for a “furry texture” align with the word embedding for “fur”).

Graphs showing Contrastive Loss and Temperature. The loss plateaus quickly, indicating failure to align.

Figure 7 shows that this failed. The loss plateaued almost immediately. When they analyzed the results, they found that image tokens aligned with “noisy, semantically void text tokens.”

Because image tokens don’t carry high-level meaning on their own, a language model (which operates on high-level meanings) cannot leverage its prior knowledge to process them. The grammar of language is about the flow of ideas; the “grammar” of images is about the spatial relationship of textures and edges. They are, quite literally, different languages.

Conclusion: Implications for Future Research

Despite the negative result regarding pre-training, the study confirmed that auto-regressive models can generate high-quality images—they just have to learn to do it from scratch.

Examples of generated images showing a kitten, a bridge, a squirrel, and a jockey.

As seen in Figure 5, the final model (trained from scratch or pre-trained—it doesn’t matter) is capable of producing photorealistic results with a decent FID score of 12.21 on MS-COCO.

Key Takeaways

  1. Transfer Learning isn’t Magic: We cannot assume that a model smart in one domain (text) will automatically be smart in another (images), even if we format the data to look the same.
  2. The Tokenizer is the Bottleneck: The current generation of image tokenizers (VQ-GAN) creates “visual words” that lack semantic meaning. This prevents LLMs from using their reasoning capabilities.
  3. Data Ratio Matters: In image-caption training, image tokens outnumber text tokens 30 to 1. This imbalance forces the model to dedicate almost all its capacity to visual structure, leading to the catastrophic forgetting of text knowledge.

The Path Forward

The authors suggest that if we want to unlock the power of LLMs for image generation, we need better Image Tokenizers. We need tokenizers that map images to semantically meaningful units (e.g., a token that actually represents “eye” or “sky”) rather than just pixel patches.

Until then, if you are building an auto-regressive image generator, you might as well initialize your weights randomly. Your fancy LLM isn’t helping.