Introduction
If you have experimented with text-to-image diffusion models like Stable Diffusion or Midjourney, you have likely encountered the “gibberish text” phenomenon. You ask for a sign that says “Welcome Home,” and the model generates a beautiful living room with a sign that reads “Wleom Hmeo.”
While diffusion models have mastered lighting, texture, and composition, they notoriously struggle with visual text generation. The letters are often distorted, words are misspelled, or the text is ignored entirely. While commercial models like DALL-E 3 are improving, open-source backbone models still lag behind, particularly when it comes to languages other than English, such as Chinese.
Why is this task so hard for AI? Is it a problem with how the model “sees” the text, or how it “draws” it?
In the paper “Empowering Backbone Models for Visual Text Generation with Input Granularity Control and Glyph-Aware Training,” researchers investigate these root causes and propose a comprehensive solution. They introduce a method that significantly improves the spelling and aesthetic quality of visual text without compromising the artistic quality of the image.

As shown in Figure 1, the difference is striking. Where standard backbone models produce garbled characters, the improved model renders crisp, legible text—even for complex artistic styles and Chinese characters.
The Root of the Problem: A Preliminary Study
Before fixing the problem, the researchers first had to diagnose it. They identified two primary culprits restricting the performance of current backbone models (specifically analyzing SD-XL): Tokenization and Cross-Attention.
1. The Tokenization Trap
Most Large Language Models (LLMs) and text encoders use Byte Pair Encoding (BPE). This method breaks words down into subword units (tokens) to manage vocabulary size.
For example, the word “diffusion” might be split into “dif” and “fusion.” While this works for semantic understanding in NLP, it is detrimental for visual generation. When a model tries to draw the word “diffusion,” it needs to know the visual structure of the whole word. If the input is chopped into “dif” and “fusion,” the model struggles to combine these disjointed concepts into a single, cohesive visual word.
The researchers tested this by comparing the generation accuracy of words that BPE splits versus words it keeps whole. They found that BPE tokenization significantly increases the difficulty of generating visual texts.
2. The Wandering Eye (Cross-Attention)
In diffusion models, cross-attention maps determine which part of the image correlates with which word in the prompt. If you type “a red apple,” the model should attend to the apple’s location when processing the token “apple.”
The researchers visualized the cross-attention maps for text that needs to be written (glyph text).

As Figure 3 illustrates:
- Case (a): When “University” is spelled correctly, the model’s attention (the heat map) is tightly focused on the signboard area.
- Case (b): When “University” is misspelled, the attention is scattered or weak.
- Case (c): Here, “Heart” is rendered correctly because the attention is focused. However, “Flower” is ignored because the attention map highlights an irrelevant region.
Conclusion: To generate accurate text, the model must effectively bind the text tokens to the specific pixels where that text should appear. Current models often fail to make this connection.
The Solution: A Two-Pronged Approach
Based on these findings, the authors propose a framework that addresses both the input representation and the training process.

Part 1: Mixed Granularity Input
Since BPE slicing harms visual text generation, the researchers introduced a Mixed Granularity Input strategy. The core idea is simple but effective: treat the words that need to be drawn (glyph words) as whole units, rather than breaking them down.

As shown in Figure 5, instead of feeding the tokens for “dif” and “fusion” separately, the model should see “diffusion” as a single entity.
But how do we get an embedding for a whole word that isn’t in the standard vocabulary? The authors utilize an OCR (Optical Character Recognition) model. They render the glyph word into a simple image and feed it into an OCR encoder to extract a feature vector. This vector inherently contains rich information about the shape and structure of the word.
The refined text embedding \(c\) is calculated by combining the standard text encoder output with these new OCR features:

Here, \(T\) is the CLIP text encoder, while the second term represents the features extracted from the rendered glyph image \(I_g\). This gives the diffusion model a “visual” understanding of the text before it even starts generating the image.
Part 2: Glyph-Aware Training
Improving the input is not enough; the model also needs to be taught how to use this information. The authors augment the standard diffusion training objective with three specific “Glyph-Aware” loss functions.
The total loss function is defined as:

Let’s break down the three specific components added to the standard MSE loss (\(\mathcal{L}_{mse}\)).
1. Attention Alignment Loss (\(\mathcal{L}_{attn}\))
Recall that poor cross-attention leads to misspelled words. This loss function forces the model to focus its attention on the correct regions.
The model computes the cross-attention map between the noisy latent image (\(z_t\)) and the glyph token (\(c_g\)):

The system effectively tells the model: “When you are thinking about the word ‘University’, your attention map must match the actual mask of where the text is located.”

By minimizing the difference between the attention map (\(CA\)) and the ground truth segmentation mask (\(M_k\)), the model learns to bind visual text to the correct image coordinates.
2. Local MSE Loss (\(\mathcal{L}_{loc}\))
Standard diffusion models use a global Mean Squared Error (MSE) loss, which treats every pixel in the image equally. However, text usually occupies a small portion of the image. The model might generate a perfect background but a messy sign, and the global loss wouldn’t penalize it heavily enough.
To fix this, the authors introduce a Local MSE Loss that applies a higher weight to the specific regions where text is located.

Here, \(M_k\) acts as a filter, ensuring this loss only counts errors within the text bounding box. This forces the model to prioritize the fine-grained details of the letters.
3. OCR Recognition Loss (\(\mathcal{L}_{ocr}\))
Finally, to ensure the generated text is actually readable, the training process includes an OCR check.
During training, the model predicts a denoised image (\(x'_0\)). This prediction is fed into a frozen OCR model (separate from the input OCR encoder). The OCR model tries to read the text in the generated image. If the text is illegible or incorrect, the loss increases.

This acts as a high-level semantic check, encouraging the model to generate text that is not just visually sharp (Local MSE) but also linguistically correct.
Experiments and Results
To validate these methods, the authors constructed a high-quality dataset of 240,000 English image-caption pairs and a synthetic dataset for Chinese text. They compared their method against leading backbone models like SD-XL, DeepFloyd, and SDXL-Turbo.
Quantitative Analysis
The results, summarized in Table 1 below, show that the proposed method (Ours) significantly outperforms baselines in OCR metrics (Precision, Recall, F1 Score).

- CLIP Score: Measures how well the image matches the prompt.
- OCR Metrics: Measure spelling accuracy.
- User Study: Human raters overwhelmingly preferred the proposed model for text accuracy and aesthetics.
Visual Comparison
The qualitative results are perhaps the most compelling. In Figure 6, we see a comparison across various challenging prompts.

- Row 1 (Car): The prompt asks for “Speed” written on a car. While SD-XL puts the text in the background or distorts it, the proposed model places it correctly on the vehicle.
- Row 3 (Bottles): “Do not litter” on bottles. DeepFloyd and SD-Cascade struggle with the bottle geometry or spelling. The proposed model wraps the text naturally around the object.
Addressing Common Failures
Even powerful models like SDXL-Turbo suffer from specific issues like word repetition.

In Figure 7 (bottom), SDXL-Turbo generates “Safeety Fircort Safey Fist” instead of “Safety First” and “No Littic Literng” instead of “No Littering.” The proposed model (top) solves these repetition and misspelling issues, thanks largely to the specific input granularity control that prevents the model from seeing “Safety” as multiple sub-tokens.
Chinese Text Generation
One of the most impressive achievements of this work is its transferability to Chinese, a language with complex glyph structures that most diffusion models fail to render.

As shown in Figure 9, baseline models (trained on the same data) often produce pseudo-Chinese characters that look like gibberish strokes. The proposed model generates accurate, legible Chinese characters (“中国”, “地图”) integrated into complex scenes.
Note: For Chinese, the authors found that a mixture of character-level and BPE tokenization worked better than whole-word tokenization, due to the complexity and quantity of Chinese characters.
Preserving Image Quality
A common fear when fine-tuning models for specific tasks is “catastrophic forgetting”—losing the original capabilities of the model. Does learning to spell make the model worse at drawing sunsets?

Figure 8 confirms that the aesthetic quality remains high. Whether it’s a watercolor Mario or a sunset beach, the model retains its artistic capabilities. The FID scores (a metric for image quality) remained comparable to the base models.
Conclusion and Implications
The paper “Empowering Backbone Models for Visual Text Generation” provides a clear roadmap for fixing one of generative AI’s most persistent flaws.
The key takeaways are:
- Granularity Matters: Breaking words into BPE sub-tokens confuses the visual generator. Feeding whole-word (glyph) information via OCR embeddings helps the model “see” the word structure.
- Attention is Key: You cannot spell what you cannot focus on. Forcing the cross-attention maps to align with text regions is critical.
- Specialized Losses: Standard pixel-level loss isn’t enough. We need local focus on text regions and high-level OCR verification during training.
This work not only improves English text generation but opens the door for reliable multi-lingual visual text generation, a capability that will be essential for the next generation of AI design tools. By teaching the model to “read” what it draws, we are moving one step closer to truly comprehensive image synthesis.
](https://deep-paper.org/en/paper/2410.04439/images/cover.png)