The world of text-to-image generation has moved beyond simply typing “a cat” and getting a generic cat. Today, users want their cat—specifically, the fluffy tabby sitting on their couch right now. This is known as customized generation.
While teaching an AI to recognize a single specific subject (like your pet or a specific toy) has become standard practice with tools like DreamBooth or LoRA, a significant bottleneck remains: scalability. What if you want to generate an image of your specific dog, playing with your specific cat, sitting on your specific sofa, in front of your specific house?
Most existing models collapse under this pressure. They either blend the concepts together (a dog with cat ears), forget one of the subjects entirely, or take an eternity to compute.
Enter LATEXBLEND, a new framework proposed by researchers at Nanjing University of Science and Technology. This paper introduces a method to efficiently “blend” multiple customized concepts into a single generation without retraining the model for every new combination.

The Problem: Why is Multi-Concept Generation Hard?
To understand the solution, we must first understand why current methods struggle.
1. The Computational Cost
Existing approaches often require “joint training.” If you want to generate a scene with Alice and Bob, you have to gather images of both and fine-tune the model on both simultaneously. If you want to add Charlie later, you have to start over. This leads to combinatorial explosions in training costs. Other methods try to merge different models (like merging two LoRAs), but this often degrades the quality of both.
2. Denoising Deviation and Layout Collapse
A more subtle but critical issue identified by the researchers is denoising deviation.
When you fine-tune a model on a specific subject (e.g., a dog), you typically use only 3–5 images. These reference images usually have the subject front-and-center. The model “overfits” not just to the dog’s face, but to the layout of those reference images.
When you try to generate that customized dog in a new context, the model fights against the new prompt because it “remembers” the original layout too strongly. As shown below, this deviation gets worse as you add more concepts. The model drifts away from the natural structure the pre-trained model would have generated, resulting in rigid, “memorized” compositions.

The Solution: The Latent Textual Space
The core innovation of LATEXBLEND is where it injects the customization.
In a standard Latent Diffusion Model (LDM), a text prompt is processed in two stages before it controls the image generation:
- Text Encoder: Converts words into embeddings.
- Linear Projection: Projects those embeddings into “Keys” and “Values” (K and V) for the Cross-Attention layers.
Most previous methods modify the input words (Textual Inversion) or the deep weights of the diffusion model (DreamBooth/LoRA). LATEXBLEND targets the space after the text encoder and after the linear projection. This is what the authors call the Latent Textual Space.

The hypothesis is simple but powerful: The output of the text encoder is “entangled”—changing one word changes the context of the whole sentence. However, the latent textual features (the Keys and Values entering the attention mechanism) are more disentangled and robust. By blending concepts here, we can mix and match subjects without them interfering with each other’s processing.
Step 1: Single-Concept Fine-tuning (The Concept Bank)
First, the system creates a “Concept Bank.” Each specific subject (e.g., your dog) is tuned individually.
The challenge is to pack all the visual identity of the dog into just the latent features corresponding to the word “dog,” without leaking information into the surrounding words. To do this, the authors use a dual-flow training strategy:
- Base Flow (\(\mathscr{F}_b\)): This uses the frozen, pre-trained weights. It processes a prompt like “A photo of a [noun]” using generic tokens.
- Concept Flow (\(\mathscr{F}_c\)): This uses learnable weights. It processes “A photo of V* [noun],” where V* is a special identifier.
During training, the system forces the model to reconstruct the reference image. Crucially, it blends the Concept Flow’s target feature (the dog) into the Base Flow’s sentence structure.

The blending operation replaces the latent features of the generic noun with the learned features of the specific concept:

This ensures that the concept representation \(\mathbf{h}_c\) is self-contained. It doesn’t rely on the rest of the sentence to define the dog; it carries the “dog-ness” entirely within itself.
Step 2: Multi-Concept Inference (The Blending)
Once the concepts are stored in the bank, generating a multi-subject image is surprisingly efficient. There is no extra training.
When a user provides a prompt like “A dog playing with a robot toy”, the system:
- Runs the prompt through the standard pre-trained model to get a “Base” latent representation (\(\mathbf{h}_b\)).
- Retrieves the specific “dog” features (\(\mathbf{h}_{c1}\)) and “robot” features (\(\mathbf{h}_{c2}\)) from the bank.
- Swaps the generic features in \(\mathbf{h}_b\) with the specific ones from the bank.

Because the concepts were trained to be self-contained in the Latent Textual Space, they slot into the new sentence perfectly. This preserves the layout capabilities of the original model (the “Base”) while injecting the specific identities.

Position Invariance
A major advantage of this approach is Position Invariance. Because the concept is distilled into a compact feature vector, it can be plugged into any part of a sentence. Whether the prompt is “A cat next to a dog” or “A dog sitting far behind a cat”, the same feature vector works.

Refining the Result: Blending Guidance
While the blending method works well, complex prompts can sometimes confuse the diffusion model’s attention mechanism (e.g., the “furry” texture of the dog might accidentally bleed onto the “robot”).
To fix this, the authors introduce Blending Guidance. This is an inference-time optimization that adjusts the noise predictions. It encourages two things:
- Alignment: The attention map for the identifier token (V*) should overlap with the coarse class token (e.g., “dog”).
- Separation: The attention map for the dog should not overlap with the robot or other background elements.

The guidance term \(g\) modifies the standard score estimate \(\hat{\epsilon}_t\) during the denoising steps:

Experimental Results
The researchers evaluated LATEXBLEND against several competitive baselines, including Custom Diffusion, Cones 2, Mix-of-Show, and OMG.
1. Visual Quality and Fidelity
Qualitatively, LATEXBLEND shows superior ability to maintain the identity of multiple subjects simultaneously without destroying the scene’s logical structure. In the comparison below, note how baselines often lose one of the subjects or merge them into a blob. LATEXBLEND keeps the guitar, the flowers, and the lighthouse distinct and recognizable.

2. Quantitative Metrics
The authors used CLIP scores (for text alignment) and DINO scores (for subject identity fidelity). In the chart below, the x-axis represents concept alignment (fidelity) and the y-axis represents text alignment. LATEXBLEND (purple stars) consistently achieves higher fidelity (further right) while maintaining strong text alignment.

3. Efficiency
Perhaps the most impactful result for students and developers is the computational cost. Methods like “Mix-of-Show” or joint training see costs skyrocket as you add more concepts (N=2, N=3, etc.).
LATEXBLEND scales linearly during fine-tuning (you just train each concept once) and incurs zero additional cost at inference time compared to standard generation (unless you use heavy guidance).

Conclusion and Implications
The LATEXBLEND paper presents a smart architectural shift. Instead of brute-forcing multi-concept generation by retraining massive weights or merging varied models, it identifies the Latent Textual Space as the optimal “mixing desk” for generative AI.
Key Takeaways:
- Disentanglement: By isolating concept features in the latent space, we prevent them from corrupting the layout of the image.
- Modularity: The “Concept Bank” approach allows for true plug-and-play generation. You can train a “dog” today, a “hat” tomorrow, and combine them next week without retraining.
- Efficiency: This method makes personalized generation accessible without requiring a supercomputer for every new combination of items.
For students interested in diffusion models, this paper highlights the importance of understanding the internal representations of these networks (specifically the Cross-Attention inputs) rather than just treating the model as a black box to be fine-tuned.
](https://deep-paper.org/en/paper/2503.06956/images/cover.png)