Introduction

If you have ever played with text-to-image (T2I) models like Stable Diffusion, you are likely familiar with the frustration of “prompt engineering.” You type a beautiful description, only to get an image with distorted faces, extra fingers, or a gloomy color palette. To fix this, the community developed a workaround: Negative Prompts.

By typing words like “ugly, bad anatomy, low quality, blurry” into a negative prompt box, users tell the model what not to generate. While effective, this process is fundamentally a guessing game. It relies on trial and error, intuition, and copying “magic words” from other users. Why should we manually guess the right negative words when we can use AI to learn the perfect negative representation mathematically?

Enter ReNeg (Reward-guided Negative embedding), a new method proposed by researchers from AMD, Dalian University of Technology, and Tsinghua University. ReNeg automates the search for the optimal negative signal. Instead of relying on discrete words, it learns a continuous “negative embedding” guided by a reward model that mimics human preference.

Figure 1. We develop ReNeg, a versatile negative embedding seamlessly adaptable to text-to-image and even text-to-video models. Strikingly simple yet highly effective, ReNeg amplifies the visual appeal of outputs from base Stable Diffusion (SD) models.

As shown in Figure 1, ReNeg significantly enhances visual appeal and detail compared to standard Stable Diffusion (SD) and even SD equipped with handcrafted negative prompts. In this post, we will dive deep into how ReNeg works, the mathematics behind it, and why it represents a smart, lightweight way to improve generative AI.

Background: Diffusion and Guidance

To understand ReNeg, we first need to quickly revisit how modern diffusion models generate images and how they are controlled.

The Diffusion Process

Diffusion models operate on a simple principle: destroy data, then learn to reverse the destruction.

  1. Forward Process: We take an image \(x_0\) and gradually add Gaussian noise over \(T\) steps until it becomes pure noise \(x_T\).
  2. Reverse Process: We train a neural network to predict the noise \(\epsilon\) at each step, allowing us to subtract it and recover the image.

The forward process is described mathematically as:

Equation describing the forward diffusion process where noise is added to the image data.

Here, \(\bar{\beta}_t\) controls the noise schedule. The reverse process (generation) involves predicting the mean and variance of the previous step:

Equation describing the reverse denoising process distribution.

Classifier-Free Guidance (CFG)

The real magic of text-to-image generation comes from Classifier-Free Guidance (CFG). This is the mechanism that forces the image to actually look like your prompt.

During training, the model is trained on both conditional inputs (your text prompt \(c\)) and unconditional inputs (an empty or “null” prompt \(\phi\)). During inference (generation), the model predicts the noise twice: once with the text prompt and once without it. It then extrapolates the difference to “push” the image toward the text prompt.

The formula for the modified noise prediction \(\tilde{\epsilon}_\theta\) is:

Equation for Classifier-Free Guidance calculation.

Here, \(\gamma\) is the guidance scale.

  • \(\epsilon_\theta(x_t, c, t)\): The noise predicted using your positive prompt.
  • \(\epsilon_\theta(x_t, \phi, t)\): The noise predicted using the “null” prompt.

The Insight: Typically, \(\phi\) is just a fixed embedding corresponding to an empty string. However, users found that replacing this “null” prompt with a “negative prompt” (e.g., “ugly, blurry”) drastically improved quality.

ReNeg takes this a step further: What if we treat \(\phi\) not as a fixed empty string or a manual list of words, but as a learnable parameter \(n\) that we optimize directly?

The Core Method: ReNeg

The authors of ReNeg propose a framework to learn this negative embedding \(n\) through gradient descent. The goal is to find a vector \(n\) such that when it is used in the CFG equation, the resulting image maximizes a “reward” (quality score).

1. Feasibility: Why Tune the Embedding?

Before building the system, the researchers conducted a pilot study to see if tuning the negative embedding was actually efficient. They compared the “parameter efficiency”—how much the generated image changes relative to a small change in parameters—for:

  1. The full model weights (\(\theta_0\)).
  2. LoRA parameters (a popular fine-tuning method).
  3. The negative embedding (\(n\)).

Table 1 comparisons of parameter efficiency.

The results (Table 1) were striking. The efficiency \(E(n)\) for the negative embedding was orders of magnitude higher than tuning the model weights. This means that tweaking the negative embedding is a highly effective “lever” to control the generation process, offering maximum impact for minimal computational cost.

2. The Training Pipeline

ReNeg uses a Reward Feedback Learning (ReFL) framework. The process is a loop:

  1. Generate: The model predicts a denoised image using the current negative embedding.
  2. Evaluate: A pre-trained Image Reward Model (like HPSv2) scores the image based on human preference.
  3. Update: The gradients from the reward score are backpropagated to update only the negative embedding \(n\).

Figure 2. Overview of the training pipeline of our ReNeg. We learn the negative embedding by integrating Classifier-Free Guidance into the training process.

The crucial innovation here is integrating CFG into the training loop. Typically, CFG is only used during inference. ReNeg uses it during training to ensure the negative embedding is optimized specifically for how it will be used in the final generation.

3. The “One-Step” Approximation

Calculating the gradient through the entire diffusion process (which can take 50+ steps) is computationally prohibitive. To solve this, ReNeg uses a “one-step prediction” technique.

At a random timestep \(t\), the model predicts the final clean image \(\hat{x}_0\) directly from the noisy latent \(x_t\):

Equation for predicting x0 from xt in a single step.

The Reward Model then evaluates this predicted \(\hat{x}_0\). The objective function maximizes the expected reward:

Equation for the learning objective function.

Improving Precision with DDIM: The researchers found that using a deterministic ODE solver (DDIM) for the sampling step provided a more accurate prediction of \(\hat{x}_0\) compared to the stochastic DDPM sampler. This accuracy is vital because if the predicted image \(\hat{x}_0\) is blurry or inaccurate, the reward model cannot provide useful feedback.

Figure 3. Deterministic ODE sampler (DDIM) improves x0 prediction. Comparisons show higher similarity scores for DDIM.

4. Global vs. Per-Sample Embeddings

ReNeg proposes two strategies for deploying these embeddings:

Strategy A: Global Negative Embedding The model learns a single, universal negative embedding over a large dataset of prompts. This embedding captures general “bad” qualities (blurriness, distortion) and can be used as a plug-and-play improvement for any prompt.

Strategy B: Per-Sample Negative Embedding Different prompts may require different negative guidance. For a photorealistic prompt, “cartoon” is a negative attribute. For a cartoon prompt, “photorealistic” is negative.

The authors devised an algorithm to fine-tune the negative embedding for a specific user prompt on the fly. It starts with the Global embedding and optimizes it further for the specific input.

Algorithm 1: Pseudo-code for learning per-sample negative embedding.

As seen in Figure 4 below, the per-sample approach (right) often recovers finer details that the global embedding (left) might miss, such as the texture of the cat’s fur or the details of the girl’s hand.

Figure 4. Comparison of results using global negative embedding and per-sample negative embedding. Red boxes highlight improvements.

Experiments and Results

The researchers evaluated ReNeg on standard benchmarks like HPSv2 and Parti-Prompts. They compared it against null text (standard SD), handcrafted negative prompts (e.g., “bad hands, text, error”), and other optimization methods.

Quantitative Analysis

The results show that ReNeg outperforms manual engineering. In the table below, observe the HPSv2 scores. ReNeg (Global and Per-sample) consistently beats SD1.5 equipped with handcrafted prompts (\(+N^*\)) and even competitive methods like TextCraftor that require fine-tuning the UNet.

Table 2. Quantitative results on HPSv2 and Parti-Prompts benchmarks. ReNeg achieves the highest scores.

Qualitative Analysis

The visual difference is clear. In the figure below, compare the columns.

  • SD1.5: Often lacks detail or aesthetic punch.
  • SD1.5 + N* (Handcrafted): Better, but can look generic.
  • ReNeg (Ours): Sharp, highly detailed, and aesthetically pleasing.

Look at the row with the “figurine of Walter White” (second row). ReNeg captures the texture and lighting significantly better than the competitors.

Figure 5. Qualitative comparisons showing ReNeg outperforms other methods in detail and prompt adherence.

What does the Negative Embedding “Look” Like?

This is one of the most fascinating parts of the paper. If we take the learned negative embedding and force the model to generate an image using it as a positive prompt, what do we see?

  • Null-text: Generates a generic, average image (e.g., a person standing).
  • Handcrafted: Generates a chaotic mix of “bad” concepts (cluttered streets, weird objects).
  • ReNeg: Generates abstract, nonsensical patterns.

Figure 6. Visualization of negative embeddings. ReNeg’s embedding (bottom right) looks abstract compared to handcrafted prompts.

This suggests that ReNeg has found a “pathological” direction in the latent space that doesn’t correspond to a human concept like “ugly,” but purely represents the mathematical opposite of high-reward images.

Generalization and Transferability

A major advantage of ReNeg is that the learned embedding is just a vector in the text encoder’s space. It is not tied to the specific weights of the UNet used during training. This means it can be transferred to any model that uses the same text encoder (like CLIP).

Across Model Versions

The authors took the embedding learned on SD1.5 and plugged it into SD1.4 and SD2.1. It worked immediately, boasting high win rates against handcrafted prompts.

Figure 7. Comparison of win rates. ReNeg generalizes well to SD1.4 and SD2.1.

Transfer to Video and ControlNet

Perhaps most impressively, the embedding enhances Text-to-Video models (like VideoCrafter2 and ZeroScope) and ControlNet. These models were never part of the ReNeg training process, yet the negative embedding cleans up their output, improving motion smoothness and aesthetic quality.

Video Generation Results: Table 4. Performance comparison on video generation models. ReNeg improves aesthetic quality and motion smoothness.

ControlNet Results: Table 5. Performance comparison on ControlNet.

Conclusion

ReNeg represents a shift in how we control generative AI. Instead of treating “negative prompts” as a linguistic interface where humans must guess the right words, ReNeg treats them as a mathematical optimization problem.

By integrating reward guidance directly into the training loop via CFG, ReNeg extracts a “pure” negative signal that aligns models with human aesthetic preferences. The key takeaways are:

  1. Efficiency: Tuning the negative embedding is far more efficient than fine-tuning the model.
  2. Automation: No more trial-and-error with “bad hands, blurry, crop” prompts.
  3. Versatility: One learned embedding can improve images, videos, and controlled generation across multiple model versions.

As generative models continue to evolve, methods like ReNeg highlight the importance of the “unconditional” space—the dark matter of diffusion models—and how shaping it can illuminate the results we actually want.