Imagine you are writing a performance review for a colleague. You want the feedback to be professional (formal), but you also want to be encouraging (positive). Now, imagine you are texting a close friend about a terrible movie you just saw. You want to be casual (informal) and critical (negative).
As humans, we blend these stylistic dimensions effortlessly. We switch our “voice” based on the context, mixing sentiment, formality, humor, and politeness to suit the situation. Large Language Models (LLMs), however, often struggle with this nuance. While they are great at generating generic text, getting them to adhere to multiple specific stylistic constraints simultaneously—like “be formal AND negative AND ironic”—is a complex engineering challenge.
In the paper “Dynamic Multi-Reward Weighting for Multi-Style Controllable Generation,” researchers from the University of Minnesota tackle this exact problem. They propose a novel Reinforcement Learning (RL) approach that dynamically adjusts how different stylistic goals are prioritized during training.
In this post, we will break down why multi-style generation is hard, explore the mathematics behind their dynamic weighting solution, and analyze the results that show why this method outperforms traditional approaches.

The Problem: The Tug-of-War in Text Generation
Controlling a single style in an LLM is a relatively solved problem. If you want a model to be “positive,” you can fine-tune it on positive reviews or use a classifier to guide it. But what happens when you have conflicting or orthogonal goals?
Consider the “Formal + Negative” combination. These two styles often pull in different directions. A “negative” signal might encourage words like “terrible” or “sucks,” while a “formal” signal pulls the vocabulary toward “unsatisfactory” or “suboptimal.”
When engineers try to train a model using Reinforcement Learning to satisfy both, they typically use a Multi-Objective Reward Function. The model generates text, and several “discriminators” (classifiers trained to spot specific styles) judge the output.
The standard approach is to simply add these scores together. However, this leads to a “tug-of-war” where the easier objective dominates. If it’s easier for the model to be formal than it is to be negative, the model might learn to maximize the “formal” score and ignore the “negative” one, because that path offers the path of least resistance to a higher total reward.
Background: The RL Fine-Tuning Pipeline
To understand the solution, we first need to visualize the training pipeline. The authors use Proximal Policy Optimization (PPO), a popular RL algorithm used in training models like ChatGPT (via RLHF).
Here is the setup:
- The Policy (LLM): This is a Llama-2 7B model. It receives a prompt and generates a response.
- The Discriminators: These are separate, smaller models trained to detect specific styles (e.g., Sentiment, Formality, Irony).
- The Reward Function: This is the crucial component. It takes the outputs from the discriminators and combines them into a single scalar number that tells the LLM “Good job” or “Try again.”

The discriminators used in this study were trained on standard datasets to detect various attributes:

The core research question is: How do we mathematically combine the outputs of Discriminator A and Discriminator B to ensure the LLM learns both equally well?
The Core Method: Shaping the Reward
The researchers experimented with several “shapes” for the reward function before arriving at their novel contribution. Let’s look at the evolution of these ideas.
1. The Naive Approaches (Logits and Softmax)
The most obvious way to combine rewards is to take the raw output (logits) or the probability (softmax) from the discriminators and sum them up.
- Softmax: Summing the probabilities (0 to 1). The problem? Models are often “overconfident.” A classifier might output 0.99 confidence for a style even if the text is only vaguely matching, giving the RL agent a messy signal.
- Logits: Summing the raw scores before probability conversion. This provides a stronger signal but can be too aggressive, causing the model to output gibberish just to hack the reward function (resulting in high perplexity/low fluency).
2. The Binarized Approach
To reduce noise, one can simplify the signal: pass or fail. If the discriminator is more than 50% sure the text is “Formal,” the model gets a +1. If not, it gets a -1.

This approach, shown in the equation above, removes the noise of the model’s confidence levels. It tells the LLM, “Just get over the line.” While effective, it lacks nuance—a generation that is barely formal gets the same reward as one that is extremely formal.
3. The Solution: Dynamic Weighting
The authors propose a method called Dynamic Weighting. The intuition is simple but powerful: Don’t just look at the score; look at how much the model is learning.
If the model is struggling to learn “Negativity” but finding “Formality” easy, the reward function should prioritize Negativity to balance the scales. To measure this, the authors look at the Gradient Norm.
In deep learning, the gradient represents the direction and magnitude of change required to reduce error. A large gradient implies the model has a lot to learn or is actively changing its understanding of that feature.
Here is the dynamic weighting formulation:

And here is how the weight (\(grad\_norm\)) is calculated:

What this means in plain English: The weight \(w\) for a specific style isn’t fixed. It is calculated by looking at the gradient of the Cross Entropy loss (\(\mathcal{L}_{CE}\)) for that style.
- We measure the “steepness” (magnitude) of the gradients for all target styles.
- We normalize them so they sum to 1.
- If a style has a high gradient norm (meaning it’s currently a significant source of the model’s loss), it gets a higher weight in the reward function.
This acts as an automatic balancing mechanism. If the model starts ignoring one style, the loss for that style increases, the gradient magnitude grows, and the reward function automatically places more value on that style in the next step.
The final reward is a linear combination of these dynamically scaled signals:

Comparison of Methods
To visualize how these methods stack up against each other, the authors provide a summary of the reward shaping techniques:

The Dynamic Weighting method (bottom row) is unique because it adapts to the training process, picking which attribute contributes most based on the model’s current performance state.
Experiments and Results
The researchers tested these methods using Llama-2 7B. They aimed to control two styles simultaneously (e.g., Negative + Informal) and assessed the results based on two criteria:
- Style Accuracy: Did the text actually match the target styles?
- Generation Quality: Is the text fluent English (measured by Perplexity) and not repetitive (measured by Bigram Duplicates)?
Performance on Two-Style Control
The results for the “Negative + Informal” combination highlight the superiority of Dynamic Weighting.

Key Takeaways from the Data:
- Softmax (standard approach): Failed significantly. It only achieved the target combination 38.5% of the time.
- Logits: High accuracy (52.65%), but look at the Perplexity (PPL). A PPL of 98.86 suggests the model is generating garbage or near-gibberish to satisfy the discriminators.
- Binarized: A strong contender with 56.8% accuracy and good fluency (low PPL).
- Dynamic Weighting (Ours): The clear winner. It achieved 60.25% accuracy with the lowest perplexity (31.46) and very low repetition.
This proves that dynamically adjusting weights prevents the model from “gaming” the system and encourages it to find a solution that satisfies both constraints fluently.
Scaling to Three Styles
The authors didn’t stop at two styles. They pushed the model to control three simultaneous attributes, such as “Positive + Formal + Irony.”

The results show that the model successfully learned to incorporate the third dimension. For example, in the first row, the model achieved 66.55% accuracy on Irony while maintaining high scores on Sentiment and Formality.
Visualizing this on a radar chart shows the “shape” of the model’s capability:

Notice the green line (Positive+Formal+Irony). It stretches far out on the Irony, Positive, and Formal axes, showing distinct control over all three compared to the baseline or 2-style models.
Comparison with PPLM (Plug and Play Language Models)
The researchers also compared their fine-tuning method against PPLM, a popular method that steers generation without updating the model weights (inference-time control).

As shown in Figure 4, the Dynamic Weighting RL approach (green bars) drastically outperforms PPLM (red bars) and the base model (blue bars). For “Positive + Formal,” the dynamic weighting approach nearly doubles the success rate of the base model.
The “Alignment Tax” and Limitations
While the results are impressive, the paper candidly discusses the trade-offs of this aggressive fine-tuning, often called the “Alignment Tax.”
When you force a model to adhere strictly to a specific style (like “Negative and Informal”), you might inadvertently degrade its factual accuracy. The researchers found that on Wikipedia-based prompts, the fine-tuned models sometimes hallucinated facts to fit the style.

In the table above, look at the entry for “Dwight.” The original Llama 2 correctly identifies it as a city in Illinois. The “Negative + Formal” model, trying to fit a specific tone, hallucinates that it is a city in Michigan founded by Alvin Lasher. This suggests that while we can control how the model speaks, we must be careful that this control doesn’t corrupt what it knows.
Conclusion
The paper “Dynamic Multi-Reward Weighting for Multi-Style Controllable Generation” offers a significant step forward in making LLMs more versatile. By moving away from static reward summation and adopting a dynamic, gradient-aware weighting scheme, we can teach models to balance conflicting stylistic goals without sacrificing fluency.
Key Takeaways:
- Multi-style generation is a balancing act: Static rewards lead to one style dominating the others.
- Gradients tell the story: Using the gradient magnitude allows the reward function to “know” which style the model is struggling with and adjust priorities in real-time.
- RL works better than steering: For complex multi-style requirements, fine-tuning with dynamic RL beats inference-time techniques like PPLM.
As we move toward more personalized AI assistants, the ability to dial in a specific, multi-faceted persona (e.g., “Professional but Empathetic” or “Witty and Concise”) will be essential. Dynamic Multi-Reward Weighting provides the mathematical blueprint for achieving that balance.
](https://deep-paper.org/en/paper/2402.14146/images/cover.png)