Escaping the Mode Collapse: How GDPO Brings Diversity to LLM Alignment
If you have used modern Large Language Models (LLMs) like ChatGPT or Claude extensively, you might have noticed a pattern. While they are incredibly helpful and safe, they can also be somewhat repetitive. Ask the same question five times, and you will often get five variations of the exact same answer—often written in the same “safe,” neutral tone.
This phenomenon is largely a byproduct of alignment. To make models safe and helpful, we train them using human preferences. The industry standard, Reinforcement Learning with Human Feedback (RLHF) and its more efficient cousin, Direct Preference Optimization (DPO), are excellent at forcing models to output high-quality answers. However, they suffer from a theoretical limitation: they are mode-seeking. They aggressively optimize for the single “best” answer, often stripping away the diversity and creativity inherent in the pre-trained model.
But what if we want a model that is both aligned and creative? What if we want a chatbot that can offer different perspectives or diverse writing styles while still adhering to human values?
In this post, we will dive deep into a paper titled “GDPO: Learning to Directly Align Language Models with Diversity Using GFlowNets” by Kwon et al. The researchers propose a new method that marries the efficiency of offline alignment with the diversity-seeking nature of GFlowNets.
The Problem: Alignment vs. Diversity
Before we understand the solution, we need to understand the tension between alignment and diversity.
The Standard Pipeline: RLHF and DPO
The standard recipe for creating a helpful assistant involves three steps:
- SFT (Supervised Fine-Tuning): Train the model on high-quality instruction-response pairs.
- Reward Modeling: Train a separate model to predict which of two responses a human prefers.
- RL (Reinforcement Learning): Use PPO (Proximal Policy Optimization) to maximize the reward model’s score.
Recent advancements, specifically DPO (Direct Preference Optimization), simplified this by removing the explicit reward model. DPO derives the reward signal directly from the preference data.
In preference modeling, we assume the probability that a human prefers response \(\pmb{y}\) over \(\pmb{y}'\) given a prompt \(\pmb{x}\) follows a specific distribution.

Here, \(r(x,y)\) is the reward. DPO uses this relationship to optimize the policy directly.
The “Mode Collapse” Issue
While DPO is computationally efficient, theoretical analysis suggests it tends to overfit the reward signal. It learns to reject “bad” responses much faster than it learns to accept “good” ones. In probability terms, these algorithms concentrate all the probability mass on the single highest-reward output (the mode).
This is fine for math problems where there is only one right answer. It is terrible for creative writing, brainstorming, or empathetic dialogue, where there are many valid ways to respond.
The Solution: Enter GFlowNets
The authors propose GDPO (GFlowNet-Direct Preference Optimization). To understand GDPO, we need a quick primer on Generative Flow Networks (GFlowNets).
Standard Reinforcement Learning (RL) tries to find the action sequence that maximizes the expected reward: \(\max \mathbb{E}[R]\).
GFlowNets, introduced by Yoshua Bengio’s group, have a different objective. They aim to sample a sequence of actions such that the probability of generating an object \(x\) is proportional to its reward \(R(x)\).
\[ P(x) \propto R(x) \]This difference is profound. If there are two good answers, and Answer A gets a reward of 10 and Answer B gets a reward of 8:
- Standard RL will eventually output Answer A almost 100% of the time.
- GFlowNet will output Answer A roughly 55% of the time and Answer B 45% of the time.
GFlowNets preserve the multimodal nature of the solution space. By applying this to LLM alignment, the goal is to create a model that samples diverse, high-quality responses rather than just the single “safest” one.
The Core Method: GDPO
The researchers formulated the alignment problem as a GFlowNet inference task. This allows them to train the model offline (using a static dataset of preferences) without the instability of online RL.
1. The Language Model as a Flow Network
We can view text generation as a Directed Acyclic Graph (DAG).
- State (\(s\)): The sequence of tokens generated so far.
- Action: Choosing the next token from the vocabulary.
- Trajectory (\(\tau\)): The complete sequence of states from the start to the End-of-Sequence (EOS) token.
In GFlowNet theory, we imagine “flow” (like water) moving through this graph. The amount of flow entering a state must equal the amount of flow leaving it. This leads us to the Detailed Balance (DB) condition.
2. The Detailed Balance Objective
The authors use a specific loss function derived from the Detailed Balance condition. For a language model, this condition simplifies because the graph is a tree (you can’t go back and change a previous token).
The objective ensures that the transition probabilities (the policy \(\pi\)) are consistent with the rewards at the end of the generation.

In this equation:
- \(\hat{\pi}\) is our language model we are training.
- \(r\) is the reward associated with the sequence.
- The equation essentially balances the probability of moving from state \(t\) to \(t+1\) against the ratio of their rewards.
3. Defining the Reward without a Reward Model
This is the clever part. GFlowNets usually require a reward function \(R(x)\) that you can query. But in offline alignment, we don’t have a ground-truth reward function; we only have pairs of “winning” (\(y_w\)) and “losing” (\(y_l\)) responses.
The authors construct a reward signal by combining the reference model (the SFT model) and the preference probability. They define the reward for a specific token step \(k\) as follows:

Let’s break this down:
- Reference Log Probability (\(r_{\text{ref}}\)): This keeps the model anchored to the original supervised training, preventing it from spewing gibberish. It includes a term for the probability of finishing the sentence (\(\top\)).
- Preference Term (\(p(y > y' | x)\)): This injects the human preference. If a response is the “winner” in the dataset, it gets a flow boost.
- Temperament (\(\alpha\)): A hyperparameter that controls how strongly we rely on the preference data.
4. The Training Algorithm
The training loop looks very similar to DPO but uses the GFlowNet loss.

The algorithm iterates through batches of preference data. For each pair of responses, it calculates the rewards using the formula above and then updates the model parameters \(\theta\) to minimize the Detailed Balance loss (\(\mathcal{L}_{\text{DB}}\)).
Crucially, GDPO is an offline method. It does not require generating new text during training (which is slow) or querying an external reward model (which is expensive). It simply learns to align its internal flow with the preferences found in the dataset.
Experiments & Results
To test if GDPO actually produces more diverse and aligned models, the authors compared it against SFT, PPO, DPO, and several other baselines (IPO, CPO, ORPO). They used two datasets:
- Anthropic HH: A dialogue dataset (open-ended, creative).
- TLDR: A summarization dataset (constrained, precision-focused).
Win Rate vs. Diversity
This is the most critical result of the paper. They measured Win Rate (judged by GPT-4 against reference answers) versus Diversity (semantic distance between generated samples).

Look at the Grey Dot (GDPO) in the plot above.
- Vertical Axis (Diversity): GDPO is significantly higher than all other methods. It is generating much more varied responses.
- Horizontal Axis (Win Rate): While it doesn’t achieve the absolute highest win rate (DPO and IPO are further right), it remains competitive.
The trade-off is visible: DPO (Green) and IPO (Red) optimize heavily for win rate but crash in diversity. They found a “winning strategy” and stuck to it. GDPO explores the space of valid answers much more broadly.
Does Temperature Fix DPO?
A common counter-argument is: “If you want diversity, why not just increase the sampling temperature of DPO?”
The authors tested this. Increasing temperature adds randomness, but does it add useful diversity?

As shown in Table 1, when you crank up the temperature on DPO (to 1.5), the diversity increases (to 50.8), but the win rate collapses (to 20.6%). The model just starts making mistakes.
In contrast, GDPO achieves a massive diversity score of 69.0 while maintaining a healthy win rate of 43.7% at standard temperature. GDPO’s diversity comes from learning the shape of the reward distribution, not just adding random noise.
Qualitative Analysis: Reading the Responses
Numbers are great, but what does this diversity look like in practice? Let’s look at an example from the Anthropic HH dataset, specifically a sensitive query about alcohol.

- DPO (Green row equivalent): The response is standard, safe, and somewhat formulaic (“I’m sorry to hear that… important to remember…”).
- GDPO (Last row): The response is distinct. It adopts a more empathetic, inquisitive tone: “Could you tell me more about that… I’m just thinking about what we might be able to do…”
In the full text analysis, the authors note that DPO and IPO often converge to either purely factual or overly protective responses. GDPO generates responses that explore different angles—sometimes emotional support, sometimes factual analysis—while staying relevant.
The Summarization Challenge
Interestingly, the results were different for the summarization task (TLDR).

In summarization (Figure 2), consistency is often preferred over creativity. There aren’t that many different ways to summarize a text correctly. Here, GDPO’s diversity advantage was less pronounced, and the win rate suffered slightly compared to DPO. This highlights that GDPO is best suited for open-ended tasks (dialogue, story generation, brainstorming) rather than convergent tasks (translation, summarization).
Conclusion and Implications
The paper “GDPO: Learning to Directly Align Language Models with Diversity Using GFlowNets” presents a compelling alternative to the current dominance of DPO.
By viewing alignment through the lens of Bayesian inference and GFlowNets, GDPO offers a way to:
- Prevent Mode Collapse: It stops models from sounding like broken records.
- Maintain Alignment: It still optimizes for human preferences, just not at the expense of variety.
- Stay Efficient: It retains the offline, reward-model-free benefits of DPO.
Why This Matters
As AI becomes integrated into creative tools—writing assistants, role-playing games, and ideation software—diversity becomes a key metric of quality. A writing assistant that always suggests the same edit is useless. A role-playing NPC that always reacts with the same “safe” personality breaks immersion.
GDPO suggests that we don’t have to choose between a safe model and a creative one. By changing the objective from “maximize reward” to “sample proportionally to reward,” we can align models that reflect the full richness of human language and preferences.
Key Takeaways for Students
- RLHF/DPO are mode-seeking; they find the peak of the mountain and stay there.
- GFlowNets are distribution-matching; they map the whole mountain range.
- Detailed Balance is a physics concept applied here to ensure probability flow creates a stable generative policy.
- GDPO is currently one of the strongest methods for keeping LLMs diverse without re-introducing the complexity of online PPO training.
](https://deep-paper.org/en/paper/2410.15096/images/cover.png)