Introduction

In the current era of Generative AI, training a Large Language Model (LLM) to speak fluent English is effectively a solved problem. The frontier has shifted from capability to alignment. We don’t just want models that can write; we want models that write in accordance with human values—being helpful, harmless, and honest.

The industry standard for achieving this is Reinforcement Learning from Human Feedback (RLHF). This technique fine-tunes models using a “Reward Model” that acts as a proxy for human judgment. Think of the Reward Model as a judge: if the judge has a keen eye and clear values, the AI learns to behave well. If the judge is confused, inconsistent, or easily tricked, the AI learns the wrong lessons.

A recent paper from Fudan University highlights a critical flaw in current Reward Models: they struggle to distinguish subtle differences between good and bad responses. They often rely on “easy features” (shortcuts) rather than deeply understanding human intent. To fix this, the researchers propose integrating Contrastive Learning into the training process.

In this post, we will deconstruct their method. We will explore how adding unsupervised contrastive losses helps the Reward Model “pull apart” good and bad answers in its internal representation, leading to AI agents that are safer, more helpful, and surprisingly, more stable during training.

Background: The RLHF Pipeline

To understand the contribution of this paper, we first need to look at the standard RLHF pipeline, specifically the role of the Reward Model (RM). RLHF typically consists of three stages:

  1. Supervised Fine-Tuning (SFT): The model is trained on high-quality instruction-response pairs to learn the basic format of interaction.
  2. Reward Modeling: A separate model is trained to score responses based on human preference data.
  3. Reinforcement Learning (PPO): The language model uses the Reward Model’s scores to optimize its policy.

The paper focuses entirely on improving Stage 2: Reward Modeling.

The Standard Reward Model

In the standard setup, human annotators are given a prompt \(x\) and two model-generated responses, \(y_1\) and \(y_2\). The annotator picks the winner. Let’s call the chosen response \(y_c\) and the rejected response \(y_r\).

The Reward Model (usually initialized from the SFT model) is trained to output a scalar score \(r(x, y)\) for any given input. The goal is simple: ensure the score for the chosen response is higher than the score for the rejected response.

Mathematically, this uses the Bradley-Terry model to estimate the probability that \(y_c\) is better than \(y_r\):

Probability equation for preference estimation.

Here, \(\sigma\) is the logistic (sigmoid) function. If the reward for the chosen response \(r_{\psi}(x, y_c)\) is much higher than the rejected response, the probability approaches 1.

To train this model, we minimize the negative log-likelihood loss:

Loss function for standard reward modeling.

This loss function essentially tells the model: “Maximize the gap between the chosen score and the rejected score.”

The Problem: The “Mushy” Middle

While the math above is sound, the reality of high-dimensional neural networks is messier. The researchers identified that standard ranking loss (Equation 1) has a limitation. It only cares about the relative ranking. It doesn’t necessarily force the model to learn distinct, robust features that separate “good” from “bad” semantically.

The authors visualized the internal feature representations of a standard Reward Model using t-SNE (a technique for visualizing high-dimensional data in 2D).

t-SNE visualization comparing baseline feature distribution versus SimCSE distribution.

Look at the plot on the left (Baseline) in Figure 1. The blue dots (chosen responses) and red dots (rejected responses) are mashed together. There is significant overlap. This means the model’s internal “understanding” of what makes a response good vs. bad is blurry. It struggles to discriminate between the two classes in the feature space.

When the features overlap this much, the model is likely relying on noise or superficial correlations to make its decisions. This leads to poor generalization when the model faces new, unseen prompts.

Core Method: Contrastive Learning

The researchers propose a solution imported from the world of computer vision and unsupervised learning: Contrastive Learning.

The core intuition of contrastive learning is geometric: we want to structure the vector space such that “similar” things are pulled close together, and “dissimilar” things are pushed far apart.

By adding a contrastive loss to the standard ranking loss, the researchers aim to force the Reward Model to learn embeddings where “chosen” and “rejected” responses are clearly separated—as seen in the right-hand plot of Figure 1 above.

The paper introduces two distinct flavors of contrastive learning for this task: SimCSE (Instance-based) and SwAV (Cluster-based).

1. Constructing Contrastive Data

Before applying the algorithms, we need to decide what we are contrasting. The authors propose two strategies for representing the data.

Single Representation Pairs

This is the standard approach. We take the hidden states \(h\) of the responses directly. We treat the representation of a specific input as the “anchor,” and an augmented version of it as the “positive.”

Difference Representation Pairs (Novel Contribution)

The authors introduce a clever twist. Instead of just embedding the response \(y\), they compute the difference between the chosen and rejected features:

\[h_{diff} = f(x, y_c) - f(x, y_r)\]

This vector explicitly represents the “delta” of quality—the specific features that made the human prefer one over the other. By applying contrastive learning to these difference vectors, the model focuses specifically on the factors of preference.

2. Approach A: SimCSE (Instance-Based)

SimCSE (Simple Contrastive Learning of Sentence Embeddings) is an unsupervised method that requires no external labels. It uses “Dropout” as a data augmentation technique.

In a Transformer model, “Dropout” randomly turns off neurons during training to prevent overfitting. If you pass the exact same sentence through the model twice with Dropout enabled, you get two slightly different embedding vectors.

SimCSE treats these two vectors as a positive pair (they should be close). It treats any other sentence in the batch as a negative pair (they should be far apart).

The loss function for SimCSE is:

SimCSE loss equation.

Let’s break this down:

  • The numerator calculates the cosine similarity (\(\text{sim}\)) between the two augmented versions of the same instance (\(h_s\) and \(h_t\)). We want this to be high.
  • The denominator sums the similarity between our instance and all other instances in the batch. We want this to be low.
  • \(\tau\) is a temperature parameter that controls how sharp the probability distribution is.

By minimizing this loss, the Reward Model learns to be robust to small noises (dropout) while distinguishing distinct instances from each other.

3. Approach B: SwAV (Cluster-Based)

The second method explored is SwAV (Swapping Assignments between Views). Unlike SimCSE, which compares individual instances, SwAV compares clusters.

The idea is to cluster the data into \(K\) prototypes (or centroids). If we have two augmented views of the same image (or text), they should belong to the same cluster.

SwAV computes a “code” \(q\) (the cluster assignment) for one view and tries to predict that code using the feature representation of the other view. It creates a “swapped” prediction task.

The total loss combines the prediction errors from both directions:

SwAV total loss equation.

The specific loss term uses the cross-entropy between the cluster assignment \(q\) and the predicted probability \(p\):

Detailed SwAV loss components showing probability calculation.

Here, \(c_k\) represents the prototype vectors (the centers of the clusters). The model learns these prototypes during training. This method encourages the model to find semantic groupings in the data (e.g., grouping all “polite refusals” together and all “toxic outbursts” together) and map new inputs to these concepts consistently.

The Total Optimization Objective

The final training objective is a hybrid. We don’t discard the original goal (ranking); we simply add the contrastive task as a regularizer.

\[ \mathcal{L}_{total} = \mathcal{L}_{rm} + \beta \mathcal{L}_{cl} \]
  • \(\mathcal{L}_{rm}\): The standard supervised ranking loss (Equation 1).
  • \(\mathcal{L}_{cl}\): The contrastive loss (either SimCSE or SwAV).
  • \(\beta\): A hyperparameter controlling the weight of the contrastive loss.

This ensures the model still learns to rank preferences, but does so while maintaining a well-structured feature space.

Experiments and Results

The researchers tested their method using the LLaMA-2-7B model. They evaluated performance across three distinct domains: Helpfulness, Harmlessness, and Summarization.

They compared their Contrastive Reward Models against three baselines:

  1. SFT: Supervised Fine-Tuning only (no RL).
  2. Vanilla PPO: Standard RLHF.
  3. DPO (Direct Preference Optimization): A popular recent alternative to PPO.

1. Head-to-Head Win Rates

The primary metric in RLHF is the “Win Rate”—how often humans (or GPT-4 as a proxy) prefer the model’s output over a baseline.

Table comparing contrastive methods against SFT, Vanilla PPO, and DPO.

Key Takeaways from Table 1:

  • Harmlessness: The SimCSE method shines here. It achieves a significantly higher win rate against Vanilla PPO (66 wins vs 6 losses). This suggests that contrastive learning helps the model sharply distinguish between safe and unsafe content.
  • Summarization: The SwAV-diff (SwAV with Difference Representations) performs exceptionally well, dominating the baseline PPO and DPO.
  • Helpfulness: The gains are modest but positive. This reflects a known tension in RLHF: optimizing for safety (harmlessness) often makes the model slightly more evasive or less helpful (“alignment tax”). However, the contrastive methods still outperform the baselines.

2. Standard Benchmarks and Generalization

To ensure the model isn’t just overfitting to the training data, the researchers tested on MT-Bench and Arena-Hard, two widely accepted benchmarks for chat capability.

Table showing MT-bench and Arena-Hard results.

As shown in Table 2, SimCSE-diff achieves the highest average score (6.53 vs Vanilla PPO’s 5.82). This confirms that the improvements in the reward model translate to general-purpose chat capabilities.

Furthermore, they tested “Out-of-Distribution” (OOD) accuracy. They trained the reward model on the Anthropic HH dataset but tested its accuracy on completely different datasets like OpenAI WebGPT and Stanford SHP.

Table showing In-Distribution and Out-of-Distribution accuracy.

Table 3 reveals an interesting pattern. While the accuracy gain on the training dataset (ID) is small, the gain on OOD datasets (WebGPT, SHP) is much more consistent. This validates the hypothesis: Contrastive learning improves generalization. By forcing the model to learn structural features rather than shortcuts, it performs better on data it hasn’t seen before.

3. Stability in Training

One of the biggest headaches in RLHF is stability. PPO (Proximal Policy Optimization) is notoriously sensitive to hyperparameters. If the Reward Model gives inconsistent scores, the policy (the language model) can collapse, outputting gibberish.

The standard PPO objective includes a KL-divergence term to prevent the model from drifting too far:

PPO total reward equation.

However, the authors found that their Contrastive Reward Model provided such stable signals that the training process was smoother even without heavy tuning.

Graphs showing training stability of contrastive methods vs baseline.

In Figure 2, look at the train/returns (bottom right) and eval/rewards (top left). The contrastive methods (blue, orange, green) show a steady, monotonic increase in rewards. The baseline (red), however, struggles to improve after a certain point. This stability is a massive practical benefit for engineers training these models.

4. Qualitative Analysis: What do the models actually say?

Numbers are great, but how does this change the text the AI generates?

Harmlessness Example: When asked a leading question about negative labeling of a child, the Vanilla PPO model gave a generic “I cannot answer” refusal.

Table showing harmlessness query examples.

In contrast, the SimCSE-diff model provided a nuanced, empathetic response explaining why negative labeling is bad, rather than just refusing to answer. This indicates a deeper semantic understanding of “harmlessness.”

Helpfulness Example: When asked for rhymes for “pig,” the Vanilla PPO model hallucinated “Sty” (related, but doesn’t rhyme).

Table showing helpfulness query examples.

The SwAV and SwAV-diff models correctly identified rhymes like “Dig,” “Fog,” and “Log.” This shows that the Reward Model was better able to penalize incorrect rhymes during training, leading to a smarter policy.

Conclusion

Reinforcement Learning from Human Feedback (RLHF) is the engine powering today’s top AI models. However, the engine is only as good as its fuel filter. If the Reward Model allows noise and “easy features” to pass through, the AI’s alignment will be brittle.

This research demonstrates that Contrastive Learning is a highly effective upgrade for Reward Models. By adding an unsupervised loss—whether instance-based (SimCSE) or cluster-based (SwAV)—we can force the model to organize its internal knowledge more effectively.

The results are three-fold:

  1. Better Separation: Chosen and rejected responses are distinct in the feature space.
  2. Better Generalization: The model performs better on new, unseen tasks.
  3. Better Stability: The subsequent RL training phase is smoother and more reliable.

As we push for AIs that handle increasingly complex and subtle human values, techniques like this—which sharpen the “discriminative capability” of the AI judge—will be essential components of the alignment toolkit.