Beyond Preferences: How Contrastive Policy Gradient Unlocks Arbitrary Reward Optimization for LLMs

Reinforcement Learning from Human Feedback (RLHF) has become the standard for turning raw Large Language Models (LLMs) into helpful assistants. If you have played with ChatGPT or Llama, you are interacting with a model that has likely undergone this process.

Traditionally, this involves a two-step dance: training a reward model to mimic human preferences, and then using Reinforcement Learning (usually PPO) to optimize the LLM against that reward. However, PPO is computationally expensive and notoriously unstable because it requires “on-policy” sampling—generating new text from the model constantly during training.

Recently, Direct Alignment methods like DPO (Direct Preference Optimization) and IPO (Identity Preference Optimization) have surged in popularity. They skip the reward model and the sampling, learning directly from a static dataset of preferences (Offline RL). They are stable and fast. But there is a catch. They generally only work with preference data (A is better than B).

What if you want to optimize your LLM for something that isn’t a preference? What if you want to optimize for code compilation rates, factual consistency scores, or a specific business metric? DPO struggles here.

Enter Contrastive Policy Gradient (CoPG).

In this post, we will deconstruct a new method that combines the best of both worlds. CoPG allows you to optimize any arbitrary reward function using offline data, without the instability of PPO or the restrictions of DPO. We will explore the mathematics behind it, how it generalizes existing methods, and look at experimental results proving its efficacy.

The Background: The Alignment Problem

To understand why CoPG is necessary, we first need to look at the mathematical foundation of aligning LLMs.

We denote a prompt as \(x\) and a generated response as \(y\). Our goal is to train a policy \(\pi\) (the LLM) that maximizes a specific Reward \(R(x, y)\). However, we can’t let the model drift too far from its original training (the reference model \(\pi_{\text{ref}}\)), or it will start outputting nonsense just to hack the reward.

This leads us to the standard regularized RL objective:

The regularized RL objective function.

Here, \(\beta\) controls the strength of the KL-divergence penalty (keeping the model close to the reference). We can simplify this expression by defining a “regularized reward” \(R^\pi_\beta\), which basically subtracts the KL penalty directly from the reward:

The definition of the regularized reward.

The Problem with Standard Policy Gradient

The classic way to solve this is using Policy Gradient (PG). The gradient of the objective looks like this:

Standard Policy Gradient with a baseline.

This equation tells us to increase the probability of generations that have a high reward (relative to a baseline \(b\)).

The problem lies in the expectation \(\mathbb{E}_{y \sim \pi}\). This requires sampling \(y\) from the current policy \(\pi\). As \(\pi\) changes during training, you have to generate new samples constantly. If you try to use a static dataset (off-policy data from distribution \(\mu\)), you have to use Importance Sampling:

Policy Gradient with Importance Sampling.

See that ratio \(\frac{\pi(y)}{\mu(y)}\)? As your model \(\pi\) learns and drifts away from the dataset distribution \(\mu\), that ratio can explode, causing the variance of your gradients to skyrocket. This makes offline training with standard Policy Gradient mathematically valid but practically impossible.

The Core Method: Contrastive Policy Gradient

The researchers behind CoPG propose a clever workaround. Instead of looking at single samples, they look at pairs of samples.

They introduce a new loss function based on the principle of contrast. The idea is to weigh the likelihood of a generation not just by its own reward, but by comparing it to another generation.

Let’s look at the CoPG loss function for a pair of responses \(y\) and \(y'\):

The CoPG loss function definition.

This looks dense, so let’s break it down. It is essentially a weighted log-likelihood.

First Term: We look at generation \(y\). We weigh its gradient by the difference between its reward and the reward of \(y'\).
Second Term: We do the same for \(y'\), weighing it by the difference between its reward and the reward of \(y\).

If \(y\) is much better than \(y'\), the first term becomes positive and large, pushing the model to increase the probability of \(y\). Simultaneously, the second term encourages decreasing the probability of \(y'\).

Why does this solve the Offline problem?

The magic of CoPG is that it allows us to perform Off-Policy Policy Gradient without the exploding importance sampling ratios.

The authors prove (Theorem 1 in the paper) that the unique maximizer of the CoPG objective is exactly the optimal policy \(\pi_*\) that solves the original RL problem.

The optimal policy definition.

When we take the gradient of the CoPG objective over a dataset, something fascinating happens. The gradient calculation simplifies into two terms that look like standard Policy Gradient, but with a specific, perfect baseline.

The gradient for a single pair \((y, y')\) is:

The gradient of the CoPG loss function (part 1).

Notice that there are no importance sampling ratios (\(\frac{\pi}{\mu}\)) here. The term \((R^\pi_\beta(y') - R^\pi_\beta(y))\) acts as a contrastive baseline. Because this baseline depends on the reward of the other sample, it stabilizes the gradient naturally.

The “Implementation Trick”

For students and practitioners looking to implement this, you don’t actually need to code the complex log-likelihood derivatives manually. The authors point out a beautiful simplification. The gradient of the CoPG loss is proportional to the negative gradient of the squared difference of regularized rewards:

The simplified gradient proportional to the squared difference of rewards.

This means you can implement CoPG by simply minimizing the squared difference between the rewards of two samples. This is incredibly “supervised-friendly” and works with standard auto-differentiation tools in PyTorch or TensorFlow.

Theoretical Unification: Connecting the Dots

One of the strongest aspects of this paper is that it doesn’t just invent a new method; it explains how CoPG relates to existing methods. It turns out CoPG is a generalization of several other algorithms.

1. Generalizing Policy Gradient

If you take the expectation of the CoPG gradient where both samples are drawn from the current policy, you recover the standard Policy Gradient (scaled by 2).

Property 1: CoPG generalizes Policy Gradient.

This confirms that CoPG is a mathematically sound extension of classical RL methods into the off-policy (dataset-based) regime.

2. Generalizing RLOO (Reinforce Leave-One-Out)

RLOO is a popular method that uses multiple samples to estimate a baseline. The paper proves that CoPG is effectively an off-policy generalization of RLOO. RLOO usually requires fresh samples; CoPG shows that the same mathematical structure holds for static datasets if you formulate the baseline correctly.

3. Generalizing IPO (Identity Preference Optimization)

IPO is a direct alignment method used when you have binary preferences (e.g., “win/loss” or “better/worse”).

The authors show that if you take the CoPG loss and replace the arbitrary continuous rewards with binary values (e.g., \(+1/4\) and \(-1/4\)), the gradient becomes identical to IPO.

The gradient of CoPG matches IPO under binary rewards.

This is a crucial insight. It implies that IPO is just a special case of CoPG where the reward function is a simple step function. CoPG, however, can handle the rich, continuous signals from a reward model (e.g., “This summary is 0.8 accurate, that one is 0.2 accurate”), whereas IPO throws that nuance away.

Experiments and Results

Does this theory hold up in practice? The authors tested CoPG in two scenarios: a controlled bandit problem and a large-scale LLM summarization task.

The Toy Bandit Experiment

In a bandit problem, the “agent” just has to pick one of three arms, each with a different fixed reward. The goal is to converge to the optimal probability distribution over the arms.

Since this is a simple problem, we can calculate the exact optimal solution. The authors compared CoPG against IPO and Policy Gradient (PG) with different baselines.

Bandit experiment results showing Regret vs Steps.

Orange (PG no baseline): Fails completely due to variance.
Green (PG value baseline): Converges, but to a biased (incorrect) solution because the baseline isn’t perfect for off-policy data.
Red (IPO): Converges, but also to a biased solution. IPO optimizes for preferences, not the exact underlying reward values.
Blue (CoPG): Converges to zero regret. It finds the mathematically optimal policy.

This visually demonstrates that if you have access to scalar rewards, treating them as simple preferences (like IPO does) leaves performance on the table.

LLM Summarization

The authors then fine-tuned a Llama2-7B model on the Reddit TL;DR summarization dataset. They trained a separate Reward Model to act as the “ground truth” evaluator.

Can CoPG optimize the reward? First, they simply checked if running CoPG on the offline dataset actually increased the reward scores of the model’s outputs.

Reward evolution during training for different CoPG betas.

The graph above shows that CoPG successfully climbs the reward landscape. The performance depends on \(\beta\) (the KL penalty).

If \(\beta\) is too low (0.01), the model becomes unstable (reward drops).
If \(\beta\) is too high (0.3), the model is too constrained by the reference and learns slowly.
At optimal \(\beta\) (0.06), it achieves high stable rewards.

Comparisons with DPO and IPO The critical question is: Does CoPG do better than the popular DPO and IPO methods?

To test this fairly, they used the same reward model to generate preference labels for DPO and IPO. This gives DPO/IPO the best possible chance to succeed.

Comparison of CoPG, DPO, and IPO rewards over time.

The results are clear. CoPG (Green) consistently achieves a higher reward than DPO (Blue) and IPO (Orange).

This validates the hypothesis: when you have a continuous reward signal (like a score from a classifier), converting it into a binary “A > B” preference for DPO/IPO loses information. CoPG uses the full scalar value of the reward to drive learning, resulting in better alignment.

The KL-Reward Trade-off Finally, usually in RL, you pay a price for higher rewards in the form of higher KL divergence (drifting further from the base model). An efficient method should settle on the “Pareto Frontier”—getting the maximum reward for a specific “budget” of KL divergence.

Scatter plot of KL vs Reward for the three methods.

In this plot, “up and to the left” is better (higher reward, lower KL). The CoPG points (Orange) form a frontier that is superior to both DPO and IPO. DPO, in particular, tends to drift significantly in KL (move right) without gaining proportional reward.

Conclusion

Contrastive Policy Gradient fills a major gap in the current LLM alignment toolkit.

For a long time, engineers had to choose between:

RLHF (PPO): Optimizes arbitrary rewards, but is slow, expensive, and unstable.
Direct Alignment (DPO/IPO): Fast and stable, but restricted to preference data.

CoPG offers a third path: Offline optimization of arbitrary rewards.

By using a contrastive baseline, CoPG eliminates the variance issues of off-policy learning. It allows you to take a dataset of prompts and scored responses (e.g., “Prompt X -> Code Y [Score: 0.9]”, “Prompt X -> Code Z [Score: 0.4]”) and train a model to maximize those scores without ever running the model during training.

As LLMs move beyond simple chat and into complex domains like coding, biology, and law—where success is measured by specific metrics rather than human “vibes”—methods like CoPG will likely become essential for precise, efficient alignment.

The Background: The Alignment Problem#

The Problem with Standard Policy Gradient#

The Core Method: Contrastive Policy Gradient#

Why does this solve the Offline problem?#

The “Implementation Trick”#

Theoretical Unification: Connecting the Dots#

1. Generalizing Policy Gradient#

2. Generalizing RLOO (Reinforce Leave-One-Out)#

3. Generalizing IPO (Identity Preference Optimization)#

Experiments and Results#

The Toy Bandit Experiment#

LLM Summarization#

Conclusion#