Introduction

If you have followed the explosion of Large Language Models (LLMs) like GPT-4 or Llama 2, you are likely familiar with the concept of Reinforcement Learning from Human Feedback (RLHF). It is the secret sauce that turns a raw, unruly text predictor into a helpful assistant. By using reinforcement learning (RL), we can align models with complex human preferences that are difficult to write down as simple code.

However, there is a fundamental inefficiency at the heart of this process. In a typical RLHF setup, the model generates an entire paragraph or response, and only then does it receive a reward signal (a score indicating how good the response was).

Imagine trying to learn to play the piano, but your teacher only gives you a single “pass” or “fail” grade after you finish a whole concerto, without telling you which specific notes were wrong. This is the sparse reward problem. It creates a massive challenge for the model known as temporal credit assignment: the model has to figure out which specific tokens (words) among the hundreds it generated were responsible for the good or bad score.

In this post, we are doing a deep dive into a research paper titled “Enhancing Reinforcement Learning with Dense Rewards from Language Model Critic”. The researchers propose a novel framework called RELC (Rewards from Language model Critique). The idea is elegant: instead of relying solely on a single score at the end of generation, why not use another LLM to act as a “Critic” that marks up the draft with red ink, giving specific, dense feedback on intermediate steps?

We will explore how this method works, the architecture behind it, and why it drastically improves sample efficiency and generation quality across tasks like sentiment control, detoxification, and summarization.

Background: The Sparse Reward Bottleneck

To understand the solution, we must first understand the problem in the context of Markov Decision Processes (MDP), which is the mathematical framework for RL.

In text generation:

State (\(s_t\)): The prompt plus all tokens generated so far.
Action (\(a_t\)): The selection of the next token from the vocabulary.
Policy (\(\pi_\theta\)): The LLM itself, parameterized by \(\theta\), determining the probability of the next token.

The goal of the agent is to maximize the expected return. In standard methods like Proximal Policy Optimization (PPO), the gradient update looks like this:

Equation 1: Policy Gradient Update

Here, \(G_t\) represents the return. In text generation tasks, the environment (or a reward model trained on human preferences) usually provides a reward \(r\) only at the final time step \(T\). For every step \(t < T\), the reward is zero.

This sparsity leads to high variance in training. If the model writes a beautiful 100-word story but uses one toxic word at the end, it might get a terrible score. The RL algorithm might erroneously punish the 99 beautiful words because it doesn’t know specifically that the last word was the culprit.

Previous attempts to fix this involved “reward shaping” or hiring humans to annotate every sentence step-by-step, which is prohibitively expensive. This paper asks: Can we automate this fine-grained feedback using LLMs themselves?

The RELC Framework

The core contribution of this paper is RELC. It introduces a mechanism to generate dense intrinsic rewards—feedback given at the token or span level—automatically.

The Architecture

The framework divides the RL agent into two distinct modules:

The Policy Model: This is the LLM we are trying to train (e.g., GPT-2 or Llama 2).
The Critic Model: This is an LLM (potentially the same model or a stronger one like GPT-4) tasked with evaluating the policy’s output.

As shown in the architectural diagram below, the Critic creates an internal feedback loop. While the environment still provides the final “extrinsic” reward, the Critic analyzes the output and injects “intrinsic” rewards for specific parts of the generation.

Figure 1: Illustration of the RELC framework with Policy and Critic modules.

How Dense Rewards Are Generated

The process works by treating the Critic as a sophisticated annotator. Here is the step-by-step workflow:

Generation: The Policy model receives a prompt and generates a full response.
Extrinsic Scoring: The environment (or an external classifier) gives the response a single scalar score (e.g., “Positive” or “Toxic”).
Critique: The Critic model is fed a prompt containing:

The task description.
Few-shot examples of how to critique.
The Policy’s generated text.
(Optionally) The extrinsic reward received.

Annotation: The Critic generates a textual explanation identifying specific spans of text that are good (contributing to a high reward) or bad (contributing to a low reward).
Reward Mapping: These textual critiques are mapped back to specific tokens.

Tokens inside a “positive” span get an intrinsic reward of +1.
Tokens inside a “negative” span get an intrinsic reward of -1.

This process transforms a single final score into a sequence of feedback signals.

A Concrete Example: Sentiment Control

Let’s say we want the model to generate positive movie reviews. The model generates: “The movie boasts breathtaking visuals, but the story falls flat.”

In a standard setting, a sentiment classifier might see the negative ending and give the whole sentence a score of -2. The model is confused: were the visuals bad? Was “movie” the wrong word?

With RELC, the Critic analyzes the sentence. It identifies “breathtaking visuals” as positive and “story falls flat” as negative.

Figure 2: Example of reward calculation in sentiment control.

As visualized above, the system assigns a +1 intrinsic reward to the tokens in “breathtaking visuals” and a -1 to “story falls flat.” The final reward \(r_t\) for each token is a weighted sum of the extrinsic (\(r^{ex}\)) and intrinsic (\(r^{in}\)) rewards:

\[ r(s, a) = \alpha_1 r^{ex}(s, a) + \alpha_2 r^{in}(s, a) \]

Where \(\alpha_1\) and \(\alpha_2\) are hyperparameters balancing the two signals. This allows the RL algorithm to reinforce the good parts (“breathtaking visuals”) even if the overall sentence failed the task.

Integration with PPO

The beauty of RELC is that it doesn’t require reinventing the RL algorithm. It simply modifies the reward signal fed into standard algorithms like PPO. The advantage function \(\hat{A}_t\), which estimates how good an action was compared to the average, is calculated using these new dense rewards:

Equation 2: Generalized Advantage Estimation

And the policy is updated using the standard clipped surrogate objective to ensure stability:

Equation 3: PPO Objective Function

Experimental Results

The researchers tested RELC on three distinct text generation tasks: Sentiment Control, Detoxification, and Summarization. They explored two configurations:

Teacher-Student: A smaller Policy model (GPT-2 Large) guided by a powerful Critic (GPT-3.5).
Self-Critique: A single model (Llama 2 7B) acting as both Policy and Critic.

1. Sentiment Control

The Task: Generate text with positive sentiment starting from neutral or negative prompts (using the IMDB dataset).

The Result: RELC demonstrated significantly better sample efficiency. “Sample efficiency” refers to how much training data (or how many steps) the model needs to reach a certain performance level.

Figure 3: Learning curves for sentiment control.

In the graph above (Figure 3a), look at the red line (RELC) versus the blue line (standard Extrinsic rewards only). The RELC agent learns much faster, reaching a high reward almost immediately. This confirms that telling the model which words are good helps it learn drastically faster than just telling it “good job” at the end.

Table 1 below details the quantitative metrics. RELC achieves the highest percentage of positive continuations (59.06%) while maintaining fluency (low perplexity).

Table 1: Automatic evaluation results for sentiment control.

2. Language Model Detoxification

The Task: Given a prompt that might trigger toxic output (from the RealToxicityPrompts dataset), generate a non-toxic continuation.

The Result: This is a harder task because “toxicity” can be subtle. The researchers found that RELC reduced toxicity more effectively than baselines.

Figure 4: Detoxification learning curves.

Figure 4 shows the learning curves. Interestingly, Figure 4b (Self-Critique using Llama 2) shows that a model can improve itself. Llama 2 is capable of identifying toxicity in its own previous outputs and using that critique to update its own policy.

The results in Table 2 are striking. RELC reduces the probability of generating toxic content to 0.8%, compared to 4.4% for standard PPO.

Table 2: Detoxification evaluation results.

The authors also compared RELC against a method called “Fine-Grained RLHF,” which uses costly API queries to get partial rewards. RELC outperformed this method as well (Table 3), achieving lower toxicity scores.

Table 3: Comparison with Fine-Grained RLHF.

3. Abstractive Summarization

The Task: Summarize Reddit posts (TL;DR dataset). This is a complex task requiring content understanding, coherence, and factuality.

The Setup: A reward model was trained on human preferences to act as the extrinsic signal. The Critic (GPT-3.5) provided intrinsic feedback on coverage and accuracy.

The Result: Once again, RELC proved superior in sample efficiency.

Figure 5: Summarization evaluation results.

In Figure 5, we see the preference score (how much the reward model likes the summary) skyrocketing for the RELC method (blue line in 5b) compared to the extrinsic-only baseline (orange).

But do humans actually prefer the output? Automatic metrics like ROUGE are notoriously unreliable for summarization. The authors conducted a human evaluation (Figure 6) assessing Coverage, Factuality, and Coherence.

Figure 6: Human evaluation of summary quality.

The red bars (Ours/RELC) consistently beat the PPO baseline (orange) and Supervised Fine-Tuning (green). The gain in Factuality is particularly notable—suggesting that the Critic helps the model hallucinate less by pinpointing unsupported statements during training.

Why Does This Work? (Analysis)

You might wonder: Is the improvement just because we are injecting more numbers into the reward function? Does the Critic actually need to be smart?

The Necessity of an Accurate Critic

To test this, the researchers ran an ablation study where they assigned intrinsic rewards randomly (Random Intrinsic Rewards).

Figure 7: Performance with random intrinsic rewards.

As shown in Figure 7, the random rewards (orange line) performed worse than the standard baseline. This proves that the accuracy of the critique matters. It is not just about signal density; it is about correct credit assignment. The Critic must accurately identify the good and bad spans.

Computational Efficiency

Adding a Critic model (especially a large one like GPT-3.5 or Llama 2) increases the inference cost during training because you have to run a second forward pass to generate the critique. Is it worth it?

Figure 8: Performance as a function of FLOPs.

Figure 8 plots performance against FLOPs (Floating Point Operations), which is a measure of total compute. Even though RELC costs more per step, it learns so much faster per step that it is more efficient overall. The red line (RELC) achieves higher scores with fewer total FLOPs than the baseline (blue line). You spend a little more compute on the Critic to save a massive amount of compute on training steps.

Conclusion

The RELC framework addresses one of the most persistent headaches in Reinforcement Learning for NLP: the sparsity of rewards. By introducing a Critic LLM into the loop, the researchers successfully automated the process of dense reward shaping.

Key Takeaways:

Dense Rewards work: Providing feedback on specific tokens/spans solves the credit assignment problem better than a single sentence-level score.
Critics can be automated: We don’t need humans to annotate every step. Strong LLMs (or even the policy model itself) can act as effective critics.
Efficiency and Quality: The method yields models that learn faster (sample efficiency) and produce better outputs (higher human preference scores).

This approach opens up exciting avenues for “Self-Correcting” AI. If models can effectively critique their own outputs and learn from those critiques via RL, we move a step closer to autonomous improvement without needing infinite amounts of human-labeled data.

References & Hyperparameters

For those interested in replicating these results, the paper provides detailed hyperparameters for the experiments.

Sentiment Control Hyperparameters: Table 7: Hyperparameters for sentiment control.

Detoxification Hyperparameters: Table 8: Hyperparameters for detoxification.

Summarization Hyperparameters: Table 9: Hyperparameters for summarization.

Introduction#

Background: The Sparse Reward Bottleneck#

The RELC Framework#

The Architecture#

How Dense Rewards Are Generated#

A Concrete Example: Sentiment Control#

Integration with PPO#

Experimental Results#

1. Sentiment Control#

2. Language Model Detoxification#

3. Abstractive Summarization#

Why Does This Work? (Analysis)#

The Necessity of an Accurate Critic#

Computational Efficiency#

Conclusion#

References & Hyperparameters#