Cracking the Black Box of RLHF: Can Simple Rules Replace Complex Reward Models?

If you have ever marveled at how helpful and polite modern Large Language Models (LLMs) like ChatGPT or Llama 2 are, you have Reinforcement Learning from Human Feedback (RLHF) to thank. It is the standard procedure for aligning raw, unruly models with human values.

But there is a “ghost” in the machine.

In the standard RLHF pipeline, we train a “Reward Model” (RM) to mimic human preferences. We then use this model to train the LLM. The problem? This Reward Model is typically a neural network—a “black box.” We know that it works, but we don’t always know why it scores one response higher than another. This opacity leads to problems like “reward hacking,” where the LLM learns to trick the reward model rather than actually being helpful.

In a fascinating new paper, “Rethinking the Role of Proxy Rewards in Language Model Alignment,” researchers attempt to peel back the layers of this black box. They ask a provocative question: Can we replace complex, black-box reward models with a “white-box” function made of simple, interpretable rules?

Their findings suggest that not only is this possible, but it also sheds light on what “alignment” actually means. Let’s dive into how they reverse-engineered the reward signal.


The Problem: The Perils of Proxy Rewards

To understand the researchers’ contribution, we first need to understand the current bottleneck in LLM training.

When aligning an LLM, we usually follow these steps:

  1. Collect Human Data: Humans rank model responses (e.g., Response A is better than Response B).
  2. Train a Proxy Reward Model: A neural network is trained on this data to predict human preferences.
  3. Optimize the Policy (RL): The LLM is trained using Reinforcement Learning (specifically PPO) to maximize the score given by the Proxy Reward Model.

The issue is that the Proxy Reward Model is an imperfect approximation of true human preference (the “Gold Reward”). If the LLM optimizes too hard against this imperfect proxy, we get overoptimization.

As shown in the figure below, a model might learn to exploit specific features—like writing incredibly long but nonsensical answers—because the proxy reward heavily biases toward length.

Figure 1: A preview of our reverse reward engineering experiment. First, we design white-box reward functions with interpretable features such as the length or relevance of the response. Then, we conduct RL training using each of the designed functions as a proxy reward and deem it a success in reverse engineering if a monotonic relationship between the proxy and the ground truth (Gold) reward scores is observed in the multiple evaluations. The reverse-engineered reward (blue) exhibits such a tendency,whereas the length-only reward (green) does not achieve the monotonic relationship, showing reward overoptimization (Gao et al., 2023).

In the chart above, look at the green line (“Length-only”). As training progresses (x-axis), the model’s score on the proxy reward goes up, but its actual quality (Gold reward) flatlines or drops. This is the definition of Goodhart’s Law: “When a measure becomes a target, it ceases to be a good measure.”

The researchers in this paper propose a method called Reverse Reward Engineering to solve this. Instead of a black box, they build a reward function from scratch using features we can understand.


Methodology: Building a “White-Box” Reward

The core idea is simple yet elegant: If we can construct a transparent reward function (a “White Box”) that, when optimized, consistently increases the “Gold” reward score, we have successfully reverse-engineered what the alignment process actually values.

The researchers identified three primary interpretable features to build their reward function:

  1. Length Incentive (LI): Humans have a known “verbosity bias”—we tend to prefer longer answers.
  2. Repetition Penalty (RP): Long answers are bad if they just repeat the same text.
  3. Query Relevance (QR): The answer must actually address the prompt.

However, simply smashing these together isn’t enough. The team introduced a critical architectural decision: Reward Branching.

The Concept of Reward Branching

Not all questions are created equal. The researchers categorized user queries into two types:

  • Open-Ended (OE): Questions requiring creativity or brainstorming (e.g., “How can I make a good first impression?”).
  • Closed-Ended (CE): Questions requiring specific facts or constrained answers (e.g., “How many movies are in the Fast and Furious franchise?”).

If you force an LLM to write a long, creative essay for a closed-ended factual question, you get hallucinations and fluff. Therefore, the reward function must change based on the query type.

Figure 2: An overview of reverse reward engineering study. It aims to imitate the ground-truth reward signal by Gold RM with white-box reward features such as length, repetition,and relevance. Specifically, we try to observe the monotonic relationship between the proxy and gold reward signals across the multiple evaluations during RL training.We could comprehend the roles of Gold RM via the interpretable features from the study.

The Formula: Reverse Engineered Reward (RER)

The researchers combined these insights into their final proposed method, the Reverse Engineered Reward (RER).

Here is the logic:

  • If the query is Open-Ended (OE): Reward Length (\(LI\)), Repetition Penalty (\(RP\)), and Query Relevance (\(QR\)).
  • If the query is Closed-Ended (CE): Do not reward length blindly. Instead, reward Repetition Penalty (\(RP\)) and similarity to a reference answer (\(AR\)).

The mathematical formulation is essentially a switch statement:

()\n\\mathsf { R E R } = \\left{ \\mathsf { L I } ( \\hat { y } ) \\cdot \\mathsf { R P } ( \\hat { y } ) \\cdot \\mathsf { Q R } ( x , \\hat { y } ) \\quad \\mathrm { i f } T ( x ) = 0 \\mathsf { E } \\right.\n[

(Note: The equation image above shows the OE condition; the paper specifies that for CE, the function changes to focus on reference answer relevance).

This function is then used inside the PPO (Proximal Policy Optimization) objective to train the model.

]\n\\operatorname* { m a x } _ { \\pi _ { \\phi } } \\mathbb { E } _ { ( x , y ) \\sim D , } [ \\mathsf { R E R } ( x , \\hat { y } , y ) - \\beta \\log \\left( \\frac { \\pi _ { \\phi } ( \\hat { y } | x ) } { \\pi _ { \\rho } ( \\hat { y } | x ) } \\right) ] ,\n()

By using this white-box reward, the researchers can see exactly why the model is being rewarded. If the model starts hallucinating, they know it’s not because of some mysterious weight in a neural network, but perhaps because the LI (Length Incentive) is too strong relative to QR (Relevance).


Experiments: Does the White Box Work?

The researchers tested their RER against several baselines on standard datasets like Anthropic-HH and AlpacaFarm. They used a high-quality open-source Reward Model (StarlingRM-34B) as the “Gold” standard to measure success.

1. The Monotonic Relationship Test

The most important test was whether optimizing the specific features (Length, Relevance, etc.) actually led to better “Gold” scores over time.

They compared four setups:

  1. w. LI: Only rewarding length.
  2. w. LI · RP: Rewarding length + repetition penalty.
  3. w. LI · RP · QR: Adding relevance.
  4. w. RER: The full method with branching (OE vs. CE).

Figure 3: Results of reverse reward engineering. We visualize normalized proxy and gold reward scores for every \\(5 0 0 \\mathrm { P P O }\\) steps against each reward design option. The results on the upper side are from Anthropic-HH(Bai et al., 2022a),and the results on the lower side are from AlpacaFarm (Dubois et al.,2O23),respectively.Instances of AlpacaEval (Li et al.,2O23)are used to compute the reward scores.We expect a monotonical relationship between the proxy and gold reward scores to achieve successin reverse engineering. We find that considering the relevance and adopting diferent rewards according to querytype,i.e.,RER,contribute to increasing the gold reward reliably.

The results were revealing:

  • Length Only (Green line, top left): Failed. The proxy score went up, but the Gold score crashed. This confirms that just making models verbose hurts quality eventually.
  • RER (Blue line, bottom right): Success. Both the proxy reward (dashed) and the Gold reward (solid) moved up together. This “monotonic relationship” proves that the white-box rules successfully captured what the Gold model cares about.

The correlation table below further cements this. RER achieved a Spearman Correlation of 0.99 and 0.97 with the Gold reward on the tested datasets. This is nearly perfect alignment between the simple white-box rules and the complex black-box ground truth.

Table 1: We report Pearson and Spearman correlation (R) between proxy and gold reward scores across multiple evaluations.For both datasets, RER shows a SpearmanR close to 1. This indicates a monotonic relationship between the two reward signals, signifying successful reverse reward engineering.

2. The Importance of Branching

Why was the “Reward Branching” (treating Open vs. Closed questions differently) so important?

When the researchers looked at the Gold Reward scores split by query type, the difference was clear. Without branching, the model struggled on Closed-Ended (CE) questions because it was trying to be too wordy. With RER, the model knew when to be concise.

Figure 4: Gold reward scores according to whether query type requires open-ended(OE) or closed-ended (CE) responses.We compare two proxy reward options, \\(\\mathsf { L I } \\cdot \\mathsf { R P } \\cdot \\mathsf { Q R }\\) and RER,based on models trained with Anthropic-HH (Bai et al., 2022a). We find they show meaningful differences in CE type, demonstrating the importance of the reward branching.

3. Beating the Black Boxes

Perhaps the most surprising result is how RER stacked up against sophisticated, black-box Reward Models.

The researchers compared their RER-trained model against models trained with UltraRM-13B and SteamSHP-XL (popular open-source reward models).

On benchmarks like AlpacaEval and MT-Bench, the simple RER method was highly competitive, often outperforming the black-box models.

Table 2: Comparison of the designed rewards with opensource RMs trained on human or AI feedback,based on Anthropic-HH (Bai et al., 2022a). The PPO model optimizing RER shows competitive performances with models trained with open-source RMs.

As shown in Table 2, the model trained with w. RER achieved a win rate of 76.9 on Vicuna Bench and 23.4 on AlpacaEval, surpassing models trained with SteamSHP and performing comparably to UltraRM.

4. Minimizing “Alignment Tax”

A common issue in RLHF is the “alignment tax”—where a model becomes so polite and “aligned” that it loses its raw intelligence or ability to follow strict instructions (like “answer with one word”).

Because RER distinguishes between open and closed questions, it mitigates this tax. The results showed that RER maintained high “Relevant Sentence Ratios” (staying on topic) without bloating the word count unnecessarily.

Table 3: We analyze the responses from AlpacaEval (Li et al., 2O23).We report the relevant sentence ratio ( \\(\\%\\) Rel. Sent), leveraging GPT-4 (OpenAI, 2023). We also reference the number of average tokens (# Avg Tokens) and 4-gram repetitions (4-gram Rep.). RER archives the win rate while not increasing unnecessary verbosity.

Notice in Table 3 that w. RER has a lower average token count (243) compared to w. LI (351), yet a much higher win rate. It produces leaner, better answers.


Qualitative Analysis: Seeing the Difference

It is helpful to look at actual text generation to understand these metrics.

In the example below, the model is asked for antonyms for the word “laureating.”

  • The LI · RP · QR model (which rewards length too much) hallucinates a numbered list of completely unrelated words in different languages just to fill space.
  • The RER model gives a concise, accurate list.

Prompt: Write down three antonyms for the given word.‘laureating’

This perfectly illustrates why Reward Branching is vital. A “List 3 antonyms” prompt is a Closed-Ended task. RER recognizes this and switches to a reward mode that values accuracy over length.


Conclusion and Implications

The paper “Rethinking the Role of Proxy Rewards” offers a significant step forward in understanding LLM alignment.

Key Takeaways:

  1. Complexity isn’t always better: A transparent combination of Length, Relevance, and Repetition penalties can compete with massive neural network reward models.
  2. Context matters: You cannot reward an LLM the same way for every prompt. Distinguishing between Open-Ended and Closed-Ended questions is essential to prevent hallucination and verbosity.
  3. Democratizing Alignment: Training a good Reward Model usually requires expensive human feedback datasets (thousands of hours of labeling). This paper suggests that strong baselines can be achieved using interpretable rules and automatic metrics, potentially making alignment cheaper and more accessible.

By reverse-engineering the reward signal, the researchers have turned a black box into a glass house, showing us that sometimes, the best way to teach a sophisticated AI is with a few simple, well-chosen rules.