DogeRM: How to Teach Reward Models New Tricks Without New Data

In the rapidly evolving world of Large Language Models (LLMs), we have witnessed titans like GPT-4 and Gemini perform incredible feats, from writing poetry to solving complex coding problems. But raw intelligence isn’t enough; these models need to be aligned with human intent. We want them to be helpful, harmless, and honest.

The standard way to achieve this is through Reinforcement Learning from Human Feedback (RLHF). Central to this process is a component called the Reward Model (RM)—a digital judge that scores the AI’s responses.

But there is a bottleneck. Training a good Reward Model requires massive amounts of “preference data” (humans looking at two answers and picking the better one). If you are building a general chatbot, this is manageable. But what if you want a Reward Model that is an expert in advanced calculus or C++ programming? You need domain experts to label that data, which is incredibly expensive and slow.

Enter DogeRM (Domain knowledge merged Reward Model).

In a recent paper from National Taiwan University, researchers proposed a clever “hack” to bypass the need for expensive domain-specific preference data. Their method? Merging a general-purpose Reward Model with a domain-specific expert model.

In this post, we will tear down the DogeRM paper. We will explore how it works, the mathematics behind the merging, and why this might be the future of specialized AI alignment.


The Background: The Bottleneck of RLHF

To understand why DogeRM is necessary, we first need to look at how Reward Models are built.

In a standard RLHF pipeline, we train a model to act as a judge. We feed it a user instruction (\(x\)) and two possible responses: a “chosen” response (\(y_c\)) and a “rejected” response (\(y_r\)). The model needs to assign a scalar score to these responses such that the chosen one gets a higher score than the rejected one.

The loss function used to train this model looks like this:

Equation for Reward Model Loss.

Here, \(r(x, y)\) is the reward score. The goal is to maximize the difference between the score of the good answer and the bad answer.

The Problem: The “Expert” Gap

Collecting generic data (e.g., “Write a recipe for cake”) is easy. Crowdsource workers can tell you which recipe looks better.

However, collecting domain-specific data is hard.

  • Math: Who decides which complex proof is more elegant? A mathematician.
  • Code: Who decides which Python function is more efficient and bug-free? A senior developer.

Expert time is expensive. As a result, most open-source Reward Models are “generalists.” They are good at tone and safety but often fail to catch subtle bugs in code or errors in logic.

The DogeRM Hypothesis

The researchers asked a simple question: We already have open-source models that are experts in math and code (Supervised Fine-Tuned, or SFT models). Can we just “inject” their brains into a general Reward Model?

Instead of training a new RM from scratch with expensive data, DogeRM merges the weights of a General Reward Model with a Domain-Specific SFT Model.


The Methodology: Merging Brains

The core of the DogeRM paper is Model Merging. This is a technique where you take the weights (parameters) of two different neural networks and combine them mathematically to create a single model that (hopefully) retains the skills of both.

The Architecture

The framework is surprisingly intuitive. Imagine two “Llama” models (referencing the LLaMA architecture used in the paper):

  1. The Judge (General RM): Knows what humans generally prefer (politeness, structure).
  2. The Expert (Domain SFT): Knows how to solve equations or write loops.

DogeRM blends them to create a Domain RM.

Figure 1: The framework of DogeRM, illustrating the merging of a general RM with a domain-specifc LM to create a domain-specific RM. All icons used in this figure are sourced from https://www.flaticon.com/.

As shown in Figure 1 above, the process takes the “General RM” and the “Domain SFT” and merges their parameters. The result is a model that can look at a math problem and judge it not just on tone, but on correctness.

Step-by-Step Mathematical Merging

Let’s break down the merging implementation. We start with two models initialized from the same base (e.g., LLaMA-2-7B).

1. Defining the Parameters

First, we identify the parameters of the Domain Expert (SFT) model:

Equation defining the SFT model parameters.

This set includes the embeddings (\(\theta_{emb}\)), the transformer layers (\(\theta_{trans}\)), and the decoding head (\(\theta_{dec}\)).

Next, we look at the General Reward Model (RM):

Equation defining the RM parameters.

Notice a key difference here: The RM has a regression head (\(\theta_{reg}\)) instead of a decoding head. This head is what outputs the numerical score.

2. Merging the Embeddings

The first step is to merge the vocabulary embeddings. Since both models might have seen slightly different data during their respective fine-tuning, their understanding of specific tokens might differ.

For tokens that exist in both models (\(t_i\)), DogeRM uses a weighted average controlled by a hyperparameter \(\lambda\) (lambda):

Equation for merging shared token embeddings.

Here, \(\lambda\) represents how much we trust the Domain Expert (SFT).

  • If \(\lambda = 1\), we become the SFT model.
  • If \(\lambda = 0\), we stay as the General RM.

For tokens that are unique to one model (perhaps the math model learned a new symbol), the method simply keeps them as they are:

Equation for handling unique token embeddings.

3. Merging the Transformer Layers

The “brain” of the model lies in the transformer layers. This is where the reasoning happens. DogeRM applies the same linear interpolation (weighted averaging) to these layers:

Equation for merging transformer layers.

This is the most critical step. By averaging the weights, the researchers are essentially overlaying the “neural pathways” of the math expert onto the judge.

4. Assembling the Final DogeRM

Finally, we assemble the merged model. We take the merged embeddings, the merged transformer layers, and—crucially—we keep the regression head from the original Reward Model.

Equation for the final merged model assembly.

We must use the RM’s regression head because the SFT model doesn’t have one (it was trained to generate text, not scores). This regression head acts as the interface that translates the merged model’s “thoughts” into a reward score.


Experimental Setup

To prove this works, the authors conducted extensive experiments.

  • Base Models: They used LLaMA-2-7B and Mistral-7B.
  • General RM: Trained on the UltraFeedback dataset.
  • Domain Experts:
  • Math: MetaMath-7B, MAmmoTH-7B.
  • Code: A custom fine-tuned LLaMA model (using OSS-Instruct and Magicoder).
  • Evaluation: They tested the models on benchmarks like RewardBench and Auto-J Eval. They also performed “Best-of-N” sampling on GSM8K (Math) and MBPP (Code).

Results: Does it Work?

The results were impressive. By simply merging weights—without any additional training on preference data—the models became significantly better at judging domain-specific tasks.

Benchmark Performance

Let’s look at Table 1. This compares the base LLaMA-2 Reward Model against the DogeRM versions merged with MetaMath, MAmmoTH, and a Code Model.

Table 1: Performance comparison across various benchmarks.

Key Takeaways from the Data:

  1. Reasoning Boost: Look at the “Reasoning” column under RewardBench. The base model (row a) scores 78.9. The merged models (rows d, e, f) jump to 85.7, 84.1, and 84.3. That is a massive improvement.
  2. Domain Specificity: Merging with a Math model (MetaMath) gave the biggest boost in Math tasks. Merging with a Code model gave the best results in Code tasks. This confirms the hypothesis that domain knowledge is being successfully transferred.
  3. No Catastrophic Forgetting: Importantly, the performance on “Chat” (general conversation) remained very high (around 95-96%). The model didn’t become “stupid” at general conversation just because it learned math.

Best-of-N Sampling

Benchmarks are nice, but does the Reward Model actually help an AI generate better answers? To test this, the researchers used Best-of-N sampling.

  • How it works: The AI generates \(N\) different answers to a problem. The Reward Model scores all of them and picks the winner. If the Reward Model is smart, it will pick the correct answer, effectively boosting the AI’s success rate.

Figure 2: Best-of-N results showing accuracy improvements.

In Figure 2 (a) (left chart), the red and orange lines represent DogeRM. As we generate more responses (moving right on the x-axis), the accuracy on the GSM8K math dataset climbs significantly higher than the baseline (green line).

This means DogeRM is much better at recognizing a correct math answer when it sees one.

The “Sweet Spot” (\(\lambda\))

You might be wondering: “What is the perfect mix ratio?” The researchers analyzed the impact of \(\lambda\) (the weight given to the Domain Expert).

Figure 3: The impact of different values of lambda on RewardBench.

Looking at the charts above:

  • The Peak: The accuracy usually peaks when \(\lambda\) is between 0.2 and 0.5.
  • The Drop-off: If \(\lambda\) gets too high (approaching 1.0), performance crashes. This makes sense—if \(\lambda=1\), the model becomes the SFT Expert entirely. The SFT Expert has no idea how to be a Reward Model (it doesn’t know how to use the regression head), so it produces garbage scores.

You need enough Domain Expert weight to gain knowledge, but enough General RM weight to maintain the “judging” capability.

Generalization to Other Architectures

The team didn’t just stop at LLaMA-2. They also tested the method on Mistral models.

Figure 11: Full results of Mistral RM + MAmmoTH2-Plus on Reward Bench.

As seen in Figure 11, the Mistral-based DogeRM (merged with MAmmoTH2-Plus) showed similar improvements, particularly in the Math domain (center chart), proving the technique is architecture-agnostic.


Why This Matters

The DogeRM paper presents a compelling narrative for the future of AI alignment.

  1. Cost Efficiency: We can leverage the thousands of open-source, fine-tuned models on HuggingFace to build better Reward Models without spending a dime on new data annotation.
  2. Modularity: Need a Reward Model for Medical advice? Just merge your General RM with a Medical LLM. Need one for Law? Merge with a Legal LLM.
  3. Simplicity: The method requires no complex training pipelines—just simple arithmetic operations on model weights.

Conclusion

DogeRM demonstrates that we don’t always need to start from scratch. By strategically merging the “reasoning” capabilities of domain-specific models with the “judging” structure of reward models, we can create AI systems that are both aligned and knowledgeable.

As we move toward more specialized AI agents, techniques like DogeRM will likely become standard practice for equipping generalist models with specialist eyes.

For more details, check out the full paper: “DogeRM: Equipping Reward Models with Domain Knowledge through Model Merging”.