The Judge We Need: How PROMETHEUS 2 Merges Skills to Rival GPT-4 in Evaluation

The explosion of Large Language Models (LLMs) has created a peculiar bottleneck in artificial intelligence. We have models that can write poetry, code, and legal briefs, but we are running out of reliable ways to grade them.

Historically, humans were the judges. But humans are slow, expensive, and often inconsistent. To solve this, the industry shifted toward the “LLM-as-a-Judge” paradigm, where powerful proprietary models like GPT-4 evaluate the outputs of smaller models. It works well, but it introduces new problems: high costs, lack of transparency (closed source), and data privacy concerns.

Ideally, we would use an open-source model to grade our AI. However, until recently, open-source “evaluator models” have been lackluster. They tend to disagree with human judges and, crucially, lack flexibility. They are usually trained to do only one thing: either assign a score (1-5 stars) or rank two responses (A vs. B).

Enter PROMETHEUS 2. In a new paper from KAIST, LG AI Research, CMU, and others, researchers introduce a model that doesn’t just close the gap with GPT-4—it challenges the way we train evaluator models altogether. By utilizing a novel “weight merging” technique, they created a unified model that excels at both scoring and ranking.

In this post, we will dissect how PROMETHEUS 2 works, the data that powers it, and why merging model weights might be better than traditional training methods.

The Problem with Current Judges

To understand why PROMETHEUS 2 is significant, we first need to look at the landscape of AI evaluation.

There are generally two classes of automated judges:

Weak Evaluators: These are typically smaller, open-source models (like Llama-2-70B or early versions of Prometheus). Their scores often fail to correlate with human judgment.
Strong Evaluators: These are proprietary giants like GPT-4 or Claude-3-Opus. Their scores correlate highly with humans and with each other.

Figure 1 showing the correlation gap between weak and strong evaluators.

As shown in Figure 1 above, strong evaluators (blue) form a tight cluster of agreement. Weak evaluators (red) are scattered; they don’t agree with the strong models, and often, they don’t even agree with each other. The goal of PROMETHEUS 2 is to move an open-source model from the red group into the blue group.

Two Ways to Grade an AI

To build a better judge, we need to master the two dominant formats of evaluation: Direct Assessment and Pairwise Ranking.

1. Direct Assessment (Absolute Grading)

This method mimics a teacher grading an essay. The model receives an instruction, a response, and a rubric (criteria). It must output a scalar score (e.g., 1 to 5) and usually a verbal explanation.

The mathematical formulation for this is:

Equation for direct assessment mapping inputs to feedback and score.

Here, the function takes an instruction (\(i\)), a response (\(r\)), a reference answer (\(a\)), and evaluation criteria (\(e\)), and produces feedback (\(v_r\)) and a score (\(s\)).

2. Pairwise Ranking (Relative Grading)

This method mimics an eye exam (“Is number 1 better, or number 2?”). The model is given an instruction and two responses. It must decide which one is better based on the criteria.

Equation for pairwise ranking mapping inputs to a winning response.

This format is often considered easier for models because relative comparison is simpler than absolute scoring. However, existing open-source models are usually trained on just one of these formats. A model trained to rank A vs. B often fails spectacularly when asked to give a score of 1-5, and vice versa.

Comparison of direct assessment versus pairwise ranking outputs.

Figure 2 illustrates the difference. In Direct Assessment, the model must justify why a response deserves a specific score (e.g., 4/5). In Pairwise Ranking, it must compare the nuance between two potentially good answers to pick a winner. PROMETHEUS 2 aims to unify these capabilities.

The Ingredients: Custom Data

You cannot train a great evaluator without great data. The researchers utilized two massive datasets to train their model.

The Feedback Collection (for Direct Assessment): This existing dataset contains inputs, responses, and—crucially—detailed score rubrics and feedback. It teaches the model how to assign scores based on specific criteria (like “helpfulness” or “creativity”).
The Preference Collection (for Pairwise Ranking): This is a new contribution of the paper. The researchers realized that most ranking datasets (like those used for RLHF) only tell the model “A is better than B.” They don’t explain why based on a specific rubric.

To create the Preference Collection, the authors took the Feedback Collection and synthesized pairs. They then used GPT-4 to generate “verbal feedback” that explicitly discusses the commonalities and differences between the two responses before declaring a winner. This resulted in over 200,000 training instances covering 1,000 different evaluation criteria.

The Recipe: Weight Merging

This is the most innovative part of the PROMETHEUS 2 methodology.

The standard way to train a model to do two tasks (scoring and ranking) is Joint Training. You simply mix the two datasets together and train the model on the combined pile. Ideally, the model learns both.

However, the researchers found that Joint Training resulted in “negative transfer.” The model actually got worse at each individual task compared to models trained on just one.

Their solution? Weight Merging.

Instead of training one model on both data, they trained two separate “expert” models:

Model A: Trained only on Direct Assessment.
Model B: Trained only on Pairwise Ranking.

Then, they merged the weights of these two models into a single final model.

Equation showing linear weight merging.

The equation above represents a simple Linear Merge, where \(\theta_{final}\) is the weighted average of the two models’ parameters. The researchers experimented with several advanced merging techniques (like Task Arithmetic and TIES), eventually settling on a method called DARE-Linear for the final model (specifically for the 8x7B version).

Why does this work better? The hypothesis is that merging preserves the specialized “features” learned by each model without the interference that happens during simultaneous training (where gradients from one task might overwrite progress on the other).

Experimental Results

The researchers tested PROMETHEUS 2 against a wide array of open-source models (Llama-2, Mistral, previous Prometheus versions) and proprietary models (GPT-3.5, GPT-4).

They used a diverse set of benchmarks to ensure the model wasn’t just memorizing its training data.

Table listing the various benchmarks used for evaluation.

1. Performance on Direct Assessment

In absolute scoring tasks, PROMETHEUS 2 (both the 7B and 8x7B versions) showed the highest correlation with human and GPT-4 judges among all open-source models.

Table showing Pearson correlations for Direct Assessment.

As seen in Table 3, PROMETHEUS 2-8x7B achieves a Pearson correlation of 0.665 with GPT-4 on MT Bench and 0.555 with humans on FLASK. This effectively halves the performance gap that previously existed between open-source evaluators and GPT-4.

2. Performance on Pairwise Ranking

For ranking tasks, the results were equally impressive. The model needs to agree with human decisions on which response is better.

Table showing consistency metrics across evaluation formats.

While the full accuracy tables show PROMETHEUS 2 topping the charts, Table 13 (above) highlights something perhaps more interesting: Consistency.

The researchers tested “Cross-Format Consistency.” If the model says “Response A is better than B” in a pairwise test, does it also give Response A a higher score than B when grading them individually? The small \(\Delta\) (Delta) values for PROMETHEUS 2 indicate that it is highly internally consistent. It understands quality in a robust way, regardless of whether you ask it to grade (1-5) or rank (A vs B).

Why Merging Wins: The Analysis

The paper dives deep into why weight merging outperformed joint training. The comparison is stark.

Table comparing Prompting, Single-Format, Joint Training, and Weight Merging.

Table 5 reveals the “Negative Transfer” phenomenon. Look at the “Joint Training” row. Often, its performance is lower than the “Direct Assessment Only” model when tested on direct assessment benchmarks.

Now look at “Weight Merging.” It scores higher than both the single-format models and the joint training models. This indicates Positive Transfer. The skills learned from pairwise ranking actually helped the model become a better absolute grader, but only when combined via merging.

The researchers also analyzed how the merge ratio (\(\alpha\)) affects performance.

Graph showing performance trends based on merging ratio alpha.

Figure 3 visualizes the balance. The x-axis represents the ratio of Direct Assessment weights to Pairwise weights.

Green Line (Direct Assessment Performance): Peaks around 0.5 (an equal mix).
Blue Line (Pairwise Performance): Peaks around 0.3 (favoring pairwise weights).

This asymmetry is fascinating. It suggests that learning to compare (ranking) is a foundational skill that boosts the ability to score, whereas learning to score provides a smaller boost to ranking.

Conclusion

PROMETHEUS 2 represents a significant step forward for open-source AI. It provides a free, transparent, and highly capable alternative to using GPT-4 as a judge.

For students and researchers, the key takeaways are:

Better Data: The introduction of the Preference Collection enables models to learn the reasoning behind rankings, not just the result.
Better Training: “Weight Merging” is a powerful technique. If you have a model that needs to do two distinct tasks well, training separate experts and merging them might yield better results than multitask training.
Flexibility: We no longer have to choose between a scoring model and a ranking model. We can have one model that does both with high human agreement.

As the ecosystem of open-source LLMs continues to grow, having a reliable, open-source “teacher” to grade them is essential. PROMETHEUS 2 fills that role, proving that sometimes, two models merged together are indeed better than one.

The Problem with Current Judges#

Two Ways to Grade an AI#

1. Direct Assessment (Absolute Grading)#

2. Pairwise Ranking (Relative Grading)#

The Ingredients: Custom Data#

The Recipe: Weight Merging#

Experimental Results#

1. Performance on Direct Assessment#

2. Performance on Pairwise Ranking#

Why Merging Wins: The Analysis#

Conclusion#