In the world of Machine Translation (MT), we have reached a fascinating tipping point. For decades, the goal of a translation system was to match human performance. Today, with the advent of Large Language Models (LLMs) like GPT-4, machine-generated translations often exceed the quality of human-written references.

This creates a paradox in evaluation. Traditional metrics work by comparing the machine’s output (the “candidate”) to a human translation (the “reference”). If the reference is treated as the “Gold Standard,” how can a metric possibly reward a machine for writing something better?

In this post, we break down a research paper that tackles this exact problem. The authors introduce RESUME (Residual score Metric), a novel approach that moves beyond simple reference matching to evaluate relative quality, effectively allowing machines to score higher than their human teachers.

The Problem: Reference Bias

To understand the innovation of RESUME, we first need to look at how we currently grade translations.

The “Gold Standard” Trap

Most MT metrics, whether they are based on lexical overlap (like BLEU) or neural embeddings (like COMET or BERTScore), operate on a simple assumption: the human reference sentence is perfect.

The evaluation function looks something like this:

\[ f(s, r, c) \]

Where \(s\) is the source, \(r\) is the reference, and \(c\) is the candidate.

The metric calculates the similarity between \(c\) and \(r\). Consequently, the maximum possible score is achieved when the candidate is identical to the reference. If a candidate differs from the reference—even if it differs by being better, more concise, or more grammatically accurate—it is penalized.

This is known as the Reference Bias problem.

Why This Matters Now

This wasn’t a huge issue when machine translation was clumsy. But consider two factors in the modern landscape:

  1. Human Error: Reference sentences are written by humans, who are prone to inconsistency, mistranslation, or grammatical slips.
  2. LLM Superiority: Models like GPT-4 are increasingly producing translations that are stylistically superior to average human references.

As shown in the authors’ research, current metrics fail to assign higher scores to candidates that outperform references. We need a metric that can say, “This translation isn’t just like the reference; it is better than the reference.”

The Solution: RESUME

The researchers propose a method called RESUME. Instead of just calculating an absolute quality score, RESUME calculates a residual score—a measure of the relative quality difference between the candidate and the reference.

The final score for a translation is calculated by taking a standard metric’s score and adding this residual adjustment:

\[ \text{Final Score} = f(s, r, c) + \lambda \cdot \text{RESUME}(s, r, c) \]

Here, \(\lambda\) is a weight parameter. The key innovation lies in what RESUME outputs:

  • A positive value if the Candidate > Reference.
  • A negative value if the Candidate < Reference.

By adding this positive residual, the total score can theoretically exceed the score of the reference itself, breaking the “Gold Standard” ceiling.

Core Method: Training for Relativity

The biggest challenge in developing RESUME was training data. Creating a massive dataset where experts explicitly label “Sentence A is better than Sentence B by exactly 0.5 points” is incredibly expensive and time-consuming.

The authors devised a clever strategy to train their model using existing datasets (like WMT Direct Assessment data) that only provide absolute quality scores.

The Training Strategy

Standard metrics minimize the error between a predicted score and a human rating \(y\), usually ranging from 0 to 1.

Standard Loss Function

For RESUME, the goal is to predict the difference (\(\Delta y\)) between the candidate and the reference.

Residual Loss Function

But existing datasets don’t have \(\Delta y\). They only have the score for the candidate (\(y\)). To solve this, the authors made a simplifying assumption for the sake of training: Assume the Reference sentence is perfect (Score = 1.0).

Using this assumption, they created a training loop that teaches the model to recognize both inferior and superior translations using a two-step process, illustrated below:

Figure 1: The training process of RESUME with absolute scores.

Step 1: Learning Negative Residuals

In the standard setup, the model compares a Candidate (\(c\)) against a Reference (\(r\)). Since the dataset provides the candidate’s score \(y\) (where \(y \le 1\)) and we assume the reference is \(1\), the residual target is:

\[ \Delta y = \text{score}(c) - \text{score}(r) = y - 1 \]

Since \(y\) is usually less than 1, this value is negative. The model learns to penalize candidates that are worse than the reference using this loss function:

Loss function for Reference > Candidate

Step 2: Learning Positive Residuals (The “Swap” Trick)

If we only trained on the equation above, the model would only ever learn to output negative numbers. It would never learn to recognize when a candidate is better.

To fix this, the researchers swapped the inputs. They feed the Reference sentence into the “Candidate” slot and the Candidate sentence into the “Reference” slot.

  • Now, the “Candidate” (actually the reference) has a score of 1.
  • The “Reference” (actually the candidate) has a score of \(y\).

The target residual becomes:

\[ \Delta y = 1 - y \]

This results in a positive value. By training on this swapped configuration, the model learns what it looks like when the input in the “Candidate” slot is superior to the input in the “Reference” slot.

Loss function for Reference < Candidate

This ingenious use of data augmentation allows RESUME to learn a full range of relative quality (-1 to +1) without requiring any new manual labeling.

Experiments and Results

Does adding this residual score actually help? The authors tested RESUME on the WMT 22 MQM dataset, which uses high-quality expert ratings.

1. Correlation with Human Judgments

The primary measure of success is whether the metric agrees with human experts. The table below compares standard metrics (like COMET, BLEURT, and UniTE) against their performance when augmented with RESUME.

Table 1: Results on the WMT22 MQM dataset in both segment-level and system-level.

Key Takeaway: As indicated by the bold values, RESUME consistently improves performance across both segment-level (judging individual sentences) and system-level (ranking translation models) evaluations. It turns unsupervised metrics like BERTScore into competitors against supervised giants like COMET.

2. Identifying Superior Translations

To prove that RESUME addresses reference bias, the authors used a post-editing dataset. This dataset contains:

  1. Pre-edited: A machine translation with errors.
  2. Post-edited: The same translation fixed by a human expert.

If we use the Pre-edited version as the Reference, a good metric should score the Post-edited version higher than the reference.

Table 2: The percentage of instances where the MT metric assigns a higher score to the post-edited translation.

The results in Table 2 are stark. Standard metrics like COMET and BLEURT almost never (1-2%) score the better translation higher than the reference because they are too focused on similarity. RESUME, however, correctly identifies the superior quality 59% of the time.

3. Case Study Analysis

Let’s look at a concrete example of this in action. In the table below, Example #1 shows a pre-edited reference that contains a mistranslation. The post-edited candidate fixes it.

Table 3: Two examples from the post-editing translation dataset.

  • BERTScore looks at the post-edited candidate, sees it doesn’t match the (flawed) reference, and gives it a 0.965, lower than the reference’s score.
  • BERTScore + RESUME recognizes the quality improvement and adds a residual boost of +0.134, pushing the score to 1.099. This correctly signals that the candidate is better than the reference.

4. Ranking LLMs (GPT-4)

One of the most significant findings is how RESUME affects the ranking of modern LLMs. In the WMT22 English-Chinese task, standard BLEU scores rank the “Online-W” system higher than GPT-3 and GPT-4.

Figure 2: Scoring results on GPTs and top-performing system (Online-W) in WMT22 en-zh pair.

However, human evaluations generally prefer GPT-4. As seen in the right-hand chart of Figure 2, when RESUME is applied, the scores shift. GPT-4 (green bar) takes the lead, aligning the automatic metric with the reality of human preference for LLM translations.

Sensitivity Analysis

Finally, the authors explored how much weight (\(\lambda\)) should be given to the residual score.

Figure 4: Changes in the average correlation of MT metrics according to RESUME score ratios.

The chart shows that while the optimal \(\lambda\) varies by metric, adding some residual score (moving right from 0 on the x-axis) almost always improves correlation, peaking around \(\lambda=0.2\) to \(1.0\) for most metrics before dropping off.

Conclusion

The era of treating human references as the absolute ceiling of translation quality is ending. As Large Language Models continue to improve, our evaluation tools must evolve to recognize when a machine has surpassed its training data.

The RESUME metric offers a practical, clever solution to the Reference Bias problem. By training a model to assess relative quality using a swapped-input training strategy, the authors have created a tool that can:

  1. Boost the accuracy of existing metrics like COMET and BLEURT.
  2. Correctly identify when a translation is better than its reference.
  3. Accurately rank high-performing LLMs like GPT-4.

This research ensures that as translation systems get smarter, our ability to grade them doesn’t get left behind.