Decoding the Score: A New Framework for Interpretable Machine Translation Metrics

Introduction

In the world of Machine Translation (MT), we have witnessed a massive shift from heuristic-based evaluation metrics (like BLEU) to neural-based metrics (like COMET and MetricX). These newer models are significantly better at aligning with human judgments. However, they come with a “black box” problem.

When a neural metric hands you a score—say, 0.86—what does that actually mean? Is it a perfect translation? Is it just “okay”? If another metric gives the same sentence a -1.49, how do you compare them?

Historically, researchers have evaluated these metrics by checking how well they correlate with human ratings. While correlation tells us if a metric is generally trending in the right direction, it doesn’t help a practitioner make specific, hard decisions. This ambiguity becomes critical as metrics are increasingly used for downstream tasks like data filtering (deciding which training data to keep) or re-ranking (choosing the best translation from a list of options).

In the paper “Beyond Correlation: Interpretable Evaluation of Machine Translation Metric,” researchers from Sapienza University of Rome introduce a new framework to solve this interpretability crisis. Instead of relying solely on correlation, they evaluate metrics based on their ability to make binary decisions—Pass/Fail—and measure their success using Precision, Recall, and F-score.

Figure 1: Quality assessments returned by COMET (Rei et al., 2020), MetricX-23-QE-XL (Juraska et al., 2023), and GEMBA-MQM (Kocmi and Federmann, 2023) for the provided machine-translated text.

As illustrated in Figure 1, different metrics provide vastly different scalar outputs for the same translation. Without a framework to ground these numbers, making informed design choices for MT systems is nearly impossible.

Background: The Interpretability Gap

To understand why this new framework is necessary, we must look at how MT metrics are currently used. Originally, metrics were used primarily to track incremental progress—checking if Model A was slightly better than Model B.

Today, however, metrics serve as active utility functions in complex pipelines:

Data Filtering: Filtering out low-quality translations from massive web-scraped datasets to train better models.
Minimum Bayes Risk (MBR) Decoding: Generating multiple translation hypotheses and using a metric to select the one that minimizes expected error.
Reinforcement Learning: Using metrics as reward models to fine-tune MT systems.

The Problem with Scalars

Most state-of-the-art metrics are trained to minimize Mean Squared Error (MSE) against human judgments. They output a single number (a scalar). The authors identify three major interpretability issues with this:

Range Consistency: Does a 0.1 increase in score mean the same quality jump at the bottom of the scale as it does at the top? (Likely not).
Error Attribution: A single number doesn’t tell you what went wrong (e.g., a critical mistranslation vs. a punctuation error).
Performance Opacity: Knowing a metric has a “0.9 correlation” with humans doesn’t tell you how often it will mistakenly label a bad translation as good.

To address the third point—Performance—the authors propose treating metrics as classifiers rather than just regression models.

Core Method: An Interpretable Framework

The researchers designed two evaluation scenarios that act as proxies for real-world use cases: Data Filtering and Translation Re-ranking.

Scenario 1: Metrics as Binary Classifiers (Data Filtering)

Imagine you are filtering a massive dataset. You want to keep “GOOD” translations and throw away “BAD” ones. You can use a metric \(\mathcal{M}\) and a threshold \(\tau\). If the metric score \(\mathcal{M}(t) \geq \tau\), the translation is kept.

To evaluate how well a metric performs this task, the authors compare the metric’s decisions against an oracle (human expert annotations). They break this down into standard classification metrics:

Precision: If the metric says a translation is GOOD, what is the probability that it actually is?

Equation 1: Precision formula

Recall: Out of all the translations that humans actually rated as GOOD, what percentage did the metric manage to find?

Equation 2: Recall formula

F-score: This combines Precision and Recall. The authors explicitly use \(F_{\beta}\) (with a specific weight favoring Precision). Why? Because in data filtering, a False Positive (keeping a bad translation) is usually more damaging to model training than a False Negative (accidentally discarding a good translation).

Equation 3: F-score formula

The authors use high-quality human labels derived from MQM (Multidimensional Quality Metrics). They define:

GOOD: No major errors and minimal minor errors.
PERFECT: Almost no errors at all.
BAD: Anything else.

Scenario 2: Translation Re-ranking

In this scenario, an MT system generates multiple potential translations for a single source sentence. The metric’s job is to rank them and pick the winner.

The evaluation measure here is Re-Ranking Precision (RRP). It calculates the overlap between the set of translations the metric thinks are best (\(T^{\mathcal{M}}\)) and the set of translations humans think are best (\(T^{\mathcal{H}}\)).

Equation 5: Re-Ranking Precision formula

This setup allows us to move away from abstract correlations and ask concrete questions: “If I use this metric to filter my data, how much garbage will leak through?”

Experimental Setup

The evaluation relied on the WMT23 MQM dataset, which provides expert-level human annotations for translation quality.

Source Language: Chinese (ZH), English (EN), Hebrew (HE).
Target Language: English (EN), German (DE).

The researchers tested a wide variety of metrics, including:

Reference-based: COMET, MetricX-23, MaTESe.
Reference-free (QE): COMET-QE, MetricX-23-QE, CometKiwi.
LLM-based: GEMBA-MQM (using GPT-4).

They optimized the thresholds (\(\tau\)) on the test set to find the theoretical maximum performance of each metric (the “ceiling” of its ability).

Experiments & Results

1. Can Metrics Distinguish Good from Bad?

The results for the binary classification task (Data Filtering) were illuminating.

Table 1: Metrics’ Precision, Recall, and F-score in binary classification, distinguishing GOOD from BAD, and PERFECT from OTHER translations.

Table 1 (above) highlights several key findings:

Good vs. Bad: Most metrics perform decently at separating decent translations from terrible ones. Top performers like GEMBA-MQM and xCOMET-QE-ENSEMBLE achieved F-scores over 81.
The Precision Problem: Notice that for almost all metrics, Precision is lower than Recall. This means metrics are generally “optimistic”—they are eager to label translations as GOOD, leading to False Positives.
Perfect vs. Other: The task becomes much harder when trying to identify “PERFECT” translations. The F-scores drop significantly (into the 60s). Current metrics lack the sensitivity to distinguish between a “good” translation and a flawless one.
The Winner: For open-source, reference-free applications (the most common scenario for data filtering), MetricX-23-QE-XL consistently performed at the top level.

2. The Instability of Thresholds

One of the major arguments for interpretability is understanding what a score means. If a score of 0.8 is “Good,” it should ideally be “Good” regardless of the language pair.

However, the experiments showed that optimal thresholds are highly unstable.

Figure 3: Tested metrics’ optimal threshold values across different language directions for GOOD vs BAD classification.

As shown in Figure 3, the optimal threshold (\(\tau\)) to separate Good from Bad varies wildly between language pairs (ZH\(\to\)EN vs EN\(\to\)DE vs HE\(\to\)EN). This confirms that raw metric scores are not universally consistent. A 0.8 might be a safe threshold for one language but too lax (or strict) for another.

The situation is similar when trying to identify PERFECT translations, as seen below.

Figure 4: Tested metrics’ optimal threshold values across different language directions for PERFECT vs OTHER classification.

3. How “Bad” are the False Positives?

Since metrics struggle with precision, it is vital to know how wrong they are when they make a mistake. If a metric labels a “BAD” translation as “GOOD,” is it a catastrophe (a hallucination) or a minor nuisance (a typo)?

The authors plotted the distribution of the “MQM Score \(\Delta\)” for false positives.

Figure 2: Distribution of the MQM score delta between the openly available metrics’ false positive MQM scores and human thresholds.

In Figure 2, the y-axis lists the metrics, and the x-axis shows how far off the False Positives were from the true threshold.

Violin Shape: A wider distribution on the left means the metric is letting through some truly terrible translations.
Key Insight: The best metrics (top rows) have distributions skewed toward the right. This means when they fail, they usually fail on borderline cases—translations that are almost good but just missed the mark.
The Outlier: Look at the bottom row, DA+SQM. This represents human non-expert annotation. Its distribution is extremely wide, indicating it is much less reliable than the automated neural metrics.

4. Re-Ranking and MBR

When selecting the single best translation (Re-ranking), Reference-based metrics generally outperformed Reference-free ones.

Table 3: Re-Ranking Precision of reference-based metrics when used as the utility function for MBR decoding.

Table 3 compares metrics in a standard re-ranking setup vs. Minimum Bayes Risk (MBR) decoding. MBR is a powerful technique where the system generates many hypotheses and selects the one most similar to all others (using the metric as a similarity measure). The data suggests that MBR decoding acts as a strong “proxy” for reference-based quality, often beating standard Quality Estimation (QE) re-ranking methods.

5. The “Human” Baseline Surprise

Perhaps the most controversial finding involves DA+SQM (Direct Assessment + Scalar Quality Metrics). This is a common method for gathering human evaluation data, where annotators use a slider to rate quality.

The authors found that DA+SQM annotations correlated worse with the expert MQM labels than the automatic metrics did.

DA+SQM had low precision and high recall (see the bottom row of Table 1).
In the False Positive analysis (Figure 2), DA+SQM allowed significantly worse translations to pass as “Good” compared to metrics like MetricX or COMET.

This raises concerns about the reliability of datasets curated using non-expert human evaluation. The authors suggest that automated metrics might actually be more reliable than these specific human annotations for fine-grained quality filtering.

Conclusion & Implications

This paper moves the field of Machine Translation evaluation from “does it correlate?” to “does it work?”. By rigorously testing metrics as classifiers, the authors provide actionable advice for practitioners:

Use MetricX-23-QE-XL if you need an open-source, reference-free metric for filtering data.
Be Careful with Thresholds: You cannot pick a single score (e.g., 0.8) and apply it across all languages. Thresholds must be tuned per language pair, ideally using a development set.
Trust Neural Metrics over DA+SQM: Expert-based MQM annotations are the gold standard. If those aren’t available, top-tier neural metrics may actually be more consistent than non-expert human crowdsourcing.
Expect False Positives: Even the best metrics struggle with Precision. They will let some bad translations through, but identifying “Perfect” translations remains an unsolved challenge.

This framework empowers researchers to make design choices—like setting filtering thresholds—based on concrete Precision/Recall trade-offs rather than intuition, ultimately leading to cleaner datasets and better translation models.

Introduction#

Background: The Interpretability Gap#

The Problem with Scalars#

Core Method: An Interpretable Framework#

Scenario 1: Metrics as Binary Classifiers (Data Filtering)#

Scenario 2: Translation Re-ranking#

Experimental Setup#

Experiments & Results#

1. Can Metrics Distinguish Good from Bad?#

2. The Instability of Thresholds#

3. How “Bad” are the False Positives?#

4. Re-Ranking and MBR#

5. The “Human” Baseline Surprise#

Conclusion & Implications#