Imagine you are a carpenter building tables. You have a ruler to measure the length of your work. But this ruler has a strange property: when you measure a table made of oak, an inch is exactly 2.54 centimeters. But when you measure a table made of pine, the ruler magically shrinks, and an “inch” becomes only 2 centimeters. As a result, your pine tables receive inflated measurements, while your oak tables are penalized.

This sounds absurd for carpentry, but recent research suggests this is exactly what is happening in the evaluation of Machine Translation (MT) systems.

In the paper “A Measure of the System Dependence of Automated Metrics,” researchers Pius von Däniken, Jan Milan Deriu, and Mark Cieliebak argue that our current automated metrics are not neutral measuring sticks. Instead, they exhibit “System Dependence”—they treat translations differently depending on which AI system produced them. This bias can lead to unfair rankings, where a worse translation system is declared the winner simply because the metric “prefers” its specific style of errors or output.

In this post, we will unpack this research, explore the mathematics behind “fair” evaluation, and look at the proposed method to quantify just how biased our measuring sticks really are.

The Problem with Correlation

Before we dive into the solution, we need to understand the status quo. Evaluating Machine Translation is hard. The gold standard is human evaluation (specifically frameworks like Multidimensional Quality Metrics, or MQM), where experts painstakingly rate translations. However, this is slow and incredibly expensive.

To speed things up, the field relies on automated metrics (like BLEU, COMET, or BERTScore). These are algorithms that look at a translation and spit out a quality score. We validate these metrics by checking their correlation with human judgments. If the metric gives a high score to a sentence that humans also liked, we say the metric is good.

Typically, we look at:

  1. Segment-level correlation: Does the metric rank individual sentences correctly?
  2. System-level correlation: Does the metric rank the systems (e.g., Google Translate vs. GPT-4) in the same order as humans?

The authors of this paper argue that correlation is insufficient. A metric can have a high correlation but still be unfair if it applies different standards to different systems.

The “Measuring Stick” Visualized

To understand this, look at the graph below (Figure 1 from the paper). The X-axis represents scores from XCOMET (a popular, high-performing metric). The Y-axis represents the expected Human MQM score (where closer to 0 is better).

Figure 1: Average Human Ratings associated with XCOMET scores on Chinese to English data. We show scores for all systems in aggregate (global) and two individual systems.

Here is what we see:

  • The Blue Line (Global) represents the average relationship between the metric and human scores across all systems.
  • The Orange Line (Lan-BridgeMT) represents the relationship for one specific high-quality system.
  • The Green Line (NLLB-Greedy) represents the relationship for a lower-quality system.

Notice the gap. For an XCOMET score of 0.8, the Orange system (Lan-BridgeMT) achieves a human score of roughly -3, while the Green system (NLLB-Greedy) achieves a human score of roughly -7.

This means an XCOMET score of 0.8 “costs” more for the Orange system to achieve than for the Green system. The metric is inflating the quality of the Green system while underestimating the Orange one. The measuring stick is changing length.

The Core Method: Formalizing Unfairness

The researchers propose a mathematical framework to measure this discrepancy. The goal is to determine if the relationship between human scores (\(h\)) and metric scores (\(m\)) is consistent across all systems.

1. The Conditional Expectation

First, we need to define the relationship between the metric and human judgment. We can express the expected human rating for a system \(k\) (\(\mathbb{E}[h_k]\)) in terms of its metric scores.

Equation 1: The expected human rating expressed as a conditional expectation.

In simple terms, this equation says: To find the true quality of a system, we look at the distribution of metric scores it gets (\(p_k(m)\)) and map those to human scores using a conversion function \(\mathbb{E}[h|m]\).

The critical insight is that this conversion function \(\mathbb{E}[h|m]\) acts as the “calibration curve.”

  • If the metric is fair (system-independent), there is one Global Function (\(f_G\)) that works for everyone.
  • In reality, each system has its own System-Specific Function (\(f_k\)).

2. Expected Deviation (ED)

To quantify the unfairness for a specific system, the authors introduce Expected Deviation (ED). This measures the gap between the “Global” assumption and the “System-Specific” reality.

Equation 2: Formula for Expected Deviation (ED).

Here is the breakdown of this equation:

  • \(\frac{1}{N} \sum f_G(m_k^{(j)})\): This is the average score the system would get if we used the global conversion curve (the Blue line in Figure 1).
  • \(\frac{1}{N} \sum f_k(m_k^{(j)})\): This is the actual average human score for that system (the specific Orange or Green line).
  • ED(k) is the difference.
  • A negative ED means the system is underestimated by the metric (it is actually better than the metric says).
  • A positive ED means the system is overestimated (it is actually worse than the metric says).

3. The System Dependence Score (SysDep)

Finally, to judge the metric itself (e.g., “How fair is XCOMET?”), we calculate the SysDep score. This is simply the range between the most overestimated system and the most underestimated system.

Equation 3: Formula for SysDep score.

A perfect metric would have a SysDep of 0, meaning it treats every system exactly the same. A high SysDep means the metric plays favorites.

To estimate these functions (\(f_G\) and \(f_k\)) from real data, the authors use a technique called Isotonic Regression. This fits a curve that is constrained to be monotonic (always increasing), which matches our intuition that higher metric scores should always imply higher human scores.

Experiments & Results

The researchers tested their method using data from the WMT23 Metrics shared task, specifically looking at Chinese-to-English (zh-en) translations. They analyzed how XCOMET ranked 15 different translation systems.

Ranking Inversions caused by Bias

The table below shows the results. This is the “smoking gun” of the paper.

Table 1: System rankings and average rating of WMT 23 zh-en systems according to XCOMET.

Let’s break down the columns:

  • Human (\(\hat{\mu}_k^H\)): The true quality. Lan-BridgeMT is the winner (Rank 1).
  • Metric (\(\hat{\mu}_k^M\)): The raw XCOMET score. It ranks GPT4-5shot as the winner (Rank 1) and Lan-BridgeMT as second.
  • Exp. Deviation (ED): The bias measure we defined earlier.

The Analysis: Lan-BridgeMT has an ED of -0.820. This is a massive underestimation. The metric is effectively “taxing” this system. Conversely, other systems lower down the list have positive EDs, meaning they are being subsidized by the metric.

Because Lan-BridgeMT was penalized so heavily by the metric’s system dependence, it lost the top spot to GPT-4. This proves that high correlation isn’t enough; because the metric treated Lan-BridgeMT unfairly compared to GPT-4, the final ranking was wrong.

We also see huge deviations at the bottom of the table. NLLB-Greedy has an ED of 1.996. This means the metric thinks it is vastly better than humans think it is. In fact, looking at the ranks, the metric places NLLB-Greedy at rank 12, while humans place it dead last at rank 15. The metric “boosted” it by 3 full ranks purely due to system dependence.

Comparing Different Metrics

Are all metrics equally biased? The authors expanded their analysis to other metrics and language pairs to compare their SysDep scores.

Table 8: SysDep for each metric and language pair.

This table reveals significant differences:

  • GEMBA-MQM (an LLM-based metric using GPT-4) has the lowest SysDep for Chinese-English (1.58), suggesting it is “fairer” for this language pair than XCOMET (SysDep ~2.8 from the previous section).
  • Reference-free metrics (like prismSrc) often perform worse, showing very high system dependence (SysDep 4.61 for zh-en).
  • MetricX-23 generally performs well across pairs.

The variability implies that “fairness” is a property we can optimize for. Some metrics are inherently better at generalizing their judgment scales across different types of model architectures than others.

Is it just noise?

A skeptic might ask: “Is this really systematic bias, or just statistical noise?” To check this, the authors simulated “Intra-System” variability (splitting one system’s data in half).

Table 9: Maximum intra-system SysDep score for all metrics and language pairs.

The Intra-System scores (Table 9) are generally much lower than the Inter-System scores (Table 8). This confirms that the deviations we are seeing are not random—they are driven by the specific characteristics of the different translation systems.

Conclusion and Implications

This paper makes a compelling case for a paradigm shift in how we evaluate AI. We cannot simply ask, “Does this metric correlate with humans?” We must also ask, “Is this metric dependent on the system it evaluates?”

The implications are significant for the development of Large Language Models and Machine Translation:

  1. Hidden Penalties: You might be developing a superior architecture that is being discarded simply because your evaluation metric has a negative bias against it (like Lan-BridgeMT).
  2. Inflation of Baselines: Older or simpler baselines (like Greedy decoding) might appear competitive only because metrics overestimate them.
  3. Better Metrics: Future metrics competitions should include SysDep (or similar fairness measures) as a primary criterion for success.

As the authors colloquially put it: “A measuring stick should not change length depending on the measured object.” By adopting the SysDep score, the NLP community can ensure that our measuring sticks are as rigid and reliable as possible, paving the way for fairer and more accurate comparisons between AI systems.