Lost in Evaluation: Why We Can’t Trust English Metrics for Multilingual AI

The field of Natural Language Processing (NLP) is currently witnessing a massive expansion in accessibility. We are no longer just building models for English speakers; with the release of multilingual Large Language Models (LLMs) like BLOOM, Llama 2, and Aya-23, the goal is to create AI that speaks the world’s languages.

However, building these models is only half the battle. The other half is determining whether they actually work.

For years, researchers have relied on a standard toolkit of metrics to grade AI summaries. If you are a student in NLP, you likely know them well: ROUGE (which checks for word overlap) and BERTScore (which checks for semantic similarity). More recently, researchers have started using LLMs (like GPT-4) to grade other LLMs.

There is a hidden assumption underlying this entire ecosystem: If these metrics work for English, they must work for everyone else.

A recent paper titled “Re-Evaluating Evaluation for Multilingual Summarization” challenges this assumption head-on. The researchers conducted a pilot study involving English, Chinese, and Indonesian to see if our standard evaluations hold up across language barriers. The results are a wake-up call for the industry: the tools we use to grade English AI are unreliable, and sometimes entirely misleading, when applied to other languages.

The Problem with Current Metrics

Before diving into the paper’s methodology, we need to understand the status quo. How do we currently know if a computer-generated summary is “good”?

The Old Guard: ROUGE and BERTScore

Historically, we treated evaluation as a matching game. ROUGE counts how many n-grams (sequences of words) in the machine-generated summary match a human-written reference summary. It is simple and computationally cheap. BERTScore is slightly smarter; instead of matching exact words, it uses embeddings (vector representations of words) to see if the summary and the reference have similar meanings, even if they use different vocabulary.

The New Guard: LLM-as-a-Judge

Recently, the trend has shifted toward using powerful models like GPT-4 to act as judges (often called G-Eval). The logic is that GPT-4 reads like a human, so it should grade like a human. It creates a score (usually 1 to 5) based on criteria like fluency and coherence.

The Blind Spot

The researchers point out a critical flaw: nearly all validation of these metrics has been done on English text. Languages like Chinese and Indonesian differ drastically in terms of script, tokenization (how words are split), and grammar.

Does a high ROUGE score in Chinese mean the summary is fluent? Does GPT-4 understand Indonesian cultural nuances well enough to judge a summary? These are the questions this research aimed to answer.

Methodology: A More Human Approach

To test these metrics, the authors couldn’t just run code; they needed high-quality human data. They constructed a pilot dataset focused on three distinct languages from different language families:

  1. English (EN): Indo-European, high-resource.
  2. Chinese (ZH): Sino-Tibetan, high-resource, distinct script.
  3. Indonesian (ID): Austronesian, medium-resource.

They collected source documents ranging from news to fiction and recipes. Then, they gathered summaries for these documents from two sources:

  1. Humans: Native speakers.
  2. Models: A variety of LLMs including GPT-4, Llama-2, Falcon, and PaLM-2.

Moving Beyond the 5-Point Scale

One of the most interesting aspects of this paper is how they collected human feedback.

Standard evaluation datasets (like SummEval) use a Likert scale, asking annotators to rate a summary from 1 to 5. The authors argue this is flawed. Social science research suggests that people have response biases—some annotators never give a 5, while others are lenient. Furthermore, a “4” from one person might mean something different than a “4” from another.

Instead, this study used Pairwise Comparisons.

Annotators were shown two summaries side-by-side and asked to pick the winner (or declare a tie) based on specific criteria. This is similar to how chess players are ranked. If Summary A beats Summary B, Summary A gains points and B loses points. This system produces an Elo score, which provides a continuous, comparative ranking of quality rather than an arbitrary absolute number.

Interface of the Ranking Task showing two summaries side-by-side. Figure 1: The pairwise ranking interface used by annotators. Instead of assigning a random number, evaluators simply choose which summary is better.

The annotators judged the summaries on four distinct dimensions:

  1. Self-Contained: Can you understand the summary without reading the original text?
  2. Fluency: Is the grammar and flow natural?
  3. Accuracy: Does it avoid hallucinations and contradictions?
  4. Subjective Preference: Which one would you rather read?

Experiment Results: The Cracks in the Foundation

The results of this study offer a sobering look at the state of multilingual evaluation.

1. The Likert Scale Illusion

First, the authors analyzed “G-Eval,” the method where GPT-4 rates summaries on a 1-5 scale. They compared GPT-4’s scores against human ratings from previous datasets.

Scatter plots showing G-Eval ratings vs Human ratings. Figure 2: G-Eval scores vs. Human ratings. Note the vertical banding—even when humans give a variety of scores (y-axis), G-Eval tends to cluster around specific numbers, showing poor alignment.

As seen in the figure above, the correlation is messy. There is significant variance. For example, look at the “Fluency” chart. The points are scattered everywhere. GPT-4 rarely gives a perfect “5” for coherence or fluency, suggesting it has internal biases that don’t match human intuition. If we can’t trust this scale for English, trusting it for Indonesian is even riskier.

2. Humans vs. Machines: Who Writes Better?

Using the Elo rating system derived from the pairwise comparisons, the researchers plotted the quality distribution of human-written summaries versus LLM-generated summaries.

Box plots comparing Human vs. LLM Elo scores. Figure 3: Elo scores for Humans (white) vs. LLMs (blue/grey). In the top row (Human Evaluators), humans generally outperform models. In the bottom row (GPT-4 Evaluators), the model prefers LLM outputs.

The top row of Figure 3 (labeled “Human Evaluator”) reveals a crucial insight: across all three languages, humans generally write better summaries than AI (the white boxes are usually higher than the blue ones). This is especially true for Chinese and Indonesian.

However, look at the bottom row, where GPT-4 acts as the evaluator. Suddenly, the gap closes, or even reverses. GPT-4 tends to rate LLM-generated summaries higher than humans do. This confirms a “self-preference bias”—LLMs prefer text that sounds like it was written by an LLM.

3. The Failure of Standard Metrics

This is the most critical finding of the paper. The researchers calculated the correlation (using \(R^2\) values) between human judgments and standard metrics (ROUGE, BERTScore, and GPT-4).

An \(R^2\) value closer to 1.0 means the metric perfectly predicts human preference. A value near 0 means there is no relationship.

Heatmap of R-squared values for different metrics. Figure 4: \(R^2\) correlation values. Darker blue indicates stronger correlation. Notice how much lighter the map gets for Chinese (ZH) and Indonesian (ID).

The heatmap above tells a clear story of degradation:

  • English (EN): The metrics work reasonably well. ROUGE and BERTScore have moderate correlations with human ratings.
  • Chinese (ZH) & Indonesian (ID): The correlations plummet. Look at the “Fluency” column for Chinese. The \(R^2\) value for ROUGE-1 is 0.15 (from Table 1 in the paper). For GPT-4 grading Chinese fluency, the correlation is practically zero (\(R^2\) = 0.03).

This implies that for Chinese summarization, standard metrics are practically random number generators regarding fluency. A high ROUGE score does not guarantee a fluent summary, and a low score doesn’t mean it’s bad.

The authors attribute this to the complexity of different scripts and tokenization methods. Chinese, for instance, does not use spaces to separate words, which confuses n-gram metrics like ROUGE that rely on clear word boundaries.

4. GPT-4 is Not a Universal Translator-Judge

There is a growing belief that we can just use GPT-4 to evaluate any language. The data suggests otherwise.

In English, GPT-4’s ratings correlate somewhat with human ratings. But in Indonesian, the Mean Absolute Error (MAE)—the gap between what a human rates and what GPT-4 rates—is massive. For “Fluency” in Indonesian, the error is significantly higher than in English.

Furthermore, GPT-4 exhibited strange artifacts. In some comparisons, it claimed LLM summaries were better because they were “more detailed,” even when human annotators found them bloated or slightly inaccurate. The model seems to confuse “verbosity” with “quality,” a bias that hurts its ability to judge concise, effective summaries in non-English languages.

Conclusion: We Need a New Yardstick

The paper “Re-Evaluating Evaluation for Multilingual Summarization” serves as a crucial check on the rapid progress of AI. We are building models that can generate text in hundreds of languages, but we are measuring their success with rulers made for English.

The key takeaways are:

  1. Standard metrics (ROUGE/BERTScore) are unreliable for non-English languages. They correlate poorly with human judgments in Chinese and Indonesian.
  2. Likert scales (1-5) are noisy. Pairwise comparisons (A vs. B) combined with Elo scores provide a much clearer picture of model performance.
  3. GPT-4 is a biased judge. It prefers LLM-generated text and struggles to align with human preferences in multilingual contexts.

What Does This Mean for Students and Researchers?

If you are working on a project involving multilingual NLP, do not blindly trust ROUGE scores. If your model gets a “state-of-the-art” score on a Chinese benchmark, it might still produce gibberish that humans hate.

The authors advocate for a more nuanced approach. We need to design evaluation frameworks that explicitly account for the linguistic properties of the target language. We also need to accept that, for now, automated metrics cannot fully replace human evaluation.

As we strive to make AI truly global, we must ensure that our definition of “quality” isn’t lost in translation.