Imagine you are traveling through Ethiopia. You want to read a local news article, translate a street sign, or communicate with a local vendor in Amharic. You pull out your phone and type the sentence into a translation app. The app churns for a second and spits out a translation. You assume it’s correct.
But what if the app just translated the name of the Prime Minister into “Al-Qaeda”? What if it translated a request for a soft drink into a statement about drug smuggling?
These aren’t hypothetical scenarios. They are real errors discovered by researchers Hizkiel Mitiku Alemyehu, Hamada M. Zahera, and Axel-Cyrille Ngonga Ngomo from Paderborn University. In their recent paper, they conducted a rigorous “health check” on how modern Artificial Intelligence handles Amharic—a Semitic language spoken by over 20 million people, yet considered “low-resource” in the world of Natural Language Processing (NLP).
While Multilingual Large Language Models (mLLMs) like Meta’s “No Language Left Behind” (NLLB) claim to bridge the gap for hundreds of languages, this study peels back the layers to see what is actually happening under the hood.
In this deep dive, we will explore their methodology, dissect the architecture of these translation models, and look at the sometimes hilarious, sometimes dangerous errors that occur when AI tries to learn a complex language with limited data.
The Context: The “Low-Resource” Challenge
Machine Translation (MT) has experienced a golden age recently. If you translate between English, French, and German, the quality is often near-human. This is because these languages are “high-resource”—there are billions of sentence pairs available on the internet for models to learn from.
Amharic, however, faces a different reality. Despite being the official working language of Ethiopia and having a rich history, it is “low-resource” digitally. There simply isn’t enough clean, parallel text (Amharic-English pairs) available to train massive models effectively using traditional methods.
Previous attempts to handle Amharic focused on statistical methods or fine-tuning smaller models. However, the current trend in AI is Multilingual Large Language Models (mLLMs). These are massive neural networks trained on hundreds of languages simultaneously. The theory is that the model can learn universal grammatical structures from high-resource languages and “transfer” that knowledge to low-resource ones like Amharic.
But does this actually work? That is the core question of this research.
The Methodology: How Do We Grade an AI?
To answer this, the researchers set up a comprehensive evaluation pipeline. They didn’t just want to know if the models worked; they wanted to know how they failed.
They selected two major families of mLLMs developed by Meta:
- NLLB-200 (No Language Left Behind): A cutting-edge model designed specifically to improve performance on low-resource languages.
- M2M-100 (Many-to-Many): An earlier multilingual model.
They tested various sizes of these models, ranging from 418 million parameters to 3.3 billion parameters. In the world of AI, “parameters” are roughly equivalent to brain synapses—generally, the more parameters, the smarter the model.

As shown in Table 1 above, the researchers looked at NLLB models ranging from 600M to 3.3B parameters, and M2M models at 418M and 1.2B.
The Evaluation Pipeline
The researchers used a dataset called Lesan, which contains diverse text sources like news, Wikipedia, and Twitter conversations. They ran these sentences through the models in two directions: Amharic-to-English and English-to-Amharic.
But here is where the study stands out: they didn’t trust the computers to grade themselves. They employed a hybrid approach.

Figure 1 illustrates their workflow. It splits into two distinct paths:
- Automatic Evaluation: Using standard algorithms to calculate a score.
- Human Evaluation: Recruiting native Amharic speakers (university students) to manually read, grade, and annotate errors in the translations.
1. Automatic Metrics: The Math of Similarity
For the automatic path, the researchers used four industry-standard metrics:
- BLEU: Counts how many word sequences (n-grams) in the translation match the reference text.
- METEOR: Checks for exact word matches but also synonyms and stems (e.g., “running” matches “run”).
- ChrF++: Looks at character-level matches, which is very important for morphologically rich languages like Amharic where one word can have many prefixes and suffixes.
- TER (Translation Edit Rate): Counts how many edits (deletions, insertions) a human would need to make to fix the AI’s sentence. (Lower is better here).
2. Human Evaluation: The MQM Framework
Algorithms like BLEU are fast, but they can’t tell if a translation is offensive or dangerous. For that, the researchers used the Multidimensional Quality Metrics (MQM) framework.
They hired native speakers and asked them to look for specific error types:
- Mistranslation: The meaning is wrong.
- Addition/Omission: The AI added words that weren’t there or deleted necessary ones.
- Untranslated: The AI gave up and left the text in the original language.
- Grammar/Punctuation/Spelling: Structural errors.
The annotators assigned a “Severity” to each error (Minor, Major, or Critical).
Calculating the Overall Quality Score (OQS)
To turn these human observations into a comparable number, the researchers used a specific set of equations. First, they calculated the Overall Quality Score (OQS).

In this equation:
- PWPT is the “Per-Word Penalty Total.”
- PS is the Penalty Scaler (set to 1).
- MSV is the Maximum Score Value (set to 100).
To get the PWPT, they look at the total penalties divided by the word count:

And finally, the APT (Absolute Penalty Total) is the sum of all error penalties. A “Minor” error might cost 1 point, while a “Critical” error costs 10.

This rigorous mathematical approach allowed the researchers to quantify exactly how “bad” a translation was based on human judgment.
Experiments & Results: The Good, The Bad, and The Gibberish
So, how did the models perform?
Automatic Evaluation Results
When relying on algorithms, the difference between the models was stark. The NLLB (No Language Left Behind) models significantly outperformed the older M2M models.

In Table 2 (Amharic to English), look at the BLEU column. The NLLB3.3B model achieved a score of 26.7. In the world of machine translation, scores above 20 are usually considered “understandable,” and scores above 30 are “good.”
Now look at M2M418M. It scored a 3.07. This is essentially random noise. It indicates that the model failed almost completely to produce coherent English from Amharic input.
The results were similar in the reverse direction (English to Amharic):

As seen in Table 3, NLLB models consistently hover around the 20-22 BLEU mark, while M2M lags behind. Interestingly, the smaller, “distilled” versions of NLLB (NLLB1.3BD) performed quite competitively, suggesting that efficient, smaller models can work for low-resource languages if trained correctly.
Human Evaluation Results
While automatic metrics give us a general ranking, human evaluation reveals the truth. The annotators rated sentences on a 0 to 5 scale (0 being unrelated, 5 being perfect).

Table 4 shows the percentage of accuracy based on human grading. The NLLB3.3B model achieved 76.1% accuracy for English-to-Amharic translation. This is a promising result, showing that for three-quarters of the sentences, the model produced a high-quality output.
However, note the drop-off for smaller models. The 1.3B model dropped to 42.67%. This suggests that for low-resource languages, model size matters immensely. The model needs that extra capacity to memorize and generalize the complex rules of Amharic.
To ensure these human scores were reliable, the researchers calculated the Fleiss’ Kappa, a statistical measure of agreement between different annotators.

Table 5 shows coefficients roughly between 0.2 and 0.4. While this might seem low to a statistician, in the subjective world of translation evaluation, this represents “Fair Agreement.” It confirms that the annotators generally agreed on which translations were good and which were bad.
Finally, looking at the calculated Overall Quality Score (OQS) derived from the error penalties:

Table 6 confirms the dominance of NLLB3.3. It achieved a score of 84.77 in English-to-Amharic. This indicates that while errors exist, they are often minor, and the text remains largely usable.
Error Analysis: When AI Hallucinates
This is the most critical part of the study. The numbers tell us that the models make mistakes, but the qualitative analysis tells us what those mistakes are.
The researchers categorized errors into specific types. Let’s look at the distribution of these errors.

In Figure 2 (Amharic to English), the blue and red segments represent the larger NLLB models. The dominant error types are Mistranslation and Omission. This means the model isn’t just making grammar mistakes; it’s getting the core meaning wrong or skipping parts of the sentence entirely.

Figure 3 (English to Amharic) shows a similar pattern, though the total volume of errors is slightly different.
The “Hallucinations”: Specific Examples
The specific examples provided by the researchers are fascinating and highlight the dangers of relying on AI for translation without human oversight.
1. The Named Entity Problem
The models struggled severely with names of people and places.
- The Prime Minister: The models consistently failed to translate “Dr. Abiy” (Ethiopia’s PM). The M2M model translated his name as “Al-Qaeda.” This is a “Critical” severity error. Imagine a news organization using this model to translate a government speech; the geopolitical consequences could be disastrous.
- Geography: The city “Bahir Dar” (capital of Amhara region) was translated as “Gujarat” (a state in India). The model likely associated “region” and “capital” in its training data with India more often than Ethiopia, leading to a hallucination.
2. Literal vs. Contextual Translation
Amharic, like all languages, has idioms and names that shouldn’t be translated literally.

Table 7 highlights these issues.
- Mr. Wendimkun: “Wendimkun” is a proper name. However, the model broke it down literally into “Wendim” (Brother) and “Kun” (Be), translating “Mr. Wendimkun” as “Mr. Brotherkun.”
- Coca-Cola: The famous soda brand was translated into “KOKO KOLLA,” a phonetic gibberish that misses the brand recognition.
3. The “Cocaine” Incident
Perhaps the most shocking error occurred in English-to-Amharic translation.

As shown in Table 8 (third row), the source text discussed “Heineken” (the beer). The model translated this into “crack cocaine.” Similarly, in another instance, a sentence about “Coca-Cola” was mistranslated to imply someone was a “cocaine smuggler.”
These errors (Severity: Critical) occurred because the models likely encountered “coca” in training data related to drugs and failed to distinguish the context of the soft drink.
4. Cultural Misunderstandings
The movie The Sound of Music was translated literally as “The Noise of the Music” (YeMuziqa Dimts), rather than keeping the title or using the culturally accepted translation. The company name “Friendly” was translated as “Friendship” (Wedajinet), losing its status as a proper noun.
Conclusion: The Road Ahead for Amharic AI
This research provides a vital reality check for the AI community. While models like NLLB-200 are a massive leap forward compared to their predecessors (like M2M), they are not yet fluent speakers of Amharic.
Key Takeaways:
- Size Wins: The 3.3 Billion parameter model (NLLB3.3B) was the only one that consistently produced usable translations. Smaller models struggled to grasp the language’s complexity.
- Context is King: The models struggle with “Named Entities” (people, places, brands). They often default to hallucinations or literal translations when they don’t recognize a name.
- Human-in-the-Loop is Mandatory: A BLEU score of 26 might look decent on paper, but it doesn’t capture the fact that the model just called a Prime Minister a terrorist or a soft drink a hard drug. Automatic metrics cannot be the only yardstick for low-resource languages.
The authors conclude that while mLLMs have great potential, future work must focus on better handling of proper names and increasing the size and quality of Amharic datasets. For students and developers interested in NLP, this paper serves as a reminder: Don’t trust the score; read the translation.
By exposing these errors, Alemyehu, Zahera, and Ngonga Ngomo have laid the groundwork for the next generation of translation models—ones that hopefully know the difference between a refreshing beverage and a felony.
](https://deep-paper.org/en/paper/file-3040/images/cover.png)