When to Say 'I Don't Know': Teaching Multilingual LLMs to Abstain

If you have ever played around with ChatGPT or similar Large Language Models (LLMs), you have likely encountered a “hallucination”—a moment where the model confidently asserts a fact that is completely wrong. For English speakers, these errors are becoming less frequent as models improve. However, for the billions of people who speak “low-resource” languages (languages with less representation in the training data, like Tamil, Telugu, or Marathi), the reliability gap is massive.

When an LLM doesn’t know the answer, the safest move is to abstain—to simply say, “I don’t know.” But teaching a model when to answer and when to abstain is a complex challenge, especially in a multilingual context.

In this post, we will dive into a fascinating research paper titled “Teaching LLMs to Abstain across Languages via Multilingual Feedback.” The researchers propose a novel method to help LLMs self-reflect by leveraging the relationships between different languages, ultimately making AI more reliable and equitable for everyone.

The Problem: The Multilingual Reliability Gap

We generally assume that if an AI is smart in English, it must be smart in other languages. Unfortunately, that is not the case. Multilingual LLMs often display severe knowledge disparities. They might be encyclopedic experts in English or French but struggle with basic facts in Nepali or Malayalam.

The standard solution to hallucinations is to teach the model to measure its own confidence. If confidence is low, the model abstains. Researchers have developed various “abstain strategies” involving calibration (checking token probabilities), prompting (asking the model “are you sure?”), and training.

The problem? These strategies were almost exclusively developed and tested in English. When researchers applied these standard English-centric abstain strategies to low-resource languages, the results were discouraging.

Figure 1: Average accuracy of abstention baselines in low- and high-resource languages.

As seen in Figure 1 above, there is a significant drop in “Abstain Accuracy” (the ability to correctly decide when to answer vs. when to stay silent) when moving from high-resource languages (red bars) to low-resource languages (blue bars). On the MMLU dataset, existing strategies dropped by 12.8% in effectiveness.

Essentially, in low-resource languages, LLMs are not only more likely to be wrong, but they are also worse at knowing they are wrong.

The Solution: Multilingual Feedback

The core hypothesis of this paper is brilliant in its simplicity: Perspective.

If you are unsure about an answer, you might ask a friend for a second opinion. If you ask a friend who thinks exactly like you, they might just confirm your bias. But if you ask friends from slightly different backgrounds or perspectives, their feedback might help you realize your mistake.

The researchers apply this logic to LLMs. Instead of asking the model to reflect on its answer in English (which might be disconnected from the local context) or solely in the target language (where the model might be weak), they ask the model to generate feedback in related languages.

How It Works

The proposed method, called Abstaining via Multilingual Feedback, follows a three-step process:

Proposed Answer: The LLM generates an answer to a question in the target language (e.g., Ukrainian).
Feedback Generation: The LLM is prompted to critique its own answer. Crucially, it generates this feedback in several different languages.
Abstain Decision: Based on this diverse feedback, the model decides whether the original answer is True (keep it) or False (abstain).

Figure 2: Overview of abstaining via multilingual feedback.

Figure 2 illustrates this workflow. Look at the bottom right quadrant labeled “Multilingual, Related.” When the model is asked a question in Ukrainian, it seeks feedback in related Slavic languages like Slovak, Russian, and Polish. This “family of languages” approach provides a richer context for the model to evaluate its own work.

Why not just use random languages? Or just English?

The researchers tested four specific settings to see what worked best:

Mono-Native: Feedback is generated only in the original language of the question.
Mono-English: All feedback is generated in English (the highest-resource language).
Multi-Random: Feedback is generated in random languages.
Multi-Related: Feedback is generated in languages linguistically or culturally close to the question language.

To mathematically determine which languages are “related,” the researchers used Lang2vec, a tool that represents languages as vectors based on linguistic attributes (syntax, phonology, geography, etc.). They calculated the distance between languages using the following equation:

Equation for calculating language distance.

By minimizing this distance, they could select the most relevant neighbors for any given language—asking for feedback in Telugu when the question is in Kannada, or Spanish when the question is in Catalan.

Why This Method Works: The Power of Conflict

You might think that using English for everything (Mono-English) would work best because LLMs are strongest in English. However, the results showed otherwise.

The researchers analyzed the type of feedback generated in these different settings. They categorized feedback into four roles:

Similar: Just repeats the answer.
Unrelated: Hallucinates something else entirely.
Complementary: Adds new, supporting info.
Conflicting: Disagrees with the proposed answer.

Conflicting feedback is actually good. It forces the model to stop and think, “Wait, if my feedback says this is wrong, maybe I should abstain.”

Figure 3: Analysis of feedback roles showing conflict levels.

Figure 3 shows that the Multi-Related approach (bottom right) generates the highest percentage of conflicting feedback (blue slice) and complementary feedback (orange slice). Monolingual approaches tend to produce “Similar” feedback (green slice), creating an echo chamber where the model just pats itself on the back for a wrong answer.

Furthermore, when GPT-4 was asked to judge the quality of the feedback, it found the Multi-Related feedback to be more relevant and informative than the alternatives.

Figure 4: GPT-4 evaluation of feedback relevance and informativeness.

As shown in Figure 4, Multi-Related feedback wins head-to-head comparisons against native and English-only strategies, particularly in being more relevant to the specific cultural or linguistic nuance of the question.

Experimental Results

The researchers evaluated their method using three models (Aya-13B, ChatGPT, and GPT-4) across varied datasets involving commonsense reasoning and general knowledge.

The Headline Result: The Multi-Related strategy consistently outperformed strong baselines. It achieved up to a 9.2% improvement in abstain accuracy for low-resource languages.

This improvement didn’t come at the cost of high-resource languages; the method remained competitive there too. But the real victory was closing the gap—making the model safer for speakers of languages like Kannada and Marathi.

Fairness and Equity

Accuracy averages can hide inequality. If a model is 99% accurate in English and 10% accurate in Bengali, the average might look okay, but the system is fundamentally unfair.

The researchers measured Equity using the Gini coefficient—a metric typically used in economics to measure wealth inequality. In this context, a lower Gini coefficient means the model’s performance is more equal across different languages.

Equation for Gini coefficient to measure equity.

Equation for Utility measurement.

Using these metrics, the Multi-Related approach was found to be the most equitable. It creates a “rising tide that lifts all boats,” rather than just optimizing for the dominant languages.

The Role of Culture

One of the most profound findings of this paper is that language is not just code; it is culture.

The researchers broke down the “relatedness” of languages by different factors: Syntactic (grammar), Geographic (location), Phonological (sound), and Cultural (values).

Table 3: Performance averages for various language relatedness settings.

Table 3 reveals a fascinating insight: Culture (based on the World Value Survey) was the most effective metric for selecting feedback languages to improve equity. This suggests that when an LLM tries to answer a question, sharing a cultural framework is just as important as sharing grammatical roots.

This is further supported by looking at where the model fails.

Figure 6: Abstain accuracy gaps across different domains.

Figure 6 compares the abstain accuracy gap between high- and low-resource languages across different topics.

Right Side (Large Gaps): Topics like “US Foreign Policy,” “High School European History,” and “Moral Disputes.” These are heavily culturally laden and West-centric.
Left Side (Small Gaps): Topics like “High School Math,” “Anatomy,” and “Physics.” These are objective, universal truths.

The massive gaps in social and cultural topics highlight that hallucinations in low-resource languages often stem from a lack of cultural knowledge, not just vocabulary.

Future Implications

The paper concludes with several forward-looking experiments that suggest where this field is going.

1. Cross-Lingual Retrieval

In many real-world scenarios, we use “Retrieval Augmented Generation” (RAG)—feeding the LLM external documents. If we translate a low-resource query to English to find documents, can we still use multilingual feedback?

Figure 7: Abstain accuracy in cross-lingual retrieval.

Figure 7 confirms that yes, even when using English documents for retrieval, the Multi-Related feedback method (green bars) generally helps the model make better abstain decisions than probability-based or reflective baselines.

2. Model Collaboration

Can a massive, general-purpose model like GPT-4 work together with a smaller, specialized multilingual model like Aya-13B?

The researchers tried using GPT-4 to answer the question, but using Aya-13B to generate the multilingual feedback.

Table 4: Collaboration between GPT-4 and Aya-13B.

Table 4 shows that this collaboration actually benefits low-resource languages. The smaller, culturally diverse model acts as a “check” on the massive, Western-centric model.

3. Why we can’t just transfer decisions

Finally, you might wonder: “Can’t we just translate the question to English, see if the model abstains in English, and copy that decision?”

Figure 5: Overlap of abstain decisions across languages.

Figure 5 suggests that answer is “No.” The Venn diagrams show the overlap of abstain decisions. In the bottom-right (Low Resource), the overlap is small. A model might confidently answer a question in English but hallucinate the same question in Tamil. Abstaining is a language-specific problem.

Conclusion

The phrase “lost in translation” is usually about meaning, but for AI, it’s about reliability. As we push for global AI adoption, we cannot treat low-resource languages as mere translations of English.

This research demonstrates that Multilingual Feedback—using the collective wisdom of related languages—is a powerful tool. By identifying knowledge gaps through the lens of related cultures and linguistic families, we can teach LLMs to be humble.

Improving the “Abstain Accuracy” isn’t just about reducing error rates; it’s about building trust. And as this paper shows, the path to trustworthy AI is not monolingual—it is deeply, structurally multilingual.

The Problem: The Multilingual Reliability Gap#

The Solution: Multilingual Feedback#

How It Works#

Why “Related” Languages?#

Why This Method Works: The Power of Conflict#

Experimental Results#

Fairness and Equity#

The Role of Culture#

Future Implications#

1. Cross-Lingual Retrieval#

2. Model Collaboration#

3. Why we can’t just transfer decisions#

Conclusion#