Introduction

We often think of Large Language Models (LLMs) like ChatGPT as universal translators. If you ask a modern LLM to translate English into French or Spanish, the results are often fluent and accurate. However, this performance is not distributed equally. When we step away from high-resource languages and attempt to translate into “low-resource” languages—those with significantly less training data available on the internet—the models often stumble. They hallucinate, miss key terms, or fail to generate coherent text entirely.

For researchers and students in Natural Language Processing (NLP), this presents a significant equity gap. How do we make these powerful models useful for the thousands of languages that don’t have the massive web corpus of English or Chinese?

A recent paper titled “Chain-of-Dictionary Prompting Elicits Translation in Large Language Models” proposes a fascinating, training-free solution. The researchers introduce a framework called Chain-of-Dictionary (CoD). Instead of fine-tuning the model or relying on hard-to-find example sentences, CoD injects a “chain” of multilingual dictionary definitions directly into the prompt.

The results are staggering. By simply providing these lexical chains, the researchers achieved performance gains of up to 13x for certain language pairs, sometimes even surpassing state-of-the-art dedicated translation models like NLLB. In this post, we will deconstruct how CoD works, why “chaining” languages is more effective than simple translation, and look at the empirical evidence of its success.

The Problem: The Lexical Gap

To understand why CoD is necessary, we first need to look at why LLMs fail at translation. The primary culprit in low-resource settings is the “lexical-level” problem. The model simply hasn’t seen enough examples of rare words or specific grammatical structures in the target language to map them correctly from the source.

Standard approaches to fixing this involve In-Context Learning (ICL), specifically few-shot prompting. This is where you give the model a few examples of full sentences translated correctly (e.g., “Here is an English sentence, here is the Tamil translation”) before asking it to translate a new one.

However, for low-resource languages, finding relevant, high-quality sentence pairs to use as few-shot examples is difficult. Furthermore, the paper suggests that unrelated few-shot examples don’t actually help the model much with the specific vocabulary of the input sentence.

This is where dictionaries come in. Dictionaries are easier to acquire than aligned parallel corpora. But simply pasting a dictionary entry into a prompt isn’t always enough. The researchers drew inspiration from Chain-of-Thought (CoT) reasoning—the idea that showing the model intermediate reasoning steps improves performance. They applied this logic to translation, creating a “chain” of meaning across multiple languages.

The Core Method: Chain-of-Dictionary (CoD)

The CoD framework operates on a simple but powerful premise: give the LLM a cheat sheet, but make it multilingual.

When the system receives a source sentence, it performs the following steps before sending the final prompt to the LLM:

  1. Keyword Extraction: It identifies specific words in the input sentence that might be difficult to translate.
  2. Dictionary Lookup: It looks up these words in a dictionary.
  3. Chaining: Instead of just finding the translation in the target language, it retrieves translations in the target language plus several “auxiliary” high-resource languages (like French, German, or Portuguese).

The Prompt Structure

The resulting prompt consists of two sections: the standard translation request and the chained multilingual dictionaries.

The format looks like this: "<word in source> means <word in target> means <word in auxiliary 1> means <word in auxiliary 2>."

By explicitly linking the source word to the target word, and then linking that target word to other languages the LLM understands very well (like German or French), the prompt creates a semantic “bridge.”

Figure 1: An illustration for CoD for English to Tamil translation. CoD consists of two sections: the standard translation prompt (the upper box) and the chained multilingual dictionaries. We highlight by languages the chained dictionary part for CoD, containing the words and their translations in different languages. CoD outperforms standard prompting in this example.

As shown in Figure 1, the difference in output is dramatic.

  • Left (Standard Prompting): The model attempts to translate an English sentence about diabetic mice into Tamil. Lacking context, it produces a low-quality translation that, when back-translated, talks about “soaps” and “two month old soaps,” completely missing the biological context.
  • Right (CoD Prompting): The prompt includes a chain. For the English word “diabetic,” it provides the Tamil translation, but also the German (Diabetiker) and French (diabétique) equivalents. This extra context stabilizes the model’s understanding. The resulting Tamil translation is accurate.

Why Chaining Languages Matters

You might wonder: Why not just give the English-to-Tamil translation? Why add French and German?

The authors hypothesize that high-resource languages act as “cross-lingual cues.” An LLM like ChatGPT has seen a massive amount of French and German text. If the connection between English and a low-resource language is weak in the model’s parameters, the connection between the low-resource language and French might be slightly stronger, or the semantic cluster formed by seeing the word in three major languages reinforces the correct meaning.

The researchers tested this hypothesis through ablation studies (experiments where you remove parts of the system to see what breaks).

Table 2: Evaluations of CoD and various baselines on GPT-3.5 averaged from 200 languages. We report on translating from English into other languages.

Table 2 highlights the importance of the chain:

  • Bilingual Dictionary (Row 3): Using just Source \(\rightarrow\) Target boosts performance to a score of 36.37.
  • Decomposed Dictionary (Row 4): If you provide the translations as separate sentences (breaking the chain), performance drops significantly to 31.20.
  • CoD (Row 12/13): The full chain pushes performance up to 38.27.

This confirms that the structure of the prompt—specifically linking the meanings in a continuous chain—allows the model to leverage its multilingual prior knowledge more effectively.

Experiments and Results

The researchers evaluated CoD on the FLORES-200 benchmark, a massive dataset covering roughly 200 languages. They tested using ChatGPT (GPT-3.5-Turbo), InstructGPT, and BLOOM.

Massive Gains in Low-Resource Languages

The most striking visual representation of the results comes from the comparison of CoD against standard ChatGPT prompting across all 200 languages.

Figure 2: An illustrated comparison of 200 languages from English into the languages between the baseline ChatGPT (GPT-3.5-TURBO) and COD. We sorted the language scores in chrF++ for ChatGPT in descending order. CoD is effective for many languages, especially for low-resource ones.

In Figure 2, the blue bars represent the baseline ChatGPT performance, and the red bars represent CoD.

  • Top Chart (High/Mid Resource): For languages where ChatGPT is already good (left side), CoD offers modest improvements.
  • Bottom Chart (Low Resource): Look at the right side of the bottom chart. There are languages where the blue bar is almost non-existent (meaning the model failed completely), but the red bar shoots up.

For example, in translating English to Serbian (Cyrillic script), the score jumped from 3.08 to 42.63—a 13x improvement.

Statistical Overview

How consistent are these improvements? The authors analyzed the win/loss ratio across the language pairs.

Table 1: Statistics of the changes in chrF++ with COD on GPT-3.5-TURBO with 200 languages. 83.75% of the directions (335 out of 400) are improved. The advantage of COD clearly outweighs the disadvantage.

As shown in Table 1:

  • X-En (Translating into English): CoD improved performance for every single language tested (200/200).
  • En-X (Translating from English): CoD improved 135 out of 200 languages. Crucially, 71 of those improved by more than 5 points, which is a significant margin in translation metrics (chrF++).
  • Degradation: In the few cases where performance dropped, the decrease was usually minor compared to the massive gains in the successful cases.

Beating the Specialist Models

Perhaps the most impressive claim is how this general-purpose LLM (equipped with CoD) compares to models specifically trained for translation, such as NLLB (No Language Left Behind) by Meta.

Table 5: Results of COD (based on GPT-3.5-TURBO) compared to SOTA translator NLLB with chrF++ scores on 200 languages from FLORES-200 full devtest set.

Table 5 reveals that while NLLB still holds an edge in translating out of English (En-X), CoD allows ChatGPT to actually surpass NLLB when translating into English (X-En). The CoD score of 66.12 beats NLLB’s 54.77. This suggests that for tasks involving understanding low-resource languages and bringing them into English, prompted LLMs are becoming a superior option to dedicated translation systems.

Case Study: Qualitative Analysis

Numbers tell one story, but actual text samples show us how the model improves. The paper provides several case studies where the standard model hallucinates or fails to grasp the topic, while CoD maintains fidelity.

Figure 3: A case study on translating from English into Kikongo with Latin script using GPT-4 throughout the cases. We highlight in green the words translated wrong by baselines but translated correctly by CoD.

In Figure 3, the task is to translate a sentence about Olympic medals into Kikongo.

  • Standard Prompt: The output (when back-translated) talks about “eight trees” and “congregations.” It has completely lost the plot.
  • Bilingual Prompt: Better, but it hallucinates “bubble to transport cargo.”
  • CoD Prompt: The translation accurately conveys “18 medals” and “medal podium.”

An interesting observation made by the authors is that CoD seems to “elicit” translation abilities even for words not explicitly in the dictionary chain. By setting the correct context and topic through the chain, the model activates the correct subset of its parameters for that language, improving the translation of the entire sentence, not just the keywords.

Conclusion and Implications

The “Chain-of-Dictionary” paper provides a compelling argument for the use of retrieval-augmented prompting in machine translation. Rather than accepting that LLMs are simply “bad” at low-resource languages, this research shows that the capability often exists within the model—it just needs the right key to unlock it.

Key Takeaways:

  1. Prior Knowledge is Key: Injecting dictionary definitions works better than few-shot examples for low-resource languages because dictionaries are easier to obtain and provide precise semantic grounding.
  2. Multilingualism helps Multilingualism: Chaining translations through high-resource languages (French, German) stabilizes the model’s understanding of low-resource languages.
  3. Training-Free Improvement: This method requires no fine-tuning or model updates. It is purely a prompting strategy that can be applied to any existing LLM.

As LLMs continue to grow, techniques like CoD will be essential for ensuring these technologies are accessible to speakers of all languages, not just the dominant ones. It bridges the gap between the static knowledge of a dictionary and the generative fluency of a Large Language Model.