Introduction
Large Language Models (LLMs) have achieved remarkable proficiency in translating between languages. You can ask a model to translate a sentence from English to Tibetan, and it will often do a passable job. But language is more than just grammar and vocabulary; it is the vessel for culture.
A critical question facing the AI research community is whether LLMs actually “understand” the culture associated with the languages they speak, or if they are merely mapping words. More specifically, how does cultural knowledge move between languages? If an LLM learns about a Korean festival while training on English text, does it automatically know about that festival when prompted in Korean? Conversely, does learning a low-resource language like Mongolian teach the model about Mongolian culture in English?
New research suggests that this process—cross-lingual transfer—is not as simple as we might hope. In the paper Cross-Lingual Transfer of Cultural Knowledge: An Asymmetric Phenomenon, researchers from Peking University uncover a fascinating disparity. While knowledge flows freely between high-resource languages (like English and Chinese), the bridge becomes a one-way street for low-resource languages.

As illustrated in Figure 1, there is a distinct difference in transfer patterns. For high-resource languages, the transfer is symmetric. However, for low-resource languages, while the model can transfer cultural knowledge from the native language to English, it struggles to do the reverse.
In this post, we will deconstruct the experimental framework used to discover this asymmetry, explain the “frequency hypothesis” that justifies it, and explore what this means for the future of multilingual AI.
Background: The Black Box Problem
To understand why this research is necessary, we must first address a major hurdle in LLM research: opacity.
Most state-of-the-art models (like GPT-4 or Claude) are closed-source. We do not know the exact composition of their training data. If a model answers a question about Tibetan culture correctly, we cannot be sure if:
- It learned the answer from English documents.
- It learned the answer from Tibetan documents.
- It transferred the knowledge from one language to another.
To study the mechanism of transfer, the researchers could not rely on existing massive models. They had to build a controlled environment. This involved training a model from scratch where every single document it saw was accounted for.
Continual Pretraining (CPT)
The study focuses on a phase called Continual Pretraining. This is a common technique where a base model (usually trained on English) is further trained on a new target language. The goal is to adapt the model to the new language. The researchers wanted to see if cultural knowledge “teleports” across the language barrier during this phase.
The Core Method: An Interpretable Framework
The authors designed a rigorous framework to isolate the effects of cross-lingual transfer from simple language learning. This methodology relies on three pillars: transparent data, decoupled transfer effects, and bilingual evaluation.
1. Transparent Pretraining
Instead of using a pre-baked model, the researchers trained a 0.5 billion parameter model (based on the Qwen architecture) from scratch using only English Wikipedia data. Crucially, they filtered out all non-Latin characters. This ensured that the base model had absolutely zero exposure to the target languages (Korean, Chinese, Tibetan, or Mongolian) before the experiment began.
2. Decoupling Transfer Effects (The “Bridge” Experiment)
This is the most innovative part of the methodology. To measure “transfer,” you have to prove that the model didn’t just learn the fact independently in the new language.
The researchers set up two distinct training settings for the continual pretraining phase:
- With Cross-Lingual Bridges: The model is trained on the new language and parallel sentence pairs (e.g., an English sentence concatenated with its translation). This explicitly helps the model align the two languages.
- Without Cross-Lingual Bridges: The model sees the exact same data—same English sentences, same target language sentences—but the parallel pairs are shuffled and separated. They never appear together in the same context window.

Figure 2 visualizes this setup using English and Tibetan as an example.
- Left Side: The base model is trained purely on English.
- Center (The Split):
- In the “w/ Bridges” path, the model sees aligned documents. It learns that “Yak” in English relates to the specific Tibetan term.
- In the “w/o Bridges” path, the links are broken.
- Right Side: We evaluate the model.
The Logic of the Gap: If the model performs significantly better in the “With Bridge” setting than the “Without Bridge” setting, that performance gap represents cross-lingual transfer. It means the model used the “bridge” (the parallel data) to access knowledge it already had in one language to answer questions in the other.
3. Bilingual Parallel Evaluation
To test cultural knowledge, the researchers gathered questions about four specific cultures. They ensured these questions existed in both English and the native language.
- English-to-Target Transfer: Can the model use knowledge learned in English (during the base training) to answer questions phrased in the target language?
- Target-to-English Transfer: Can the model use knowledge learned in the target language (during continual pretraining) to answer questions phrased in English?
Experiments and Results
The study focused on four cultural communities, selected to represent a mix of high-resource and low-resource languages.

As shown in Table 2, the researchers looked at:
- Koreans (High-resource language): Data is abundant.
- Han Chinese (High-resource language): Data is very abundant.
- Tibetans (Low-resource language): Data is scarce; distinct script.
- Mongols in China (Low-resource language): Data is scarce; distinct script.
They collected hundreds of culture-specific questions (e.g., regarding holidays, history, and customs) for each group. Table 3 below details the dataset statistics, noting specifically the challenge of collecting data for low-resource languages where question length can vary significantly due to the nature of the scripts.

The Results: A Tale of Two Transfers
The experimental results revealed the “Asymmetric Phenomenon” highlighted in the paper’s title. Let’s analyze the performance graphs.

Figure 3 displays accuracy over training steps. The Blue Line represents the setting “With Bridges” (Transfer enabled), and the Orange Line represents “Without Bridges” (No Transfer). The gap between the lines is the transfer effect.
1. High-Resource Languages (Rows 1 & 2: Korean, Chinese)
Look at graphs 1a and 2a (Target Language Evaluation). There is a consistent gap between the blue and orange lines. This means that English knowledge is successfully helping the model answer questions in Korean and Chinese.
Now look at 1b and 2b (English Evaluation). The gap is even wider. As the model learns Chinese or Korean, it transfers that new cultural knowledge back into its English capabilities.
- Conclusion: Transfer is Bidirectional. The bridge supports traffic in both directions.
2. Low-Resource Languages (Rows 3 & 4: Tibetan, Mongolian)
This is where it gets interesting.
- Target-to-English (Graphs 3b & 4b): Look at the bottom row. There is a distinct gap. The blue line is higher. This means that as the model reads Tibetan or Mongolian text, it is successfully transferring that cultural knowledge to English. It learns about a Tibetan custom in Tibetan, and via the bridge, can answer questions about it in English.
- English-to-Target (Graphs 3a & 4a): Look at the third row. The lines are almost overlapping, or the gap is negligible/inconsistent. The “Bridge” isn’t helping much.
- Conclusion: Transfer is Asymmetric. Knowledge flows out of the low-resource language, but English knowledge does not flow in to help with the low-resource language tasks.
The Frequency Hypothesis: Why the Asymmetry?
Why does the bridge work one way but not the other for Tibetan and Mongolian?
The researchers propose the Frequency-Based Hypothesis: Cultural knowledge transfers only if it appears frequently enough in the source training data.
To prove this, they calculated the Cultural Density—how often cultural keywords appear in the corpora of the different languages.

Table 1 provides the smoking gun:
- The High-Resource Case: For Korean and Chinese culture, the density in the English corpus and the Native corpus is roughly comparable (same order of magnitude). English Wikipedia talks about China and Korea quite a lot. Therefore, there is enough “source material” in English to transfer over to the native language.
- The Low-Resource Case: Look at the numbers for Tibetans and Mongols.
- In English Corpus: The density is very low (~1.5e-7).
- In Native Corpus: The density is significantly higher (roughly 60x higher for Tibetan).
The Explanation: The model cannot transfer what it does not know. The English base model barely knows anything about Tibetan culture because those concepts rarely appear in English Wikipedia. Therefore, building a “bridge” from English to Tibetan doesn’t bring any new cultural insights to the Tibetan side.
However, the reverse works beautifully. The Tibetan corpus is rich with Tibetan culture. When the model reads this, the “bridge” allows it to export that rich knowledge into the English conceptual space.
Conclusion and Implications
This paper provides a sobering look at the limitations of current multilingual LLMs. It challenges the assumption that simply training a massive English model and “teaching it languages” will automatically result in a culturally intelligent system.
Key Takeaways:
- Transfer is not guaranteed: Just because a model knows two languages doesn’t mean it shares knowledge between them perfectly.
- Data sparsity is the bottleneck: For low-resource cultures, English-centric models often lack the “source knowledge” to transfer. You can’t bridge a void.
- The “Export” Value of Low-Resource Languages: Interestingly, training on low-resource languages is highly effective at teaching the model about those cultures in English. This suggests that preserving and using low-resource language data is vital not just for those communities, but for enriching the global knowledge base of the AI in English as well.
This research highlights the importance of data transparency and careful curriculum design. If we want AI to truly represent global diversity, we cannot rely solely on the massive gravity of English data. We must ensure that the “bridges” we build have strong foundations on both sides of the river.
](https://deep-paper.org/en/paper/2506.01675/images/cover.png)