Lost in Translation: How AI Detects Information Gaps Across Wikipedia Languages

Wikipedia is often viewed as a singular, universal repository of human knowledge. We tend to assume that switching the language setting from English to French or Russian simply translates the text. However, the reality is far more complex. Wikipedia is a federation of distinct communities, each with its own editors, cultural norms, and biases. This leads to distinct narratives where facts present in one language are completely omitted in another.

For computational social scientists, these discrepancies offer a window into cross-cultural differences and systemic biases. But how do you measure these differences at scale? Previous methods relied on coarse statistics, like counting the number of articles or simple word-level analysis. These approaches miss the nuance of what specifically is missing.

In this post, we dive into a research paper that introduces INFOGAP, a novel, automated pipeline designed to locate fine-grained information gaps between languages. The authors apply this method to a case study of LGBT biographies, revealing striking differences in how public figures are portrayed across English, Russian, and French Wikipedia.

As shown in Figure 1, the differences can be subtle but significant. In the biography of basketball player Brittney Griner, the English version highlights her athletic achievements (her triple-double record), while the Russian version focuses on controversies. INFOGAP provides the tooling to detect these divergences automatically.

The Challenge of Cross-Lingual Comparison

Comparing articles across languages is notoriously difficult. A sentence in English rarely maps 1-to-1 to a sentence in Russian. Sentences are often complex, containing multiple clauses and distinct facts.

Furthermore, manual analysis is unscalable. If a researcher wants to understand how LGBT figures are portrayed globally, they cannot manually read and cross-reference thousands of articles in multiple languages. They need a system that can:

Understand the semantic meaning of text (not just keyword matching).
Operate at the level of atomic facts (not just sentences).
Scale to thousands of biographies efficiently.

The INFOGAP Method

The researchers propose a three-stage pipeline to solve this: Fact Decomposition, Cross-Lingual Fact Alignment (X-FACTALIGN), and Cross-Lingual Fact Matching (X-FACTMATCH).

$Figure 2: Schematic of the INFOGAP procedure. We describe the Fact Decomposition and Multilingual Alignment steps in \$\\ S 2 . 1\$ , and the Alignment Verification step in \$\\ S 2 . 2\$$

1. Fact Decomposition

Sentences in Wikipedia are information-dense. A single sentence might say, “Born in 1990, she played for Baylor University and was the first NCAA player to score 2,000 points and block 500 shots.” This contains birth information, college affiliation, and specific statistical records.

To compare this against another language, the system first breaks these complex sentences down into atomic facts. The authors utilized GPT-4 for this task, processing entire paragraphs to resolve co-references (e.g., understanding that “she” refers to Griner) and outputting a list of standalone factual statements.

2. X-FACTALIGN: Finding the Haystack

Once article $E$ (English) and article $F$ (e.g., French) are decomposed into lists of facts, the system needs to check if a specific fact from $E$ exists in $F$. Brute-forcing this by comparing every English fact against every French fact is computationally expensive and prone to error.

Instead, the authors use a retrieval approach. They embed the facts using LaBSE (Language-Agnostic BERT Sentence Embedding), which maps sentences from different languages into a shared vector space. However, simply using Cosine Similarity isn’t enough due to the Hubness Problem.

The Hubness Problem: In high-dimensional vector spaces, certain vectors (hubs) tend to appear as nearest neighbors to many other vectors, even if they aren’t semantically related. To fix this, the authors used a density-normalized distance metric. They also constructed a bipartite graph between paragraphs in both languages to narrow the search space. If a fact comes from Paragraph A in English, the system prioritizes searching in the French paragraphs that are semantically “adjacent” to Paragraph A.

3. X-FACTMATCH: The Verdict

The final step is verification. The system retrieves the top candidate facts from the target language that might match the source fact. But “similarity” doesn’t mean “equivalence.”

To determine if the fact is actually present, the authors employ an entailment check. They prompt an LLM (GPT-4) with the source fact (hypothesis) and the candidate facts (premise). The model is asked: Can the source fact be inferred from the candidate facts?

This step converts a fuzzy similarity score into a hard decision: Entailed (Present) or Not Entailed (Missing).

Validating the Pipeline

Before trusting the automated results, the authors validated INFOGAP against human annotations. They manually labeled thousands of facts across English, French, and Russian biographies.

Table 1: Number of facts labeled using INFOGAP for each language pair and direction, and number of manually annotated facts.

As shown in Table 1, the scale of analysis possible with automation dwarfs what is possible manually. But is it accurate?

$Table 2: Performance of INFOGAP with respect to the manual annotations ( \${ \\mathit { n } } = 8 0\$ for each language pair), in terms of \$F _ { 1 }\$ score.$

Table 2 shows the results. INFOGAP achieved F1 scores between 0.78 and 0.90, significantly outperforming a standard NLI (Natural Language Inference) baseline and random guessing. This high reliability confirms that the pipeline can be trusted for large-scale sociological analysis.

Case Study: LGBT Biographies

The authors applied INFOGAP to the LGBTBioCorpus, a dataset of Wikipedia biographies of LGBT figures. The goal was to investigate how sexual orientation affects information coverage across English (En), French (Fr), and Russian (Ru) Wikipedia.

This domain is particularly sensitive. Narrative framing, omission of positive events, or emphasis on controversy can significantly alter the reader’s perception of a public figure.

RQ1: The Information Asymmetry

First, the authors looked at the sheer volume of shared information. Do different languages tell the same story?

$Figure 3: Distribution of information overlaps for LGBTBioCorpus. Top: Distribution over the percentage of facts in En biographies also found in their Fr and Ru counterparts.Bottom: Distribution over the percentage of facts in \$\\mathsf { F r }\$ and Ru biographies also found in their English counterparts. \$N = 2\$ ,700 biographies. In general, En biographies contain more facts that are exclusive to En.$

Figure 3 illustrates the information overlap.

En $\rightarrow$ Fr/Ru (Top Row): The overlap is low. For example, the median En $\rightarrow$ Ru overlap is only 0.23. This means that for a typical biography, 77% of the facts found in the English version represent unique information missing from the Russian version.
Fr/Ru $\rightarrow$ En (Bottom Row): The overlap is higher. A significant portion of the information in French and Russian articles (roughly 55-66%) can be found in the English version.

This confirms a “super-set” dynamic: English Wikipedia acts as a global hub with comprehensive coverage, while other languages often contain a subset of that information, plus some unique local context.

RQ2: Sentiment and Bias

The deeper question is: what kind of information is being shared or omitted? Does the sentiment of a fact affect its likelihood of crossing language barriers?

To test this, the authors categorized facts by their implied sentiment (Positive, Negative, or Neutral). They then ran a Bayesian binomial regression to see which factors predicted whether a fact would be included in the target language.

$Table 3: Mean of posterior distribution of regression coefficients. \$^ { * * }\$ indicates that \$9 5 \\%\$ posterior credible interval for the coefficient does not contain zero.$

Table 3 reveals a disturbing trend regarding Russian Wikipedia:

Neutrality Preference: Generally, facts with neutral sentiment are more likely to be shared across languages.
The Russian Negative Bias: In the Ru -> En and En -> Ru directions, the interaction between is_lgbt and conn_neg (negative connotation) is positive and significant.

Specifically, Russian biographies of LGBT people are disproportionately likely to share negative facts with English biographies compared to positive ones. The regression model estimates that for LGBT figures, roughly 51% of negative facts in Russian bios are shared with English, whereas for non-LGBT figures, that number drops to 38%. This suggests a narrative framing in Russian Wikipedia that preserves or emphasizes negative aspects of LGBT lives.

For context, Table 4 shows the baseline distribution of sentiment across the languages.

Table 4: Distribution of implied sentiment about biography subjects for En,Fr,and Ru articles.

RQ3: Missing Positive Events

Finally, the researchers used INFOGAP to identify specific “events”—clusters of facts—that carry positive sentiment but are missing from one language. This is not just about counting stats; it is about finding narrative gaps.

The authors formally defined a “missing event” using the set notation $M$:

$()\nM = { \\mathcal { V } \\in E | \\mathrm { a l l } ( { F | \\mathcal { F } e _ { i } | i \\in [ N _ { V } ] } ) }\n[$

Here, a paragraph $\mathcal{V}$ in the source article $E$ is considered missing if none of its constituent facts $e_i$ are entailed by the target article $F$.

One might worry that the model could simply be making a mistake—claiming an event is missing when it’s actually there. However, the authors provide a probabilistic bound on this error.

$]\n\\leq \\exp \\left( - 2 ( 1 - \\epsilon ) ^ { 2 } k ^ { 2 } / k \\right) = \\exp ( - 2 ( 1 - \\epsilon ) ^ { 2 } k ) .\n()$

This inequality shows that the probability of the model incorrectly marking an entire event (consisting of $k$ facts) as missing decreases exponentially with the number of facts. If INFOGAP says a 5-fact paragraph is missing, it is statistically very likely to be missing.

Real-World Examples of Narrative Gaps

The automated analysis surfaced fascinating examples of selective storytelling.

Table 5: Examples of events from biographies that contain a large number of positive facts that are only contained in one language version of the article relative to another.We provide translations (Google Translate)for the first two rows,rather than the original French and Russian content.

Table 5 highlights distinct narrative choices:

Tim Cook (En vs. Ru): The Russian article explicitly details Cook’s fundraising for Ukraine and Apple’s suspension of sales in Russia. This entire event is missing from the English article. While it makes sense that Russian editors focus on Russia-related news, the omission in English (despite coverage in US media) might reflect an attempt to maintain a “neutral” stance on controversial geopolitical topics.
Chelsea Manning (En vs. Fr): The French article includes a section describing praise Manning received from Ron Paul regarding her whistleblowing. This positive framing is absent from the English article, perhaps reflecting the more polarized American view of her actions compared to a potentially more sympathetic European view.
Ada Colau (En vs. Ru): The English article details her climate activism as Mayor of Barcelona. The Russian article omits these positive environmental policy achievements entirely.

Implications and Conclusion

This research moves beyond simple translation to uncover the “editorial soul” of different language communities. It shows that Wikipedia is not a monolithic encyclopedia but a collection of diverse, sometimes conflicting, narratives.

The INFOGAP method is a significant step forward for computational social science. By distilling complex articles into atomic facts and using LLMs for rigorous entailment checking, researchers can now:

Audit Bias: Systematically check if minority groups are portrayed more negatively in specific languages.
Assist Editors: Automatically flag missing positive achievements (like Ada Colau’s climate work) to help editors in other languages enrich their articles.
Study Information Flow: Understand how news and narratives propagate (or fail to propagate) across cultural borders.

While the study focused on LGBT biographies, the underlying technology is language- and topic-agnostic. It could be applied to analyze coverage of geopolitical conflicts, scientific debates, or historical events, making it a powerful tool for fighting misinformation and promoting a more complete, global understanding of the truth.

Technical Appendix: Efficient Implementation

For students interested in the technical implementation, it is worth noting that while the authors used GPT-4 for the initial experiments, they found it cost-prohibitive for the full dataset (parsing Alan Turing’s bio alone cost >100k tokens).

To solve this, they “distilled” the knowledge from GPT-4 into smaller, open-source models (Flan-T5 and mT5). They created a seed dataset of labeled facts (Tables 6 and 7) using GPT-4 and trained the smaller models to replicate the decomposition and entailment tasks.

$Table 6: Initial seed set of people for obtaining INFOGAP labels with GPT-4 for the \$\\mathsf { E n } \\to \\mathsf { F r }\$ and \$\\mathsf { F r } \\to \\mathsf { E n }\$ directions; see \$\\ S 2 . 3\$ .We used the INFOGAP labels on this seed set to finetune a flan-t5-large model; see \$\\ S 3 . 1\$$

$Table 7: Initial seed set of people for obtaining INFOGAP labels with GPT-4 for the \$\\mathsf { R u } \\to \\mathsf { E n }\$ and En $ { \\sf R u }$ directions; see \$\\ S 2 . 3\$ .We used the INFOGAP labels on this seed set to finetune a mt5-large model; see \$\\ S 3 . 1\$$

This approach makes INFOGAP not just accurate, but also computationally efficient and reproducible for future research.

The Challenge of Cross-Lingual Comparison#

The INFOGAP Method#

1. Fact Decomposition#

2. X-FACTALIGN: Finding the Haystack#

3. X-FACTMATCH: The Verdict#

Validating the Pipeline#

Case Study: LGBT Biographies#

RQ1: The Information Asymmetry#

RQ2: Sentiment and Bias#

RQ3: Missing Positive Events#

Real-World Examples of Narrative Gaps#

Implications and Conclusion#

Technical Appendix: Efficient Implementation#