Introduction

Imagine trying to learn a language that has no dictionary, no textbook, and no Google Translate support. Now, imagine trying to teach a computer to translate that language. This is the reality for thousands of “low-resource” and endangered languages around the world.

Modern Machine Translation (MT) systems, like the ones behind the translation tools we use daily, are data-hungry beasts. They learn by looking at millions of examples of translated sentences—known as parallel data. For example, to learn English-to-Spanish translation, the model analyzes huge datasets containing English sentences paired with their Spanish equivalents.

But what happens when those parallel datasets don’t exist?

This creates a “chicken and egg” problem. To build a translation system, you need parallel data. To find parallel data efficiently (a process called sentence mining), you usually need a strong multilingual model… which requires parallel data to train.

In this post, we are doing a deep dive into a research paper titled “Improving Parallel Sentence Mining for Low-Resource and Endangered Languages.” The researchers tackle this circular problem head-on. They propose a method to mine parallel sentences for endangered languages without relying on existing parallel data, using only monolingual texts (which are much easier to find).

We will explore how they constructed a new benchmark for three distinct language pairs and improved the mining process using clever techniques like Isotropy Enhancement and Unsupervised Alignment.

Background: The Challenge of Sentence Mining

Parallel sentence mining is the task of searching through two large, separate collections of monolingual text (e.g., all of Wikipedia in Occitan and all of Wikipedia in Spanish) to find sentences that mean the same thing.

Think of it as finding a needle in a haystack, but you have two haystacks, and you have to match specific needles from one to the other.

The Status Quo

For major languages like English, French, or Chinese, this problem is largely solved. Researchers use powerful sentence encoders like LaBSE (Language-agnostic BERT Sentence Embedding). These models convert sentences into mathematical vectors (lists of numbers). If the model is good, the vector for “Hello” in English will be very close to the vector for “Bonjour” in French in the vector space. You simply calculate the distance between vectors to find matches.

However, models like LaBSE are trained on massive amounts of parallel text. If you try to use them on a language they haven’t seen (like Chuvash), or a language with very little data, their performance collapses.

The Alternative: Monolingual Models

The researchers in this paper argue that for endangered languages, we cannot rely on models that require parallel data for training. Instead, we should look at monolingual language models. These are models trained only on raw text in a single language, or many languages independently, without explicit translation pairs.

The specific model chosen for this study is Glot500, a multilingual model trained on over 500 languages. While Glot500 has seen these low-resource languages, it hasn’t necessarily been taught how they translate to others. The challenge, therefore, is to take this raw monolingual understanding and sharpen it into a tool capable of finding translations.

BELOPSEM: A New Benchmark

To test their methods, the researchers realized existing benchmarks weren’t realistic enough for endangered languages. They created BELOPSEM (Benchmark of low-resource languages for parallel sentence mining).

They focused on three specific language pairs, chosen to represent different levels of difficulty. In all cases, the source is a low-resource language, and the target is a high-resource language.

Occitan-Spanish (OCI-ES): Occitan is a Romance language spoken in Southern Europe. It is closely related to Spanish and French. This is considered the “easiest” pair because the languages are linguistically similar.
Upper Sorbian-German (HSB-DE): Upper Sorbian is a Slavic language spoken in eastern Germany. While it shares a geography with German, they are from different language families (Slavic vs. Germanic). However, Upper Sorbian is related to Czech and Polish, which helps.
Chuvash-Russian (CHV-RU): Chuvash is a Turkic language spoken in Russia. It is very different from Russian (a Slavic language), and it is linguistically distant from many other Turkic languages. This represents the “hard mode” of mining.

The researchers constructed datasets where true parallel sentences were hidden inside large monolingual corpora, mimicking a real-world mining scenario.

Table 1: Number of sentences in the datasets for all three language pairs in BELOPSEM.

As shown in Table 1 above, the datasets are split into training and testing sets. Crucially, the “parallel” rows show how few true matches exist compared to the total number of sentences (roughly 6%). The goal is to retrieve these few matches without getting tricked by the thousands of non-matching sentences.

The Core Method: PASEMILL

The researchers developed a pipeline called PASEMILL. Let’s break down the architecture of this mining system step-by-step.

Step 1: Sentence Representation

First, we need to turn sentences into math. The system feeds sentences from both the source (e.g., Chuvash) and target (e.g., Russian) languages into the Glot500 model.

Since Glot500 is a standard Transformer model (like BERT), it outputs embeddings for every word. To get a single vector representing the whole sentence, the researchers use mean-pooling on the 8th layer of the network. This simply averages the word vectors to create a sentence vector.

Step 2: Similarity Search with CSLS

Once every sentence is a vector, we need to find which Source Vector is closest to which Target Vector.

A naive approach uses Cosine Similarity, which measures the angle between two vectors. However, high-dimensional vector spaces often suffer from the Hubness Problem. Some vectors act as “hubs”—they appear close to everything, even sentences that aren’t translations. This creates false positives.

To fix this, the researchers use CSLS (Cross-Domain Similarity Local Scaling).

CSLS Equation

As defined in the equation above, CSLS calculates the cosine similarity between sentences $x$ and $y$, but it subtracts a penalty based on the “denseness” of the neighborhood around them (the $k$ nearest neighbors).

If a sentence $x$ is in a crowded region (a hub), the penalty is high, lowering the score.
This ensures that we only match sentences that are uniquely similar to each other, not just generically similar to everything.

Step 3: Improvements

This is where the paper makes its most significant contributions. Standard mean-pooling from a monolingual model isn’t usually accurate enough for high-quality mining. The researchers introduce two post-processing steps to boost performance.

Improvement A: Unsupervised Alignment Post-processing

Even if two sentences have similar vector representations, they might not be exact translations. A stronger check is to look at the word-level alignment.

If “The cat sat” matches “Le chat s’est assis,” we should be able to draw lines connecting “cat” to “chat” and “sat” to “assis.”

The researchers use a tool called SimAlign, which uses the language model to align words without needing a bilingual dictionary. They calculate an alignment score: the percentage of words in the sentence pair that have a strong match.

They filter the mined pairs using a dynamic threshold $\theta$:

Threshold Equation

Here, the threshold is determined by the average similarity score of the dataset plus a margin ($\lambda$) times the standard deviation. If a sentence pair’s alignment score doesn’t pass this threshold, it is discarded. This acts as a rigorous double-check to remove false positives.

Improvement B: Cluster-Based Isotropy Enhancement (CBIE)

This concept is complex but fascinating. It addresses a geometric flaw in language models known as Anisotropy.

The Problem: In many language models, sentence embeddings aren’t spread out evenly in the vector space (like a sphere). Instead, they tend to cluster in a narrow cone. This is bad for mining because when all vectors are squashed into a narrow cone, the distances between them become meaningless. Even unrelated sentences appear to be close neighbors simply because everything is jammed into the same corner of the room.

The Solution (CBIE): The researchers apply Cluster-Based Isotropy Enhancement.

They cluster the sentence vectors.
For each cluster, they calculate the “dominant directions” (using Principal Component Analysis).
They mathematically remove these dominant directions.

Think of it like taking a squashed football and inflating it back into a proper sphere. This forces the vectors to spread out, making the meaningful distances between true translation pairs much clearer.

Let’s look at the visual evidence of this transformation.

Figure 1: t-SNE plots for 1,OOO parallel Occitan-Spanish sentences before and after CBIE transformation.

In Figure 1 above (Occitan-Spanish), look at the plot on the left (a). The data points form distinct, stringy clusters. This is anisotropy. Now look at the right (b). After CBIE, the points form a uniform cloud. This “sphering” of the data makes similarity search significantly more accurate.

We see the same effect even more dramatically for the hardest language pair, Chuvash-Russian:

Figure 3: t-SNE plots for 1,0O parallel Chuvash-Russian sentences before and after CBIE transformation.

In Figure 3(a), the data is heavily structured and clustered. In 3(b), the CBIE transformation successfully normalizes the distribution, preparing the space for effective mining.

Experiments & Results

So, does this actually work? The researchers compared three main setups:

XLM-R: A standard multilingual baseline.
LaBSE: The state-of-the-art sentence encoder (trained on massive parallel data).
Glot500: The proposed monolingual-only approach, tested with and without the improvements (Alignment and CBIE).

The performance is measured using F-score (a balance of Precision and Recall).

$Table 2:F-scores \$( \\% )\$ on the test datasets of the three mining corpora in BELOPSEM.$

Table 2 reveals several critical insights:

1. Language Distance Matters Look at the baseline scores (top rows). Occitan-Spanish (OCI-ES) scores high because the languages are similar. Chuvash-Russian (CHV-RU) scores significantly lower. This confirms that mining gets harder as linguistic distance increases.

2. The LaBSE Trap LaBSE performs incredibly well for Occitan and Upper Sorbian. Why? Because it likely saw related languages (Spanish, French, Czech, Polish) during its massive training process. However, look at the Chuvash (CHV-RU) column. LaBSE scores 28.24%. The Glot500 model (with improvements) scores 43.62%. This is a massive finding. For a language that is truly distinct and unseen by the big models (like Chuvash), a smaller model trained on monolingual data performs better than the state-of-the-art, provided you enhance it correctly.

3. The Power of Improvements Comparing “Glot500 (NO/NO)” to “Glot500 (YES/YES)”:

OCI-ES: Improves from 72.6% to 84.5%.
HSB-DE: Jumps from 20.9% to 50.8% (More than double!).
CHV-RU: Improves from 37.8% to 43.6%.

The combination of Alignment (filtering out bad matches) and CBIE (fixing the vector space) consistently unlocks better performance.

A Qualitative Look

Numbers are great, but let’s look at an actual example to see why the improvements work.

Table 3: Example of sentence mined for the OccitanSpanish (OCI-ES) corpus before and after CBIE transformation with corresponding similarity scores.

In Table 3, we see an Occitan sentence about a professor at Nagoya University.

Before CBIE: The model is confused. It thinks the closest match is a random Spanish sentence about a man named Pablo living in Parácuaro. The similarity score is negative (-0.004).
After CBIE: The vector space has been corrected. The model now correctly identifies the Spanish translation (“Actualmente, trabaja en la Universidad de Nagoya…”). The similarity score jumps to positive 0.118.

This demonstrates that the “anisotropy” problem was literally blinding the model to the correct translation, and fixing the geometry revealed the match.

Conclusion & Implications

This research offers a promising roadmap for the digital preservation of endangered languages.

The key takeaway is that we do not need to wait for massive parallel datasets to exist before we can build translation technologies. By leveraging monolingual data—which is far easier to collect from the web, books, and documents—and applying smart mathematical corrections like CBIE and unsupervised alignment, we can effectively mine the parallel data we need.

Why does this matter?

Breaking the Cycle: It breaks the “chicken and egg” cycle, allowing us to bootstrap translation systems for languages like Chuvash or Upper Sorbian from scratch.
Efficiency: It shows that we don’t always need the biggest, most expensive model (LaBSE). A targeted model (Glot500) with the right post-processing can win in low-resource scenarios.
Scalability: The techniques used (SimAlign and CBIE) are unsupervised. They don’t require human labels, making them scalable to hundreds of other languages.

By refining how machines “see” the relationship between languages, we move one step closer to a world where no language is left behind in the digital age.

Introduction#

Background: The Challenge of Sentence Mining#

The Status Quo#

The Alternative: Monolingual Models#

BELOPSEM: A New Benchmark#

The Core Method: PASEMILL#

Step 1: Sentence Representation#

Step 2: Similarity Search with CSLS#

Step 3: Improvements#

Improvement A: Unsupervised Alignment Post-processing#

Improvement B: Cluster-Based Isotropy Enhancement (CBIE)#

Experiments & Results#

A Qualitative Look#

Conclusion & Implications#