Lost in Translation: Why We Need the MUSTS Benchmark for Multilingual AI

Introduction: The Challenge of Meaning

Imagine you are building a search engine or a chatbot. A user types in: “The bird is bathing in the sink.”

Later, another user types: “Birdie is washing itself in the water basin.”

To a human, these sentences are virtually identical in meaning. To a computer, they are distinct strings of characters. The ability of a machine to understand that these two sentences convey the same semantic information is called Semantic Textual Similarity (STS).

For the last decade, Natural Language Processing (NLP) has become incredibly good at this—if you speak English. But what happens when we step outside the “high-resource” bubble of English, French, or Spanish? What happens when we try to measure similarity in Sinhala, Tamil, or low-resource variants of Arabic?

This is the core problem addressed in the research paper “MUSTS: MUltilingual Semantic Textual Similarity Benchmark.” The researchers identified a critical gap in how we evaluate AI models: our current benchmarks are biased towards wealthy languages, often rely on poor machine translations, and confuse “similarity” with “relatedness.”

In this post, we will tear down the new MUSTS benchmark, explore how modern architectures (from Transformers to Large Language Models) tackle sentence similarity, and reveal a surprising finding: when it comes to low-resource languages, the biggest, smartest LLMs aren’t always the best tool for the job.

Background: Similarity vs. Relatedness

Before diving into the architecture, we must define what we are actually measuring. In the rush to build massive datasets, the NLP community has often conflated two distinct concepts: Similarity and Relatedness.

Similarity implies equivalence. “The car is fast” and “The automobile is quick” are similar.
Relatedness implies a topical connection. “The car is fast” and “The driver uses gasoline” are related, but they do not mean the same thing.

Many existing benchmarks, such as the Massive Text Embedding Benchmark (MTEB), include datasets that score for relatedness rather than strict similarity. This muddies the water when trying to train a model for tasks like paraphrase detection or semantic search.

MUSTS strictly adheres to the STS annotation guidelines. To understand exactly how granular this gets, look at the scoring criteria used in the benchmark:

Table 3 showing the STS similarity scoring scale from 0 to 5.

As shown in the table above, the scale ranges from 0 (completely dissimilar) to 5 (completely equivalent). This nuance is vital. A score of 3 means “roughly equivalent but important details differ.” A model must be sensitive enough to catch those missing details, not just spot that both sentences are about “birds.”

The Gap in Current Benchmarks

Current multilingual benchmarks often fail in three ways:

Language Coverage: They focus on “Winner” languages (high-resource) and ignore “Underdogs” (low-resource).
Annotation Quality: They often rely on machine-translated data. If you translate an English dataset into Russian using Google Translate, and then use that to test a model, you are testing the model’s ability to match translation errors, not its understanding of natural Russian.
Task Confusion: As mentioned, mixing relatedness and similarity.

MUSTS addresses this by curating datasets across 13 languages, covering diverse language families and resource levels.

Table 1 listing the 13 languages in MUSTS, their families, and dataset sizes.

As you can see in the table above, MUSTS includes languages like Sinhala and Tamil, which are often excluded from major benchmarks like MTEB. It ensures that every dataset included has been rigorously vetted for true semantic similarity.

Core Methods: How Machines Calculate Meaning

The researchers evaluated over 25 different methods to solve the STS problem on this new benchmark. These methods generally fall into two categories: Unsupervised (using pre-existing embeddings without specific training on these datasets) and Supervised (training models specifically for this task).

Let’s break down the complex architectures used.

1. Unsupervised Approaches

Unsupervised methods are attractive because they don’t require expensive labeled training data for every new language.

Vector Averaging and SIF

The simplest way to compare sentences is to turn every word into a vector (a list of numbers representing meaning) and average them. However, this is noisy. Words like “the” and “and” appear frequently but carry little semantic weight.

The researchers used a technique called Smooth Inverse Frequency (SIF). SIF improves upon simple averaging by:

Weighting: Giving lower weight to frequent words (similar to TF-IDF).
Common Component Removal: Mathematically removing the “common direction” that all sentence vectors share. This removes the “background noise” of the language, leaving behind the unique semantic content of the sentence.

LLM Prompting

With the rise of Large Language Models (LLMs) like Llama-3 and Mistral, a new unsupervised method has emerged: just ask the model.

The researchers tested several prompting strategies:

Zero-shot (ZS): Just giving the instructions.
Few-shot (FS): Giving the model 5 examples.
Chain of Thought (CoT): Asking the model to explain its reasoning before giving a score.

The specific prompts used are crucial for reproducibility. The researchers used the templates below:

Table 4 showing the prompt templates used for Zero-shot, Few-shot, and Chain of Thought strategies.

LLM-Encoders

These are LLMs that have been specifically tweaked to output high-quality embeddings (vector representations) of text, such as NV-Embed-v2 or gte-Qwen2.

2. Supervised Approaches

If you have training data (which MUSTS provides), you can train models to be specialists.

The Cross-Encoder (Transformers)

This is generally considered the “Gold Standard” for accuracy, though it is computationally expensive. In this architecture, you feed both sentences into the model simultaneously.

Figure 1 illustrating the Cross-Encoder architecture where both sentences are fed into a single Transformer.

As illustrated in Figure 1, the Transformer processes Sentence 1 and Sentence 2 together. This allows the mechanism of Self-Attention to look at words in Sentence 1 and compare them directly to words in Sentence 2 layer by layer. The model then outputs a final representation (often from the [CLS] token) that is fed into a regressor to predict the similarity score (0-5).

Why is this powerful? Because the model can see the interaction between words before making a decision. It knows that “bank” in Sentence 1 refers to a river because it sees “water” in Sentence 2.

The Bi-Encoder (Sentence Transformers)

The Cross-Encoder is accurate but slow. You can’t pre-calculate embeddings; you have to run the model every time you have a new pair of sentences.

The alternative is the Bi-Encoder (or Siamese Network) architecture.

Figure 2 showing the Bi-Encoder architecture with two parallel transformers and cosine similarity.

In this setup (Figure 2), Sentence 1 and Sentence 2 pass through the Transformer independently. We take the output (usually via Mean Pooling) to create a fixed embedding vector (\(U\) and \(V\)) for each sentence. We then calculate the Cosine Similarity between these two vectors.

Why is this useful? You can pre-compute the vector for Sentence 1 and store it. When Sentence 2 arrives, comparison is instant. The researchers utilized this architecture to fine-tune smaller LLMs (like gte-Qwen2-1.5B).

Experiments & Results

The researchers ran these models across all 13 languages. The performance metric used was Spearman Correlation, which measures how well the model’s ranking of similarity matches the human ranking. A score of 1.0 is perfect; 0.0 is random.

Here is the comprehensive results table. Take a moment to examine the columns for Low-resource languages (like Sinhala - Si) versus High-resource (like English - En).

Table 2 showing Spearman correlation results for all models across 13 languages.

Key Finding 1: The Low-Resource Gap

Look at the LLM Prompting section in the table. For English (En), Llama-3.1-8B achieves a score of 0.801. This is excellent.

Now look at Sinhala (Si). The same model scores 0.396.

This is a massive degradation. While LLMs are hailed as universal reasoners, their ability to determine semantic similarity in languages they weren’t heavily trained on is poor.

Key Finding 2: Old School vs. New School

Surprisingly, for low-resource languages, older and simpler methods often won.

LaBSE, a BERT-based sentence encoder released years ago, achieved 0.499 on Sinhala, significantly beating the massive Llama-3 model.
Even SIF (Smooth Inverse Frequency), the simple mathematical weighting method, was competitive with LLMs in low-resource settings.

This suggests that for “Underdog” languages, massive parameter counts do not automatically translate to better semantic understanding. Specialized multilingual encoders like LaBSE still hold the crown because they were explicitly designed to align languages in a shared vector space.

Key Finding 3: Supervised Training Reigns Supreme

The bottom section of the table (“Training Transformers”) shows the results when models are actually trained on the MUSTS data. The scores jump significantly. InfoXLM Large achieved an average score of 0.88, dominating every unsupervised method. This confirms that while “Zero-shot” LLM capabilities are hype-worthy, fine-tuning a Transformer is still the best way to get state-of-the-art results in production.

A Look at the Data Distributions

Why is the task so hard for some languages? Part of the answer lies in the data itself. Let’s compare the datasets for a high-resource language (English) and a low-resource language (Sinhala).

English Data Structure

Charts showing similarity distribution and word share for the English dataset.

In the English dataset (Top rows), we see a healthy distribution of sentence pairs. The “Word Share” (how many words overlap between the two sentences) acts as a decent proxy for similarity. As the similarity score (x-axis) goes up, the word share (y-axis) generally goes up. The model can rely somewhat on lexical overlap (matching keywords) to help it guess.

Sinhala Data Structure

Charts showing similarity distribution and word share for the Sinhala dataset.

Now look at Sinhala. The training set (Chart 3) shows a very different shape in the violin plots. Even in high-similarity bins (3-4 and 4-5), the “Word Share” is spread out. Two sentences in Sinhala can be semantically identical but share very few words due to the language’s complex morphology and rich vocabulary.

This makes the task much harder for a model. It cannot rely on simple keyword matching; it must genuinely “understand” the underlying concept. This morphological complexity helps explain why simple vector averaging fails and why models need specific training to succeed here.

Conclusion and Implications

The MUSTS paper provides a reality check for the NLP community. While we often hear that “language barriers are broken” by the latest GPT or Llama model, the data suggests otherwise.

Key Takeaways for Students and Practitioners:

Don’t trust the leaderboard blindly: A model that ranks #1 on MTEB (dominated by English/high-resource data) might be mediocre for your specific language needs. The rankings in MUSTS were significantly different from MTEB.
LLMs aren’t magic: For low-resource semantic similarity, a smaller, specialized model like LaBSE or a fine-tuned XLM-R often outperforms a generic Large Language Model.
Data quality matters: The success of MUSTS proves the value of strictly annotated, clean data over massive, noisy, machine-translated datasets.

As we move forward, benchmarks like MUSTS are essential. They force us to look beyond the “Winner” languages and ensure that the AI revolution includes the billions of people who speak “Underdog” languages. If you are building multilingual applications, assessing your model on MUSTS rather than just translating English benchmarks is a necessary step toward true reliability.

Introduction: The Challenge of Meaning#

Background: Similarity vs. Relatedness#

The Gap in Current Benchmarks#

Core Methods: How Machines Calculate Meaning#

1. Unsupervised Approaches#

Vector Averaging and SIF#

LLM Prompting#

LLM-Encoders#

2. Supervised Approaches#

The Cross-Encoder (Transformers)#

The Bi-Encoder (Sentence Transformers)#

Experiments & Results#

Key Finding 1: The Low-Resource Gap#

Key Finding 2: Old School vs. New School#

Key Finding 3: Supervised Training Reigns Supreme#

A Look at the Data Distributions#

English Data Structure#

Sinhala Data Structure#

Conclusion and Implications#