Language Models (LMs) like GPT-4 or LLaMA are impressive. They can write poetry, code in Python, and summarize history. But how do we know if they actually understand the grammatical rules of a language, or if they are simply parroting statistical patterns they memorized during training?

This question becomes even harder when we step away from English. Russian, for example, is a morphologically rich language with flexible word order and complex agreement rules. Evaluating a model’s grasp of Russian grammar requires more than just checking if the output “looks okay.”

Enter RuBLiMP (Russian Benchmark of Linguistic Minimal Pairs). Created by a diverse team of researchers, this new benchmark introduces a rigorous method for testing the grammatical competence of language models. It moves beyond simple templates and addresses the “elephant in the room” of AI evaluation: data contamination.

In this deep dive, we will explore how RuBLiMP was built, why “minimal pairs” are the weapon of choice for linguists, and what this benchmark reveals about the current state of AI’s ability to understand the Russian language.

The Problem with Evaluating Grammar

To test a human’s linguistic competence, linguists often use acceptability judgments. You give a person two sentences and ask which one sounds “correct” to a native speaker.

For example:

  1. The cat is on the mat. (Grammatical)
  2. *The cat are on the mat. (Ungrammatical)

This constitutes a minimal pair. The two sentences are identical except for one specific feature—in this case, the number agreement between the subject (“cat”) and the verb (“is/are”). If a listener consistently picks sentence #1, they understand subject-verb agreement.

The Landscape of Benchmarks

For years, the gold standard for this type of evaluation in English has been BLiMP (Benchmark of Linguistic Minimal Pairs). However, for languages other than English, resources have been scarce, small, or artificially constructed.

As shown in the comparison table below, existing benchmarks for languages like Chinese (CLiMP), Japanese (JBLiMP), or Swedish (DaLAJ) vary significantly in size and methodology. Most rely on manual templates or translations, which can result in unnatural sounding sentences.

Table 1: Comparison of benchmarks of linguistic minimal pairs for different languages.

Prior to this research, Russian evaluation relied largely on translation-based templates (like CLAMS), which often fail to capture the nuances of the language. RuBLiMP changes the game by introducing 45,000 minimal pairs derived from real, open-text corpora, covering a massive range of linguistic phenomena.

The RuBLiMP Method: From Raw Text to Minimal Pairs

The researchers didn’t just write 90,000 sentences by hand. Instead, they developed a sophisticated, four-stage pipeline to generate high-quality data that reflects how Russian is actually used in the wild.

The process is visualized in the diagram below:

Figure 1: Overview of the RuBLiMP’s minimal pair generation approach.

1. Sentence Extraction and Annotation

The process begins with Sentence Extraction (a). The team scraped sentences from diverse sources: Wikipedia, Wikinews, and Librusec (a collection of books). This ensures the benchmark covers different domains, from encyclopedic descriptions to literary narrative.

Next comes Sentence Annotation (b). They used a state-of-the-art morphosyntactic parser to analyze the structure of every sentence, creating a “dependency tree.” This map tells the system that “cat” is the subject, “slept” is the verb, and “in zero gravity” is a prepositional phrase modifying the verb.

2. Perturbation (The Art of Breaking Sentences)

Once the system understands the sentence structure, it applies expert-written rules to generate the Minimal Pairs (c). This is where the linguistic magic happens. The system deliberately “breaks” the sentence to isolate a specific grammatical rule.

For example, if the valid sentence is:

Vpervye kosmonavt spal v nevesomosti (“For the first time an astronaut slept in zero gravity”)

The system might generate an ungrammatical version by changing the case ending of the noun “zero gravity” to one that is grammatically illegal after the preposition “in.”

The researchers categorized these perturbations into three main linguistic pillars:

  1. Morphology: Violating word formation rules (e.g., incorrect prefix order).
  2. Syntax: Violating structural rules (e.g., subject-predicate agreement, negation).
  3. Semantics: Violating meaning-based constraints (e.g., using a tense that contradicts a time adverb).

The distribution of these phenomena is comprehensive, covering 45 distinct paradigms grouped into 12 high-level categories.

Figure 2: Distribution of phenomena in RuBLiMP.

3. The “Anti-Cheating” Layer: Decontamination

This is arguably the most critical contribution of the paper. Large Language Models are trained on massive chunks of the internet. There is a high probability that a model “knows” a sentence is correct simply because it memorized that exact sentence during training, not because it understands the grammar.

To solve this, the researchers implemented Minimal Pair Curation (d) using a technique called MIN-K% PROB.

The logic is fascinating: If a model has memorized a training example, it will assign a high probability to every token in that sentence. If it hasn’t seen the sentence before, there will likely be some “outlier” tokens with lower probability. By filtering out sentences that models seem to have memorized (where the “surprise” factor is too low), RuBLiMP ensures that the evaluation tests generalization, not memory.

The chart below demonstrates the impact of this filtering. The \(\Delta\)-scores represent the performance drop when using decontamination. A positive drop means the models were relying on memorization for the non-filtered data.

Figure 3: Delta-scores for each LM and K%.

What Does RuBLiMP Look Like?

To give you a concrete idea of what the models are tested on, here are examples of the specific paradigms used. You can see how subtle the differences are—a single letter change in a suffix or a slight movement of a word can render a Russian sentence ungrammatical.

Table 6: Examples of all 45 paradigms in RuBLiMP.

For instance, under Argument Structure, the benchmark tests if a model knows that an animate subject (like a person) can be swapped with an inanimate object in certain contexts, but not others. Under Aspect, it tests if the model understands that you cannot use a “perfective” verb (implying a completed action) with a word meaning “for a long time” (implying duration).

Experiments: Humans vs. Machines

The researchers evaluated 25 different Language Models, ranging from monolingual Russian models (like ruBERT, ruGPT) to massive multilingual models (like XLM-R, BLOOM, and LLaMA). They also collected a human baseline using native Russian speakers.

The Metric

The models were evaluated using perplexity (for decoder models) or pseudo-perplexity (for encoder models). Essentially, the model is presented with the pair. If it assigns a higher probability to the grammatical sentence than the ungrammatical one, it scores a point.

The formulas used for these calculations are standard in NLP:

Equation 1: Perplexity calculation Equation 2: Pseudo-perplexity calculation

The Results

The results paint a humbling picture for AI. While models have advanced significantly, they still lag behind human intuition.

1. Humans are undefeated. Humans achieved near-perfect scores (>95%) across almost all paradigms. The sentences created by the automated pipeline were validated by linguists (Table 2 below) and found to be highly reliable, with plausible minimal pairs constituting over 94% of the dataset.

Table 2: The ratio of plausible minimal pairs by phenomenon.

2. Bigger isn’t always better. Surprisingly, massive models didn’t always outperform smaller, specialized ones. For example, ruGPT-medium performed similarly to ruGPT-large. Specialized Russian models generally outperformed massive multilingual models like mGPT or BLOOM on specific Russian grammatical nuances.

3. What is easy and what is hard?

  • Easy: Models are generally good at Morphology and Agreement. If a noun is plural, the model knows the verb should be plural. This is likely because these patterns appear constantly in training data within a varied, local context.
  • Hard: Models struggle with Semantics and Structural Relations.
  • Negation: Models fail to distinguish between correct and incorrect uses of negative pronouns (e.g., “He never goes” vs. “He ever goes”).
  • Tense: Models struggle to identify when a verb tense contradicts a time adverb (e.g., “Yesterday he will go”).
  • Long-distance dependencies: When the subject and verb are separated by many other words, models lose track of the relationship.

4. The Length Effect The researchers analyzed how sentence length affects accuracy. Interestingly, model performance generally improves as sentences get longer. This might seem counterintuitive (longer sentences are more complex), but shorter sentences in this benchmark often isolate the most difficult semantic phenomena.

Figure 4: Results on RuBLiMP for the monolingual LMs per domain grouped by seven quintiles of the length.

Multilingual Analysis

The team also compared how multilingual models perform across different languages using RuBLiMP alongside benchmarks for English (BLiMP), Chinese (CLiMP/SLING), and others.

The finding? No single model rules them all. A model might be excellent at English syntax but perform at random-guessing levels for Russian or Japanese. This highlights the danger of assuming a “multilingual” model is equally competent in all its supported languages.

Table 11: Accuracy scores for the multilingual experiments.

Conclusion

RuBLiMP represents a significant step forward in how we evaluate Language Models. By moving away from simple templates and confronting the issue of data contamination head-on, the researchers have provided a tool that tells us not just if a model works, but how it understands language.

The takeaways for students and researchers are clear:

  1. Don’t trust the loss curve: A model with low training loss might just be memorizing data. Decontaminated benchmarks are essential.
  2. Grammar is not solved: While LLMs are fluent, they still struggle with the structural and semantic logic that humans grasp intuitively—especially in morphologically complex languages like Russian.
  3. Context matters: The method of extracting sentences from real books and articles creates a much harder, more realistic test than the synthetic sentences used in the past.

As AI continues to evolve, benchmarks like RuBLiMP act as the necessary “report card,” ensuring that our machines aren’t just faking fluency, but actually acquiring linguistic competence.