If you read the sentence “The dog bit the man,” you know exactly who is in trouble. If you swap the words to “The man bit the dog,” the meaning flips entirely. This is because English relies heavily on word order to convey meaning. To understand the sentence, you need to know not just what words are present, but where they are sitting.

But what if you spoke a language where “The dog bit the man” and “Man dog bit the” meant the exact same thing?

In the world of Natural Language Processing (NLP), we have largely designed our most powerful models—like Transformers and BERT—based on the intuition of languages like English. We assume that knowing the position of a word is crucial. This is handled by a mechanism called Positional Encoding (PE). However, a fascinating research paper titled “A Morphology-Based Investigation of Positional Encodings” challenges this assumption.

The researchers pose a fundamental question: Is the machinery we use to track word order actually necessary for all languages?

In this post, we will dive deep into this study, exploring the relationship between linguistics and deep learning architecture. We will see how the structural complexity of a language (morphology) might make standard Transformer components redundant, and what this tells us about the future of multilingual AI.

The “English Bias” in Deep Learning

Before we dissect the paper, we need to understand the architectural status quo.

Modern Large Language Models (LLMs) are built on the Transformer architecture. One of the quirks of the Transformer’s core mechanism—Self-Attention—is that it is “permutation invariant.” This is a fancy way of saying that if you feed the Transformer a bag of jumbled words, it treats them exactly the same as an ordered sentence. To the raw attention mechanism, “I love AI” looks identical to “AI love I.”

To fix this, the original creators of the Transformer introduced Positional Encodings (PE). These are mathematical vectors added to word embeddings that act like timestamps or coordinate markers. They tell the model, “This word is first,” “This word is second,” and so on.

For English, this is non-negotiable. But the authors of this paper argue that this design choice ignores a massive field of study: Linguistic Typology.

Morphology vs. Syntax

Languages generally use two main strategies to convey who did what to whom:

  1. Word Order (Syntax): You place the subject before the verb and the object after (like English or Chinese).
  2. Word Form (Morphology): You change the structure of the word itself (adding suffixes, prefixes, or modifying the root) to indicate its role.

Languages that rely on word order are often called Analytic or Morphologically Poor. Languages that rely on modifying words are called Synthetic or Morphologically Rich.

In a morphologically rich language like Sanskrit, Finnish, or Turkish, the word for “dog” might change its ending depending on whether the dog is biting, being bitten, or being given a bone. Because the word itself carries the grammatical information, the order in which the words appear matters much less.

This leads to the paper’s core hypothesis: If a language encodes grammatical information inside the word itself (high morphology), the Deep Learning model should rely less on Positional Encodings.

The Hypothesis Visualized

To truly understand this, let’s look at a comparison provided by the researchers between English (a morphologically poor language) and Sanskrit (a morphologically rich language).

The figure illustrates the effect of word order on semantics for two languages: English (left) and Sanskrit (right). English is a morphologically poor language with SVO word order whereas Sanskrit is a morphologically rich language with no dominant word order. Distorting the word order completely alters the meaning for English. However, for Sanskrit the meaning remains intact.

As shown in Figure 1 above, look at the English side on the left. If you rotate the subjects—Father, Child, King—the meaning changes drastically. The structure is rigid.

Now look at the Sanskrit side on the right. The words have specific endings (markers) like acc (accusative case).

  • Pita (Father)
  • Rajne (to the King)
  • Balkam (Child - object)
  • Datvan (Gave)

Regardless of whether you say “Father to the King Child Gave” or “Child to the King Father Gave,” the suffixes tell you exactly who gave whom to whom. The meaning remains intact despite the scrambling.

The researchers suggest that a BERT model trained on Sanskrit shouldn’t panic if we take away its ability to see word order. A BERT model trained on English, however, should fail catastrophically.

Methodology: How to Test the Hypothesis?

To prove this relationship, the authors conducted a massive empirical study covering 22 languages across 9 language families.

1. Quantifying Complexity (The TTR Metric)

First, they needed a way to measure how “morphologically complex” a language is. They used a metric called the Type-Token Ratio (TTR).

  • Token: The total number of words in a text.
  • Type: The number of unique words in that text.

In English, “walk,” “walks,” “walked,” and “walking” are distinct types, but we don’t have that many variations. In a language like Finnish, a single noun can have hundreds of forms. Therefore, morphologically rich languages have a higher TTR because they generate far more unique word forms for the same amount of text.

The authors calculated TTR for all 22 languages using the FLORES-200 benchmark to ensure consistency.

  • Low TTR (Analytic): Vietnamese (0.077), Chinese (0.17), English (0.194).
  • High TTR (Synthetic): Finnish (0.428), Korean (0.465), Turkish (0.376).

2. The “Lobotomy” Experiment

The experimental setup was clever but straightforward. They took pre-trained BERT models for each of the 22 languages. They then fine-tuned these models on various downstream tasks (like naming entities or parsing sentences).

They ran two versions of the training:

  1. Baseline: Standard BERT with Positional Encodings intact.
  2. Perturbed: BERT with Positional Encodings set to zero (effectively removing the model’s ability to see word order).

They then measured the Relative Decrease in performance.

\[ \text{Relative Decrease} = \frac{\text{Score}_{\text{baseline}} - \text{Score}_{\text{perturbed}}}{\text{Score}_{\text{perturbed}}} \]

If the model crashes and burns without PE, the relative decrease is high (meaning PE was essential). If the model barely shrugs, the relative decrease is low (meaning PE wasn’t doing much work).

3. The Scope of Languages

To ensure the results weren’t a fluke, the study used a diverse set of languages, as detailed in Table 2 below. This prevents the study from being biased toward just European languages.

Table 2: Overview of different languages

Results: Syntactic Tasks

The most direct test of grammar is a syntactic task. The researchers looked at Named Entity Recognition (NER) and Part-of-Speech (POS) tagging. These tasks require the model to understand the grammatical structure of the sentence to identify nouns, verbs, people, and locations.

Let’s look at the results for Named Entity Recognition.

Figure 2: Effect of Positional Encoding on NER task.

Figure 2 tells a compelling story. The y-axis shows how much performance dropped when Positional Encodings were removed. The languages on the x-axis are roughly ordered by morphological complexity.

  • The Left Side (Analytic): Look at Vietnamese (vi) and Chinese (zh). The drop is massive—over 60% for Vietnamese! These languages have very little morphology; they rely almost entirely on word order. Without PE, the model is lost.
  • The Middle (Moderate): English (en) and French (fr) sit in the middle. They suffer a significant drop, but not as bad as the purely analytic languages.
  • The Right Side (Synthetic): Look at Turkish (tr) and Finnish (fi). The curve flattens out. The performance drop is minimal. The model basically says, “I don’t need to know the position; the word endings tell me everything I need to know.”

This trend is repeated in the Part-of-Speech (POS) tagging task, shown in Figure 3.

Figure 3: Effect of Positional Encoding on POS task.

Here again, Vietnamese (vi) and Chinese (zh) see the highest relative decrease in performance. As we move right toward the morphologically rich languages like Russian (ru) and Finnish (fi), the reliance on positional encoding diminishes.

Dependency Parsing

Perhaps the hardest syntactic task is Dependency Parsing—mapping out the tree of relationships between words (e.g., determining which adjective modifies which noun). This usually requires strict attention to structure.

Figure 4: Efect of Positional Encoding on Dependency Parsing.

In Figure 4, the trend is stark.

  • Chinese (zh): A massive drop (around 60%).
  • Finnish (fi): A much smaller drop (around 20%).

This confirms that for morphologically rich languages, the syntax is encoded inside the words. The “Bag of Words” (a sentence with no order) still contains enough information to reconstruct the grammatical tree because the “lego blocks” of the words only fit together in specific ways.

Results: Semantic Tasks

Does this trend hold for tasks that are more about meaning than strict grammar? The researchers looked at XNLI (Natural Language Inference) and PAWS-X (Paraphrasing).

In these tasks, the model has to determine if two sentences contradict each other or mean the same thing.

Figure 5: Effect of Positional Encoding on XNLI.

Figure 5 shows the results for Natural Language Inference. The curve is flatter compared to the syntactic tasks, but the pattern remains. Vietnamese (vi) still relies most on position, while Swahili (sw) and Urdu (ur) rely less on it.

Why is the curve flatter? The authors suggest that semantic tasks can often be solved using “keyword matching.” If Sentence A mentions “cat” and “eating” and Sentence B mentions “feline” and “food,” you can guess they are related without knowing the exact grammar. Thus, Positional Encoding is less critical for all languages in these tasks, though strictly ordered languages still need it more.

The Statistical Verdict

To formalize these observations, the authors calculated the Spearman Correlation Coefficient. This statistical measure tells us how strongly two variables are related. In this case, they correlated:

  1. Morphological Complexity (TTR)
  2. Relative Decrease in Performance (Importance of PE)

A score of -1.0 would mean a perfect negative correlation: as complexity goes up, the need for PE goes down.

Table 1: Spearman correlation coefficient between morphological complexity of a language and relative decrease in performance across different tasks

As Table 1 shows, the correlations are strongly negative.

  • Dependency Parsing (UAS): -0.882. This is an incredibly strong correlation for social/linguistic data. It effectively proves the hypothesis: The more complex the words, the less the order matters.
  • NER: -0.742.
  • XNLI: -0.773.

The data provides definitive evidence that the utility of Positional Encodings is not universal—it is tied to the typological nature of the language.

Conclusion: Rethinking the “One Architecture Fits All” Approach

This research paper provides a crucial reality check for the field of NLP. For years, we have treated the Transformer architecture (and its Positional Encodings) as a universally optimal solution. This study reveals that this design is actually optimized for analytic languages like English.

The authors summarize their contribution succinctly:

“Our findings reveal that the importance of positional encoding diminishes with increasing morphological complexity in languages.”

Why does this matter?

  1. Efficiency: Training LLMs is expensive. If we are building a model specifically for Finnish, Turkish, or Sanskrit, we might be wasting parameters and computation on Positional Encodings that the model barely uses. We could potentially design more efficient, morphology-aware architectures for these languages.
  2. Performance: Conversely, simply slapping English-centric architectures onto morphologically rich languages might be suboptimal. If the model is forced to pay attention to position when it should be paying attention to word structure, it might learn inefficiently.
  3. Better Tokenization: The results suggest that for rich languages, how we break words into pieces (tokenization) is likely more important than how we order them.

As AI becomes truly global, studies like this remind us that language is diverse. Our models should reflect the beautiful complexity of human communication, rather than forcing every language to fit into an English-shaped box.


This blog post explains the research presented in “A Morphology-Based Investigation of Positional Encodings” by Ghosh et al., covering their experiments on how linguistic morphology interacts with deep learning architectures.