In the world of Natural Language Processing (NLP), we often take word order for granted. If you speak English, “The dog chased the cat” and “The cat chased the dog” mean two very different things. The syntax—the structure of the sentence—is rigidly defined by the sequence of the words.
But what if the order didn’t matter? What if you could say “Chased cat dog the” and, due to the way the words are modified or “tagged,” the meaning remained exactly the same?
This is the reality for Morphologically-Rich Languages (MRLs) like Sanskrit, Turkish, and Lithuanian. In these languages, grammar is handled by complex word endings (morphology) rather than position. While this gives speakers poetic freedom, it creates a massive headache for modern AI models. Most state-of-the-art parsers, specifically those built on Transformer architectures, rely heavily on positional encoding. They expect patterns in word order. When they encounter a language where order is arbitrary, they struggle to generalize, often overfitting to the specific word orders seen in their training data.
In this post, we will deep-dive into a fascinating research paper titled “CSSL: Contrastive Self-Supervised Learning for Dependency Parsing…”. The researchers propose a clever solution: a Contrastive Self-Supervised Learning (CSSL) framework. By teaching the model that a sentence and its scrambled version are actually the same thing, they achieve significant performance gains in dependency parsing for low-resource languages.
Let’s unpack how they did it.
The Problem: When “Position” Misleads the Model
To understand the solution, we first need to understand the task: Dependency Parsing.
Dependency parsing is the process of analyzing the grammatical structure of a sentence. The goal is to establish relationships between “head” words and “dependent” words. For example, identifying that “dog” is the subject (dependent) of the verb “chased” (head).
The Challenge of Free Word Order
In English, the parser learns that the noun before the verb is usually the subject, and the noun after is the object.
In Sanskrit, however, relationships are defined by case markers (vibhakti). As long as the markers are correct, the words can appear in almost any order. This is known as “Relatively Free Word Order.”

As shown in Figure 1 above, the dependency tree for the original Sanskrit sentence (top) and a permuted version (bottom) is identical. The arrows (dependencies) still point to the same words with the same labels (like kartā for agent/subject), even though the linear position of the words has changed completely.
The Conflict with Pre-training
Here lies the conflict. Modern pre-trained models (like BERT or mBERT) are trained with Positional Encodings. They learn that position \(1\) relates to position \(2\) in specific ways.
When you fine-tune these models on a Sanskrit dataset, the model tries to learn syntactic rules based on position. But because the position is arbitrary, the model gets confused. It might learn a rule from the training data that doesn’t apply to the test data simply because the writer chose a different word order.
You might ask: Why not just remove the position encoding? The researchers highlight a counter-intuitive finding from previous work: simply stripping out position information actually hurts performance. The model needs some sense of structure. The goal, therefore, isn’t to blind the model to position, but to make the model robust to variations in it.
The Solution: Contrastive Self-Supervised Learning (CSSL)
The researchers propose a method to force the model to learn “Permutation Invariance.” They want the model’s internal representation of a sentence to be the same, regardless of how the words are ordered.
To do this, they employ Contrastive Learning.
What is Contrastive Learning?
Contrastive learning is a technique used to learn valid representations of data by comparing similar and dissimilar pairs. The intuition is simple:
- Take an Anchor input (the original data).
- Take a Positive input (a version of the anchor that is slightly different but means the same thing).
- Take a Negative input (completely different data).
- Train the model to pull the Anchor and Positive closer together in vector space, while pushing the Anchor and Negative far apart.

Figure 2 illustrates this geometry. The model adjusts its weights so that the blue dot (Anchor) moves toward the green dot (Positive) and away from the red dot (Negative).
Applying CSSL to Dependency Parsing
In Computer Vision, creating a “Positive” pair is easy—you just crop or rotate the image. In NLP, it’s harder; if you change words, you usually change the meaning.
However, the researchers leverage the unique feature of Morphologically-Rich Languages: Permutation.
Because word order is free in Sanskrit, a scrambled sentence is a perfect Positive example. It contains the exact same semantic and syntactic information as the original.
- Anchor (\(X_i\)): The original sentence from the training set.
- Positive (\(X_i^+\)): The same sentence with words randomly reordered.
- Negative (\(X_i^-\)): A different, random sentence from the batch.

Figure 3 shows the pipeline. The phrase “gacchāmi aham vanam” (I am going to the forest) is the input. The model learns to associate it strongly with the permuted version “aham vanam gacchāmi”, while differentiating it from an unrelated sentence.
Deep Dive: The Mathematical Framework
The beauty of this approach is that it is modular. It doesn’t require reinventing the Transformer architecture. It simply adds an auxiliary “loss function” during the training process.
The training involves optimizing two objectives simultaneously:
- The Classification Loss (\(\mathcal{L}_{CE}\)): The standard task of predicting the correct dependency tree.
- The Contrastive Loss (\(\mathcal{L}_{CSSL}\)): The task of aligning the vector representations of permuted sentences.
The Contrastive Loss Function
The researchers use a specific formula to calculate how well the model is distinguishing between positive and negative pairs.

Let’s break this equation down:
- \(z_i\) and \(z_{i^+}\): These are the vector representations (embeddings) of the Anchor and the Positive (permuted) sentence.
- Similarity (\(z_i \cdot z_{i^+}\)): The dot product calculates how similar these two vectors are. We want this number to be high.
- The Denominator (\(\sum \exp(...)\)): This sums up the similarity between the Anchor and all other examples in the batch (the negatives). We want the anchor to be dissimilar to these, making this part of the fraction smaller relative to the numerator.
- \(\tau\) (Tau): A “temperature” parameter that controls how sharp the probability distribution is.
Essentially, this formula calculates the probability of identifying the correct positive pair out of a batch of random negatives. Minimizing this loss forces the model to encode the “essence” of the sentence regardless of word order.
The Total Loss
The final objective function for the neural network is a simple addition of the two tasks:

Here, \(\mathcal{L}_{CE}\) ensures the model still learns how to parse grammar (finding heads and dependents), while \(\mathcal{L}_{CSSL}\) ensures the model’s understanding is robust to word scrambling.
Experiments and Results
Does this theory hold up in practice? The researchers tested their CSSL framework on Sanskrit (using the Sanskrit Treebank Corpus) and 6 other low-resource Morphologically-Rich Languages (Turkish, Telugu, Gothic, Hungarian, Ancient Hebrew, and Lithuanian).
They used a strong baseline model called RNGTr (Recursive Non-Autoregressive Graph-to-Graph Transformer), which is currently one of the best architectures for this task.
Results on Sanskrit
The results for Sanskrit were compelling. They compared several variations:
- RNGTr: The standard baseline.
- RNGTr (NoPos): The baseline with position encoding removed.
- RNGTr (DA): The baseline using “Data Augmentation” (training on more scrambled data) but without the contrastive loss.
- Prop. System (CSSL): The proposed method.

Table 1 (above) reveals key insights:
- Position Encoding Matters: Notice the “RNGTr (NoPos)” row. The score drops drastically (from 89.62 to 80.78 UAS). This confirms that simply “blinding” the model to position is a bad idea. It needs position data, but it needs to learn not to over-rely on it.
- CSSL Beats Standard Augmentation: While Data Augmentation (DA) helped (90.38 UAS), the CSSL method achieved 91.86 UAS.
- Best of Both Worlds: Combining Data Augmentation and CSSL (the last row) yielded the highest results (92.43 UAS), suggesting the two techniques are complementary.
Multilingual Robustness
The researchers didn’t stop at Sanskrit. They ran the same experiments on six other languages. The trend held firm. On average, the CSSL approach improved the parsing accuracy (UAS/LAS scores) by about 3 points over the baseline.
Interestingly, they also tested it on English. Since English has a fixed word order, you would expect this method (scrambling words) to hurt performance or at least not help.
- Result: The method didn’t hurt English performance much, but it didn’t help significantly either, and standard Data Augmentation worked better for English. This serves as a “sanity check,” confirming that CSSL is specifically beneficial for languages where word order is truly free.
Why This Matters
This paper provides a blueprint for handling one of the trickiest aspects of linguistic diversity: Word Order.
For years, NLP has been dominated by English-centric assumptions. Models like BERT assume that if “Man” comes before “Bites,” the Man is the agent. But for a vast number of the world’s languages—including many classical and low-resource languages—this assumption fails.
The CSSL framework proposed here is elegant because:
- It is unsupervised regarding order: You don’t need manual rules telling the model “this language is free word order.” The contrastive pairs teach it automatically.
- It works with existing models: You can plug this loss function into almost any graph-based parser.
- It solves the “Low-Resource” bottleneck: These languages often lack massive datasets. By effectively using permutations as “free” positive examples, the model squeezes more learning out of limited data.
By teaching models that “Order is noise, but Morphology is signal,” we move one step closer to truly universal language understanding.
References
- Ray, P., Sandhan, J., Krishna, A., & Goyal, P. CSSL: Contrastive Self-Supervised Learning for Dependency Parsing on Relatively Free-Word-Ordered and Morphologically-Rich Low-Resource Languages.
- Note: All figures and tables referenced are adapted from the original paper’s provided image deck.
](https://deep-paper.org/en/paper/2410.06944/images/cover.png)