Introduction
In the world of Natural Language Processing (NLP), we have become accustomed to the “bigger is better” paradigm. Massive models like BERT or GPT are trained on effectively the entire internet, learning the statistical patterns of language before they are ever shown a specific task. But what happens when we zoom in from the level of sentences and paragraphs to the level of individual characters? And more importantly, what happens when we don’t have the internet’s worth of data for a specific language?
This is the challenge of morphological inflection in low-resource languages. Morphology is the study of the internal structure of words. It is the system that tells us that the past tense of “walk” is “walked,” but the past tense of “run” is “ran.” For machines to master a language, they must understand these rules.
In a recent paper, researchers explored a fascinating question: Can we improve a model’s ability to conjugate and inflect words by training it on unsupervised tasks using only the data we already have?
The results are surprising. They suggest that simple tasks like “autoencoding” (predicting the input word itself) can significantly boost performance, sometimes even more than complex “denoising” strategies that power models like BERT. This blog post breaks down their methodology, the architecture, and the implications for training models when data is scarce.
The Problem: Character-Level Transduction
Before diving into the solution, let’s formalize the problem. Character-level sequence-to-sequence tasks (or character transduction) involve mapping a source string to a target string. Unlike machine translation, where we map English words to French words, here we map a “lemma” (the dictionary form of a word) to its “inflected” form based on specific grammatical features.
Mathematically, if we have a set of source strings \(S\) and target strings \(Y\), along with a set of morphological tags \(\tau\), we are trying to learn a function:

Here, \(f\) is typically a neural network. A classic example of this function in action looks like this:

The model must learn that adding the tag PST (Past Tense) to cry transforms the y into i and adds ed. This is trivial for high-resource languages like English where we have abundant examples. But for low-resource languages—which make up the majority of the world’s languages—we might only have a few hundred training pairs.
The researchers investigated 27 typologically diverse languages to test their hypotheses. As shown below, this includes languages ranging from Amharic and Belarusian to Navajo and Sanskrit.

The Method: Unsupervised Secondary Tasks
The core innovation of this paper is not a new massive architecture, but a clever training strategy. The authors explore Transfer Learning using unsupervised secondary tasks.
Transfer learning usually involves training a model on a secondary task (like language modeling) and then fine-tuning it on the target task. The authors compare two specific ways to arrange this training:
- Pretraining (PT): Train on the secondary task first, then fine-tune on the inflection task.
- Multi-Task Learning (MTL): Train on both the secondary task and the inflection task simultaneously.
But what are these “secondary tasks”? The authors specifically look at tasks that don’t require human annotation. They extract this data directly from the existing training set.
Task 1: Autoencoding (AE)
Autoencoding is conceptually simple. The model takes a word as input and is asked to predict the exact same word as output.

In the context of morphological inflection, this teaches the model the valid character sequences of the language. It forces the model to learn a representation of the word tried without worrying about the grammatical transformation yet.
Task 2: Denoising (CMLM)
Denoising is the mechanism behind famous models like BERT and RoBERTa. The authors use Character-level Masked Language Modeling (CMLM). They take a word, corrupt it by masking or changing certain characters, and ask the model to recover the original word.

In this example, the @ symbol acts as a noise token. The model must infer that tr@e@ is actually tried. This is generally considered a harder task than autoencoding because it requires the model to understand the context of the surrounding characters to fill in the blanks.
The Objective Function
For the Multi-Task Learning (MTL) setup, the model has to balance learning the inflection task (turning “cry” into “cried”) and the unsupervised task (turning “tr@e@” into “tried”) at the same time.
The loss function used to train the model is a weighted sum of the losses from both tasks:

In this equation:
- \(\theta\) represents the model parameters.
- \(l_1\) is the loss for the unsupervised task (AE or CMLM).
- \(l_2\) is the loss for the main inflection task.
- \(\alpha\) and \(\beta\) are weights (set to 1 in this paper) that determine how important each task is.
Experiments and Results
The authors set out to answer three main questions. Let’s look at what the data says.
RQ1: Does this help even without new data?
Usually, unsupervised tasks are useful because they allow us to use massive amounts of unlabelled external data. However, in this experiment, the authors generated the unsupervised data only from the existing training set.
The verdict: Yes.
Looking at the results table below, compare the “Baseline” column against “PT-AE” (Pretrained Autoencoding) or “MTL-AE” (Multi-Task Autoencoding).

Almost all methods outperformed the baseline. This is a crucial finding: You can improve performance in low-resource settings simply by asking your model to memorize or denoise the training data, without needing a single extra word of external data.
RQ2: Is Denoising (CMLM) better than Autoencoding (AE)?
Given the success of BERT, one might assume that Denoising (CMLM) is superior to simple Autoencoding.
The verdict: No, Autoencoding wins.
In the table above, MTL-AE (Multi-Task Autoencoding) achieves the highest average test accuracy (73.66), significantly beating MTL-CMLM (62.76). Even in the pretraining setup (PT-AE vs PT-CMLM), Autoencoding holds a slight edge.
This suggests that for character-level tasks with small datasets, the simpler objective of just copying the input might provide a better inductive bias than the complex task of reconstructing noisy inputs.
RQ3: Pretraining vs. Multi-Task Learning?
The verdict: MTL wins, but it’s risky.
Multi-Task Learning with Autoencoding (MTL-AE) was the best overall performer. However, Multi-Task Learning with Denoising (MTL-CMLM) was actually worse than the baseline.
This result is puzzling. Why would training on a denoising task hurt the model when done simultaneously with the main task?
The Deep Dive: Why Did Denoising Hurt MTL?
The authors hypothesized that the denoising task (tr@e@ -> tried) might be too different from the inflection task (cry + PST -> cried) when optimized on the exact same small vocabulary. The two objectives might be fighting each other, pulling the model’s parameters in conflicting directions.
To test this, they introduced External Data.
They sourced data from Universal Dependencies (UD)—a large collection of treebanks. They used this external text only for the secondary unsupervised task.
The hypothesis: If we use external data, the distribution of the unsupervised task will be different enough that it won’t conflict with the inflection task, or perhaps the added diversity will simply overpower the conflict.
Results with External Data

As shown in Table 3 above, adding external data ("-UD" columns) drastically changed the picture.
- MTL-CMLM-UD (using external data for denoising) jumped to 72.22 accuracy, a massive improvement over the 62.76 achieved without external data.
- It now comfortably beats the baseline.
This confirms that denoising is a powerful objective, but in a multi-task setup restricted to small datasets, it requires a broader data distribution to be effective.
Analyzing the Gradients
To understand why Autoencoding (AE) is stable while Denoising (CMLM) is volatile in the multi-task setup, the researchers analyzed the gradients during training.
Gradients tell us how much the model needs to update its weights to reduce error. Ideally, in a multi-task setup, we want the secondary task to provide stable, helpful updates while the main task is learning.

Figure 1 helps visualize this. These “violin plots” show the distribution of gradients for the secondary task during the early stages of training.
- MTL-AE (Left blue/purple): The distribution is tight and centered near zero. This implies stability. The autoencoding task doesn’t demand wild updates to the model weights.
- MTL-CMLM (Middle orange): The distribution is wider (longer tails). This indicates high variance. The denoising task is “noisy” in its optimization signal, which likely disrupts the learning of the main inflection task.
This analysis suggests that AE is “unreasonably effective” because it stabilizes the representation learning without interfering with the delicate process of learning morphological rules.
Conclusion and Implications
This research paper provides a practical playbook for students and practitioners working on low-resource languages or character-level tasks.
Here are the key takeaways:
- Don’t waste your data: Even if you can’t find more data, you can “reuse” your training data for unsupervised tasks like Autoencoding to boost performance.
- Keep it simple: Complex objectives like Masked Language Modeling (Denoising) aren’t always better. For character-level transduction, simple Autoencoding often yields better results.
- MTL is powerful but sensitive: Multi-Task Learning generally outperforms Pretraining, but you must choose your auxiliary task carefully. If the tasks conflict (like Denoising and Inflection on the same small data), performance can degrade.
- External data fixes conflicts: If you want to use a complex objective like Denoising, you likely need external data (like Universal Dependencies) to make it work in an MTL setup.
By understanding the interplay between training objectives and data availability, we can build robust models that preserve the linguistic richness of the world’s diverse languages, one character at a time.
](https://deep-paper.org/en/paper/file-3140/images/cover.png)