There are approximately 7,000 languages spoken in the world today. Tragically, nearly half of them are considered endangered. While communities and linguists are working tirelessly to preserve and revitalize these languages, the process of language documentation is notoriously slow and labor-intensive.

Imagine you are a field linguist recording a story from an elder in an endangered language. You have the audio and the transcription. But to make that data useful for dictionaries, grammars, or teaching materials, you need to perform a task called Interlinear Glossing. This involves analyzing the text morpheme-by-morpheme (the smallest units of meaning) and assigning grammatical labels to them. It is a task that requires deep expertise and immense time.

In this post, we are doing a deep dive into GlossLM, a research paper that aims to accelerate this process using Natural Language Processing (NLP). The researchers have compiled the largest-ever corpus of Interlinear Glossed Text (IGT) and developed a massively multilingual model that can automate gloss generation, even for languages with very little data.

The Bottleneck: What is Interlinear Glossed Text?

Before we look at the solution, we must understand the data format. Interlinear Glossed Text (IGT) is the standard format used in linguistics to explain the morphosyntax of a language.

As shown in the image below, IGT typically consists of three lines:

  1. Transcription: The sentence in the source language.
  2. Gloss: The morphological analysis. This includes lexical glosses (translating the root word stem) and grammatical glosses (labels like PAST for past tense, or 3PL for 3rd person plural).
  3. Translation: A free translation in a major language like English.

Figure 1: Components of interlinear gloss with an Arapaho sentence and English translation. Blue boxes show transcriptions that are unsegmented (top) or segmented (bottom). Segmented text is split into morphemes which are aligned with the gloss labels shown in the green box.

The image above highlights a specific challenge: Segmentation. In the top blue box, the text is “unsegmented”—it’s just the natural sentence. In the bottom blue box, a linguist has manually broken the words down into morphemes (e.g., separating prefixes and suffixes with hyphens).

Historically, automated systems have struggled because:

  • They often require the input text to be already segmented (which takes human time).
  • Data is scarce. Most endangered languages don’t have the millions of sentences needed to train standard Large Language Models (LLMs).
  • Glossing conventions vary wildly between different researchers.

Part 1: Building the GlossLM Corpus

The first major contribution of this research is not a model, but a dataset. To train a model that understands the general structure of glossing, the researchers needed data—lots of it.

Existing IGT data is scattered across PDF research papers, textbooks, and various small digital repositories. It is rarely standardized. The researchers undertook a massive effort to aggregate and clean this data, resulting in the GlossLM Corpus.

The Scale of the Data

The researchers combined data from six major sources, including ODIN (Online Dictionary of Interlinear Text) and diverse shared tasks.

Table 1: Number of unique examples and languages in each source corpus for the GLOSSLM dataset.

As you can see in the table above, the final corpus contains over 450,000 examples covering 1,800 languages. This is the largest digitized collection of IGT to date.

The “Long Tail” Problem

However, quantity doesn’t mean equality. In NLP, we often deal with a “long tail” distribution. A few languages have a lot of data, while the vast majority have very little.

Figure 7: Counts per language. We only show languages with at least 2k samples present in the dataset. Arapaho is by far the most represented language.

The graph above illustrates this disparity. Arapaho (a language spoken by the Arapaho people of Wyoming and Oklahoma) dominates the dataset with nearly 100,000 examples. In contrast, 50% of the languages in the corpus have fewer than 10 examples. This imbalance makes it crucial to design a model that can transfer knowledge from high-resource languages to low-resource ones.

The Challenge of Standardization

One of the biggest headaches in processing linguistic data is inconsistency. One linguist might label a singular noun as SG, another as S, and another might use SING.

The researchers analyzed the frequency of unique gloss labels across the corpus and found a classic Zipfian distribution:

Figure 2: Distribution of unique glosses across all languages.

There are over 11,000 unique gloss labels, but the most common 200 labels account for over 82% of the data. To address this, the researchers attempted to normalize these top 200 labels to the UniMorph schema, a standardized set of morphological feature labels. For example, PAST, PST, and pret might all be mapped to a single standard tag. This normalization allows the model to recognize that a “past tense” marker in one language serves the same function as one in another language, facilitating cross-lingual learning.

Part 2: The GlossLM Model

With the data prepared, the researchers moved to the modeling phase. Their goal was to create a single system that could take an unsegmented sentence in a target language (plus a translation) and output the gloss line.

Architecture: Why ByT5?

The researchers chose ByT5 (Byte-Level Text-to-Text Transfer Transformer) as their base architecture.

Standard LLMs (like BERT or GPT) use “tokenizers” that break text into subwords. These tokenizers are usually trained on English or other major languages. When you try to use them on an indigenous language with complex morphology (like polysynthetic languages where one word is a whole sentence), standard tokenizers fail miserably, leading to “Out of Vocabulary” errors.

ByT5, however, operates on bytes—the raw underlying digital representation of text. It doesn’t need a tokenizer. It treats text as a stream of characters, making it incredibly robust for multilingual settings and diverse scripts.

The Training Pipeline

The researchers employed a two-stage training process:

  1. Continual Pretraining: They took a standard ByT5 model (already trained on general text) and continued training it on the massive GlossLM corpus. This taught the model the general structure of IGT: how to segment words, how to align labels, and what glossing looks like across 1,800 languages.
  2. Finetuning: They then took this “gloss-aware” model and finetuned it on specific target languages to maximize performance.

Experiments and Results

The researchers evaluated their model on seven diverse languages. Crucially, they focused on the unsegmented setting (the “closed track”). This is the hardest version of the task, where the model must figure out where the morpheme boundaries are and label them simultaneously.

Does Multilingual Pretraining Work?

First, they checked if the massively multilingual model (before finetuning) could perform well “out of the box” compared to existing state-of-the-art (SOTA) systems.

Figure 3: Comparison of our pretrained model and the SOTA (Girrbach, 2023a) for in-domain languages on unsegmented data. Our model outperforms on all three languages.

The results in Figure 3 are promising. On languages that were present in the pretraining data (Arapaho, Tsez, Uspanteko), the pretrained GlossLM model (blue bars) outperformed the previous SOTA (orange bars) without any language-specific finetuning. This suggests the model successfully internalized the glossing logic during the pretraining phase.

Finetuning for State-of-the-Art Performance

Next, they finetuned the model on specific language datasets and compared it against several strong baselines, including Tü-CL (a specialized model using latent segmentation) and TOKEN-CLASS (a RoBERTa-based model).

Figure 4: Morpheme accuracy for various systems.

As shown in Figure 4, the finetuned GlossLM (blue bars) achieves the highest morpheme accuracy in 5 out of 7 languages.

  • Success Cases: It dominates in high-resource scenarios (Arapaho, arp) and performs very well in mid-resource scenarios.
  • The Challenge: For languages with extremely small training sets like Gitksan (git, roughly 70 examples), specialized models like Tü-CL that are architected specifically for segmentation still hold an edge. However, GlossLM remains highly competitive.

The “Low-Resource” Sweet Spot

The most significant finding of this paper is revealed when we look at how the model performs relative to the amount of available data.

The researchers compared their GlossLM (Pretrained + Finetuned) model against a standard ByT5 (Finetuned only) model. The difference? The GlossLM model had seen the 450k multilingual IGT examples first.

Figure 5: Performance after monolingual finetuning, comparing a standard pretrained ByT5 with a continually pretrained GlossLM model.

This graph tells the most important story:

  • Left Side (Low Data): Look at the languages on the left side of the x-axis (like Lezgi and Gitksan). The gap between the Blue line (GlossLM) and the Grey line (Standard ByT5) is massive. For Lezgi, pretraining boosted accuracy by over 15 percentage points.
  • Right Side (High Data): As we move to the right (Arapaho, arp), the lines converge. If you have tens of thousands of examples, the pretraining matters less because the model can learn enough from the specific data alone.

The Conclusion: Multilingual pretraining is a game-changer for low-resource languages. It allows the model to transfer its general knowledge of “how glossing works” to a new language where it might only see a few hundred examples.

Does Normalization Help?

Finally, the researchers asked: Did all that work normalizing labels to the UniMorph schema actually help?

Figure 6: Change in morpheme accuracy after normalizing glosses to the UniMorph schema and finetuning GlossLM.

The results, shown in Figure 6, were mixed.

  • Green Bars (Improvement): Normalization helped significantly for Lezgi (lez) and Nyangbo (nyb). These are unseen or low-resource languages. Standardizing labels likely helped the model bridge the gap between languages.
  • Left Bars (Decline): For high-resource languages like Uspanteko (usp), normalization actually hurt performance. This is likely because normalization is “lossy”—it simplifies specific nuances that a model with plenty of data could otherwise learn to predict perfectly.

Conclusion and Implications

The GlossLM project represents a significant step forward in computational linguistics. By compiling the largest-ever IGT corpus and demonstrating the power of multilingual pretraining, the researchers have created a tool that can genuinely assist language documentation.

Key Takeaways:

  1. Data is King: Aggregating 450k examples enables models to learn the structure of linguistic annotation, which transfers across languages.
  2. Pretraining helps the “Have-Nots”: The biggest performance gains were seen in languages with the least amount of data, which describes the vast majority of endangered languages.
  3. No “Curse of Multilinguality”: Often, training on too many languages degrades performance on individual ones. GlossLM showed no signs of this, maintaining high accuracy across the board.

A Note on Ethics: The authors conclude with an essential reminder: these tools are designed to look over the shoulder of a linguist, not replace them. Language documentation is a deeply human endeavor involving culture, history, and community. GlossLM is a “copilot” that can handle the repetitive work of morphological tagging, freeing up linguists and community members to focus on the revitalization and usage of their languages.

The GlossLM model and dataset are publicly available on Hugging Face, opening the door for future research into automated translation and language preservation.