Introduction

Imagine being a linguist trying to document a language that only a few dozen people on Earth still speak. The clock is ticking. Estimates suggest that up to 90% of the world’s languages are at risk of disappearing within the next century. Preserving them isn’t just about recording audio; it involves a painstaking process called Interlinear Glossed Text (IGT). This requires transcribing speech, translating it, segmenting words into their smallest meaning-bearing units (morphemes), and tagging each one grammatically.

It is labor-intensive, slow, and requires high expertise. While machine learning has revolutionized translation for major languages like French or Chinese, it struggles with endangered languages because we simply don’t have enough data to train the “data-hungry” neural networks.

But what if we could make our models work smarter, not harder?

In the paper “Multiples Sources are Better Than One,” researchers from the University of British Columbia propose a novel approach to solve the data scarcity problem. Instead of relying solely on the source text, they mimic how human linguists work: they look at translations, consult dictionaries, and leverage general linguistic knowledge. By integrating these external sources—and tapping into the power of modern Large Language Models (LLMs)—they achieved a massive leap in accuracy for low-resource language glossing.

Figure 1: When glossing input such as the French sentence Le chien aboie, our system utilizes multiple information sources: an English sentence-level translation, general linguistic knowledge provided by an LLM and dictionary definitions for the input tokens.

As shown in Figure 1, the core idea is simple but powerful: don’t just ask the model to predict the gloss from the raw text. Give it the translation (“The dog barks”), give it a dictionary, and let an LLM help refine the answer.

Background: The Challenge of Automatic Glossing

To understand the contribution of this paper, we first need to understand the task. Glossing is the process of annotating a sentence morpheme-by-morpheme.

Take this example from the Gitksan language:

Original: Ii hahla’lsdi’y goohl IBM
Segmentation: ii hahla’lst-’y goo-hl IBM
Gloss: CCNJ work-1SG.II LOC-CN IBM
Translation: And I worked for IBM.

The goal of an automatic system is to take the Original line and produce the Gloss line. This is incredibly difficult because the model has to figure out where one morpheme ends and the next begins (segmentation) and then assign the correct meaning or grammatical tag to each part.

The Data Bottleneck

Previous neural models for this task were trained almost exclusively on the source text. For a language like Lezgi, a researcher might only have 3,000 sentences of training data. In the world of Deep Learning, where models like GPT-4 are trained on trillions of tokens, 3,000 sentences is microscopic.

Standard models hit a ceiling because they cannot memorize enough patterns. If the model encounters a word stem it hasn’t seen before, it fails. The researchers realized that while glossed data is rare, translations are almost always available because documentation projects inherently involve translating the language into English or Spanish.

The Core Method: A Multi-Source Pipeline

The authors propose a pipeline that significantly upgrades the traditional neural glossing architecture. They build upon a baseline model (Girrbach, 2023) and introduce three key enhancements:

Translation Encoders (feeding the English translation into the model).
Character-Based Decoders (to handle unknown words).
LLM Post-Correction (using GPT-4 or LLaMA as a final editor).

Let’s break these down.

1. The Baseline Architecture

The foundation of this work is the model by Girrbach (2023). It treats glossing as a two-step process: segmentation and classification.

Figure 2: Pipeline of Girrbach (2023)’s model.

As visualized in Figure 2 (left side), the baseline takes the transcript (e.g., “Les chiens”) and passes it through an LSTM Encoder. It then uses a mathematical method called the Forward-Backward algorithm to perform unsupervised segmentation—essentially calculating the probability that a specific character marks the end of a morpheme.

The model calculates marginal probabilities for boundaries using these recursive equations:

Equation for Alpha Equation for Beta Equation for Marginal Probability

These equations allow the model to score every possible way to chop up the word and find the most likely segmentation without needing explicit segmentation labels during training. Once segmented, an MLP classifier predicts the gloss for each chunk.

2. Incorporating Translations (The “Encoder-Decoder” Upgrade)

The baseline works well for grammar tags (like “PL” for plural) because they repeat often. It fails on “lexical morphemes”—the stems of words (like “dog” or “run”)—because if the model hasn’t seen “dog” in the training data, it can’t predict it.

The researchers introduced a Translation Encoder. They take the English translation (e.g., “The dog barks”) and encode it using powerful pre-trained models like BERT or T5.

Figure 3: Pipeline of the proposed work. The lower portion of the diagram demonstrates how attention weights inform the model when predicting the glossing targets.

In Figure 3, you can see how this works. The model now has two inputs: the source transcript and the translation. The system uses an Attention Mechanism to connect them. When the model tries to gloss the Gitksan word for “work,” the attention mechanism “looks” at the word “worked” in the English translation vector. This provides a massive hint to the model, effectively allowing it to “cheat” by looking at the answer key provided in the translation.

Additionally, they replaced the simple classifier with a Character-Based Decoder. Instead of selecting from a fixed list of labels, the model generates the gloss character-by-character (e.g., generating “d”, “o”, “g”). This allows it to construct words it has never explicitly learned, provided it can copy relevant information from the translation.

3. LLM Post-Correction

Even with the translation encoder, neural networks sometimes hallucinate or make typos (e.g., outputting stoply instead of story). To fix this, the authors introduce a final step: In-Context Learning with LLMs.

They treat the output of their trained model as a “Silver Gloss”—a rough draft that is mostly correct but needs polishing. They then feed this draft to an LLM (like GPT-4) along with a prompt.

Figure 4: The procedure of selecting in-context learning examples to generate components for LLM prompting.

Figure 4 illustrates this prompting pipeline. The prompt includes:

The source sentence.
Word-for-word dictionary lookups (if available).
The English sentence translation.
The “Silver Gloss” (the draft).
Instruction: “Correct the gloss.”

Crucially, they use Few-Shot Learning. They retrieve similar examples from the training set to show the LLM how to correct glosses. They tested several strategies for picking these examples:

Random: Just pick any two training sentences.
BERT-Similarity: Pick sentences that are semantically similar.
Overlap: Pick sentences that share the most words.

This step acts as a highly intelligent spell-checker that understands the context of the language.

Experiments and Results

The researchers tested their method on six languages from the 2023 SIGMORPHON Shared Task: Arapaho, Gitksan, Lezgi, Natügu, Tsez, and Uspanteko. These languages are typologically diverse and truly low-resource (some have fewer than 1,000 training sentences).

They also simulated an “Ultra-Low Resource” setting by restricting the training data to just 100 sentences, mimicking the very early stages of a language documentation project.

Quantitative Improvements

The results were compelling. The proposed method (incorporating BERT/T5 and the character decoder) outperformed the baseline across the board.

Table 2: Word-level accuracy of languages in the 2023 Sigmorphon Shared Task (Ginn et al., 2023) (left) and ultra low-resource settings (right). Model specifics are elaborated in Section 5.

Looking at Table 2:

Standard Setting (Left): The combined model (T5+attn+chr) achieved an average accuracy of 82.56%, nearly 4 percentage points higher than the baseline.
Ultra-Low Setting (Right): The gap widens significantly. With only 100 sentences, the BERT+attn+chr model reached 42.04% accuracy compared to the baseline’s 32.26%. This ~10% improvement is a game-changer for linguists just starting to document a language.

The Power of Prompting

The addition of the LLM post-correction provided a further boost. The authors found that using Overlapping Words to select examples for the prompt worked best.

Table 3 and 4: Accuracy comparison showing prompt improvements

As Table 3 and 4 show, adding the prompting step (T5/BERT… + Prmpt) pushed accuracy even higher. For Gitksan, the lowest-resource language in the set (only ~30 training sentences!), the accuracy jumped from 21.09% (baseline) to over 30% with the full pipeline.

When they added external dictionaries to the prompt (Table 5 below), accuracy increased further still, proving that “more sources are indeed better.”

Table 5: Word-level accuracy of all languages. We compare the model performance among the accumulated effort of incorporating external dictionaries with other models.

Learning Curves

One of the most striking visualizations in the paper is the learning curve for Arapaho.

Figure 5: Lexical morpheme and word-level accuracy on Arapaho. We incorporate prompting with the encoder-decoder model which is enriched with translation.

Figure 5 reveals a critical insight: The prompt-based correction (blue/light blue lines) is vastly more effective when data is scarce. Look at the left side of the chart (100 sentences). The gap between the model with prompting and without is massive. As you move to the right (100% data), the gap closes. This confirms that LLMs act as a crucial safety net when the specialized model hasn’t seen enough data to generalize well.

Visualizing “Attention”

Does the model actually use the translations, or is it just a black box? The researchers visualized the attention weights to find out.

Figure 6: Difference from mean attention weights of glossed output tokens (y-axis) with respect to encoded translation tokens (x-axis) for a Natügu example.

In this heatmap (Figure 6) for a Natügu sentence, the Y-axis lists the predicted glosses, and the X-axis lists the English translation words. The red squares indicate strong attention.

Notice the intersection of “kill” (gloss) and “kills” (translation).
Notice “people” (gloss) and “people” (translation).

The model is explicitly focusing on the relevant English word to generate the correct gloss. This confirms that the Translation Encoder is doing exactly what it was designed to do: bridging the semantic gap between the two languages.

Conclusion & Implications

The paper “Multiples Sources are Better Than One” demonstrates a pragmatic path forward for computational linguistics in low-resource settings. By moving away from a “source-text only” view and integrating translations, dictionaries, and LLMs, the authors achieved state-of-the-art results.

Key takeaways for students and researchers:

Don’t ignore auxiliary data: If you have translations or dictionaries, build architectures that can consume them.
Hybrid systems win: Combining a specialized, trained neural network (for segmentation and grammar) with a general-purpose LLM (for semantic correction) offers the best of both worlds.
Hope for endangered languages: The massive improvements in the ultra-low resource setting (100 sentences) suggest that AI can be a genuine aid to linguists in the early, critical stages of language preservation.

This work highlights that in the era of massive AI models, the solution to specific, small-data problems often lies in creatively connecting large general knowledge bases with specialized local data.

Introduction#

Background: The Challenge of Automatic Glossing#

The Data Bottleneck#

The Core Method: A Multi-Source Pipeline#

1. The Baseline Architecture#

2. Incorporating Translations (The “Encoder-Decoder” Upgrade)#

3. LLM Post-Correction#

Experiments and Results#

Quantitative Improvements#

The Power of Prompting#

Learning Curves#

Visualizing “Attention”#

Conclusion & Implications#