Introduction

Imagine typing the following sentence into a translation engine: “The doctor asked the nurse to help her in the procedure.”

If you translate this into a language with grammatical gender—like Spanish, German, or Hebrew—the model has to make a choice. Is the doctor male or female? Is the nurse male or female? Historically, Natural Language Processing (NLP) models have relied heavily on stereotypes found in their training data. As a result, they frequently translate “doctor” as male and “nurse” as female, even when the sentence explicitly uses the pronoun “her” to refer to the doctor.

This phenomenon is a form of extrinsic bias—bias that manifests in the final output of a downstream task. To fix this, researchers have developed various methods to “clean” the internal mathematics of the model, known as intrinsic debiasing. The logic seems sound: if we remove gender information from the model’s internal word representations (embeddings), the model should stop making gendered assumptions in its output.

But does it actually work that way?

In the paper Applying Intrinsic Debiasing on Downstream Tasks: Challenges and Considerations for Machine Translation, researchers from the Hebrew University of Jerusalem and the Allen Institute for AI investigate this exact connection. They explore whether cleaning up the internal geometry of a model actually leads to fairer translations. Their findings reveal that the relationship is far from simple—it depends heavily on where you debias, which words you target, and what language you are translating into.

Background: The Two Types of Bias

To understand the researchers’ approach, we first need to distinguish between the two ways bias is measured in NLP.

Intrinsic Bias: This looks at the model’s internal representations (word embeddings). In a biased vector space, the mathematical distance between “doctor” and “man” is much smaller than between “doctor” and “woman.” Intrinsic debiasing methods try to force these distances to be equal.
Extrinsic Bias: This looks at the model’s behavior in a real-world task, such as Machine Translation (MT). If the model consistently misgenders professionals based on stereotypes, it exhibits extrinsic bias.

There is a growing gap in the literature: many new methods are proposed to fix intrinsic bias, but they are rarely rigorously tested on complex downstream tasks like translation.

The Toolkit: Intrinsic Debiasing Methods

The researchers tested three prominent methods for removing gender information from vector spaces:

Hard-Debiasing: A method that uses Principal Component Analysis (PCA) to identify a “gender direction” in the vector space and subtracts it. It is non-linear and not fully exhaustive.
INLP (Iterative Null-space Projection): A method that trains a classifier to predict gender from the vectors, then removes the information used by that classifier. It repeats this process iteratively.
LEACE: A newer, closed-form method that mathematically prevents any linear classifier from detecting the protected attribute (in this case, gender).

The Core Method: Integrating Debiasing into MT

The researchers integrated these intrinsic debiasing methods into a standard Transformer-based Neural Machine Translation (NMT) architecture. Their goal was to see how different design choices impacted the final translation quality and fairness.

They identified three major design challenges that arise when moving from simple word vectors to a complex translation system.

Figure 1: A schematic view of a neural machine translation system, highlighting different possibilities for applying intrinsic debiasing techniques. We examine three considerations: (1) where to apply the debiasing; (2) which tokens to apply the debiasing to (e.g. only gender-indicative words or the entire vocabulary); and (3) the effect of different target languages.

As illustrated in Figure 1, the researchers systematically manipulated three variables:

1. The Architecture Location (Where?)

A Transformer model isn’t just one big block of numbers; it has distinct stages. The researchers experimented with applying debiasing at three specific points:

Encoder Input: The embeddings of the source language (English) before they are processed.
Decoder Input: The embeddings of the target language fed into the decoder.
Decoder Output: The final representation just before the model predicts the next word (the softmax layer).

2. The Tokenization Mismatch (What?)

This is a subtle but critical challenge. Intrinsic debiasing methods are usually designed for whole words (e.g., “nurse”, “doctor”). However, modern NMT models use sub-word tokenization to handle rare words. A word like “receptionist” might be broken into tokens like re, ception, and ist.

If we have a debiasing algorithm meant for the word “receptionist,” how do we apply it to ception? The researchers tested three strategies:

All-tokens: Debias every single token in the vocabulary.
N-token-profession: Debias the tokens that make up a list of professions, even if the word is split into multiple pieces.
1-token-profession: Debias only the specific professions that appear as a single, whole token in the model’s vocabulary.

3. Target Language (Who?)

Different languages handle gender differently. The team tested translations from English into Hebrew, German, and Russian. All three have grammatical gender, but they differ significantly in their morphology (structure of words) and alphabets.

Experiments and Results

To evaluate performance, the researchers used two primary metrics:

Accuracy: Using the WinoMT dataset, they measured how often the model correctly identified the gender of a professional (e.g., translating “doctor” as female when the sentence implies she is female).
BLEU: A standard metric for translation quality. We want to reduce bias without destroying the model’s ability to translate correctly.

The datasets used for the different languages are detailed below:

Table 1: Datasets used for evaluating different target languages. The Dataset Size describes the number of sentences in the dataset. Russian and German datasets are described in Choshen and Abend (2021)’s paper. The Hebrew dataset is based on the Opus TED talks dataset (Reimers and Gurevych, 2020).

Finding 1: The “Whole Word” Matters

One of the most interesting findings concerned the tokenization problem. You might assume that debiasing more tokens is better. However, the results showed the opposite.

As shown in Table 2, the 1-token-profession strategy consistently outperformed the others. This suggests that gender information is semantically strong in complete words. When a word is shattered into sub-word tokens (like cep or tion), those fragments may not carry the “gender direction” in a way that intrinsic debiasing methods can detect or fix. Attempting to debias these fragments likely adds noise without removing bias.

Finding 2: Location Depends on the Method

There is no single “best” place to insert the debiasing block. It depends entirely on which mathematical method you are using.

Table 3: Opus MT’s gender prediction accuracy with intrinsic debiasing methods applied on different embedding tables. Each cell is averaged across our target languages (de, he, ru). Bold numbers represent best per debiasing method. The accuracy is measured by Stanovsky et al. (2019)’s method on their WinoMT dataset

Table 3 highlights a crucial divergence:

Hard-Debiasing worked best at the very beginning (Encoder Input). This method is non-linear.
INLP and LEACE worked best at the very end (Decoder Output). These methods are linear.

This makes architectural sense. INLP and LEACE are designed to remove linear dependencies. The Transformer architecture is highly non-linear. If you apply a linear “fix” at the beginning, the subsequent layers might re-introduce or scramble that information. By applying it at the Decoder Output, just before the final prediction, these linear methods can effectively “guard” the output. Conversely, Hard-Debiasing removes a subspace via PCA, which seems to serve as a better foundation for the encoder to process inputs initially.

Finding 3: The Accuracy vs. Quality Trade-off

Does making a model fairer make it a worse translator? The researchers plotted the improvement in gender accuracy against the change in BLEU score (translation quality).

Figure 2 reveals the trade-offs:

INLP (center graph) shows a significant drop in BLEU scores (the blue bars go deep into the negative). It removes bias, but at a cost to the overall translation quality.
LEACE (right graph) and Hard-Debiasing (left graph) are much “safer.” They improve gender accuracy (orange bars) while maintaining BLEU scores that are close to the baseline.

This suggests that LEACE and Hard-Debiasing are more precise—they surgically remove gender information without deleting other semantic information necessary for translation.

Finding 4: Language Morphology is a Barrier

Referring back to Table 2, we see that debiasing was successful for German (DE) and Hebrew (HE), but barely moved the needle for Russian (RU).

Why? The authors attribute this to Russian’s rich morphology. Russian words change form (inflect) heavily based on their grammatical case. This increases the vocabulary size significantly, meaning that fewer professions appear as “single tokens” in the tokenizer. Since the researchers found that debiasing is most effective on single tokens (Finding 1), languages that split words up more frequently are harder to debias using these methods.

Conclusion and Implications

This paper serves as a reality check for the field of AI fairness. It demonstrates that we cannot simply “plug and play” intrinsic debiasing methods into complex systems and expect them to work.

The key takeaways for students and practitioners are:

Don’t ignore the tokenizer: Debiasing works best on whole words. If your model chops words into pieces, standard debiasing might fail.
Match method to architecture: Linear debiasing methods (like LEACE) belong at the end of the network; geometric projections (like Hard-Debiasing) often work better at the start.
One size does not fit all languages: A method that fixes bias in German might fail in Russian due to morphological differences.

Ultimately, intrinsic debiasing is a promising tool, but it requires careful tuning and system-aware integration to successfully mitigate bias in the real world. Future work must look beyond the embeddings and consider the entire pipeline—from the tokenizer to the final output layer.

Introduction#

Background: The Two Types of Bias#

The Toolkit: Intrinsic Debiasing Methods#

The Core Method: Integrating Debiasing into MT#

1. The Architecture Location (Where?)#

2. The Tokenization Mismatch (What?)#

3. Target Language (Who?)#

Experiments and Results#

Finding 1: The “Whole Word” Matters#

Finding 2: Location Depends on the Method#

Finding 3: The Accuracy vs. Quality Trade-off#

Finding 4: Language Morphology is a Barrier#

Conclusion and Implications#