Imagine you are using a real-time translation app. You speak into the microphone: “I was born in London.” You are a woman. The app translates your sentence into French.
In English, the sentence is neutral. But in French, grammar demands a choice. If the speaker is female, it should be “Je suis née à Londres.” If the speaker is male, it is “Je suis né à Londres.”
How does the AI decide? In text-to-text translation, the system has no clue; it usually guesses (often defaulting to the masculine form). But in Speech Translation (ST), the model has access to your voice. Ideally, the AI should “hear” the acoustic features associated with your voice, encode that information, and use it to select the correct grammatical gender.
A fascinating new paper, Different Speech Translation Models Encode and Translate Speaker Gender Differently, dives deep into this exact mechanism. The researchers uncover a surprising trend: while older, traditional models are quite good at “hearing” gender, the newest, state-of-the-art architectures are essentially “deaf” to these cues, leading to a significant masculine bias in translation.
In this post, we will break down their methodology, their innovative use of “probing” to peek inside neural networks, and why this matters for the future of fair AI.
The Problem: Notional vs. Grammatical Gender
Before we inspect the neural networks, we need to understand the linguistic challenge. The researchers focused on translating from English (a language with “notional” gender) into French, Italian, and Spanish (languages with “grammatical” gender).
In English, nouns and adjectives referring to the speaker rarely change based on gender. “I am happy” is the same for everyone. In Romance languages, however, agreement is mandatory.
- Italian: Sono felice (neutral/ambiguous) vs. Sono stanco (Masculine) / Sono stanca (Feminine).
- French: Je suis prêt (M) / Je suis prête (F).
If the context doesn’t explicitly state the gender (e.g., “I am a woman who…”), the translator must rely on the audio signal. The core question of this paper is: Do current Speech Translation models actually use the audio signal to determine gender, or are they just guessing?
The Suspects: Two Types of Architectures
The researchers compared two distinct families of Speech Translation models.
1. The Traditional Encoder-Decoder (Enc-Dec)
This is the classic “End-to-End” architecture. It consists of a Speech Encoder (which processes audio) and a Text Decoder (which generates translation). These components are trained together from scratch specifically for the task of speech translation. The model used here is a standard Transformer-based model.
2. The Modern “Speech + MT” (Adapter-based)
This is the new wave of high-performance models, such as SeamlessM4T and ZeroSwot. These architectures are like Frankenstein’s monster, but in a good way. They take a very powerful, pre-trained Speech Encoder (like wav2vec 2.0 or w2v-BERT) and stitch it to a very powerful, pre-trained Machine Translation (MT) model (like NLLB).
To make these two giant, pre-trained brains talk to each other, they use a component called an Adapter. The Adapter’s job is to squash and map the heavy speech representations so they look like text embeddings that the MT model can understand.
The Hypothesis: The researchers suspected that while the pre-trained speech encoders are powerful, the Adapter might be acting as a filter, stripping away “non-textual” information—like the speaker’s voice characteristics—before the decoder ever sees it.
Methodology: Probing the Neural Brain
How do we know what a neural network “knows”? We can’t ask it. We have to perform neurosurgery. This technique is called Probing.
A probe is a small, simple classifier (usually a logistic regression or a small neural net) that sits on top of the frozen hidden states of the main model. If the probe can look at the hidden states and accurately predict the speaker’s gender, we know that gender information is encoded in those states.
The Innovation: Attention-Based Probing
Previous research usually took the sequence of hidden states (which represents the audio over time) and averaged them (Mean Pooling) to get a single vector. The authors of this paper argue that averaging dilutes the signal. Instead, they designed an Attention-Based Probe.
Inspired by architectures like the Q-Former, this probe uses a learnable “query” vector. It scans the entire sequence of hidden states and learns to pay attention only to the specific moments where gender information is strongest.

Figure 2 above visualizes how this probe works. The graphs show the “attention weights”—essentially, where the model is looking.
- (a) Test-Generic: Looking at general speech.
- (b) Test-Speaker: Looking at sentences where the speaker refers to themselves.
Notice the spikes on the left side of the graphs. This indicates that gender information is heavily concentrated at the very beginning of the sentence. The model decides the gender almost immediately. The probe learns to focus on these early timestamps and ignore the rest. This makes the probing much more accurate than simple averaging.
Results: Who Encodes Gender?
The researchers tested the models on English-to-Spanish (en-es), French (en-fr), and Italian (en-it). They probed different parts of the models:
- Enc-Dec: The encoder output.
- Seamless / ZeroSwot (Pre-Adapter): The raw output of the speech encoder.
- Seamless / ZeroSwot (Post-Adapter): The output after it has been processed to look like text.
Here is what they found:

Table 1 Analysis: Look at the Enc-Dec row at the bottom. The scores are incredibly high (over 90% F1). The traditional model retains a massive amount of gender information in its representations.
Now look at Seamless and ZeroSwot.
- Pre-Adapter: They encode gender reasonably well (ZeroSwot is better here), though not as intensely as the Enc-Dec.
- Post-Adapter: This is the critical finding. Look at the drop in scores. For Seamless, the F1 score plummets to roughly 54-59%. The recall for “She” drops significantly.
The verdict: The Adapter acts as a bottleneck. In the effort to compress speech into text-like embeddings, the adapter discards the “voice” of the speaker, effectively scrubbing gender information from the signal before the translation even begins.
The Consequence: Translation Accuracy
Does “scrubbing” gender information matter? Some might argue that removing biometric data is good for privacy. However, in translation, it leads to errors.
The researchers evaluated the actual translation output using the MuST-SHE dataset, which is specifically designed to test gender agreement (e.g., checking if “I was born” is translated with the correct feminine or masculine ending).

Table 2 Analysis:
- COMET: This measures general translation quality. ZeroSwot and Seamless actually have higher general quality scores than the traditional Enc-Dec. They are better translators overall.
- Acc. (Accuracy): This measures gender correctness. Here, the tables turn. The Enc-Dec model, which had lower general quality, achieves the highest gender accuracy (avg 85.57%). Seamless, which scrubbed the gender info, performs abysmally on gender accuracy (avg 53.35%—barely better than a coin toss).
The Link Between Encoding and Translation
The paper provides compelling evidence that a model’s ability to translate gender is directly tied to its ability to encode gender internally.

Figure 1 Analysis: This scatter plot is striking.
- The X-axis is the Probing F1 score (how well the model “knows” the gender internally).
- The Y-axis is the Translation Accuracy (how correctly it translates the sentence).
- The correlation is almost perfect (\(R^2 = 0.99\)).
The logic is undeniable: If the model deletes the gender information in the adapter layers (low X-axis), it fails to generate the correct grammatical gender in the output (low Y-axis).
The Masculine Default
When the model loses the acoustic information, what does it do? It falls back on the biases present in its training text data. In most text corpora, the masculine form is the default or “neutral” form.
Consequently, the models that “scrub” gender info (Seamless and ZeroSwot) exhibit a massive Masculine Default Bias. They almost always translate “I” as male.
However, the researchers found that even when the model does encode the gender correctly, it can still fail. They analyzed “Confusion Matrices” to see where the errors happen.

Figure 4 & Table 6 Analysis: Figure 4 shows cases where the Probe was correct (the model knew the gender), but the translation was wrong. Notice the bottom-left cells in the matrices (e.g., 33, 33, 38). These represent cases where the Probe said “She” (Correct), but the Translation said “He” (Wrong).
Why does this happen? Table 6 gives us a qualitative clue. Look at example (d):
- Source: “My main sport was soccer, and I was a goalkeeper…”
- Output: “…j’étais un gardien [Masculine]…”
Even though the speaker is female, and the model likely detected a female voice (encoded in the states), the strong semantic association between “soccer/goalkeeper” and “men” in the training data overrode the acoustic signal. The linguistic bias overpowered the acoustic reality.
Why This Matters
This paper highlights a critical tension in modern AI development. We are moving toward “Foundation Models”—massive, general-purpose systems like SeamlessM4T that can do everything. These models rely on adapters to bridge different modalities (speech, text, image).
However, this architecture seems to introduce a regression in specific nuances. By compressing speech to look like text, we lose the richness of the voice. For a female speaker translating “I am tired” into Italian, the difference between “Sono stanca” and “Sono stanco” is not just a grammatical error; it’s a misgendering of the user.
Key Takeaways:
- Architecture Matters: Newer isn’t always better for every sub-task. Old-school Encoder-Decoder models are better at preserving speaker identity than Adapter-based models.
- The Adapter Bottleneck: The current method of adapting speech to text embeddings acts as a filter that removes gender cues.
- Fairness through Data: “Scrubbing” gender (intentional or accidental) leads to unfairness. To get unbiased translations for female speakers, the model needs to know the gender is female.
- Probing works: The attention-based probing method proved to be a reliable proxy for predicting how the model will behave in the real world.
As we build the next generation of universal translators, we must ensure that in our quest for efficiency and modularity, we don’t silence the user’s actual voice.
](https://deep-paper.org/en/paper/2506.02172/images/cover.png)