Does Gender-Inclusive Language Confuse AI? Exploring the Lou Dataset

Language is a living, breathing entity. It shifts with every generation, adapting to reflect the values and identities of the people who speak it. In recent years, one of the most significant linguistic evolutions—particularly in grammatical gendered languages like German—has been the rise of Gender-Fair Language (GFL).

In German, the traditional “generic masculine” (using masculine plural forms to refer to mixed groups) is increasingly being replaced by forms that explicitly include women and non-binary individuals. You might see Studenten (students, masculine) become Studentinnen und Studenten (students, feminine and masculine) or Student*innen (using a “gender star” to include everyone).

While this shift promotes societal inclusion, it poses a fascinating technical question: How do our current AI models handle this change?

Most Large Language Models (LMs) are trained on massive datasets scraped from the internet over the last decade or more. These datasets are dominated by older, traditional language patterns. So, when a modern German speaker uses a gender-inclusive form, does the AI stumble? Does it misinterpret the sentiment? Does it flag a neutral sentence as toxic simply because it uses a new grammatical structure?

To answer these questions, researchers from TU Darmstadt and partnering institutions created Lou, a groundbreaking dataset designed to stress-test German text classification models against gender-fair language.

The Challenge of German Gender

To understand the problem, we first need to look at the specific linguistic hurdles in German. Unlike English, where nouns are largely gender-neutral (a “baker” is a baker regardless of gender), German is heavily gendered.

Der Bäcker (The male baker)
Die Bäckerin (The female baker)

For a long time, the plural die Bäcker was used to refer to all bakers. Gender-fair language seeks to break this default. The researchers identified six distinct strategies currently in use, ranging from explicit binary inclusion to complete neutralization.

Figure 1: A German stance detection instance from the Lou dataset. We reformulate the masculine formulation Konsumenten (consumers) regarding six inclusive or neutral strategies, highlighted in yellow. Translation: Consumers must be well supported.

As shown in Figure 1, a simple sentence like “The consumers must be well supported” can be rewritten in drastically different ways:

Doppelnennung (Binary Inclusion): Explicitly naming both (Konsumentinnen und Konsumenten).
GenderStern / Gap / Doppelpunkt: Using characters like *, _, or : inside the word (Konsument*innen). This is intended to visually represent the spectrum of gender identities.
Neutralization: Using a participle or abstract noun (konsumierende Zielgruppe - “consuming target group”).
De-e (Neosystem): A proposed system introducing entirely new pronouns and suffixes to remove gender markers naturally (Konsumenterne).

The researchers wanted to know: If we take a sentence labeled as “Negative Sentiment” or “Toxic,” and we rewrite it using these strategies, does the AI still recognize the label?

Building Lou: Humans vs. Machines

One might assume that creating a dataset like this is easy—just ask ChatGPT to rewrite the sentences, right? The authors found that it wasn’t so simple.

They aimed to create high-quality parallel data. They took existing German datasets for tasks like Stance Detection (is the author for or against a topic?), Toxicity Detection, and Hate Speech Detection. They then set out to reformulate specific instances that contained gendered terms.

The “Amateur” Problem

The researchers conducted a study comparing “amateur” annotators (native speakers with moderate experience in GFL) against professional linguists. The results were telling. Amateurs struggled significantly to apply these linguistic rules consistently.

Reformulation Errors per Dataset and Strategy Figure 9: Detailed overview of the categorization frequency when analyzing the errors of the amateur annotators per dataset (row) and strategy (rows).

As Figure 9 illustrates, the error rates for amateurs were high, particularly when dealing with complex strategies like the GenderStern (star). The errors weren’t just typos; they included:

Numerus Errors: Messing up singular vs. plural agreement.
Pronoun Errors: Forgetting to adapt pronouns (e.g., changing “his” to a neutral form).
Over-complication: Creating convoluted sentences when trying to be neutral.

This finding is crucial for the field: Crowdsourcing gender-fair data is unreliable. To build Lou, the researchers had to rely on a multi-stage process involving professional proofreaders to ensure the data was linguistically sound. The result is the Lou Dataset: 3,600 reformulated instances across seven classification tasks.

Table 1:Example of the seven German classification tasks in Lou,along with their translations.Gender-fait reformulation strategies (subscript) are highlighted in yellow,and masculine formulations are in orange.

The Experiment: Stress-Testing the Models

With the dataset in hand, the researchers evaluated 16 different Language Models. These included:

German Specialists: Models like GBERT and GELECTRA, trained specifically on German text.
Multi-Lingual Giants: Models like XLM-R and mBERT.
English Models: Models like RoBERTa (used as a baseline to see if models just look at surface-level tokens).
Instruction-Tuned LLMs: Modern generative models like Llama-3 and GPT-4.

They tested the models in two ways: Fine-Tuning (training the model on the task) and In-Context Learning (asking the model to solve the task via a prompt without training).

Key Findings: Does GFL Break the Models?

The results revealed a complex landscape. It’s not as simple as “GFL makes performance worse.” In fact, the impact varies wildly depending on the strategy used and the difficulty of the task.

1. The Performance Paradox

Surprisingly, using gender-fair language doesn’t always hurt performance. In some cases, it actually improved the F1 Macro score (a measure of precision and recall).

$Figure 3: Difference between original and reformulated instances for strategies, model types,and tasks in average \$F _ { 1 }\$ macro (left). The size and the color indicate the diference, whether positive (blue)or negative (red). On the right, we stack the average difference per LM and seed or prompt template for the model types and strategies.$

Looking at Figure 3, the blue bubbles represent performance improvements, while red bubbles represent degradation.

The “Star” Effect: Strategies like GenderStern (using the *) often led to slight improvements (blue bubbles).
The Cost of Neutrality: Strategies that required heavy rewriting, like Neutral or the De-e system, often hurt performance (red bubbles).

Why? The researchers hypothesize that inclusive strategies (like the star) usually keep the root of the word intact. The model recognizes the word “Student” inside “Student*innen.” However, neutral strategies often replace the word entirely (e.g., changing “Doctors” to “Medical Professionals”), which shifts the semantic context enough to confuse the model.

2. The “Label Flip” Phenomenon

While the average scores might look stable, looking at individual predictions tells a scarier story. The researchers measured Label Flips—how often a model changes its answer just because the gender formulation changed.

Imagine an AI reviewing a comment for toxicity.

Original: “The doctors are incompetent.” -> Model says: Toxic.
Reformulated: “The doctors*innen are incompetent.” -> Model says: Not Toxic.

This is a “flip,” and it happened surprisingly often.

$Figure 4: Label flip fractions for strategies, model types, and tasks. Size indicates the label flip fraction under gender-fair language and the color positive (blue) or negative (red) effect on aggregated performance.$

Figure 4 highlights that up to 10.9% of predictions flipped in some tasks. This is significant. It implies that a content moderation system might let hate speech slip through, or flag innocent comments, purely based on whether the writer used a gender-inclusive ending.

Crucially, Detox tasks (detecting toxicity and hate speech) were the most volatile. These tasks are already difficult for models, and the added complexity of GFL seems to push them over the edge.

3. Why Does the Model Flip? (Uncertainty and Attention)

To understand why these flips occur, the researchers looked “under the hood” at the model’s internal states. They found that GFL changes how the model attends to words and lowers its confidence.

When a model encounters a reformulated sentence, two things happen:

Certainty Drops: The model is less sure of its prediction.
Attention Spikes: The model’s “attention mechanism” (how it focuses on different words) becomes erratic. It stares harder at the unfamiliar tokens (like *innen) and loses track of the broader context.

$Figure 7: Distribution of instance properties and label flip fractions are statistically significant \$( p < 0 . 0 5 )\$ ·$

Figure 7 visualizes this relationship. The Prediction Certainty chart (top right) shows that instances with lower certainty are much more likely to flip (the higher bars on the right). Essentially, if a model was already borderline on a decision, the introduction of gender-fair syntax is the “straw that breaks the camel’s back,” causing it to panic and switch its label.

4. The Syntactic Barrier

The researchers also analyzed the “embeddings”—the mathematical representation of words inside the model. They found that the impact of GFL is most visible in the lower layers of the model.

In deep learning, lower layers usually handle syntax and grammar, while higher layers handle meaning and semantics. The fact that GFL disrupts the lower layers suggests that models view this primarily as a syntactic violation. They are getting stuck on the grammar of the * or the : rather than understanding the inclusive meaning behind it.

$Figure 10: Projection (1D using T-SNE)of the vector difference between the average embeddings ofthe reformulated examples and the original ones for allsix strategies and 13 layers \$\\mathbf { \\dot { x } }\$ -axis) of GBERT-base,including the embedding layer (0).$

Figure 10 visualizes how distinguishable the different strategies are across the model’s layers. In the early layers (left side of the x-axis), the lines are spread out—the model sees “Studenten” and “Student*innen” as very different mathematical objects. As the data moves to higher layers (right side), the lines converge. The model eventually figures out they mean roughly the same thing, but the initial syntactic confusion leaves a mark on the final prediction.

Implications: Is Evaluation Broken?

With all these flips and fluctuations, you might wonder if our current benchmarks for German AI are invalid. If Model A beats Model B on standard German, does it still win on Gender-Fair German?

The good news is yes. The researchers found that the ranking of models remains consistent. A model that is smart enough to handle standard German well is usually also robust enough to handle GFL better than a weaker model. We don’t need to throw away all previous leaderboards, but we do need to be aware of the “noise” GFL introduces.

Conclusion

The Lou Dataset provides the first systematic look at how the shift toward gender-inclusive language impacts German AI. The takeaways are a mix of caution and optimism:

Language is moving faster than AI: Models trained on historical data struggle with the syntactic novelty of gender-fair forms.
The “Butterfly Effect” is real: A tiny change, like adding a colon inside a word, can flip a toxicity prediction. This is a safety risk for deployment in the real world.
Specialization matters: German-specific models generally handled these nuances better than broad multilingual models, but even they are not immune.

As society continues to adopt more inclusive language, our NLP pipelines must adapt. We cannot treat Gender-Fair Language as “noise” or “errors.” It is the new norm. Datasets like Lou are the first step toward teaching our machines to understand not just the grammar of the past, but the values of the present.

The Challenge of German Gender#

Building Lou: Humans vs. Machines#

The “Amateur” Problem#

The Experiment: Stress-Testing the Models#

Key Findings: Does GFL Break the Models?#

1. The Performance Paradox#

2. The “Label Flip” Phenomenon#

3. Why Does the Model Flip? (Uncertainty and Attention)#

4. The Syntactic Barrier#

Implications: Is Evaluation Broken?#

Conclusion#