Introduction
Language is rarely neutral. When we write or speak about different social groups—whether defined by nationality, race, or gender—we often rely on subtle associations that frame how those groups are perceived. These associations are what we call social stereotypes.
For years, natural language processing (NLP) researchers have tried to quantify these biases. Early attempts were groundbreaking but somewhat blunt, often relying on static word embeddings to show, for example, that “man” is to “computer programmer” as “woman” is to “homemaker.” While useful for identifying broad societal biases, these methods struggle with nuance. A social group like “Canadians” or “Chinese” isn’t stereotyped in the exact same way across every context. The stereotypes applied in a political discussion differ vastly from those in a discussion about sports or economics.
Furthermore, traditional methods suffer from an “identity” problem. If you try to analyze the context around a word like “Russia,” the semantic meaning of the word “Russia” itself is so strong that it often overpowers the subtle framing of the surrounding sentence.
In this deep dive, we will explore the research paper “ADAPTIVE AXES: A Pipeline for In-domain Social Stereotype Analysis.” This paper proposes a sophisticated new pipeline that combines state-of-the-art text embedding models with Large Language Models (LLMs) to capture these slippery, domain-specific stereotypes. By masking the target identity and focusing on the context, the researchers have found a way to measure not just what we say about social groups, but how the very structure of our sentences frames them.
Background: The Evolution of Bias Detection
To understand why “Adaptive Axes” is a significant step forward, we first need to look at how researchers have traditionally measured bias.
The Word Embedding Era
The most common approach has been to use Semantic Axes within word embedding spaces (like Word2Vec or GloVe). Imagine a 3D space where every word is a point. Words with similar meanings are close together.
To find bias, researchers define a “semantic axis” by picking two opposing poles, such as Good vs. Bad or Peaceful vs. Violent. They then project a social group’s vector (e.g., the point for “Country X”) onto this line. If the point lands closer to “Violent,” the model has captured a stereotype.
The Problem: Context and Identity
While elegant, this approach has two major flaws:
- Static Representations: It assumes the stereotype for a group is constant. It doesn’t account for domains. A group might be framed positively in Art but negatively in Politics.
- The Identity Trap: When using modern contextual models (like BERT), the embedding for a token (like “French”) is heavily influenced by the word “French” itself. The researchers note that the identity of the word dominates its contextual representation. This makes it hard to see how the surrounding text is actually doing the framing.
The “Adaptive Axes” pipeline addresses these issues by shifting the focus from the word to the masked context and by using LLMs to create cleaner, more specific semantic axes.
The Core Method: ADAPTIVE AXES
The proposed pipeline is a multi-step process designed to refine how we measure stereotypes. It moves away from analyzing single words and instead analyzes the “shape” of the conversation surrounding an entity.
As shown in the architecture diagram below, the pipeline consists of three main phases: refining semantic axes, augmenting them with LLMs, and finally embedding the masked contexts.

Let’s break down these distinct components.
1. Enhancing Semantic Axes
A semantic axis is only as good as the words used to define it. If you define a “Beauty” axis using the words “beautiful” vs. “ugly,” that’s a start. But if you rely on older databases like WordNet, you might find that the list of synonyms for “ugly” includes irrelevant or rare words like “psychogenic” or “noetic” (which appear in WordNet’s antonyms for physical beauty). These “dirty” lists add noise to the measurement.
Pruning with LLMs: The researchers employ LLMs (like GPT-4) to “prune” these axes. They feed the LLM a raw list of synonyms and antonyms and ask it to remove words that don’t fit the core semantic contrast.

By using the prompt above, the system ensures that the poles of the axis (e.g., Left Pole vs. Right Pole) are semantically tight and relevant.
Augmenting with Domain-Specific Axes: Standard axes like Good-Bad are too generic for specific domains like Economics or Politics. To capture in-domain stereotypes, the pipeline generates new axes. For example, in a “Military” domain, relevant axes might be Peaceful Protests vs. Military Intervention. In “Politics,” it might be Transparency vs. Opaqueness.

This generation step allows the analysis to scale to any topic, creating bespoke measuring sticks for any conversation.
2. The Power of Text Embedding Models
In the past, researchers projected word embeddings onto axes. This paper argues for using Text Embedding Models (sentence encoders like UAE-large-v1 or Mistral-7B-based embeddings). These models are trained to understand the meaning of full sentences or phrases, making them much better at capturing the “vibe” of a context than a single word vector.
3. Masking: The “Ghost” in the Sentence
This is arguably the most innovative part of the pipeline. To solve the “Identity Trap” mentioned earlier, the researchers mask the target entity.
If the sentence is:
“The Russian military initiated a blockade.”
The model sees:
“The [MASK] military initiated a blockade.”
The pipeline embeds this masked sentence. It then compares this “context embedding” against the semantic axes. If the context embedding aligns closely with the Aggression axis, we know the sentence frames the subject as aggressive, regardless of which country fills the blank. By aggregating thousands of these masked contexts where “Russia” used to be, the model reveals the cumulative stereotype associated with that group solely through the contexts they appear in.
Experiments and Validation
Does this actually work better than previous methods? The researchers conducted extensive validation to prove that text embedding models handle semantic axes better and that the resulting stereotypes align with human intuition.
Validating the Axes
First, they checked if modern text embedding models could distinguish between the two poles of a semantic axis better than older models (like GloVe or standard BERT).
They used a “consistency metric” (\(C\)). Briefly, they take a word out of a semantic pole and see if the model places it closer to its original group or the opposing group.

As shown in the table above, modern models like SFR-Embedding-Mistral and UAE-large-v1 achieve higher consistency scores and a higher number of consistent axes compared to GloVe or standard BERT. This confirms that these models effectively understand the semantic contrast between concepts like “Democracy” and “Authoritarianism.”
Validating In-Domain Specificity
One of the main claims of the paper is that domain-specific axes are necessary because generic axes miss the point. To prove this, they looked at the variance of the embeddings along different axes.
The logic is simple: If an axis is meaningful for a specific domain, the data points should spread out along it (high variance). If the axis is irrelevant, all the data points will clump in the middle (low variance).
\[ \cos ( \theta ) = { \frac { \mathbf { E } _ { \mathrm { d o m a i n } } \cdot \mathbf { A } _ { \mathrm { s / g } } } { \| \mathbf { E } _ { \mathrm { d o m a i n } } \| \| \mathbf { A } _ { \mathrm { s / g } } \| } } \]Using the cosine similarity equation above, they measured the alignment of domain embeddings (\(E\)) with the axes (\(A\)).
\[ \operatorname { V a r } ( X ) = E \left[ ( X - \mu ) ^ { 2 } \right] = E [ X ^ { 2 } ] - ( E [ X ] ) ^ { 2 } \]They then calculated the variance (Var). The results were compelling:

The table shows that the generated domain-specific axes (like Trade Barriers in economics) consistently ranked in the top 10% for variance. This means these axes are capturing real, significant differences in how entities are described in those domains.
Human Evaluation
Finally, the ultimate test: do humans agree with the model? The researchers set up a study where human annotators ranked the output of three methods:
- Random Baseline
- Token-based Embedding (The old BERT method)
- ADAPTIVE AXES (The new masked context method)
Annotators read sentences about countries (like China or Canada) and judged which set of keywords best described the “social impression” of the sentence.

Lower numbers are better (ranking 1st is best). Adaptive Axes achieved an average ranking of 1.675, significantly outperforming the token-based approach (1.925) and the random baseline. This indicates that looking at the masked context aligns much better with how humans perceive framing than looking at the word tokens alone.
Case Study: Country Stereotypes in US News
The researchers applied their pipeline to a massive dataset: the “News on the Web” (NOW) corpus, specifically looking at US news articles from 2010 to 2023. They analyzed how four countries—China, Russia, Germany, and Canada—were framed across three domains: Politics, Economics, and Culture.
The data distribution was significant, ensuring the results were statistically robust:

General Stereotypes
The results, summarized in the table below, reveal distinct in-domain profiles for each country.

Notice the nuance here:
- China: In Economics, the context is associated with “overseas,” “industrious,” and “factory-made.” However, in Politics, the associations shift sharply to “authoritarian” and “socialized.”
- Germany: In Economics, it is associated with “antimonopoly” and “market economy,” reflecting its status as a regulated European economic powerhouse.
- Canada: In Culture, it is associated with “multiculturalism,” contrasting with “nationalistic” frames found elsewhere.
These descriptors aren’t random; they align with geopolitical realities and media narratives. The pipeline successfully disentangled the economic “China” (the trade partner) from the political “China” (the rival), which a single vector representation would have mashed together.
Tracking Stereotypes Over Time: The US-China Trade War
Because the pipeline relies on text contexts, it can track how these contexts change over time. The authors performed a temporal analysis of US news regarding China, specifically looking at the period surrounding the onset of the US-China trade war (around 2018).
They utilized two LLM-generated axes relevant to the conflict:
- Trade Barriers vs. Free Trade
- Market Economy vs. Planned Economy

The graphs above tell a clear story.
- Left Chart: The red line represents the association with “Trade Barriers.” Notice the sharp spike starting in 2018. This correlates perfectly with the Section 301 sanctions announced by the US.
- Right Chart: The green line represents the association with “Market Economy.” It dips significantly around 2018/2019, implying that news coverage began framing China less as a market participant and more as a state-controlled (Planned Economy) entity during the height of the tensions.
This demonstrates that ADAPTIVE AXES isn’t just a static snapshot tool; it acts as a social barometer, capable of detecting shifts in media framing as real-world events unfold.
Conclusion and Implications
The “ADAPTIVE AXES” paper presents a compelling argument for rethinking how we measure bias in NLP. By moving away from static word tokens and embracing the complexity of context, the researchers have provided a tool that mirrors the complexity of human language.
Here are the key takeaways:
- Context is King: Masking the target entity allows us to measure the frame rather than the identity.
- LLMs as Assistants: Using LLMs to prune and generate semantic axes cleans up the data and allows for highly specific, in-domain analysis that generic databases like WordNet cannot support.
- Sentence Encoders: Modern text embedding models are superior to word-level models for capturing semantic contrast.
Limitations to Keep in Mind While powerful, the authors note that this pipeline measures co-occurrence, not necessarily causation. For example, Canada is strongly associated with the axis “North.” This isn’t a stereotype; it’s a geographical fact. Distinguishing between a descriptive association (like geography) and a biased stereotype (like “warlike”) remains a challenge that requires careful interpretation of the axes.
Ultimately, this work opens the door for sociologists, political scientists, and media analysts to quantify the “mood” of millions of documents instantly. It moves us beyond asking “Is this text biased?” to asking “How is the framing of this specific group evolving in this specific domain?"—a much richer question for a complex world.
](https://deep-paper.org/en/paper/file-2716/images/cover.png)