Beyond Manual Word Lists: Debiasing AI with Continuous Prompts

Pre-trained Language Models (PLMs) like BERT and RoBERTa have revolutionized Natural Language Processing (NLP). They act as the backbone for everything from sentiment analysis to hate speech detection. However, these models have a significant skeleton in the closet: they inherit human biases present in their massive training datasets.

When we deploy these models, they often exhibit “extrinsic social bias”—unfair behavior in specific downstream tasks. For instance, a model might be more likely to classify a tweet as “offensive” simply because it contains African American English (AAE), or associate certain professions more strongly with one gender.

The standard approach to fixing this has historically relied on manual word lists. Researchers would create lists of sensitive words (e.g., “he”, “she”, “man”, “woman”) and try to scrub the model’s reliance on them. But this approach is brittle. It’s limited by the length of the list and human intuition. What about biases that are subtle, contextual, or related to attributes like race where specific “trigger words” are harder to define?

In this post, we’ll dive into a research paper that proposes a more robust solution: Continuous Prompts Adjustment Debiasing (CPAD). This method moves away from manual lists and instead uses “continuous prompts” to mathematically isolate and neutralize bias during the inference stage.

Background: The Problem with Discrete fixes

To understand CPAD, we first need to understand the limitations of current methods.

Intrinsic vs. Extrinsic Bias:

Intrinsic Bias refers to bias inside the pre-trained model itself (e.g., embedding associations).
Extrinsic Bias is what we actually care about in applications: the bias the model shows when performing a specific task (e.g., a resume scanner rejecting female candidates).

Previous attempts to fix extrinsic bias often used discrete prompts or data augmentation based on word lists. For example, swapping “he” with “she” to create counterfactual data. But language is complex. Bias regarding race, for example, isn’t just about a few words; it’s often woven into names, dialects, and sentence structures. Manual lists simply cannot capture the entire vocabulary space of a sensitive group.

The researchers behind CPAD argue that we need a method that generates “continuous token lists” from the entire vocabulary space to bridge the gap between model outputs and fairness targets.

The CPAD Framework: An Overview

CPAD treats debiasing as a prompt-tuning problem.

In prompt-tuning, instead of fine-tuning the entire massive language model (which is expensive), we freeze the model and train a small sequence of vectors (the “prompt”) that we prepend to the input. These prompts guide the model to perform a specific task.

CPAD takes this a step further by training two types of prompts:

Task-Specific Prompts: Guide the model to solve the problem (e.g., “Is this hate speech?”).
Debiasing Prompts: Capture the demographic information (bias) we want to avoid.

By learning these separately, the system can mathematically “subtract” the bias from the prediction at runtime.

Figure 1: Illustration of CPAD: The color orange indicates the trainable parameters in each phase, while the color blue shows the frozen ones.

As shown in Figure 1, the architecture operates in three distinct phases:

Task-specific Learning Phase: Training the model to be good at the job.
Debiasing Learning Phase: Training the model to recognize sensitive attributes (like race or age) so we know what to remove.
Debiasing Phase (Inference): combining the above to make a fair prediction.

Let’s break these down in detail.

Phase 1: Task-Specific Learning

First, we need a baseline. We want a model that performs well on the downstream task (like Sentiment Analysis).

The authors use a template-based approach. They wrap the input text \(x_i\) with pseudo tokens (learnable vectors) \([P^t]\) and discrete tokens \([D^t]\) ending with a [MASK] token.

Equation 2

The pseudo tokens are passed through a Task-specific Prompt Encoder (TPE) to create continuous prompts \(h^t\).

Equation 7

The input to the model becomes a combination of the original text embeddings and these learned prompt embeddings:

Equation 4

The model is then trained to predict the correct label \(y_i\) for the [MASK] token. This is standard prompt-tuning. The objective function \(\mathcal{L}_{task}\) is a standard cross-entropy loss, ensuring the model is accurate.

Equation 8

At this stage, we have a model that is accurate, but likely biased.

Phase 2: Debiasing Learning

This is where CPAD innovates. To remove bias, we first need to capture it. The goal here is to train Debiasing Prompts (\(h^0, h^1, ...\)) that specifically encode information about protected attributes (like Age or Race).

Instead of using a manual list of words to define “Race,” the authors generate Prototypes.

Prototype Generation

The model runs through the training data. For a specific protected attribute (say, Race), the system collects the model’s output distributions for all examples belonging to “Group A” (e.g., White-aligned English) and “Group B” (e.g., AAE).

It then aggregates these outputs to create a “center” or prototype for each group. This effectively creates a continuous representation of that group’s linguistic patterns across the entire vocabulary, rather than just a few keywords.

Equation 9

Here, \(C_j^k\) is the prototype for sensitive group \(k\) of attribute \(j\). It is the average of the outputs \(O_i\) for that group.

The Contrastive Objective

Now, we train the Debiasing Prompt Encoders. We want these prompts to help the model distinguish between sensitive groups. If the prompt successfully captures the “essence” of a specific demographic, the model’s output for an input text should be very close to that demographic’s prototype.

The training uses a contrastive loss function:

Equation 10

This equation might look intimidating, but the intuition is simple:

Minimize the distance between the input’s output and its correct demographic prototype (\(c_+\), positive prototype).
Maximize the distance between the input’s output and the incorrect demographic prototype (\(c_-\), negative prototype).

By doing this, the debiasing prompts become experts at extracting demographic signals—exactly the signals we want to neutralize later.

Phase 3: The Debiasing Phase (Inference)

We now have a Task Prompt (which knows how to do the job) and Debiasing Prompts (which know about demographics). To make a fair prediction, we adjust the input by combining these prompts.

The authors propose two ways to do this: Fine-grained and Coarse-grained adjustment.

Fine-grained Adjustment

In this method, the adjustment happens deep inside the prompt encoder. We define hyper-parameters \(\alpha\) and \(\beta\) that control how much we want to penalize the bias.

We adjust the pseudo tokens (\(P_u\)) and the encoder parameters (\(\tau\)) by subtracting the influence of the sensitive groups.

Equation 11

Notice the logic: we take the task parameters, and we mix them with the bias parameters weighted by \(\alpha\) and \(\beta\). Wait, why are we adding them?

Actually, look closely at the usage in the paper’s logic context (Equation 1, shown below). The goal is generally to find a balance. The formulation implies a linear interpolation where we construct a new prompt that balances task knowledge against the bias knowledge we explicitly modeled.

Coarse-grained Adjustment

This approach is simpler and often more effective. It operates directly on the output embeddings of the prompt encoders (\(h_u\)).

Equation 12

We combine the task prompt embedding \(h^t\) with the demographic prompt embeddings \(h^0\) and \(h^1\).

Finally, the prediction is made using this adjusted input. The general formula for the classifier \(\mathcal{F}\) becomes:

Equation 1

Here, \(\Omega\) represents the adjustment operation (combining the prompts) and \(\mathbf{H}\) represents the set of debiasing prompts.

Experimental Results

The researchers tested CPAD on three NLU tasks:

Hate Speech Detection (Race bias)
Sentiment Analysis (Race bias - AAE vs. White-aligned)
Psychometric Dimension Prediction (Race and Age bias)

They compared CPAD against several baselines, including standard Fine-tuning, Adapter, P-tuning, and other debiasing methods like Auto-Debias and Causal-Debias.

Metric 1: Group Fairness

Group fairness checks if the model treats different groups equally. Ideally, the “Gap” in True Positive Rates (TPR) and True Negative Rates (TNR) between groups should be zero.

The overall fairness score is calculated as:

Equation 15

The Results:

Table 1: Group fairness results in hate speech detection and sentiment analysis… Table 2: Group fairness results in psychometric dimension prediction…

Looking at Table 1 (Top table in the image above):

Hate Speech: CPAD achieves the lowest “Overall” score (10.23%), meaning it has the smallest gap between groups. It significantly outperforms standard Finetuning (14.50%).
Sentiment Analysis: CPAD achieves an Overall score of 41.29%, drastically better than P-tuning (56.03%).

In Table 2, we see results for Psychometric prediction involving both Race and Age. CPAD consistently achieves lower gaps (better fairness) compared to baselines.

Metric 2: Fairness through Unawareness

Group fairness measures outcomes, but “Fairness through Unawareness” measures information leakage. Can we guess the protected attribute (like race) just by looking at the model’s output probability distribution? If the model is truly fair/unaware, we shouldn’t be able to.

The researchers propose a metric where they try to predict the sensitive group based on the distance between the output and the prototypes.

Equation 16

An ideal “Leakage” score is 50% (for binary groups)—meaning the output is as good as a random coin flip at revealing the demographic.

Table 3: Fairness through unawareness results…

In Table 3, CPAD brings the leakage down closer to 50% compared to the biased model. For example, in Sentiment Analysis, leakage drops from 82.78% to 73.89%.

Table 4: Fairness through unawareness results in psychometric dimension prediction…

Table 4 confirms this trend for Psychometrics, reducing race leakage from 63.05% to 61.21%.

The Trade-off: Accuracy vs. Fairness

There is always a catch. Usually, making a model fairer reduces its accuracy because the model can no longer rely on “cheap” stereotypes to make predictions.

The authors analyzed this trade-off by varying the hyper-parameter \(\alpha\) (which controls the strength of the debiasing).

Figure 3: Trade-off factor alpha evaluation results in hate speech detection…

In Figure 3 (Hate Speech):

Red Bar (Accuracy): Remains relatively stable until \(\alpha\) gets very high (0.9).
Orange Line (Leakage): Drops significantly as \(\alpha\) increases.
Purple/Green Bars (Gaps): Generally decrease.

This shows CPAD can improve fairness significantly with only a minor hit to accuracy, provided \(\alpha\) isn’t set too high.

Figure 4: Trade-off factor alpha evaluation results in sentiment analysis…

Figure 4 (Sentiment Analysis) shows a steeper trade-off. As we push for lower leakage (orange line), the accuracy (red bars) takes a hit. This highlights the difficulty of the task—sentiment analysis on Twitter is heavily correlated with dialect, making debiasing harder without losing semantic meaning.

Intersectional Bias

One of the strengths of CPAD is its ability to handle multiple attributes at once (e.g., a Black woman vs. a White man). This is “Intersectional Bias.”

Figure 6: Intersectional bias mitigation results in psychometric dimension prediction…

The heatmaps in Figure 6 visualize the overall fairness score (lower is better, represented by lighter colors/yellow) as we adjust weights for both Race (\(\alpha\)) and Age (\(\beta\)).

The bottom-left corner (0,0) represents the biased model (dark/red).
As we move towards the center (increasing \(\alpha\) and \(\beta\)), the colors shift to yellow/light green, indicating improved fairness across both dimensions simultaneously.

Conclusion

The CPAD method represents a significant step forward in making NLP models fairer. By moving away from manual word lists and embracing continuous prompts, the authors created a system that is:

Comprehensive: It uses the entire vocabulary space to define bias, not just a few keywords.
Flexible: It allows for a fine-tuned trade-off between fairness and accuracy using \(\alpha\) and \(\beta\) parameters.
Extensible: It can handle multiple protected attributes (Race, Age) simultaneously.

For students and practitioners, CPAD illustrates the power of Prompt Tuning not just for adapting models to new tasks, but for fundamentally altering how models process information to align with ethical guidelines. As we continue to deploy Large Language Models in sensitive areas like hiring and moderation, techniques like CPAD will be essential to ensure these powerful tools serve everyone equitably.

Beyond Manual Word Lists: Debiasing AI with Continuous Prompts#

Background: The Problem with Discrete fixes#

The CPAD Framework: An Overview#

Phase 1: Task-Specific Learning#

Phase 2: Debiasing Learning#

Prototype Generation#

The Contrastive Objective#

Phase 3: The Debiasing Phase (Inference)#

Fine-grained Adjustment#

Coarse-grained Adjustment#

Experimental Results#

Metric 1: Group Fairness#

Metric 2: Fairness through Unawareness#

The Trade-off: Accuracy vs. Fairness#

Intersectional Bias#

Conclusion#