Can We Fix AI Bias Just by Changing the Examples? Strategic Prompting for Fairer LLMs

Large Language Models (LLMs) like GPT-4 and Claude-3 have revolutionized how we interact with text. But recently, these models have started moving into a domain traditionally dominated by statistical models: tabular data. Imagine an LLM deciding whether someone qualifies for a loan or predicting their income bracket based on a spreadsheet of personal information.

While the capability is impressive, it brings a massive elephant into the room: Fairness.

We know LLMs can inherit biases from their training data. When we use them for high-stakes decisions like credit scoring, how do we ensure they treat minority groups fairly? A recent paper, Strategic Demonstration Selection for Improved Fairness in LLM In-Context Learning, explores a fascinating solution. It suggests that the secret to a fairer model might lie in the “few-shot” examples—or demonstrations—we provide in the prompt.

In this deep dive, we will explore how strategic prompt engineering can mitigate bias, uncover the mechanics behind it, and introduce a novel algorithm called FCG (Fairness via Clustering-Genetic) designed to automate this process.

The Setup: In-Context Learning and Tabular Data

To understand the solution, we first need to understand the mechanism: In-Context Learning (ICL).

Unlike traditional machine learning, where you retrain a model’s weights on a new dataset, ICL works by simply giving the LLM a few examples in the prompt. You say, “Here are three examples of loan applications and the decisions made. Now, decide on this new application.”

The researchers investigated this using a specific prompt structure.

The Workflow of Prompts in LLMs showing the transition from task clarification to demonstration selection and final testing.

As shown in Figure 1, the process has three stages:

Task Description: Telling the LLM what to do (e.g., “Predict loan approval”).
Demonstrations (The Core): This is where the magic happens. We select \(K\) examples from the training set to show the model how it’s done.
The Question: The actual test sample we want the LLM to classify.

The research question is simple but profound: Does the choice of those \(K\) examples in Step 2 change how fair the model is?

Defining Fairness in Math

Before we look at the results, we need to define “fairness” mathematically. The researchers focused on binary classification (e.g., approve/deny) and looked at sensitive attributes like gender (Male/Female).

They utilized the notation \(Z\) for the sensitive feature (where \(Z=0\) is the minority group, e.g., females in high-income datasets, and \(Z=1\) is the majority) and \(Y\) for the label (e.g., high income vs. low income).

To measure the distribution of these groups in the examples provided to the LLM, they used the following ratios:

Equations defining r_z and r_y, representing the proportion of minority samples and negative labels respectively.

Here, \(r_z\) is the proportion of minority samples in the prompt, and \(r_y\) is the proportion of a specific label.

To evaluate if the model is actually being fair, the paper uses two standard metrics:

Demographic Parity (DP): The probability of a positive prediction should be similar regardless of the group.
Equalized Odds (EO): The model should have similar True Positive Rates (TPR) and False Positive Rates (FPR) for both groups.

Ideally, the difference (\(\Delta\)) between groups should be zero, and the ratio (\(R\)) should be close to 1.

Key Discovery: Prioritize the Minority

The researchers tested various “Demonstration Strategies” (S1, S2, S3) on models like GPT-3.5, GPT-4, and Claude-3 using datasets like Credit (predicting overdue payments) and Adult (predicting income >50k).

S1 (Balanced): The examples in the prompt are 50% minority, 50% majority.
S2 (Prioritize Minority): The examples are 100% from the minority group (\(r_z = 1\)).
S3 (Prioritize Minority + Unbalanced Labels): All minority samples, with skewed outcome labels.

The results were counter-intuitive to standard statistical wisdom, which usually suggests balanced training data is best.

Bar charts showing prediction and fairness performance comparison across different LLMs.

Figure 2 illustrates the performance on the Credit dataset. Look at the yellow (\(R_{eo}\)) and grey (\(R_{dp}\)) bars—these represent fairness ratios (higher is better). You can see a distinct improvement in fairness metrics for strategies S2 and S3 compared to the Zero-shot baseline or S1.

The takeaway: Deliberately filling the prompt with examples from the underrepresented minority group significantly boosts fairness without tanking accuracy.

We can see this in detail with GPT-3.5 on the Adult Income dataset:

Table showing GPT-3.5 performance. Fairness metrics like Rdp and Reo are highest in columns S2 and S3.

In Table 1, look at the Fairness section. The \(R_{dp}\) (Demographic Parity Ratio) jumps from 0.40 in Zero-shot to 0.67 in S3. The difference in demographic parity (\(\Delta_{dp}\)) drops significantly. This suggests that LLMs pay close attention to the demographics shown in the context window.

The “Why”: A Perturbation Analysis

Why does showing the LLM only minority examples make it fairer? Is it learning the actual pattern, or just memorizing the labels? To find out, the researchers performed a Perturbation Analysis.

They took the demonstrations and intentionally “broke” them by flipping labels (changing High Income to Low Income) or flipping genders, then fed these corrupted prompts to the LLM.

Workflow of Perturbations showing how raw few-shot data is modified for analysis.

What did they find?

Prediction vs. Fairness Trade-off: When they messed with the sensitive features (e.g., changing male labels to female), they found a trade-off. Increasing the proportion of minority labels enhanced fairness but sometimes hurt predictive accuracy.
Ground Truth Matters (Sort of): Perturbing the outcome labels (\(Y\)) caused a drastic drop in fairness. This means the model isn’t just looking at the demographics; it is actively looking at the correlation between the demographic and the outcome in the prompt.

This confirmed that strategic selection isn’t just noise—it steers the model’s internal reasoning.

The Solution: The FCG Algorithm

Knowing that demonstration selection matters, relying on random sampling or manual curation is inefficient. We need an automated way to find the best set of examples that maximizes both accuracy and fairness.

The researchers proposed the Fairness via Clustering-Genetic (FCG) algorithm. It is a two-step process:

Clustering (Diversity): Instead of picking random points, the algorithm groups the training data into clusters (based on features). This ensures the pool of candidates covers the entire data distribution, not just the most common types of people.
Genetic Evolution (Optimization): It uses a genetic algorithm to “evolve” the best set of prompts. It picks a set, tests it, scores it based on a mix of Accuracy and Fairness, and then “breeds” the best sets to find an optimal combination.

The Workflow of Fairness via Clustering-Genetic (FCG) showing the transition from clustering to genetic evolution to final selection.

As visualized in Figure 4, the process starts by splitting the data into subgroups (e.g., Female/High-Income, Male/Low-Income). It clusters these subgroups to filter out redundancy. Then, the genetic algorithm iterates through combinations to find the “super-prompt.”

FCG vs. LENS

There is an existing method called LENS (Filter-then-Search) for selecting prompts. However, LENS is computationally expensive because it uses the LLM itself to filter data.

Workflow comparison of FCG, LENS, and LENS with FCG.

Figure 5 shows the difference. FCG uses Unsupervised Filtering (clustering) in the first step, which is much faster and doesn’t require querying an expensive LLM API thousands of times. It only brings in the LLM for the final scoring.

Does FCG Actually Work?

The results indicate a resounding yes. When applied to the Adult dataset, FCG consistently produced better fairness scores than random selection strategies.

Table 4 comparison analysis showing FCG enhances fairness across almost all strategies.

In Table 4, looking at the “K-Shot (K=8)” columns, we see high Fairness scores (\(R_{dp}\) and \(R_{eo}\)) across the board. Notably, using FCG with the strategy of prioritizing minority samples (\(r_z = 1\)) yielded some of the best results, effectively harmonizing the paper’s two main contributions: strategy (prioritize minorities) and method (use FCG to pick the specific examples).

Conclusion

This research highlights a critical aspect of the AI era: How we ask the question matters as much as the model we ask.

For students and practitioners working with LLMs on structured data, the takeaways are clear:

Don’t just randomly sample: The examples you put in your prompt frame the model’s worldview for that specific task.
Representation is key: For fairness, showing the model examples of underrepresented groups (minority classes) is often more effective than showing a “balanced” view of the world, likely because it forces the model to attend to features it might otherwise ignore.
Automate the process: Algorithms like FCG show that we can mathematically optimize prompts to meet ethical guidelines (fairness) without sacrificing utility (accuracy).

As LLMs continue to integrate into decision-making pipelines, techniques like FCG will be essential toolkits for ensuring these powerful systems serve everyone equitably.

Can We Fix AI Bias Just by Changing the Examples? Strategic Prompting for Fairer LLMs#

The Setup: In-Context Learning and Tabular Data#

Defining Fairness in Math#

Key Discovery: Prioritize the Minority#

The “Why”: A Perturbation Analysis#

What did they find?#

The Solution: The FCG Algorithm#

FCG vs. LENS#

Does FCG Actually Work?#

Conclusion#