Introduction: The “Waterbird” Problem

Imagine you are training an AI to classify birds. You feed it thousands of images of waterbirds (like ducks) and landbirds (like sparrows). The model achieves 99% accuracy on your validation set. You are ready to deploy.

But then, disaster strikes. You show the model a duck standing on a grassy field, and it confidently shouts “Landbird!” You show it a sparrow flying over a lake, and it predicts “Waterbird!”

What went wrong?

This is a classic case of spurious correlation. In the training data, 95% of waterbirds appeared against a water background. The model, being a lazy learner, didn’t learn to recognize the beak or the feathers; it simply learned that blue background = waterbird. It relied on a “shortcut” feature.

For modern foundation models like CLIP, which are trained on massive web-scale data, this problem is pervasive. While these models have high average accuracy, their worst-group accuracy (e.g., classifying waterbirds that are on land) is often abysmal.

Fixing this usually requires one of two expensive things:

Retraining the whole model to unlearn the bias (computationally expensive).
Manually labeling groups (telling the model “this is a duck on land”) so it can learn explicitly (annotation expensive).

In this post, we are doing a deep dive into a fascinating paper titled “Project-Probe-Aggregate: Efficient Fine-Tuning for Group Robustness.” The authors propose a clever, three-step method called PPA that fixes these biases without retraining the whole model and without needing expensive group labels. It achieves state-of-the-art results by tuning less than 0.01% of the parameters.

Background: Robustness and The Failure-Based Strategy

Before identifying the solution, we must formalize the problem. In standard machine learning, we typically minimize the average error over the data distribution \(\mathbb{P}\):

Standard Expected Error

However, when spurious correlations exist, minimizing average error allows the model to ignore minority groups (like landbirds on water). To measure true robustness, we look at Worst-Group Accuracy (WGA). We divide the data into groups \(g \in \mathcal{G}\) (e.g., Waterbird+Water, Waterbird+Land, Landbird+Water, Landbird+Land) and measure the error on the hardest group:

Worst-Group Error

The Unsupervised Challenge

The hardest version of this problem is unsupervised group robustness. This means we do not have labels telling us which image belongs to which group (\(g\)). We only have the class labels \(y\) (bird type) and the images \(x\).

A common strategy to solve this is the Failure-Based Debiasing Scheme. The logic is simple but powerful:

Train a standard, “lazy” model (often called Empirical Risk Minimization or ERM).
Let it overfit to the spurious features.
Identify the examples the model gets wrong. These are likely the minority groups (the “counter-bias” examples).
Train a second model that pays extra attention to these hard examples.

The paper we are analyzing improves drastically on this framework. The authors argue that standard models aren’t “biased enough” to cleanly identify the minority groups, and standard re-weighting isn’t optimal.

The PPA Method

The authors propose Project-Probe-Aggregate (PPA). It is a parameter-efficient fine-tuning method, meaning the massive pre-trained backbone (like CLIP) is frozen, and we only train a small linear layer.

The method consists of three distinct steps. Let’s break them down.

Step 1: Project (Creating the “Super-Biased” Model)

To find the minority groups (e.g., ducks on land), we first need a model that relies heavily on the background (water vs. land). If we can build a model that only looks at the background, it will definitely misclassify the ducks on land.

Standard training tries to learn the object and the background. The authors propose a way to force the model to ignore the object (the class) and focus only on the spurious features.

They utilize the text encoder of the foundation model (CLIP). They take the text embeddings of the class names (e.g., “A photo of a waterbird”) to create a matrix \(Z\). These embeddings represent the “core” features of the class.

To force the model to look at everything except the class concept, they mathematically project the image features onto the nullspace of these class proxies.

The projection matrix \(\Pi\) is calculated as:

Projection Matrix Equation

Here, \(Z\) represents the class features. Multiplying by \(\Pi\) effectively removes the directions in the feature space that correspond to the class definitions.

We then train a linear classifier \(f_b\) (the biased model) on these “projected” features:

Biased Classifier Equation

Why does this work? By removing the signal that describes the actual object (the bird), the classifier has no choice but to rely on whatever signal is left to predict the label. In datasets with spurious correlations, the remaining signal is the strong spurious feature (the background).

The authors prove mathematically (which we will touch on later) that this projection amplifies the model’s reliance on spurious correlations.

To verify this, look at the precision and recall of identifying minority groups below. The PPA method (green) significantly outperforms standard ERM (orange) and other methods in finding the “worst-group” examples.

Precision and Recall of Group Identification

Step 2: Probe (Scoring the Groups)

Now that we have a “super-biased” model \(f_b\), we use it to create pseudo-group labels.

If the biased model predicts the class correctly, the image likely follows the spurious correlation (Majority Group). If it predicts incorrectly, the image likely violates the correlation (Minority Group).

We define a pseudo-attribute \(\hat{a}\) based on whether the biased model was wrong:

Pseudo-attribute definition

Now, every image has a pseudo-group label \(\hat{g} = (y, \hat{a})\).

Next, we train a Probe—a new linear classifier \(h_d\)—to predict these pseudo-group labels. But we don’t just use standard cross-entropy. We need to account for the fact that these groups are heavily imbalanced.

The authors introduce Group Logit Adjustment (GLA). This loss function adds a margin to the logits based on the estimated group priors \(\hat{\beta}\) (how frequent each group is).

Group Logit Adjustment Loss

Here, \(\tau\) is a hyperparameter that controls how much we correct for the imbalance. This step effectively trains a model that is excellent at distinguishing between the four scenarios (Waterbird on Water, Waterbird on Land, etc.), even though it never saw ground-truth group labels.

Step 3: Aggregate (The Final Classifier)

We now have a probe \(h_d\) that predicts group labels. But for our final task, we don’t want to predict “Waterbird on Land”; we just want to predict “Waterbird.”

During inference, computing group probabilities and summing them up can be computationally overhead. The authors propose a clever simplification called Weight-Space Aggregation.

Since the probe is a linear model (\(W_d\)), the weights for a specific class \(y\) can simply be the sum of the weights for all groups belonging to that class.

Aggregation Equation Weight Space Aggregation Detail

This results in a final debiased classifier \(f_d\) that is just a standard linear layer, incurring zero extra inference cost compared to a standard model.

Theoretical Analysis: Why This Works

The paper provides rigorous theoretical backing for the intuition used in Step 1 and Step 2.

Why Projection Amplifies Bias (Proposition 1)

The authors analyze a linear regression setting where the target \(y\) depends on core features \(c\) and spurious features \(s\).

Linear Regression Model

They mathematically demonstrate that when you project out the core features (using \(\Pi\)), the weight assigned to the spurious features (\(\gamma'\)) in the new model increases compared to the original weight (\(\gamma\)).

Spurious Feature Weight Increase

Because the denominator is positive and the correlation term \(\mathbf{r}_s^\top \mathbf{r}_{y_o}\) is usually positive in spurious datasets (background correlates with label), \(\gamma' > \gamma\). This proves that the projected model is mathematically forced to be more biased.

Bayes Optimality (Proposition 2)

The authors also seek to minimize the Balanced Group Error (BGE), which treats all groups equally regardless of their size in the training set.

Balanced Group Error Definition

They prove that their aggregation strategy (summing the group logits minus the log group priors) is the Bayes Optimal classifier for minimizing BGE.

Bayes Optimal Classifier

This theoretical result explains why the Group Logit Adjustment in Step 2 is crucial—it aligns the training objective with the ultimate goal of balanced performance.

Experiments and Results

The authors evaluated PPA on standard benchmarks known for spurious correlations: Waterbirds, CelebA (classifying hair color, biased by gender), MetaShift, Living-17, and BAR.

Performance Comparison

The results are impressive. As shown in Table 2 below, PPA (bottom row) consistently outperforms other unsupervised methods (like JTT, CnC, and various prompting strategies).

Comparison Table Waterbirds and CelebA

Waterbirds: PPA achieves 84.3% worst-group accuracy using CLIP ResNet-50. This beats the previous state-of-the-art (CFR) by a significant margin (76.9%).
CelebA: The improvement is even more stark, jumping from ~77% (previous methods) to 91.1%.
Efficiency: Crucially, PPA achieves this by tuning <0.01% of the parameters, whereas many baselines require heavier tuning.

Similar dominance is seen on the Living-17 and BAR datasets:

Comparison Table Living-17 and BAR

Ablation: Do we really need Projection and Logit Adjustment?

You might wonder if Step 1 (Projection) or Step 2 (GLA) are actually necessary. The authors performed an ablation study to verify this.

Ablation Study Table

Row (a) vs (b): Just adding Group Logit Adjustment (GLA) improves performance, but not enough.
Row (b) vs (d): Adding the Projection step (Step 1) provides a massive jump in accuracy (e.g., on Waterbirds, from 54.4% to 84.3%). This confirms that standard models are not “biased enough” to identify the minority groups accurately; the projection is essential.
Row (d) vs (e): Row (e) uses Ground Truth group labels. PPA (Row d) comes remarkably close to the performance of using ground truth labels, validating the quality of the pseudo-labels.

Sensitivity to Hyperparameters

The method introduces \(\tau\) (tau) in the Logit Adjustment loss. Does the method break if \(\tau\) isn’t perfect?

Tau Sensitivity Graph

The graph above shows that performance is relatively stable around \(\tau=1.0\), which aligns with the theoretical prediction (Prop 2) that \(\tau=1\) is optimal for minimizing balanced group error.

Visualizing the Data

To make these results concrete, let’s look at what the “groups” actually look like.

Waterbirds: The distinction is clear—Waterbirds on water vs. land, and Landbirds on water vs. land. Waterbirds Samples

CelebA: The task is detecting “Blond Hair.” The spurious feature is gender. The minority group is “Male with Blond Hair” (which is rare in the dataset) and “Female with Dark Hair.” CelebA Samples

Conclusion and Implications

The Project-Probe-Aggregate (PPA) paper offers a masterclass in how to handle bias in modern AI systems. Instead of fighting the model’s tendency to learn shortcuts with brute force (more data, more training), it uses a judo-like approach:

Leans into the bias: Use projection to make the bias worse so it’s easier to detect.
Mathematically corrects: Use logit adjustment to theoretically balance the groups.
Simplifies: Collapse everything back into a simple linear classifier.

For students and practitioners, the key takeaways are:

Foundation models carry the biases of the web. You cannot assume CLIP “knows” what a bird is just because it has high accuracy.
Unsupervised debiasing is possible. We don’t always need expensive annotations to fix fairness issues.
Linear probes are powerful. You can fix deep structural issues in a model just by intelligently training the final classification layer.

As AI models continue to grow, efficient, mathematically grounded techniques like PPA will be essential for ensuring these systems are robust, fair, and reliable in the real world.

Introduction: The “Waterbird” Problem#

Background: Robustness and The Failure-Based Strategy#

The Unsupervised Challenge#

The PPA Method#

Step 1: Project (Creating the “Super-Biased” Model)#

Step 2: Probe (Scoring the Groups)#

Step 3: Aggregate (The Final Classifier)#

Theoretical Analysis: Why This Works#

Why Projection Amplifies Bias (Proposition 1)#

Bayes Optimality (Proposition 2)#

Experiments and Results#

Performance Comparison#

Ablation: Do we really need Projection and Logit Adjustment?#

Sensitivity to Hyperparameters#

Visualizing the Data#

Conclusion and Implications#