Introduction: The “Waterbird” Problem
Imagine you are training an AI to classify birds. You feed it thousands of images of waterbirds (like ducks) and landbirds (like sparrows). The model achieves 99% accuracy on your validation set. You are ready to deploy.
But then, disaster strikes. You show the model a duck standing on a grassy field, and it confidently shouts “Landbird!” You show it a sparrow flying over a lake, and it predicts “Waterbird!”
What went wrong?
This is a classic case of spurious correlation. In the training data, 95% of waterbirds appeared against a water background. The model, being a lazy learner, didn’t learn to recognize the beak or the feathers; it simply learned that blue background = waterbird. It relied on a “shortcut” feature.
For modern foundation models like CLIP, which are trained on massive web-scale data, this problem is pervasive. While these models have high average accuracy, their worst-group accuracy (e.g., classifying waterbirds that are on land) is often abysmal.
Fixing this usually requires one of two expensive things:
- Retraining the whole model to unlearn the bias (computationally expensive).
- Manually labeling groups (telling the model “this is a duck on land”) so it can learn explicitly (annotation expensive).
In this post, we are doing a deep dive into a fascinating paper titled “Project-Probe-Aggregate: Efficient Fine-Tuning for Group Robustness.” The authors propose a clever, three-step method called PPA that fixes these biases without retraining the whole model and without needing expensive group labels. It achieves state-of-the-art results by tuning less than 0.01% of the parameters.
Background: Robustness and The Failure-Based Strategy
Before identifying the solution, we must formalize the problem. In standard machine learning, we typically minimize the average error over the data distribution \(\mathbb{P}\):

However, when spurious correlations exist, minimizing average error allows the model to ignore minority groups (like landbirds on water). To measure true robustness, we look at Worst-Group Accuracy (WGA). We divide the data into groups \(g \in \mathcal{G}\) (e.g., Waterbird+Water, Waterbird+Land, Landbird+Water, Landbird+Land) and measure the error on the hardest group:

The Unsupervised Challenge
The hardest version of this problem is unsupervised group robustness. This means we do not have labels telling us which image belongs to which group (\(g\)). We only have the class labels \(y\) (bird type) and the images \(x\).
A common strategy to solve this is the Failure-Based Debiasing Scheme. The logic is simple but powerful:
- Train a standard, “lazy” model (often called Empirical Risk Minimization or ERM).
- Let it overfit to the spurious features.
- Identify the examples the model gets wrong. These are likely the minority groups (the “counter-bias” examples).
- Train a second model that pays extra attention to these hard examples.
The paper we are analyzing improves drastically on this framework. The authors argue that standard models aren’t “biased enough” to cleanly identify the minority groups, and standard re-weighting isn’t optimal.
The PPA Method
The authors propose Project-Probe-Aggregate (PPA). It is a parameter-efficient fine-tuning method, meaning the massive pre-trained backbone (like CLIP) is frozen, and we only train a small linear layer.
The method consists of three distinct steps. Let’s break them down.
Step 1: Project (Creating the “Super-Biased” Model)
To find the minority groups (e.g., ducks on land), we first need a model that relies heavily on the background (water vs. land). If we can build a model that only looks at the background, it will definitely misclassify the ducks on land.
Standard training tries to learn the object and the background. The authors propose a way to force the model to ignore the object (the class) and focus only on the spurious features.
They utilize the text encoder of the foundation model (CLIP). They take the text embeddings of the class names (e.g., “A photo of a waterbird”) to create a matrix \(Z\). These embeddings represent the “core” features of the class.
To force the model to look at everything except the class concept, they mathematically project the image features onto the nullspace of these class proxies.
The projection matrix \(\Pi\) is calculated as:

Here, \(Z\) represents the class features. Multiplying by \(\Pi\) effectively removes the directions in the feature space that correspond to the class definitions.
We then train a linear classifier \(f_b\) (the biased model) on these “projected” features:

Why does this work? By removing the signal that describes the actual object (the bird), the classifier has no choice but to rely on whatever signal is left to predict the label. In datasets with spurious correlations, the remaining signal is the strong spurious feature (the background).
The authors prove mathematically (which we will touch on later) that this projection amplifies the model’s reliance on spurious correlations.
To verify this, look at the precision and recall of identifying minority groups below. The PPA method (green) significantly outperforms standard ERM (orange) and other methods in finding the “worst-group” examples.

Step 2: Probe (Scoring the Groups)
Now that we have a “super-biased” model \(f_b\), we use it to create pseudo-group labels.
If the biased model predicts the class correctly, the image likely follows the spurious correlation (Majority Group). If it predicts incorrectly, the image likely violates the correlation (Minority Group).
We define a pseudo-attribute \(\hat{a}\) based on whether the biased model was wrong:

Now, every image has a pseudo-group label \(\hat{g} = (y, \hat{a})\).
Next, we train a Probe—a new linear classifier \(h_d\)—to predict these pseudo-group labels. But we don’t just use standard cross-entropy. We need to account for the fact that these groups are heavily imbalanced.
The authors introduce Group Logit Adjustment (GLA). This loss function adds a margin to the logits based on the estimated group priors \(\hat{\beta}\) (how frequent each group is).

Here, \(\tau\) is a hyperparameter that controls how much we correct for the imbalance. This step effectively trains a model that is excellent at distinguishing between the four scenarios (Waterbird on Water, Waterbird on Land, etc.), even though it never saw ground-truth group labels.
Step 3: Aggregate (The Final Classifier)
We now have a probe \(h_d\) that predicts group labels. But for our final task, we don’t want to predict “Waterbird on Land”; we just want to predict “Waterbird.”
During inference, computing group probabilities and summing them up can be computationally overhead. The authors propose a clever simplification called Weight-Space Aggregation.
Since the probe is a linear model (\(W_d\)), the weights for a specific class \(y\) can simply be the sum of the weights for all groups belonging to that class.

This results in a final debiased classifier \(f_d\) that is just a standard linear layer, incurring zero extra inference cost compared to a standard model.
Theoretical Analysis: Why This Works
The paper provides rigorous theoretical backing for the intuition used in Step 1 and Step 2.
Why Projection Amplifies Bias (Proposition 1)
The authors analyze a linear regression setting where the target \(y\) depends on core features \(c\) and spurious features \(s\).

They mathematically demonstrate that when you project out the core features (using \(\Pi\)), the weight assigned to the spurious features (\(\gamma'\)) in the new model increases compared to the original weight (\(\gamma\)).

Because the denominator is positive and the correlation term \(\mathbf{r}_s^\top \mathbf{r}_{y_o}\) is usually positive in spurious datasets (background correlates with label), \(\gamma' > \gamma\). This proves that the projected model is mathematically forced to be more biased.
Bayes Optimality (Proposition 2)
The authors also seek to minimize the Balanced Group Error (BGE), which treats all groups equally regardless of their size in the training set.

They prove that their aggregation strategy (summing the group logits minus the log group priors) is the Bayes Optimal classifier for minimizing BGE.

This theoretical result explains why the Group Logit Adjustment in Step 2 is crucial—it aligns the training objective with the ultimate goal of balanced performance.
Experiments and Results
The authors evaluated PPA on standard benchmarks known for spurious correlations: Waterbirds, CelebA (classifying hair color, biased by gender), MetaShift, Living-17, and BAR.
Performance Comparison
The results are impressive. As shown in Table 2 below, PPA (bottom row) consistently outperforms other unsupervised methods (like JTT, CnC, and various prompting strategies).

- Waterbirds: PPA achieves 84.3% worst-group accuracy using CLIP ResNet-50. This beats the previous state-of-the-art (CFR) by a significant margin (76.9%).
- CelebA: The improvement is even more stark, jumping from ~77% (previous methods) to 91.1%.
- Efficiency: Crucially, PPA achieves this by tuning <0.01% of the parameters, whereas many baselines require heavier tuning.
Similar dominance is seen on the Living-17 and BAR datasets:

Ablation: Do we really need Projection and Logit Adjustment?
You might wonder if Step 1 (Projection) or Step 2 (GLA) are actually necessary. The authors performed an ablation study to verify this.

- Row (a) vs (b): Just adding Group Logit Adjustment (GLA) improves performance, but not enough.
- Row (b) vs (d): Adding the Projection step (Step 1) provides a massive jump in accuracy (e.g., on Waterbirds, from 54.4% to 84.3%). This confirms that standard models are not “biased enough” to identify the minority groups accurately; the projection is essential.
- Row (d) vs (e): Row (e) uses Ground Truth group labels. PPA (Row d) comes remarkably close to the performance of using ground truth labels, validating the quality of the pseudo-labels.
Sensitivity to Hyperparameters
The method introduces \(\tau\) (tau) in the Logit Adjustment loss. Does the method break if \(\tau\) isn’t perfect?

The graph above shows that performance is relatively stable around \(\tau=1.0\), which aligns with the theoretical prediction (Prop 2) that \(\tau=1\) is optimal for minimizing balanced group error.
Visualizing the Data
To make these results concrete, let’s look at what the “groups” actually look like.
Waterbirds: The distinction is clear—Waterbirds on water vs. land, and Landbirds on water vs. land.

CelebA: The task is detecting “Blond Hair.” The spurious feature is gender. The minority group is “Male with Blond Hair” (which is rare in the dataset) and “Female with Dark Hair.”

Conclusion and Implications
The Project-Probe-Aggregate (PPA) paper offers a masterclass in how to handle bias in modern AI systems. Instead of fighting the model’s tendency to learn shortcuts with brute force (more data, more training), it uses a judo-like approach:
- Leans into the bias: Use projection to make the bias worse so it’s easier to detect.
- Mathematically corrects: Use logit adjustment to theoretically balance the groups.
- Simplifies: Collapse everything back into a simple linear classifier.
For students and practitioners, the key takeaways are:
- Foundation models carry the biases of the web. You cannot assume CLIP “knows” what a bird is just because it has high accuracy.
- Unsupervised debiasing is possible. We don’t always need expensive annotations to fix fairness issues.
- Linear probes are powerful. You can fix deep structural issues in a model just by intelligently training the final classification layer.
As AI models continue to grow, efficient, mathematically grounded techniques like PPA will be essential for ensuring these systems are robust, fair, and reliable in the real world.
](https://deep-paper.org/en/paper/2503.09487/images/cover.png)