The rise of Vision-Language Models (VLMs) like CLIP has fundamentally changed how we approach computer vision. Instead of training massive networks from scratch, we now have the luxury of “prompting” pre-trained models to recognize concepts they have already learned. By feeding a model an image and a text description like “A photo of a dog,” we can achieve zero-shot classification with impressive accuracy.

However, crafting these text prompts by hand is tedious and suboptimal. This led to the rise of Prompt Learning (or Prompt Tuning), where we treat the text prompt as a set of learnable vectors and optimize them using back-propagation. It is a lightweight, efficient way to adapt massive models to specific downstream tasks.

But there is a catch. Like all deep learning methods, prompt learning relies on labeled data. In the real world, datasets are messy. Annotators make mistakes, and web-scraped data is often mislabeled. This phenomenon, known as Label Noise, can be catastrophic. When we force a model to learn from wrong answers, it overfits to errors, degrading its performance on clean data.

In this post, we will dive deep into NLPrompt, a novel approach presented at a major computer vision conference that solves this problem. We will explore how a surprisingly simple loss function—Mean Absolute Error (MAE)—can stabilize training, and how Optimal Transport can be used to “purify” dirty datasets automatically.

The Paradox of Loss Functions

To understand NLPrompt, we first need to look at how we typically train these models. The standard objective function for classification is Cross-Entropy (CE) Loss.

CE loss is fantastic for clean data because it aggressively penalizes incorrect predictions, forcing the model to converge quickly. However, this aggression is its downfall when noise is present. If an image of a “Cat” is mislabeled as a “Dog,” CE loss will scream at the model to update its weights until it classifies the cat as a dog. The model essentially memorizes the noise.

In contrast, Mean Absolute Error (MAE) is a loss function often used in regression, but rarely in classification. In traditional deep learning (training ResNets from scratch), MAE is notorious for slow convergence and getting stuck in local optima.

However, the researchers behind NLPrompt discovered something counter-intuitive: In the specific context of Prompt Learning, MAE is a superhero.

Because prompt learning only optimizes a tiny number of parameters (the prompt vectors) while keeping the massive VLM backbone frozen, the optimization landscape is different. As shown in the figure below, the researchers found that while CE loss performance collapses as noise increases, MAE remains rock-steady.

The performance of training with MAE loss and CE loss in prompt learning on Caltech101 dataset.

In Figure 1, look at the bottom right graph (75% noise). The red line (CE Loss) plummets to near-uselessness. The blue line (MAE Loss) barely flinches. This observation is the foundation of NLPrompt.

The Theory: Why is MAE Robust?

Why does MAE survive where CE fails? To answer this, the authors utilize Feature Learning Theory.

In a simplified view, we can decompose the features a model learns into two categories:

  1. Task-Relevant Features: Signals that actually help classify the image (e.g., whiskers for a cat).
  2. Task-Irrelevant Features: Noise or background details that don’t matter.

When training with noisy labels, the gradient updates from CE loss are unbounded. If the label is wrong, the gradient can be huge, forcing the model to pick up on task-irrelevant features to “explain away” the wrong label.

MAE, mathematically, is bounded. The gradient of the absolute error \(|y - s|\) is effectively constant (either 1 or -1). This means that a massive error (a noisy label) contributes roughly the same amount to the gradient update as a moderate error. It prevents the noisy samples from dominating the learning process.

The researchers proved theoretically that the ratio of task-relevant to task-irrelevant feature learning is superior under MAE when noise is present.

\[ \frac { \Delta \beta _ { \mathrm { M A E } } ^ { ( t ) } } { \Delta \beta _ { \mathrm { C E } } ^ { ( t ) } } = \frac { 1 } { 2 { \mathbb { E } } \left[ s _ { y } \right] } \cdot \frac { 1 - p \frac { 1 } { 1 - { \mathbb { E } } \left[ s _ { y } \right] } } { 1 - 2 p } > \frac { 1 } { 2 { \mathbb { E } } \left[ s _ { y } \right] } , \]

Inequality showing the ratio of update coefficients for MAE vs CE.

While the math above might look dense, the inequality essentially states that the update step (\(\Delta\)) for the useful features (\(\beta\)) is consistently stronger for MAE relative to CE in noisy conditions.

The Solution: NLPrompt

While MAE is robust, it isn’t perfect. On perfectly clean data, CE loss still yields higher accuracy because it converges faster and more precisely.

The ideal solution would be a hybrid: Use Cross-Entropy for the clean data and MAE for the noisy data.

But how do we know which data is clean and which is noisy without human checking? This is where the NLPrompt (Noise-Label Prompt) framework comes in. It uses a clever technique called Prompt-based Optimal Transport (PromptOT) to purify the dataset.

The Architecture

Let’s look at the overall framework of NLPrompt:

The framework of NLPrompt using Optimal Transport to separate data.

The process works in three steps:

  1. Feature Extraction: Images are passed through the Image Encoder, and class names are passed through the Text Encoder (using the current prompts).
  2. PromptOT Purification: An Optimal Transport solver categorizes samples into “Clean” (green) or “Noisy” (yellow).
  3. Harmonized Training: Clean samples update the prompt using CE Loss; noisy samples update the prompt using MAE Loss.

Step 1: Prompt-based Optimal Transport (PromptOT)

Standard methods for separating clean/noisy data often use clustering (like K-Means). However, K-Means requires random initialization, which can be unstable.

VLMs give us a massive advantage: Latent Alignment. The text encoder and image encoder map concepts to the same space. The text embedding for “A photo of a dog” is effectively the perfect “centroid” or prototype for dog images.

NLPrompt uses these text features as anchors. It calculates the cost matrix between all image features in a batch and these text prototypes. The cost is defined as the negative log similarity:

\[ \begin{array} { r l } { \underset { \mathbf { Q } \in \mathbb { R } _ { + } ^ { C \times N } } { \operatorname* { m i n } } } & { \langle - \log ( \mathbf { T } \cdot \mathbf { I } ^ { \top } ) , \mathbf { Q } \rangle } \\ { \mathrm { s . t . } } & { \mathbf { Q } \mathbb { 1 } _ { N } = \displaystyle \frac { 1 } { C } \mathbb { 1 } _ { C } , \mathbf { Q } ^ { \top } \mathbb { 1 } _ { C } = \frac { 1 } { N } \mathbb { 1 } _ { N } . } \end{array} \]

The Optimal Transport optimization problem.

Here, \(\mathbf{T}\) is the text feature matrix and \(\mathbf{I}\) is the image feature matrix. The goal is to find a mapping matrix \(\mathbf{Q}\) that minimizes the transport cost. This effectively asks: “What is the most efficient way to assign these images to these class prototypes, assuming the classes are roughly balanced?”

By solving this (using the fast Sinkhorn-Knopp algorithm), we get a matrix \(\mathbf{Q}\) that acts as a “soft” assignment of images to labels.

Step 2: Generating Pseudo-Labels

Once we have the optimal transport matrix \(\mathbf{Q}\), we can determine the “true” class of an image according to the model’s current understanding. The pseudo-label \(\tilde{y}_i\) is simply the class with the highest probability in the transport matrix:

\[ \tilde { y } _ { i } = \arg \operatorname* { m a x } _ { j } \mathbf { Q } _ { i j } . \]

Pseudo-label assignment equation.

Step 3: Data Splitting

Now we compare the generated pseudo-label \(\tilde{y}_i\) with the actual dataset label \(\hat{y}_i\).

  • If they match, the data is likely Clean.
  • If they disagree, the data is likely Noisy. \[ { \mathcal D } _ { \mathrm { c l e a n } } = \{ i \mid \hat { y } _ { i } = \tilde { y } _ { i } \} , \quad { \mathcal D } _ { \mathrm { n o i s y } } = \{ j \mid \hat { y } _ { j } \neq \tilde { y } _ { j } \} . \] Data partitioning equation.

Step 4: The Harmonized Loss

Finally, the prompts are updated using a combined loss function. The clean subset is trained with Cross-Entropy to refine the decision boundaries sharply. The noisy subset is trained with Mean Absolute Error to learn robust features without overfitting to the specific wrong labels.

\[ \ell _ { \mathrm { N L P r o m p t } } = \sum _ { i \in \mathcal { D } _ { \mathrm { c l e a n } } } - \mathbf { y } _ { i } ^ { \top } \log \mathbf { s } _ { i } + \sum _ { j \in \mathcal { D } _ { \mathrm { n o i s y } } } | | \mathbf { y } _ { j } - \mathbf { s } _ { j } | | _ { 1 } . \]

The final NLPrompt loss function combining CE and MAE.

Experimental Results

The researchers evaluated NLPrompt across several datasets (Caltech101, Flowers102, UCF101, etc.) under varying noise levels. They tested Symmetric Noise (random flipping) and Asymmetric Noise (flipping to visually similar classes, which is much harder).

Synthetic Noise Performance

Table 1 below summarizes the results. The comparison includes standard CoOp, “GCE” (Generalized Cross Entropy, a common robust loss), and JoAPR (a previous state-of-the-art method).

Performance metrics across various datasets and noise levels.

Key Takeaways:

  1. Dominance at High Noise: Look at the 50% and 75% noise columns. NLPrompt (bottom row of each block) frequently beats CoOp by massive margins (sometimes +30-40% accuracy).
  2. Consistency: Even at low noise levels (12.5%), NLPrompt remains competitive or superior, showing that the “Purification” step doesn’t accidentally discard good data.

Real-World Noise: Food101N

Synthetic noise is a good stress test, but real-world noise is unpredictable. The authors tested on Food101N, a dataset of food images scraped from the web with naturally occurring label errors (about 20% noise).

Test accuracy on Food101N.

As shown in Table 2, NLPrompt achieves 76.46% accuracy, outperforming the baseline CoOp (69.50%) and the previous best method JoAPR (72.57%). This confirms that the PromptOT mechanism works on real-world data distributions, not just synthetic setups.

Ablation Study: Does OT Matter?

You might wonder: “Is the Optimal Transport part really necessary? Can’t we just use MAE on everything?”

The ablation study below answers this clearly.

Ablation studies on Flowers102.

  • Row (b): Using MAE on all data (PromptMAE) gets 85.90% average accuracy. This is good!
  • NLPrompt: Using the full purification strategy gets 92.00%.
  • Row (d) vs (e): Interestingly, training only on the clean data (d) is worse than the full method. This implies that even the noisy data contains useful visual signals that MAE can extract safely.

Conclusion

NLPrompt offers a compelling solution to the noisy label problem in the era of Foundation Models. Rather than reinventing the wheel with complex correction networks, it leverages the inherent strengths of Vision-Language Models:

  1. Robustness via MAE: Taking advantage of the prompt learning landscape to use a loss function that ignores outliers.
  2. Alignment via OT: Using the pre-aligned text-image space to identify and separate noisy labels without external supervision.

For students and practitioners, this paper highlights an important lesson: Standard practices (like always using Cross-Entropy) should be questioned when the data or the training paradigm (like prompt tuning) changes. Sometimes, the simplest tools—like Mean Absolute Error—just need the right context to shine.