When Data Lies: Robust Multi-View Clustering in a Noisy World

In the ideal world of machine learning research, data is clean, labels are accurate, and every input matches its description perfectly. In the real world, however, data is messy. Sensors glitch, annotators make mistakes, and datasets are full of noise.

Imagine you are training an AI to understand scenes using two “views”: an image from a camera and a caption from a text file. In a perfect dataset, a picture of a sheep is always paired with the text “A sheep on the grass.” But what happens if the data pipeline gets crossed? What if the picture of the sheep is paired with the text “A yellow boat on the beach”?

This phenomenon is known as Noisy Correspondence, and it can destroy the performance of unsupervised learning models. Furthermore, when these models try to teach themselves using Pseudo-labels (guessing the category of an image to learn from it), they often guess wrong, reinforcing their own errors.

Today, we are diving deep into a CVPR paper titled “ROLL: Robust Noisy Pseudo-label Learning for Multi-View Clustering with Noisy Correspondence.” This research tackles these two critical problems head-on, proposing a robust framework that allows AI to learn effectively even when the data is lying to it.

The Problem: The Double Whammy of Noise

To understand why this paper is significant, we first need to look at the specific domain: Multi-View Clustering (MVC).

MVC aims to group similar data points together (clustering) by leveraging information from multiple sources (views). For example, if you are clustering news stories, one view might be the headline, and another view might be the image accompanying the article. By looking at both, the algorithm should theoretically get a better understanding of the topic than by looking at one alone.

However, existing MVC methods rely on two dangerous assumptions:

  1. The Flawless Prediction Assumption: They assume that the pseudo-labels generated during the self-supervised training process are correct.
  2. The Perfect Alignment Assumption: They assume that View A and View B for the “same” data point actually correspond to each other.

When these assumptions fail, we face the Noisy Pseudo-label Problem (NPP) and the Noisy Correspondence Problem (NCP).

Figure 1. The motivation of our ROLL. Cross-view sample pairs are contaminated, which could cause some unaligned sample pairs to be mistaken for positive pairs, i.e., noisy correspondence problem.

As illustrated in Figure 1 above, these problems create a chaotic learning environment.

  • Noisy Correspondence (NCP): Look at the connections between the images and the text. The image of the sheep is correctly linked to the sheep text (checked), but the bus image might be incorrectly linked to the boat text (crossed). If the model tries to pull the “Bus” image and “Boat” text together in the feature space, it learns nonsense.
  • Noisy Pseudo-labels (NPP): On the right side, we see the clustering assignments. If the model incorrectly guesses that a “Car” belongs to the “Boat” cluster, and then uses that guess as ground truth for training, it creates a feedback loop of error.

The researchers behind ROLL (Robust nOisy pseudo-Label Learning) realized that to build a truly robust system, they had to solve both problems simultaneously.

The Solution: The ROLL Framework

The proposed method, ROLL, is designed to prevent the overfitting that usually occurs when deep learning models memorize noisy data. The framework operates in two distinct stages: a Warm-up Stage and a Robust Learning Stage.

Figure 2. The framework of our ROLL. During the warm-up stage, we first obtain pseudo-labels of each view by contrastive learning (CL). Then, we conduct learning from noisy pseudo-labels through noise-tolerance pseudo-label learning (NPL).

Stage 1: The Warm-Up

Before the model can handle noise, it needs a basic understanding of the data. The authors employ a standard autoencoder architecture for this.

For every view \(v\) (e.g., image or text), there is an encoder \(E\) that compresses the input \(X\) into a latent representation \(Z\).

Equation 1

To ensure these representations retain meaningful information, the model tries to reconstruct the original input from \(Z\) using a decoder. The reconstruction loss is calculated as:

Equation 2

During this phase, the model also uses standard contrastive learning to align the different views. This essentially tells the model: “Make the representation of Image A look similar to the representation of Text A.” Once this warm-up is done, the model performs K-means clustering on the learned features to generate the initial—likely noisy—pseudo-labels.

Stage 2: Learning from Noisy Pseudo-Labels (NPL)

This is where the paper innovates. Once we have initial pseudo-labels, we want to use them to supervise the training. This is usually done by calculating the probability that a sample \(i\) belongs to a cluster center \(C\).

The predicted probability distribution is calculated using the softmax function over the similarities between the sample representation and the cluster centers:

Equation 3

In a standard, naive approach, you would simply minimize the Cross-Entropy (CE) loss between the predicted probability and the pseudo-label \(Y\):

Equation 4

The Trap: If \(Y\) (the pseudo-label) is wrong, minimizing this loss forces the model to learn the mistake.

The Fix: The authors introduce Noise-tolerance Pseudo-label Learning (NPL). The core idea is to weight the loss based on cross-view consistency. If View A (image) and View B (text) both strongly agree on the cluster assignment, it is likely a correct label. If they disagree, the label is likely noise, and the model should learn less from it.

They introduce a weighting mechanism into the loss function:

Equation 7

Notice the fraction term in the equation above. It calculates the dot product between the probability distributions of two different views (\(L^v\) and \(L^u\)).

  • High Agreement: If both views predict similar distributions, the dot product is large, the weight is high, and the model learns from this sample.
  • Low Agreement: If the views predict different clusters, the weight is low, effectively “silencing” this sample during the update so it doesn’t corrupt the model.

The total NPL loss sums this up across all views:

Equation 8

Stage 3: Robust Multi-view Contrastive Learning (RCL)

The second major contribution tackles the Noisy Correspondence Problem (NCP).

Standard Contrastive Learning uses the InfoNCE loss. It tries to maximize the similarity between positive pairs (an image and its matching text) and minimize similarity with negative pairs.

First, similarity is measured using cosine distance:

Equation 9

Then, probabilities for positive (\(P^+\)) and negative (\(P^-\)) pairs are computed:

Equation 10

And the standard InfoNCE loss is applied:

Equation 11

The Problem with InfoNCE: InfoNCE is notorious for focusing heavily on “hard samples”—pairs that are difficult to align. In a clean dataset, this is good; it forces the model to learn fine-grained details. In a noisy dataset, a “hard sample” is often just a mismatch (e.g., the Sheep image and Boat text). Forcing the model to align them causes overfitting to noise.

The Fix: The authors propose a Robust Multi-view Contrastive Loss (RCL). They introduce a control factor \(r\) (where \(0 < r < 1\)) to modulate how much attention the model pays to hard samples.

The new loss function looks like this:

Equation 13

This equation might look intimidating, but it is mathematically elegant because it bridges two extremes.

  1. If \(r \to 0\): The equation becomes asymptotically equivalent to the standard InfoNCE loss. This is highly discriminative but not robust to noise. Equation 18

  2. If \(r \to 1\): The equation becomes equivalent to the Mean Absolute Error (MAE) loss. MAE treats all samples equally. It is very robust to noise, but it’s not very smart—it doesn’t learn strong discriminative features (underfitting). Equation 17

By setting \(r\) somewhere in the middle (e.g., 0.1), ROLL gets the best of both worlds: it learns discriminative features like InfoNCE but ignores the impossible outliers like MAE.

The total optimization objective combines the reconstruction, the noise-tolerant pseudo-labels, and the robust contrastive learning:

Equation 16

Experimental Results

The researchers tested ROLL against 11 state-of-the-art methods on five datasets, including Scene15, CUB (birds), and Reuters (news). They simulated noisy correspondence by randomly shuffling a percentage of the samples in one view (from 20% up to 80% noise).

1. Performance Tables

The results, shown in Table 1, are compelling.

Table 1. The multi-view clustering performance (%) on five widely-used datasets with different noise rates.

Take a look at the CUB dataset (second major column) with a 50% noise rate:

  • SURE (a competitor) achieves an Accuracy (ACC) of 20.30%.
  • DealMVC achieves 8.32%.
  • ROLL (Ours) achieves 77.63%.

The difference is massive. While other methods collapse when half the data is mismatched, ROLL maintains high performance. Even at extreme noise levels, it manages to find the underlying structure of the data.

2. Robustness Analysis

One of the most interesting visualizations in the paper is how performance changes as noise increases.

Figure 3. The clustering performance of the CUB dataset under different noise rates.

In Figure 3, the red line represents ROLL. You can see that as the noise rate (x-axis) goes from 0.1 (10%) to 0.8 (80%), the performance of competing methods (blue, green, yellow lines) plummets toward zero. ROLL, however, stays almost flat. It is incredibly stubborn—in a good way. It refuses to let the noise corrupt its feature space.

3. Visualizing the Feature Space

To prove that the numbers aren’t lying, the authors used t-SNE to visualize the latent representations of the data.

Figure 4. The t-SNE visualizations on CUB with 50% noise rate

  • Plot (a) and (c): These represent competitor methods (ICMVC and RMCNC). Notice how the colors (clusters) are smeared together. There is no clear separation between classes.
  • Plot (d): This is ROLL. The clusters are tight, compact, and well-separated. Even with 50% of the text descriptions being wrong, ROLL successfully grouped the images by their actual content.

4. Parameter Sensitivity

Finally, the authors analyzed how the hyperparameters affect the model.

Figure 5. Parameter analysis on Scene15 with 20% noise rate. Figure 6. Robustness analysis on CUB with 20% noise rate.

Figure 6 (Right side) specifically looks at the parameter \(r\) in the robust contrastive loss.

  • Blue line (\(r=0.1\)): High accuracy. The model is discriminative but robust.
  • Red line (\(r=0.9\)): Performance drops. This is closer to MAE loss; the model is too “relaxed” and underfits the data, failing to learn sharp distinctions between clusters.

This confirms the theory that a balance between InfoNCE (hard-mining) and MAE (equal-weighting) is necessary for noisy multi-view learning.

Ablation Studies

How do we know which part of ROLL is doing the heavy lifting? Is it the Noise-tolerant Pseudo-labeling (NPL) or the Robust Contrastive Learning (RCL)?

The authors performed ablation studies, systematically removing parts of the model to see what breaks.

Table 2. Ablation studies (%) of our ROLL on two datasets.

  • Removing RCL: Performance drops significantly (ACC drops from ~45% to ~35% on Scene-15). This shows that handling noisy correspondence is crucial.
  • Removing NPL: Performance drops even more (ACC drops to ~33% on Scene-15). This shows that blindly trusting pseudo-labels is dangerous.
  • Combined: The full model (bottom row) consistently outperforms any partial variation.

Conclusion and Implications

The “ROLL” paper provides a sobering reminder that in the era of big data, “more data” isn’t always “better data”—especially if that data is noisy.

By identifying the twin challenges of Noisy Pseudo-labels and Noisy Correspondence, the authors have highlighted a major bottleneck in unsupervised multi-view learning. Their solution is elegant in its duality:

  1. NPL acts as a filter, using cross-view consensus to ignore unreliable pseudo-labels.
  2. RCL acts as a shock absorber, modifying the contrastive loss to prevent the model from overfitting to mismatched pairs.

For students and practitioners, this paper teaches a valuable lesson: standard loss functions like Cross-Entropy and InfoNCE are powerful, but they are brittle. When moving from curated academic datasets to the messy real world, designing loss functions that are robust to noise is just as important as the network architecture itself.