Introduction
In the world of medical diagnostics, there is rarely a single, indisputable truth. When three different radiologists look at a CT scan of a lung nodule or an MRI of a tumor, they will likely draw three slightly different boundaries around the lesion. This isn’t an error; it is the inherent ambiguity of medical imaging caused by blurred edges, low contrast, and complex anatomy.
However, traditional Deep Learning models treat segmentation as a deterministic task. They are trained to output a single “correct” mask. This creates a disconnect between AI outputs and clinical reality. Furthermore, training these models requires massive datasets with pixel-perfect annotations, which are incredibly expensive and time-consuming to obtain.
We are currently facing a dichotomy in research:
- Semi-Supervised Learning (SSL): Great at using limited labeled data and abundant unlabeled data, but usually forces a single output.
- Ambiguity-Aware Learning: Great at generating multiple plausible segmentations (mimicking diverse human experts), but usually requires fully labeled datasets with multiple annotations per image.
What if we could combine these strengths?
This article breaks down AmbiSSL (Annotation Ambiguity Aware Semi-Supervised Medical Image Segmentation), a novel framework proposed by researchers from the Indian Institute of Technology Roorkee. AmbiSSL is designed to learn from limited data while still capturing the natural uncertainty of medical diagnoses.

As illustrated in Figure 1, AmbiSSL bridges the gap. Unlike standard semi-supervised methods (top) that produce one map, or existing ambiguity methods (middle) that ignore unlabeled data, AmbiSSL (bottom) leverages the best of both worlds to generate diverse, probabilistic segmentations.
Background: The Challenge of Uncertainty
To understand AmbiSSL, we need to grasp two core concepts: Inter-rater Variability and Latent Distribution Learning.
Inter-rater Variability
In medical datasets, “Ground Truth” is often a collection of annotations from multiple experts (\(Y_{set}\)). For a single image \(x\), we might have annotations \(\{y_1, y_2, y_3\}\). A good model shouldn’t just average these; it should be able to produce any valid variation from this set.
Latent Spaces
How do we teach a neural network to produce different outputs for the same input? We use a Latent Space. Instead of mapping an image directly to a label, we map the image to a probability distribution (usually a Gaussian). We then sample a “code” (\(z\)) from this distribution.
- Prior Distribution: The network guesses the distribution based only on the input image.
- Posterior Distribution: The network shapes the distribution using both the input image and the ground truth labels.
During training, we try to make the Prior look like the Posterior (using KL-Divergence loss). During testing, we only have the image, so we sample from the Prior to get diverse results.
The AmbiSSL Method
The AmbiSSL framework is sophisticated, involving multiple modules working in harmony. It aims to answer a difficult question: How do we learn uncertainty from unlabeled data where we have no human annotations to tell us what the variations look like?
The solution involves three key components:
- Diverse Pseudo-Label Generation (DPG)
- Semi-Supervised Latent Distribution Learning (SSLDL)
- Cross-Decoder Supervision (CDS)
Let’s break these down step-by-step.
1. Diverse Pseudo-Label Generation (DPG)
For labeled data, the model learns diversity from the multiple expert annotations. For unlabeled data, however, the model is blind. To fix this, AmbiSSL creates its own “artificial experts” using a technique called Randomized Pruning.
The architecture uses a backbone encoder-decoder (\(E^b_\theta, D^b_\theta\)). To create diversity, the researchers introduce two additional decoders, \(\phi\) and \(\xi\). These aren’t just copies; they are transformed via pruning.

As shown in Figure 2 above, the system takes an unlabeled image and passes it through the backbone. Simultaneously, it uses pruned decoders to generate different “views” of the segmentation.
The Pruning Mechanism: Pruning involves turning off specific weights in the neural network layers. By removing different weights in the final layers of the decoders, the network is forced to rely on different features, effectively simulating different human annotators.
The transformation of the weights \(\tilde{W}_k\) is defined as:

Here, \(M_k\) is a binary mask. The function \(\lambda(W_k)\) decides which weights to keep based on a probability \(q_k\). It essentially keeps the “Top \(a\%\)” of weights and zeros out the rest:

Generating the Labels: Once the decoders are pruned, the model samples a latent code \(z\) from the Prior distribution and concatenates it with the features from the pruned decoders. This generates a set of diverse pseudo-labels (\(\hat{P}\)) for the unlabeled images:

To ensure these pseudo-labels are robust, the researchers use an ensemble approach, summing the predictions from the backbone decoder (\(\theta\)) with the pruned decoders (\(\phi\) and \(\xi\)):

This process creates a synthetic “ground truth set” for the unlabeled data, allowing the model to learn ambiguity even when no humans are involved.
2. Semi-Supervised Latent Distribution Learning (SSLDL)
Now that we have labels (real ones for labeled data, pseudo ones for unlabeled data), we need to train the probabilistic mechanism. This is handled by the SSLDL module.

The top half of Figure 3 illustrates the SSLDL. The goal is to learn a shared latent space.
For Labeled Data: The model uses a standard Probabilistic U-Net approach.
- Prior Network (\(E^{prior}\)): Looks at the image \(x^l\) and predicts a Normal distribution (\(\mu_{prior}, \sigma_{prior}\)).
- Posterior Network (\(E^{post}\)): Looks at the image \(x^l\) AND the expert annotations \(Y_{set}\) to predict a more accurate distribution (\(\mu_{post}, \sigma_{post}\)).

For Unlabeled Data: Here is where AmbiSSL innovates. Instead of a Normal distribution (which can be overly confident and sensitive to outliers), the researchers model the unlabeled distributions as a Laplace distribution. This is more robust to the potential noise in the pseudo-labels generated in the previous step.
The posterior for unlabeled data is calculated using the image \(x^u\) and the pseudo-label sets (\(\hat{P}\)) we generated earlier:

The Loss Function: To train the network, we minimize the Kullback–Leibler (KL) divergence between the Prior and the Posterior. This forces the Prior (which we use at test time) to be as informative as the Posterior (which saw the answers).

3. Cross-Decoder Supervision (CDS)
The final piece of the puzzle is ensuring the pruned decoders learn useful features. If we just let them run wild, they might produce garbage. We need supervision.
AmbiSSL uses a Cross-Decoder strategy (illustrated in the bottom half of Figure 3). The idea is simple but powerful: Decoder \(\phi\) should help train Decoder \(\xi\), and vice versa.
- Augmentation: We take the unlabeled image and apply weak augmentation (\(x^u\)) and strong augmentation (\(x^{\hat{u}}\)).
- Prediction: We generate segmentations using sampled latent codes.

- Cross-Training: The output of Decoder \(\phi\) is compared against a randomly selected pseudo-label from Decoder \(\xi\)’s set, and vice versa. This forces consistency across different views and augmentations, preventing the models from collapsing into a single mode.

The Complete Training Objective
The model is trained end-to-end. The Supervised Loss handles the labeled data using standard Dice loss and the KL divergence for the labeled samples.

The Unsupervised Loss combines the KL divergence for the unlabeled data and the cross-decoder segmentation loss.

Finally, these are combined into a total loss function, where \(\mu\) is a ramp-up factor that gradually increases the importance of the unsupervised loss as training progresses.

Experiments and Results
The researchers evaluated AmbiSSL on two challenging medical datasets:
- LIDC-IDRI: Thoracic CT scans for lung nodule segmentation (4 expert annotators).
- ISIC 2018: Dermoscopic images for skin lesion segmentation (3 expert annotators).
Evaluation Metrics
To measure success, looking at accuracy alone isn’t enough. We need to measure diversity.
- Generalized Energy Distance (GED): Measures the distance between the distribution of predictions and the distribution of ground truths. A lower GED means the model captures the diversity better.

- Dice Soft: Measures the overlap accuracy. Higher is better.

Quantitative Results
The results, presented in Table 1 below, compare AmbiSSL against state-of-the-art methods like the Probabilistic U-Net and various semi-supervised baselines.

Key Takeaways:
- Superior Diversity: On the LIDC-IDRI dataset (using only 10% labeled data), AmbiSSL achieves a GED of 0.1620. Compare this to the “Probabilistic U-Net” (which requires full supervision) at 0.2679. This indicates AmbiSSL produces variations much closer to the ground truth distribution.
- High Accuracy: The Soft Dice score is 89.86%, outperforming the baselines.
- Efficiency: The model approaches the “Upper Bound” (trained on 100% labeled data) using significantly less labeled data, proving the efficacy of the unlabeled learning component.
Similar trends were observed on the ISIC skin lesion dataset:

Here, with 20% labeled data, AmbiSSL achieves a GED of 0.2444, significantly lower than the baseline methods, demonstrating robustness across different medical modalities.
Qualitative Analysis
Numbers are great, but in medical imaging, visual confirmation is vital.

Figure 4 (left side of the image above) shows the segmentation results. The top rows show the “Human Annotators,” displaying the natural variation in how experts define the nodule boundaries. The bottom rows show “AmbiSSL Predictions.”
Notice how AmbiSSL doesn’t just output the same shape three times. It generates distinct, plausible variations that mimic the subtle differences seen in the human rows. This confirms that the randomized pruning and latent space learning successfully captured the “ambiguity” of the task.
Ablation Studies
The researchers also investigated how sensitive the model is to hyperparameters.
- Weighting (\(\alpha_u\)): Table 3 (below) shows that a weight of 0.5 for the unsupervised loss strikes the best balance.

- Pruning Parameters: Referring back to the table in
images/025.jpg(Table 4), the depth (\(L\)) and percentage (\(a\%\)) of pruning were tested. Pruning the deeper layers (L=2) with a 50% pruning rate provided the best diversity (lowest GED). This suggests that perturbing the high-level semantic features is more effective for generating diverse views than perturbing low-level features.
Conclusion and Implications
The AmbiSSL paper presents a significant step forward for medical AI. By acknowledging that there is no single “correct” segmentation in medical imaging, and by unlocking the potential of vast amounts of unlabeled data, the authors have created a more practical and realistic tool for clinicians.
Key Innovations Summarized:
- Pruned Decoders: A clever way to generate diverse pseudo-labels from unlabeled data.
- Laplace Distributions: A robust statistical choice for handling noisy pseudo-labels in the latent space.
- Cross-Decoder Supervision: A mechanism to ensure self-correction and stability during training.
Why this matters: In a clinical setting, a model that says “Here is the tumor” is useful. But a model that says “Here is the tumor, and here are the likely variations in its boundary” is a powerful aid for surgical planning and radiation therapy. AmbiSSL moves us closer to AI that thinks like a team of doctors, rather than a single, over-confident machine.
](https://deep-paper.org/en/paper/file-1931/images/cover.png)