Introduction
In the world of computer vision, data is the new oil, but refining that oil—specifically, annotating images—is incredibly expensive. This is particularly true for Instance Segmentation, the task of identifying and outlining every distinct object in an image at the pixel level. Unlike simple bounding boxes or image tags, creating a precise mask for every pedestrian, car, or cup in a dataset requires significant human effort and time.
To solve this, researchers have turned to Weakly Supervised Instance Segmentation (WSIS). The goal of WSIS is to train models that can predict pixel-perfect masks while only using “cheap” labels during training. These cheap labels typically fall into three categories:
- Image-level Tags: Just knowing “there is a dog in this image.”
- Points: A single click on an object.
- Bounding Boxes: A rectangle drawn around the object.
Historically, research papers pick one of these weak label types and optimize a model specifically for it. This is called a homogeneous setting. But in the real world, why limit ourselves? What if we have a budget where we can afford tags for 10,000 images, bounding boxes for 500 images, and points for 1,000 images?
Enter WISH (Weakly supervised Instance Segmentation using Heterogeneous labels).

As illustrated in Figure 1, WISH is a novel framework designed to handle heterogeneous labels. It unifies tags, points, and boxes into a single training pipeline. Even more impressively, it leverages the power of the Segment Anything Model (SAM) not just as a post-processing tool, but as a core component of the training supervision.
In this deep dive, we will explore how WISH manages to outperform specialized models by treating weak labels as “prompts” and learning from the latent space of a foundation model.
Background: The Hierarchy of Supervision
Before dissecting the WISH architecture, we need to formalize the problem. In a fully supervised setting, we are spoiled. For a dataset \(\mathbf{D}\), every image \(\mathbf{I}\) comes with a perfect set of masks and class labels.
However, in WSIS, our ground truth \(\mathbf{Y}\) changes depending on what we can afford.
Defining the Weak Labels
The authors formulate the dataset as:

Where \(\mathbf{Y}_i\) represents the labels for image \(i\). In a fully supervised world, \(\mathbf{Y}_i\) looks like this:
Here, \(\mathbf{M}\) represents the dense pixel masks we typically desire. But in WISH, we deal with three cheaper alternatives:
Tags (\(\mathbf{Y}^t\)): The set of classes present in the image. We know what is there, but not where or how many.

Points (\(\mathbf{Y}^p\)): A set of coordinate points \(\mathbf{X}\), one for each instance. This gives us location, but no shape or size.

Bounding Boxes (\(\mathbf{Y}^b\)): The familiar rectangles used in object detection. These provide location and size, but not shape.

The Heterogeneous Challenge
The core innovation of this paper is the move from a homogeneous setting (using only one of the above) to a heterogeneous one. The goal is to train a single model where the label for any given image could be any of the three:

This flexibility allows for “Budget-Aware” annotation strategies, which we will discuss in the experiments section.
The Foundation: Segment Anything Model (SAM)
To bridge the gap between these weak labels (points/boxes) and the desired output (masks), the authors employ SAM. SAM is a vision foundation model trained on 1 billion masks. Its “superpower” is promptable segmentation.
Mathematically, SAM takes an image \(\mathbf{I}\) and a prompt \(\mathbf{P}\) (which can be a point or a box) and outputs a mask \(\mathbf{M}\):

Most prior works use SAM as a “teacher” that generates pseudo-labels offline. WISH takes a different approach: it integrates SAM’s Prompt Encoder directly into the learning process to teach the model how to understand objects.
The WISH Framework
The WISH framework is built upon Mask2Former, a state-of-the-art architecture for segmentation. Mask2Former uses a Transformer decoder to attend to image features and generate “object queries.” Each query represents a potential object instance.
In a standard fully supervised Mask2Former:
- Queries (\(\mathbf{Q}\)) are processed by a Transformer Decoder.
- A Classification Head (\(\mathcal{H}_{cls}\)) predicts the object class.
- A Mask Head (\(\mathcal{H}_{mask}\)) generates the binary mask.
The authors of WISH realized that since they don’t have ground truth masks, they need a different way to guide the model. They introduce a third head: the Prompt Head.
Architecture Overview
Let’s look at the complete architecture:

The workflow consists of several interconnected stages:
- Image Encoding: The image is processed to extract features \(\mathbf{F}\).
- Transformer Decoder: Generates object queries \(\mathbf{Q}\).
- Prediction Heads:
- \(\hat{\mathbf{y}}_{cls}\): Class predictions.
- \(\hat{\mathbf{M}}\): Mask predictions (via dot product with pixel embeddings).
- \(\hat{\mathbf{Z}}\): Prompt Latent Predictions (The new contribution).
The standard predictions are calculated as:

But what is this “Prompt Latent Prediction” \(\hat{\mathbf{Z}}\)?
Learning from the Latent Space
This is the most conceptually interesting part of the paper. Since the weak labels (points and boxes) are essentially “prompts,” the authors use SAM’s pre-trained Prompt Encoder (\(\mathcal{E}_{SAM}^{prompt}\)) to convert the ground truth weak labels into a latent embedding vector \(\mathbf{Z}\).

Here, \(\mathbf{Z}\) represents how SAM “sees” the weak label. It is a rich vector representation of the condition “there is an object at this point/box.”
The WISH model then attempts to predict this vector directly from its object queries using the new Prompt Head (\(\mathcal{H}_{prompt}\)):

By forcing the model to predict \(\hat{\mathbf{Z}}\) that matches the ground truth \(\mathbf{Z}\), WISH ensures that the object queries capture the same localization and instance information that SAM expects. This acts as a powerful constraint, effectively transferring SAM’s understanding of “what defines an instance” into the WISH model without needing dense masks.
Multi-Stage Matching
In Transformer-based detection (like DETR or Mask2Former), the model outputs a fixed set of \(N\) predictions, but an image has \(K\) actual objects. We must figure out which prediction pairs with which ground truth object. This is called Bipartite Matching.
In fully supervised learning, we match based on Class and Mask similarity. In WISH, we lack ground truth masks, so the matching cost is redesigned to include three components:
1. Classification Cost:
Does the predicted class match the ground truth label?

2. Prompt Cost (The Novelty):
Does the predicted prompt latent vector \(\hat{\mathbf{Z}}\) match the ground truth prompt embedding \(\mathbf{Z}\)? They use Kullback-Leibler Divergence (KLD) to measure this similarity.

3. Mask Cost (Using SAM):
Even without ground truth masks, we can generate a “proxy” mask. We feed the weak label (point/box) into SAM to generate a mask \(\mathbf{M}_{SAM}\).
SAM usually outputs three candidate masks for ambiguity (e.g., the whole person vs. just the face). WISH adaptively selects the one that best matches the current prediction:

Total Matching Cost:
The final cost function combines these three distinct signals—semantic class, latent prompt representation, and spatial mask consistency.

Once the optimal matching between predictions and ground truth is found (using the Hungarian algorithm), the final segmentation loss is calculated to train the network:

Handling the “Tag” Problem
The framework described above works beautifully for Points and Boxes because they have spatial coordinates that SAM can encode. But what about Image-level Tags?
A tag says “Cat,” but gives zero coordinates \((x, y)\). SAM cannot accept “Cat” as a prompt to locate an object. The authors need to bridge the gap between semantic tags and spatial prompts.
Step 1: Generating CAMs
To find the object, WISH uses Class Activation Maps (CAMs). They add a small auxiliary branch to the image encoder:

This generates a coarse heatmap \(\mathbf{A}\) for every class. This branch is trained using a simple classification loss (\(\mathcal{L}_{cam}\)) based on the image tags.

Step 2: From Heatmaps to Points
Once the model learns to highlight regions associated with “Cat,” the authors extract local peaks from the heatmap.
- Find the highest activation points in the CAM.
- Filter out low-confidence peaks.
- Treat these peaks as pseudo-point labels.
Effectively, they convert the abstract Tag label into a Point label:

Now that the Tag has been converted into a Point (\(\mathbf{X}\)), it can be fed into the standard WISH pipeline (SAM Prompt Encoder -> Latent Space) just like a manual point annotation!
Step 3: Self-Enhancement
CAMs are notoriously noisy and “blobby.” To refine them, the authors introduce a feedback loop. As the main WISH model (the Mask2Former part) gets better at predicting crisp masks (\(\hat{\mathbf{M}}\)), these predictions are used to supervise and sharpen the CAMs.

This Self-Enhancement Loss ensures that as training progresses, the CAMs become more accurate, which leads to better pseudo-points, which leads to better training of the main model—a virtuous cycle.
The Total Loss
The final objective function for WISH is a sum of the segmentation loss (driven by matching), the CAM classification loss, and the self-enhancement loss:

Experimental Results
The authors evaluated WISH on two major benchmarks: PASCAL VOC and MS-COCO. They compared WISH against state-of-the-art (SoTA) methods in both homogeneous and heterogeneous settings.
1. Homogeneous Performance
First, they checked if WISH works well when used traditionally (only tags, only points, or only boxes).
PASCAL VOC Results:

As shown in Table 1, WISH achieves new SoTA results across all categories.
- Tags (T): WISH achieves 46.0 AP, beating BESTIE (previous best) which had 51.0 AP50. (Note: AP is a stricter metric than AP50).
- Points (P): WISH jumps to 52.4 AP, significantly higher than previous methods.
- Boxes (B): WISH hits 54.6 AP, approaching the performance of fully supervised Mask2Former (54.8 AP). This suggests that bounding boxes, when used with SAM-guided training, carry almost as much signal as pixel-wise masks.
COCO Results:

Table 2 confirms the trend on the much harder COCO dataset. WISH consistently outperforms specialized methods like BoxInst and Discobox.
2. Heterogeneous Performance (The Budget Study)
This is the most critical experiment. The authors pose the question: “Given a fixed budget, what is the best annotation strategy?”
They define a budget \(\zeta\) and assign costs to each label type:
- Tag cost (\(\beta_t\)): 1
- Point cost (\(\beta_p\)): 6
- Box cost (\(\beta_b\)): 12
The constraint is:

They tested different combinations of labels that sum up to the same total cost.

Key Findings from Table 3:
- Tag + Box (Bottom Row):
- Left column: Spend all money on Tags (10,582 images) \(\rightarrow\) 46.0 AP.
- Right column: Spend all money on Boxes (882 images) \(\rightarrow\) 45.1 AP.
- Middle column: A mix! 5,290 Tags + 441 Boxes \(\rightarrow\) 48.3 AP.
The heterogeneous mix outperforms the homogeneous extremes. This is a crucial insight for industry applications. It implies that rather than annotating a small dataset perfectly (boxes/masks) or a massive dataset poorly (tags), a hybrid strategy yields the best model performance. The rich spatial information from the few boxes helps the model learn to segment, while the large volume of tags provides diversity and generalization.
3. Ablation: Why direct use of SAM decoder fails
One might ask: “If you have SAM, why not just train the Prompt Head and then feed that into the frozen SAM mask decoder during inference? Why use the Mask2Former decoder?”
The authors tested this (Figure 3).

They found that approach (b)—directly using SAM for inference—dropped performance by over 3% AP. This proves that distilling SAM’s knowledge into the WISH weights (Approach A) is more effective than relying on SAM as a runtime module. WISH learns to resolve ambiguities and class-specific features that the generic SAM decoder might miss.
Conclusion
The WISH paper presents a compelling step forward for Weakly Supervised Instance Segmentation. By breaking the silo of homogeneous labels, it opens the door to flexible, budget-aware annotation strategies.
Key Takeaways:
- Unification: WISH is the first framework to seamlessly integrate tags, points, and boxes into one model.
- Prompt Latent Space: Instead of just mimicking SAM’s output masks, WISH mimics SAM’s internal representation of prompts, providing a richer supervisory signal.
- Heterogeneity Wins: Mixing cheap tags with expensive boxes yields better results than using either exclusively for the same price.
As foundation models like SAM continue to evolve, frameworks like WISH demonstrate the best way to utilize them: not just as black-box tools, but as teachers that guide the training of specialized, efficient models. For students and researchers in computer vision, WISH offers a blueprint for how to handle the trade-off between data quantity, data quality, and annotation cost.
](https://deep-paper.org/en/paper/file-2294/images/cover.png)