Introduction

In the world of computer vision, data is the new oil, but refining that oil—specifically, annotating images—is incredibly expensive. This is particularly true for Instance Segmentation, the task of identifying and outlining every distinct object in an image at the pixel level. Unlike simple bounding boxes or image tags, creating a precise mask for every pedestrian, car, or cup in a dataset requires significant human effort and time.

To solve this, researchers have turned to Weakly Supervised Instance Segmentation (WSIS). The goal of WSIS is to train models that can predict pixel-perfect masks while only using “cheap” labels during training. These cheap labels typically fall into three categories:

  1. Image-level Tags: Just knowing “there is a dog in this image.”
  2. Points: A single click on an object.
  3. Bounding Boxes: A rectangle drawn around the object.

Historically, research papers pick one of these weak label types and optimize a model specifically for it. This is called a homogeneous setting. But in the real world, why limit ourselves? What if we have a budget where we can afford tags for 10,000 images, bounding boxes for 500 images, and points for 1,000 images?

Enter WISH (Weakly supervised Instance Segmentation using Heterogeneous labels).

Comparison of homogeneous and heterogeneous settings. Left shows homogeneous setting, where every image in dataset is annotated with a single weak label type. The right shows the proposed heterogeneous WSIS, a generalized approach that allows different weak label types across samples for a single dataset.

As illustrated in Figure 1, WISH is a novel framework designed to handle heterogeneous labels. It unifies tags, points, and boxes into a single training pipeline. Even more impressively, it leverages the power of the Segment Anything Model (SAM) not just as a post-processing tool, but as a core component of the training supervision.

In this deep dive, we will explore how WISH manages to outperform specialized models by treating weak labels as “prompts” and learning from the latent space of a foundation model.


Background: The Hierarchy of Supervision

Before dissecting the WISH architecture, we need to formalize the problem. In a fully supervised setting, we are spoiled. For a dataset \(\mathbf{D}\), every image \(\mathbf{I}\) comes with a perfect set of masks and class labels.

However, in WSIS, our ground truth \(\mathbf{Y}\) changes depending on what we can afford.

Defining the Weak Labels

The authors formulate the dataset as: Equation 2

Where \(\mathbf{Y}_i\) represents the labels for image \(i\). In a fully supervised world, \(\mathbf{Y}_i\) looks like this: Equation 3 Here, \(\mathbf{M}\) represents the dense pixel masks we typically desire. But in WISH, we deal with three cheaper alternatives:

  1. Tags (\(\mathbf{Y}^t\)): The set of classes present in the image. We know what is there, but not where or how many. Equation 4

  2. Points (\(\mathbf{Y}^p\)): A set of coordinate points \(\mathbf{X}\), one for each instance. This gives us location, but no shape or size. Equation 5

  3. Bounding Boxes (\(\mathbf{Y}^b\)): The familiar rectangles used in object detection. These provide location and size, but not shape. Equation 6

The Heterogeneous Challenge

The core innovation of this paper is the move from a homogeneous setting (using only one of the above) to a heterogeneous one. The goal is to train a single model where the label for any given image could be any of the three:

Equation 7

This flexibility allows for “Budget-Aware” annotation strategies, which we will discuss in the experiments section.

The Foundation: Segment Anything Model (SAM)

To bridge the gap between these weak labels (points/boxes) and the desired output (masks), the authors employ SAM. SAM is a vision foundation model trained on 1 billion masks. Its “superpower” is promptable segmentation.

Mathematically, SAM takes an image \(\mathbf{I}\) and a prompt \(\mathbf{P}\) (which can be a point or a box) and outputs a mask \(\mathbf{M}\):

Equation 8

Most prior works use SAM as a “teacher” that generates pseudo-labels offline. WISH takes a different approach: it integrates SAM’s Prompt Encoder directly into the learning process to teach the model how to understand objects.


The WISH Framework

The WISH framework is built upon Mask2Former, a state-of-the-art architecture for segmentation. Mask2Former uses a Transformer decoder to attend to image features and generate “object queries.” Each query represents a potential object instance.

In a standard fully supervised Mask2Former:

  1. Queries (\(\mathbf{Q}\)) are processed by a Transformer Decoder.
  2. A Classification Head (\(\mathcal{H}_{cls}\)) predicts the object class.
  3. A Mask Head (\(\mathcal{H}_{mask}\)) generates the binary mask.

The authors of WISH realized that since they don’t have ground truth masks, they need a different way to guide the model. They introduce a third head: the Prompt Head.

Architecture Overview

Let’s look at the complete architecture:

Overview of the proposed WISH framework.

The workflow consists of several interconnected stages:

  1. Image Encoding: The image is processed to extract features \(\mathbf{F}\).
  2. Transformer Decoder: Generates object queries \(\mathbf{Q}\).
  3. Prediction Heads:
  • \(\hat{\mathbf{y}}_{cls}\): Class predictions.
  • \(\hat{\mathbf{M}}\): Mask predictions (via dot product with pixel embeddings).
  • \(\hat{\mathbf{Z}}\): Prompt Latent Predictions (The new contribution).

The standard predictions are calculated as: Equation 10

But what is this “Prompt Latent Prediction” \(\hat{\mathbf{Z}}\)?

Learning from the Latent Space

This is the most conceptually interesting part of the paper. Since the weak labels (points and boxes) are essentially “prompts,” the authors use SAM’s pre-trained Prompt Encoder (\(\mathcal{E}_{SAM}^{prompt}\)) to convert the ground truth weak labels into a latent embedding vector \(\mathbf{Z}\).

Equation 11

Here, \(\mathbf{Z}\) represents how SAM “sees” the weak label. It is a rich vector representation of the condition “there is an object at this point/box.”

The WISH model then attempts to predict this vector directly from its object queries using the new Prompt Head (\(\mathcal{H}_{prompt}\)):

Equation 12

By forcing the model to predict \(\hat{\mathbf{Z}}\) that matches the ground truth \(\mathbf{Z}\), WISH ensures that the object queries capture the same localization and instance information that SAM expects. This acts as a powerful constraint, effectively transferring SAM’s understanding of “what defines an instance” into the WISH model without needing dense masks.

Multi-Stage Matching

In Transformer-based detection (like DETR or Mask2Former), the model outputs a fixed set of \(N\) predictions, but an image has \(K\) actual objects. We must figure out which prediction pairs with which ground truth object. This is called Bipartite Matching.

In fully supervised learning, we match based on Class and Mask similarity. In WISH, we lack ground truth masks, so the matching cost is redesigned to include three components:

1. Classification Cost: Does the predicted class match the ground truth label? Equation 13

2. Prompt Cost (The Novelty): Does the predicted prompt latent vector \(\hat{\mathbf{Z}}\) match the ground truth prompt embedding \(\mathbf{Z}\)? They use Kullback-Leibler Divergence (KLD) to measure this similarity. Equation 14

3. Mask Cost (Using SAM): Even without ground truth masks, we can generate a “proxy” mask. We feed the weak label (point/box) into SAM to generate a mask \(\mathbf{M}_{SAM}\). SAM usually outputs three candidate masks for ambiguity (e.g., the whole person vs. just the face). WISH adaptively selects the one that best matches the current prediction: Equation 15

Total Matching Cost: The final cost function combines these three distinct signals—semantic class, latent prompt representation, and spatial mask consistency. Equation 16

Once the optimal matching between predictions and ground truth is found (using the Hungarian algorithm), the final segmentation loss is calculated to train the network: Equation 17


Handling the “Tag” Problem

The framework described above works beautifully for Points and Boxes because they have spatial coordinates that SAM can encode. But what about Image-level Tags?

A tag says “Cat,” but gives zero coordinates \((x, y)\). SAM cannot accept “Cat” as a prompt to locate an object. The authors need to bridge the gap between semantic tags and spatial prompts.

Step 1: Generating CAMs

To find the object, WISH uses Class Activation Maps (CAMs). They add a small auxiliary branch to the image encoder:

Equation 18

This generates a coarse heatmap \(\mathbf{A}\) for every class. This branch is trained using a simple classification loss (\(\mathcal{L}_{cam}\)) based on the image tags. Equation 19

Step 2: From Heatmaps to Points

Once the model learns to highlight regions associated with “Cat,” the authors extract local peaks from the heatmap.

  1. Find the highest activation points in the CAM.
  2. Filter out low-confidence peaks.
  3. Treat these peaks as pseudo-point labels.

Effectively, they convert the abstract Tag label into a Point label: Equation 20

Now that the Tag has been converted into a Point (\(\mathbf{X}\)), it can be fed into the standard WISH pipeline (SAM Prompt Encoder -> Latent Space) just like a manual point annotation!

Step 3: Self-Enhancement

CAMs are notoriously noisy and “blobby.” To refine them, the authors introduce a feedback loop. As the main WISH model (the Mask2Former part) gets better at predicting crisp masks (\(\hat{\mathbf{M}}\)), these predictions are used to supervise and sharpen the CAMs.

Equation 21

This Self-Enhancement Loss ensures that as training progresses, the CAMs become more accurate, which leads to better pseudo-points, which leads to better training of the main model—a virtuous cycle.

The Total Loss

The final objective function for WISH is a sum of the segmentation loss (driven by matching), the CAM classification loss, and the self-enhancement loss:

Equation 22


Experimental Results

The authors evaluated WISH on two major benchmarks: PASCAL VOC and MS-COCO. They compared WISH against state-of-the-art (SoTA) methods in both homogeneous and heterogeneous settings.

1. Homogeneous Performance

First, they checked if WISH works well when used traditionally (only tags, only points, or only boxes).

PASCAL VOC Results: Table 1. Quantitative comparison between WISH and WSIS works under homogeneous settings on VOC 2012 val set.

As shown in Table 1, WISH achieves new SoTA results across all categories.

  • Tags (T): WISH achieves 46.0 AP, beating BESTIE (previous best) which had 51.0 AP50. (Note: AP is a stricter metric than AP50).
  • Points (P): WISH jumps to 52.4 AP, significantly higher than previous methods.
  • Boxes (B): WISH hits 54.6 AP, approaching the performance of fully supervised Mask2Former (54.8 AP). This suggests that bounding boxes, when used with SAM-guided training, carry almost as much signal as pixel-wise masks.

COCO Results: Table 2. Quantitative comparison between WISH and conventional WSIS methods under homogeneous settings on COCO 2017.

Table 2 confirms the trend on the much harder COCO dataset. WISH consistently outperforms specialized methods like BoxInst and Discobox.

2. Heterogeneous Performance (The Budget Study)

This is the most critical experiment. The authors pose the question: “Given a fixed budget, what is the best annotation strategy?”

They define a budget \(\zeta\) and assign costs to each label type:

  • Tag cost (\(\beta_t\)): 1
  • Point cost (\(\beta_p\)): 6
  • Box cost (\(\beta_b\)): 12

The constraint is: Equation 26

They tested different combinations of labels that sum up to the same total cost.

Table 3. Results of WISH framework under heterogeneous settings on PASCAL VOC val set.

Key Findings from Table 3:

  • Tag + Box (Bottom Row):
  • Left column: Spend all money on Tags (10,582 images) \(\rightarrow\) 46.0 AP.
  • Right column: Spend all money on Boxes (882 images) \(\rightarrow\) 45.1 AP.
  • Middle column: A mix! 5,290 Tags + 441 Boxes \(\rightarrow\) 48.3 AP.

The heterogeneous mix outperforms the homogeneous extremes. This is a crucial insight for industry applications. It implies that rather than annotating a small dataset perfectly (boxes/masks) or a massive dataset poorly (tags), a hybrid strategy yields the best model performance. The rich spatial information from the few boxes helps the model learn to segment, while the large volume of tags provides diversity and generalization.

3. Ablation: Why direct use of SAM decoder fails

One might ask: “If you have SAM, why not just train the Prompt Head and then feed that into the frozen SAM mask decoder during inference? Why use the Mask2Former decoder?”

The authors tested this (Figure 3). Figure 3. (a): The original WISH framework… (b): An alternative leveraging the prompt head followed by the SAM mask decoder.

They found that approach (b)—directly using SAM for inference—dropped performance by over 3% AP. This proves that distilling SAM’s knowledge into the WISH weights (Approach A) is more effective than relying on SAM as a runtime module. WISH learns to resolve ambiguities and class-specific features that the generic SAM decoder might miss.


Conclusion

The WISH paper presents a compelling step forward for Weakly Supervised Instance Segmentation. By breaking the silo of homogeneous labels, it opens the door to flexible, budget-aware annotation strategies.

Key Takeaways:

  1. Unification: WISH is the first framework to seamlessly integrate tags, points, and boxes into one model.
  2. Prompt Latent Space: Instead of just mimicking SAM’s output masks, WISH mimics SAM’s internal representation of prompts, providing a richer supervisory signal.
  3. Heterogeneity Wins: Mixing cheap tags with expensive boxes yields better results than using either exclusively for the same price.

As foundation models like SAM continue to evolve, frameworks like WISH demonstrate the best way to utilize them: not just as black-box tools, but as teachers that guide the training of specialized, efficient models. For students and researchers in computer vision, WISH offers a blueprint for how to handle the trade-off between data quantity, data quality, and annotation cost.