Introduction

The release of the Segment Anything Model (SAM) marked a turning point in computer vision. Trained on over 1 billion masks, SAM demonstrated an incredible ability to perform “zero-shot” segmentation—identifying objects it had never seen before without specific training. It seemed like the “Jack of all trades” for image analysis.

However, as many researchers and students soon discovered, being a Jack of all trades often means being a master of none. When applied to highly specialized domains—such as identifying polyps in medical imaging, detecting camouflaged animals, or spotting specific crop diseases—SAM’s performance often drops. It struggles to grasp the nuanced, domain-specific features that these tasks demand.

The standard solution is Parameter-Efficient Fine-Tuning (PEFT). Instead of retraining the massive model from scratch (which is computationally prohibitive), we freeze most of the model and tweak only a small set of parameters (using methods like Adapters or LoRA).

But here lies the conflict: standard fine-tuning methods often treat the model’s components in isolation. They adjust the weights but fail to preserve the sophisticated relationships between the image encoder and the mask decoder that SAM learned during its massive pre-training. By focusing too much on new data, we risk overwriting the “universal visual logic” that made SAM great in the first place.

In this post, we will dive deep into InfoSAM, a novel approach presented at ICML 2025. This method uses Information Theory to bridge the gap between the pre-trained foundation model (the teacher) and the fine-tuned model (the student). InfoSAM doesn’t just copy features; it distills and preserves the domain-invariant relationships—the structural “essence”—ensuring that the fine-tuned model adapts to new tasks without forgetting how to see.

Figure 1: Comparing traditional PEFT and distillation paradigms with our proposed InfoSAM.

As shown in Figure 1 above, unlike traditional methods that tune modules separately (a) or align simple features (b), InfoSAM (c) focuses on the “Relation Transfer” using information-theoretic principles.

Background: The Challenges of Adaptation

Before dissecting InfoSAM, let’s establish the context of the problem.

The Encoder-Decoder Disconnect

SAM follows a standard encoder-decoder architecture:

Image Encoder: A heavy Vision Transformer (ViT) that processes the raw image into embeddings.
Mask Decoder: A lighter module that takes those embeddings (and prompts) to generate the final segmentation mask.

When we use PEFT methods like LoRA (Low-Rank Adaptation) or Adapters, we usually inject small, trainable layers into the encoder or decoder. The problem is that these modifications can disrupt the implicit harmony between the encoder and decoder. The extensive pre-training of SAM established a delicate distribution of features that links these two modules. Naive fine-tuning tends to suppress these universal visual features in favor of overfitting to the specific textures or colors of the new dataset.

The Solution: Knowledge Distillation?

A common way to keep a model “on track” is Knowledge Distillation (KD). You treat the original, frozen SAM as a “Teacher” and the new, fine-tuning model as a “Student.” You force the Student to mimic the Teacher.

However, standard KD has a flaw in this context. It typically aligns features (e.g., “Make layer X of the student look like layer X of the teacher”). But we don’t necessarily want the student to copy the teacher’s features exactly—after all, the teacher performs poorly on this specific medical or agricultural task!

We need the student to learn the relationships—the structural understanding of objects (edges, shapes, geometry)—while ignoring the teacher’s bias toward natural images (textures, colors). This is where Domain-Invariant Information comes in. We want to transfer the knowledge that holds true across all domains (structure) while filtering out the noise.

Core Method: InfoSAM

The researchers propose InfoSAM to solve this by formulating the fine-tuning process as an information-theoretic problem. The goal is to maximize the transfer of helpful relational knowledge while minimizing the transfer of useless or harmful information.

The Architecture Overview

Let’s look at the overall flow of InfoSAM.

Figure 2: The Flowchart of InfoSAM.

The framework operates on two parallel tracks:

The Teacher (Blue): The frozen, pre-trained SAM.
The Student (Orange): The SAM undergoing fine-tuning (with Adapters or LoRA).

The magic happens in the Relation Module (\(f^T\) and \(f^S\)). Instead of looking at the image embeddings (\(z_i\)) or mask tokens (\(z_m\)) in isolation, InfoSAM extracts the interaction between them.

Step 1: extracting Relations

The first challenge is quantifying the relationship between the encoder and decoder. The authors introduce an Attention-Based Relation Module.

Figure 3: The architecture of attention-based relation module.

As visualized in Figure 3, this module takes the image features (\(z_i\)) and the mask tokens (\(z_m\)) as input. It uses an attention mechanism to compute a Relation Representation (\(r\)).

It projects the mask token into a Query (\(Q\)).
It projects the image feature into a Key (\(K\)).
It computes an attention map that represents how the mask decoder is “looking at” the image encoder’s features.

Mathematically, the attention score \(S_\alpha\) combines the dot product of \(Q\) and \(K\) with the residuals of the original inputs:

Equation 11

This results in a compressed tensor \(r^T\) (for the teacher) and \(r^S\) (for the student) that encapsulates the structural dependencies of the model.

Step 2: The Information Bottleneck (Compression)

Here is the critical insight: Not all relationships in the Teacher are worth keeping. Some are “pseudo-invariant”—for example, relying on color distributions that might work for a dog in a park but fail for a tumor in an X-ray.

To filter this, InfoSAM applies the Information Bottleneck Principle. The goal is to compress the relation representation \(r^T\) so that it retains only the essential domain-invariant information.

The authors use Rényi’s \(\alpha\)-entropy to measure information. Unlike standard Shannon entropy, which requires estimating full probability distributions (very hard for high-dimensional images), Rényi’s entropy can be estimated directly from data samples using matrix operations.

The objective is to minimize the mutual information between the raw inputs (\(z_i^T, z_m^T\)) and the extracted relation (\(r^T\)). This forces the Relation Module to throw away redundancy and noise, keeping only the most salient structural links.

The loss function for this compression (\(\mathcal{L}_r\)) is derived using matrix-based entropy. To make this computationally efficient, the authors set \(\alpha=2\). This allows them to calculate entropy using the Frobenius norm of the Gram matrices, avoiding expensive eigenvalue decompositions.

Equation 14

In this equation:

The first term (\(-\log...\)) maximizes the entropy of the relation \(r\), encouraging rich features.
The second term (\(\log...\)) minimizes the joint entropy, filtering out spurious correlations between the encoder/decoder and the relation module.

Step 3: Relational Distillation (Transfer)

Once we have a clean, compressed representation of the Teacher’s structural knowledge (\(r^T\)), we want to transfer it to the Student.

We do this by maximizing the Mutual Information between the Teacher’s relation (\(r^T\)) and the Student’s relation (\(r^S\)).

Equation 16

This distillation loss (\(\mathcal{L}_d\)) aligns the Student’s understanding of “structure” with the Teacher’s, ensuring the fine-tuned model doesn’t “forget” how to define object boundaries.

The Final Objective

The total loss function combines the standard segmentation loss (Cross-Entropy + IoU) with the new information-theoretic losses.

Equation 17

Equation 18

By tuning \(\lambda_1\) and \(\lambda_2\), the model balances between compressing the teacher’s knowledge (cleaning it up) and transferring it to the student.

Experiments & Results

The researchers tested InfoSAM across four highly diverse domains: Natural Images (Camouflage), Medical Imaging (Polyps, Skin Lesions), Agriculture (Leaf Disease), and Remote Sensing (Roads). They applied InfoSAM to both the original SAM and the newer SAM2.

Quantitative Performance

The results show that InfoSAM consistently outperforms other state-of-the-art PEFT methods.

1. Comparison with PEFT Methods In Table 1 below, you can see InfoSAM (bottom rows) compared against methods like Adapter, LoRA, and specialized SAM tuners like SAM-Adapter. Note the significant jump in the “Remote Sensing” column (Road IoU), where the teacher model initially failed completely (7.2% IoU), but InfoSAM boosted the student to over 61%.

Table 1: Comparison of PEFT methods for SAM across various downstream segmentation tasks.

2. Comparison with Distillation Methods Table 2 compares InfoSAM against other distillation techniques like MobileSAM and TinySAM. Standard distillation methods often degrade performance because they force the student to mimic the teacher too closely—even when the teacher is wrong! InfoSAM, by focusing on relations rather than raw features, avoids this trap.

Table 2: Comparison of distillation methods for SAM fine-tuning across various domains.

Scaling the Teacher

An interesting question is: Does a bigger Teacher help? The authors tested using ViT-Large and ViT-Huge as teachers for a ViT-Base student.

Figure 4: Performance of InfoSAM with larger teacher models

As shown in Figure 4, InfoSAM scales beautifully. As the teacher gets smarter (moving from ViT-B to ViT-H), the student’s performance improves, outperforming other distillation methods like MobileSAM in complex scenarios.

Qualitative Visualization

Numbers are great, but in computer vision, seeing is believing.

Camouflaged Object Detection: In Figure 10, observe the “GT” (Ground Truth) column versus the model outputs. Standard SAM (3rd column) produces blobs that barely resemble the bird. InfoSAM (last column) produces a sharp, accurate mask that closely hugs the bird’s silhouette.

Figure 10: Visualization results on camouflaged object segmentation.

Leaf Disease Segmentation: In agriculture, spotting the exact extent of a disease is vital. InfoSAM captures the fragmented, irregular shapes of leaf blight much better than the baseline Adapter methods.

Figure 11: Visualization results on leaf disease segmentation.

Road Segmentation (Remote Sensing): This is arguably the hardest task for SAM because roads are thin, continuous structures that look nothing like “objects” in the traditional sense. InfoSAM maintains the connectivity of the road network far better than competitors.

Figure 12: Visualization results on remote sensing road segmentation.

What makes the Relation Module special?

The authors performed an ablation study (Figure 9 below) to see what the Relation Module is actually learning.

Top Row (Without Regularization): The relation maps are noisy and scattered.
Bottom Row (With InfoSAM Regularization): The maps become focused and structured over time.

This visualizes the Information Bottleneck in action. The model is actively suppressing noise (the white, scattered areas) and focusing on the domain-invariant structure (the dark, focused areas).

Figure 9: Evolution of relation maps and their statistical distributions over epochs

Furthermore, the authors found that a Relation Module trained on leaves could be transferred to a model fine-tuning on medical images and still provide a benefit. This strongly suggests that the module has successfully captured universal, domain-invariant concepts of segmentation (like “what is an edge”) rather than dataset-specific memorization.

Conclusion & Implications

InfoSAM represents a sophisticated step forward in the world of Foundation Models. It moves beyond the brute-force approach of “just add more trainable parameters” and introduces a principled, mathematical framework for knowledge transfer.

Key Takeaways:

Relationships Matter: Preserving the interaction between the encoder and decoder is more valuable than preserving the features of individual modules.
Less is More: By using an Information Bottleneck, InfoSAM filters out the “noise” of the pre-trained model, ensuring the student learns only what is necessary for structure and generalization.
Efficiency via Math: The use of Rényi’s \(\alpha\)-entropy with \(\alpha=2\) allows for complex information-theoretic optimization without the heavy computational cost usually associated with these methods.

For students and researchers working with Large Vision Models, InfoSAM offers a blueprint for how to adapt these giants to specialized tasks. It proves that with the right theoretical perspective, we can make small student models that don’t just mimic their teachers—they effectively learn from them to become masters of their own new domains.

Introduction#

Background: The Challenges of Adaptation#

The Encoder-Decoder Disconnect#

The Solution: Knowledge Distillation?#

Core Method: InfoSAM#

The Architecture Overview#

Step 1: extracting Relations#

Step 2: The Information Bottleneck (Compression)#

Step 3: Relational Distillation (Transfer)#

The Final Objective#

Experiments & Results#

Quantitative Performance#

Scaling the Teacher#

Qualitative Visualization#

What makes the Relation Module special?#

Conclusion & Implications#