Introduction
In the rapidly evolving landscape of computer vision, the Vision Transformer (ViT) has emerged as a powerhouse. From self-driving cars to medical imaging, ViTs are achieving remarkable performance, often outperforming traditional Convolutional Neural Networks (CNNs). However, like many deep learning models, they suffer from a significant drawback: they act as “black boxes.” We feed an image in, and a classification comes out, but we often have little insight into why the model made that decision.
This lack of transparency is a critical bottleneck for deploying AI in safety-critical fields. If a model classifies a tumor as malignant, a radiologist needs to know which pixels influenced that decision. This is where Feature Attribution comes in—a set of techniques designed to generate “heatmaps” or attribution maps that highlight the input regions most relevant to the model’s prediction.
While there are existing methods to interpret these models, many struggle with the unique architecture of Transformers. In this post, we will dive deep into a paper titled “Comprehensive Information Bottleneck for Unveiling Universal Attribution to Interpret Vision Transformers.” We will explore a novel method called CoIBA (Comprehensive Information Bottleneck for Attribution), which proposes a smarter, mathematically grounded way to “squeeze” the relevant information out of a network to create highly accurate explanations.
The Problem: Disagreement Across Layers
To understand why we need a new method, we first need to look at how current methods fail. A popular approach for interpretation is the Information Bottleneck (IB) principle. The idea is simple: inject noise into a specific layer of the network to filter out irrelevant information, leaving only the signal necessary for the prediction.
However, existing IB-based methods typically focus on a single target layer. The assumption is that if we analyze one layer, we understand the decision. But deep networks process information hierarchically. What one layer considers “important” might differ from another layer.

As shown in Figure 1 above, relying on a single layer can be misleading. Look at the top row (the “Cannon” image).
- The attribution map from Layer 2 highlights the barrel.
- Layer 5 highlights the top of the wheels.
- Layer 8 focuses on the bottom of the wheels.
Which one is the “true” explanation? The truth is, the decision-making process is distributed across all these layers. By isolating just one, we get a fragmented and inconsistent view of the model’s reasoning. Furthermore, the chart on the right (b) shows that for different images, different layers provide the “best” explanation. There is no single “golden layer” that works for every image.
Background: The Information Bottleneck
Before we unravel the solution, let’s establish the foundation: the Information Bottleneck (IB) principle.
In the context of attribution, the goal is to find a “bottleneck” variable \(Z\) that compresses the input representation \(R\) as much as possible while preserving the information about the target class \(Y\).
Imagine you are looking at a photo of a dog in a park. To identify it as a “dog,” you don’t need to know the exact shade of green of the grass or the shape of the clouds. That is “irrelevant information.” The IB principle tries to mathematically “dampen” that irrelevant signal.
The standard method, known as IBA (Information Bottleneck for Attribution), inserts a bottleneck into a specific layer \(l\). It computes a bottleneck representation \(Z_l\) using a “mask” or damping ratio \(\lambda_l\):

Here, \(R_l\) is the original feature map, and \(\epsilon_l\) is Gaussian noise. The parameter \(\lambda_l\) controls the signal-to-noise ratio.
- If \(\lambda_l \approx 1\), the signal passes through clearly.
- If \(\lambda_l \approx 0\), the signal is replaced by noise (blocked).
The goal is to learn the optimal \(\lambda_l\) that minimizes the information between the original features and the bottleneck (compression) while maximizing the information between the bottleneck and the label (prediction).

While IBA is powerful, its limitation is its layer-specificity. It restricts information flow in just one place, ignoring the distributed evidence across the rest of the deep network.
The Core Method: Comprehensive Information Bottleneck (CoIBA)
The authors of this paper propose CoIBA, which shifts the paradigm from analyzing a specific layer to analyzing the comprehensive flow of information across multiple target layers.
1. Universal Damping Ratio
The fundamental difference between the traditional IBA and the new CoIBA approach is how they handle the bottleneck.

As illustrated in Figure 2:
- IBA (Left): Inserts a bottleneck in a specific layer (e.g., the \(L\)-th layer) with a specific damping ratio. To get a full picture, you would have to run this process iteratively for every layer, which is computationally expensive and inconsistent.
- CoIBA (Right): Inserts bottlenecks into multiple layers simultaneously. However, instead of learning a separate damping parameter for every single layer (which would be chaotic and hard to optimize), it shares a Universal Damping Ratio (\(\lambda\)) across all targeted layers.
This shared ratio acts as a global “volume control” for relevance. By sharing this parameter, the model is forced to find features that are universally important across the network’s depth, compensating for information that might be over-compressed or omitted in a single layer.
The bottleneck equation for CoIBA looks slightly different. It applies the universal \(\lambda\) to the intermediate bottleneck representations:

Here, \(\lambda\) is derived from a learnable parameter passed through a sigmoid function (ensuring it stays between 0 and 1). Crucially, CoIBA uses uniform perturbation across channels. This means that within a single token (patch of an image), all feature channels are dampened by the same amount. This prevents the model from cheating by picking specific neurons and instead forces it to focus on the spatial “tokens” (parts of the image) that matter.
2. The Architecture
How does this fit into the actual Transformer?

Figure 5 provides a schematic overview:
- Input: The image is split into patches and fed into the ViT.
- Bottleneck Layers: As features pass through the targeted layers (e.g., Layer 1, Layer 2… Layer \(L\)), CoIBA injects noise defined by the universal damping ratio.
- Relevant Information: The system calculates the relevant information (Mean and Variance) based on the current state.
- Optimization: The model optimizes the damping ratio \(\lambda\) to highlight the most important pixels in the final attribution map.
3. The Objective Function and Variational Upper Bound
This is the mathematical heart of the paper. We want to maximize the predictive info while minimizing the information flow through all the bottlenecks. The initial objective function looks like this:

This equation says: “Maximize the information the final layer \(Z_L\) has about the label \(Y\), but subtract the mutual information accumulated between consecutive layers.”
The problem is that calculating the mutual information term (the summation part) is mathematically intractable—it’s too hard to compute exactly. We need a way to estimate it. Usually, researchers use a variation lower bound (ELBO), but applying that here would require heuristic balancing of separate \(\beta\) terms for every layer.
To solve this, the authors propose a Variational Upper Bound. They leverage the property that information cannot be created out of thin air (Data Processing Inequality). The information flowing through the later layers cannot exceed the information flowing through the first bottleneck layer.
Therefore, instead of summing up the compression terms for every layer, we can bound the entire process by looking at the input to the first targeted layer (\(R_1\)) and its bottleneck (\(Z_1\)):

This leads to a radically simplified objective function:

Why is this brilliant?
- Simplicity: We only need to compute mutual information for the first bottleneck layer.
- No Heuristics: We don’t need to manually tune weights (\(\beta_l\)) for every layer to balance their contribution. The math guarantees that restricting the first layer effectively bounds the information flow for the whole sequence.
- Efficiency: It makes the optimization process faster and more stable.
Experiments and Results
Does this rigorous mathematical formulation translate to better explanations? The authors conducted extensive experiments using Vision Transformers (ViT, DeiT, Swin) on datasets like ImageNet.
1. Faithfulness (Insertion and Deletion)
A good explanation should be faithful. If the heatmap says “this dog’s ear is important,” then removing the ear from the image should drastically drop the model’s confidence in identifying the “dog.”
- Deletion Metric: We progressively delete pixels starting from the most “important.” The faster the accuracy drops, the better the explanation. (Lower is better).
- Insertion Metric: We start with a blurred image and add pixels back based on importance. The faster accuracy rises, the better. (Higher is better).
The authors combined these into a table (Insertion \(\uparrow\) / Deletion \(\downarrow\)):

In Table 1, CoIBA (last column) consistently outperforms existing methods like Chefer-LRP, Generic, and standard IBA across various models (ViT-B, ViT-L, DeiT). For example, on ViT-B, CoIBA achieves a Deletion score of 13.01 (lower is better) compared to 17.23 for IBA.
2. Robustness on “Hard” Samples
One of the most interesting findings is how CoIBA handles uncertainty. Most interpretation methods work well when the model is very confident (e.g., a clear picture of a cat). But when the model is unsure (low confidence), explanations often fall apart.

Figure 7 (also shown in Figure 4 of the paper) visualizes this difficulty-aware performance. The plots show the difference between Insertion and Deletion scores (\(\Delta\)InsDel) across different confidence levels.
- The Blue/Purple lines (CoIBA) are consistently higher than the others, even on the left side of the graphs where confidence is low (0-20%).
- This implies that CoIBA provides reliable explanations even when the AI is struggling with “hard” or ambiguous images.
3. Visual Quality
Numbers are great, but in Explainable AI, we need to see the results.

In Figure E, compare CoIBA (far right) with other methods.
- Vacuum (2nd row): Notice how CoIBA highlights the entire body of the vacuum cleaner and the hose. Other methods (like Generic or ViT-CX) result in scattered noise or only highlight small edges.
- School Bus (4th row): CoIBA cleanly highlights the bus structure, whereas others focus on background noise or disjointed patches.
4. FunnyBirds: A Controlled Test
To further validate their method, the authors used “FunnyBirds,” a synthetic dataset designed specifically to test AI explainability. Since these images are computer-generated, we know the ground truth of which parts are important.

Figure 6 (embedded in the text of the image above) shows radar charts.
- Acc (Accuracy): How well the attribution matches ground truth.
- Con (Contrastivity): Can the explanation distinguish between similar classes?
- Com (Completeness): Does it cover all relevant features?
CoIBA achieves the largest area on the radar chart, indicating it is the most well-rounded method, particularly excelling in Contrastivity (Con), meaning it’s very good at showing exactly what makes a bird a “FunnyBird” vs. a regular bird.
Discussion
Why the Universal Ratio Matters
You might wonder, “Why force every layer to use the same damping ratio? Shouldn’t flexible layers be better?”
The ablation study in the paper (shown below) reveals why the universal ratio is superior.

Figure 9(a) shows the mutual information. By sharing the ratio, CoIBA compensates for over-compression in early layers. Essentially, later layers “tell” the earlier layers what is important via the backpropagation of the shared parameter. If we optimized layers individually, early layers might discard information that a later layer actually needs, leading to a disconnect.
Handling Out-of-Distribution Data
The robustness of CoIBA on “hard” samples (low confidence) suggests it is better suited for out-of-distribution data. In real-world scenarios—like a self-driving car seeing a weirdly shaped truck—the model’s confidence might drop. An interpretation method that fails in these moments is useless. CoIBA’s stability here is a significant contribution to AI safety.
Conclusion
The “black box” nature of Vision Transformers is a major hurdle for their adoption in sensitive fields. The paper “Comprehensive Information Bottleneck for Unveiling Universal Attribution to Interpret Vision Transformers” introduces a compelling solution with CoIBA.
By moving beyond the limitations of analyzing single layers and instead adopting a multi-layer, comprehensive approach with a Universal Damping Ratio, CoIBA manages to:
- Resolve conflicts between different network layers.
- Provide a theoretical guarantee via a variational upper bound that discarded information truly isn’t needed.
- Deliver superior performance on both quantitative metrics (Insertion/Deletion) and qualitative visuals.
For students and researchers in Computer Vision, CoIBA represents a sophisticated step forward in Explainable AI (XAI). It reminds us that to understand deep networks, we shouldn’t just look at a single slice of the brain—we need to look at the whole thought process.
This post is based on the research paper “Comprehensive Information Bottleneck for Unveiling Universal Attribution to Interpret Vision Transformers” by Jung-Ho Hong et al.
](https://deep-paper.org/en/paper/2507.04388/images/cover.png)