Introduction: The “Black Box” Problem
Imagine you are a doctor using an AI system to diagnose an X-ray. The AI predicts “Pneumonia” with 95% confidence. As a responsible practitioner, your immediate question isn’t just “Is it correct?” but rather “Why?”
If the AI points to a specific shadow on the lung (the “Where”) but doesn’t tell you what it sees, you might be left guessing. Conversely, if the AI says it detects “fluid accumulation” (the “What”) but doesn’t tell you where it is, you can’t verify if it’s looking at the lung or an artifact in the background.
Humans explain things by combining these two elements. We say, “That is a dog because I see ears here and fur there.” We combine semantic concepts with spatial localization.
Most Deep Neural Networks (DNNs) cannot do this. They are “black boxes” that ingest pixels and output probabilities. While we have made strides in Explainable AI (XAI), most methods force us to choose between visual heatmaps (which can be ambiguous) or abstract concepts (which lack location).
In the paper “Show and Tell: Visually Explainable Deep Neural Nets via Spatially-Aware Concept Bottleneck Models,” researchers Itay Benou and Tammy Riklin Raviv propose a unified framework that solves this dilemma. They introduce the SALF-CBM, a model that can both “show” you where it’s looking and “tell” you what concepts it found there, all without requiring expensive human annotations.

As shown above, this method can take an image of a dog and decompose it into understandable parts—highlighting the hat, the face, or the ball—just as a human would break down the scene.
The Landscape of Explainability
To understand why this paper is significant, we need to look at the two prevailing approaches to XAI that existed prior to this work.
1. Attribution Methods (The “Where”)
These methods, like GradCAM, generate heatmaps. They look at the gradients in the network to figure out which pixels influenced the final decision the most.
- Pros: Great for localization.
- Cons: They don’t tell you what the model saw. A red blob on a car tire could mean “tire,” “rubber,” “black circle,” or “dirt.” Without semantic labels, heatmaps are open to interpretation.
2. Concept Bottleneck Models (The “What”)
Concept Bottleneck Models (CBMs) force the neural network to squeeze its information through a “bottleneck” layer before making a final prediction. This layer consists of specific neurons representing concepts (e.g., “wing,” “beak,” “feathers”).
- Pros: Highly interpretable decision logic (e.g., “It has a wing and a beak, therefore it is a bird”).
- Cons: Traditional CBMs usually pool information globally. They lose spatial data. They can tell you the image contains “grass,” but they can’t tell you where the grass is. Furthermore, forcing this bottleneck often hurts the model’s accuracy compared to a standard “black box” model.
The Gap
The researchers identified a critical gap: there was no unified method that provided both spatial awareness and concept clarity without sacrificing performance or requiring humans to manually label thousands of concepts.
The Solution: SALF-CBM
The authors introduce the Spatially-Aware and Label-Free Concept Bottleneck Model (SALF-CBM). Let’s break down that name:
- Spatially-Aware: It preserves the spatial structure (height and width) of the image features, rather than squashing them into a single vector.
- Label-Free: It doesn’t need humans to label concepts (like “this is a wing”). It uses large language models and Vision-Language Models (like CLIP) to figure out relevant concepts automatically.
The Architecture
The process of converting a standard black-box model into a transparent SALF-CBM involves four clever steps.

Step 1: Generating the Concept List
First, the model needs a vocabulary. Instead of hiring experts, the researchers use GPT. They prompt GPT with questions like “List the most important features for recognizing a {class}” or “What is commonly seen around a {class}?”.
For a dataset of birds, this might generate a list including “beak,” “wing,” “red feathers,” etc. The system filters these to remove synonyms, resulting in a clean list of \(M\) concepts.
Step 2: Local Image-Concept Similarities (The “Red Circle” Trick)
This is perhaps the most innovative part of the training setup (shown in section (b) of the image above). The researchers need “ground truth” to teach their model where concepts are located, but they don’t have segmentation masks.
Their solution? Visual Prompting.
They use CLIP, a powerful pre-trained model that understands both images and text. Research has shown that if you draw a red circle on an image, CLIP focuses its attention on the content inside that circle.
The researchers take their training images and systematically draw red circles across a grid. For every circle location, they ask CLIP: “How similar is the content inside this red circle to the text concept ‘beak’?”
By doing this across the whole image for every concept, they build a Spatial Concept Similarity Matrix (\(P\)). This matrix acts as a teacher, telling the new model where concepts likely exist in the training data.
Step 3: Training the Spatially-Aware Bottleneck
Now, they take a standard backbone (like ResNet or a Vision Transformer) and process the image to get features. Instead of pooling these features immediately, they project them into a Concept Bottleneck Layer (CBL).
This layer outputs a set of maps—one map for every concept. The goal is for these learned maps to match the “teacher” matrix created by CLIP in Step 2.
The loss function used to train this layer tries to maximize the similarity between the learned maps (\(q\)) and the CLIP-generated target maps (\(p\)):

Here, the similarity is calculated using a cubic cosine similarity, which emphasizes strong matches and ignores weak background noise.
Step 4: The Final Classification
Once the model has generated these concept maps (e.g., a map showing where the “wings” are), it performs a global pooling operation to get a summary score for each concept.
Finally, a sparse linear layer makes the prediction. This layer learns weights (\(W\)) connecting concepts to classes. For example, the “Tiger” class will learn a strong positive weight for the “stripes” concept.

Because the final decision is just a weighted sum of concepts, we can look at the weights to understand exactly why the model made a decision.
Visualizing the Logic
One of the most powerful aspects of SALF-CBM is how transparent the decision-making process becomes. At test time, you don’t need CLIP or red circles anymore. The model simply ingests an image and produces concept maps.
We can visualize the flow of information using Sankey diagrams (like the one below).

In the example above, seeing a “stuffed animal” pushes the probability toward “Toy Store,” while seeing “wooden board” and “pallets” pushes the probability toward “Crate.” The model is literally showing its work.
Experimental Results
It’s often assumed that making a model explainable makes it dumber (the accuracy-interpretability tradeoff). Does SALF-CBM suffer from this?
Classification Accuracy
The researchers tested their model on three major datasets: CUB-200 (birds), Places365 (scenes), and ImageNet (objects). They compared it against standard black-box models and other Concept Bottleneck approaches (P-CBM, LF-CBM).

Surprisingly, SALF-CBM outperforms other explainable models. Even more impressive is that on complex datasets like ImageNet and Places365, it actually outperforms the original “black box” backbone (ResNet-50).
Why? By structuring the latent space into meaningful concepts, the model might be learning more robust features that generalize better than raw, abstract vectors.
Heatmap Quality (Zero-Shot Segmentation)
How good are the spatial maps? To test this, the authors used a segmentation task. They took the generated heatmaps and checked if they accurately covered the object of interest, comparing against famous methods like GradCAM and LRP.

As seen in the visual comparison above, standard methods often produce noisy “blobs” that spill over into the background. SALF-CBM (labeled “Ours”) produces tight, object-specific highlights.
The quantitative data backs this up:

SALF-CBM achieved the highest Pixel Accuracy and Mean Intersection-Over-Union (mIoU), proving that its spatial awareness is not just a gimmick—it’s actually locating objects more precisely than gradient-based methods.
Do the Neurons Mean What They Say?
A major criticism of CBMs is whether a neuron labeled “Wing” actually looks for a wing, or if it’s just detecting “bird stuff.”
The researchers conducted a user study where humans rated the consistency of the images activating specific concept neurons.

The results show that SALF-CBM neurons are significantly more semantically consistent than baseline neurons. If the model says it sees a “beak,” it really is looking at a beak.
Interactive Explanations: “Explain Anything”
Because SALF-CBM preserves spatial information throughout the network, it allows for cool interactive features that standard models can’t support.
The authors introduce an “Explain Anything” mode. Users can mask a specific region of an image (Region of Interest or ROI) and ask the model, “What concepts do you see right here?”

In the example above, the model correctly identifies “textile” and “stitched design” on the girl’s dress, and “lawn” on the green scribbles. This turns the model into an exploratory tool.
Debugging and Fixing the Model
This spatial transparency allows for actionable debugging. The authors present a case where the model incorrectly classified a Traffic Light as a Parking Meter.
In a standard black box, you’d just see “Parking Meter: 90%.” You wouldn’t know why.
With SALF-CBM, the researchers could probe the specific region of the traffic light. They found that the model was detecting generic “sign” concepts but missing the crucial “flashing light” concept in that area.

By manually intervening—locally editing the concept map to increase the activation of “flashing light” and “ability to change color” on the pole—the model flipped its prediction to the correct class: Traffic Light.
This suggests a future where humans can “debug” neural networks not by retraining them entirely, but by correcting their logic spatially, similar to how a teacher corrects a student.
Video Tracking Capabilities
Although the model was trained on static images, the spatial consistency is robust enough to work on video. By applying the model frame-by-frame, it effectively acts as an object tracker for specific concepts.

In the top row of this figure, the concept “soccer ball” (yellow) tracks the ball perfectly across frames, while “trees” (red) stays on the background. This capability emerges naturally without any video-specific training.
Conclusion
The “Show and Tell” paper represents a significant step forward in Explainable AI. It moves us away from the dichotomy of either having good heatmaps or having good semantic labels.
By leveraging the power of Vision-Language Models (CLIP) to supervise the training, SALF-CBM achieves:
- Label-Free Training: No need for expensive human bounding boxes.
- Spatial Precision: Heatmaps that actually outline objects, outperforming GradCAM.
- Semantic Clarity: Explanations based on human-understandable concepts.
- High Performance: Accuracy that matches or beats “black box” models.
For students and researchers entering the field, this paper illustrates a vital lesson: Interpretability doesn’t have to be a post-hoc analysis layer slapped onto a finished model. By designing the architecture to be interpretable from the ground up—specifically by preserving spatial dimensions—we can build AI systems that are not only powerful but also transparent and trustworthy.
](https://deep-paper.org/en/paper/2502.20134/images/cover.png)