Introduction
In the rapidly evolving landscape of Artificial Intelligence, Deep Neural Networks (DNNs) have achieved superhuman performance in tasks ranging from medical diagnosis to autonomous driving. However, these models suffer from a notorious flaw: they act as “black boxes.” We feed them data, and they give us an answer, but they rarely tell us why they reached that conclusion.
In critical domains like healthcare and finance, “because the computer said so” is not an acceptable justification. This has given rise to the field of Explainable AI (XAI). One of the most popular tools in the XAI toolkit is the Saliency Map—a heatmap that highlights which parts of an image the model focused on to make its decision.
While existing methods like GradCAM (Gradient-weighted Class Activation Mapping) are widely used, they possess a subtle but significant limitation: they rely heavily on the model’s decision boundary. They tell us what changes would flip the model’s decision, rather than what features make the image distinctively “it.”
In this post, we will deep dive into a CVPR paper titled “DiffCAM: Data-Driven Saliency Maps by Capturing Feature Differences.” The researchers at Carnegie Mellon University propose a novel approach that steps away from gradients and decision boundaries. Instead, DiffCAM looks at the actual data distribution, explaining a prediction by asking: How is this specific image different from a reference group?
By the end of this article, you will understand the mathematical intuition behind DiffCAM, how it outperforms traditional methods in specific scenarios, and why it might be the future of interpreting complex foundation models.
Background: The Problem with Gradients
To understand why DiffCAM is necessary, we first need to understand how current methods work—and where they fail.
Most modern saliency map techniques (like the CAM family) operate on the principle of feature attribution. They look at the activation maps in the final convolutional layer of a network and combine them to create a heatmap. The big question is: how do we weigh these maps? Which feature channel is more important?
The Decision Boundary Trap
Methods like GradCAM use the gradient of the predicted class score with respect to the feature maps. In simple terms, they ask: “If I change this feature slightly, how much does the prediction score change?” This measures the sensitivity of the prediction.
However, sensitivity (gradients) does not always equal importance.
The authors of DiffCAM illustrate this with a brilliant analogy involving a student taking a test. Imagine a model predicts whether a student Passes or Fails based on Reading and Writing scores.

Let’s look at Figure 1 above.
- The Scenario: Writing is generally a harder skill, so the “Decision Boundary” (the diagonal line) is tilted. The model is more sensitive to changes in Writing scores.
- The Student (\(x_t\)): This student failed. Why?
- GradCAM’s Explanation (Red Arrow): GradCAM looks for the fastest way to cross the decision boundary. Because the boundary is steep relative to the Writing axis, GradCAM concludes the student failed because of their Writing score.
- The Reality: Look at the data points. The student’s Writing score is actually quite high—higher than most students who failed! Their Reading score, however, is significantly lower than the “Pass” group.
- DiffCAM’s Explanation (Purple Arrow): DiffCAM compares the student to the “Pass” group (the green dots). It sees that the biggest difference lies in the Reading score. Therefore, it correctly identifies “Low Reading” as the reason for failure.
This highlights the core philosophy of DiffCAM: To explain a decision, we shouldn’t just look at the boundary; we must look at the data distribution relative to a reference group.
The Core Method: DiffCAM
DiffCAM stands for Difference Class Activation Map. It is designed to capture the essential feature differences between a target example (the image you want to explain) and a reference group (a set of images you are comparing it to).
The General Framework
DiffCAM fits into the standard CAM framework. If we have a deep convolutional neural network, the saliency map \(M_c(x)\) for a class \(c\) is a weighted sum of the activation maps \(A_k(x)\).

The difference between all XAI methods lies in how they calculate the weights, \(w_k^c\).
For context, GradCAM calculates weights by averaging the gradients:

And ScoreCAM calculates weights by measuring how much the score increases when a feature map is applied to the input:

DiffCAM takes a completely different route. It generates explanations without relying on the class prediction score \(y^c\) or its gradients.
The Intuition: Signal-to-Noise Ratio
DiffCAM frames the explanation problem as a search for an optimal direction in the feature space.
Imagine the high-dimensional space where the deep features of images live. We want to find a vector \(w\) (a direction) that creates the maximum separation between our target image (\(z\)) and the average of our reference group (\(\hat{\mu}\)).
However, simply maximizing the distance isn’t enough. We also want to ensure that the direction we choose represents features that are common among the reference group (low variance). If the reference group varies wildly along that direction, it’s not a reliable feature for comparison.
The researchers formulated this as an optimization problem to maximize an objective function \(J(w)\):

- Numerator (Signal): The squared distance between the target and the reference mean along direction \(w\).
- Denominator (Noise): The variance (spread) of the reference examples along direction \(w\).
This is classically known as Linear Discriminant Analysis (LDA) or finding the Fisher’s Linear Discriminant.
The Solution
Using convex optimization and linear algebra, the authors derive a closed-form solution for the optimal weights \(w^*\).

Here, \(S_v^{-1}\) is the inverse of the covariance matrix of the reference group (representing internal variation), and \((z - \hat{\mu})\) is the difference between the target and the reference mean.
Once \(w^*\) is calculated, the final DiffCAM saliency map is generated just like other CAM methods:

A Flexible Framework
One of the most powerful aspects of DiffCAM is its flexibility. By changing the Reference Group, you can ask the model different questions.

As shown in Figure 2, the choice of reference determines the explanation:
- “Why is this a 6?” (Standard Explanation): Set the reference group to all other classes. DiffCAM finds features that make this image different from everything else.
- “Why is this not a 4?” (Counterfactual Explanation): Set the reference group to class 4. DiffCAM highlights features that distinguish the image specifically from a 4.
- “What makes this specific 6 unique?” (Intra-class Variance): Set the reference group to other 6s. DiffCAM highlights outliers or anomalies in this specific drawing of a 6.
Experiments and Results
The authors validated DiffCAM across several datasets, including MNIST (handwritten digits), ImageNet (general objects), and medical datasets.
Qualitative Analysis: “Why Is” and “Why Not”
Let’s look at how DiffCAM visualizes features compared to other methods using MNIST digits.

In Figure 3:
- Columns 2-5 (Standard Explanation): Notice that while GradCAM often highlights just a specific edge or curve, DiffCAM (Column 5) tends to highlight the comprehensive object pattern. It captures the “whole” digit because the whole shape contributes to it being different from other digits.
- Last 3 Columns (Counterfactuals): This is where DiffCAM shines.
- “DiffCAM (3 against 5)”: It highlights the left side of the ‘3’, because the lack of a closing loop on the left is what makes it not a ‘5’.
- “DiffCAM (7 against 1)”: It highlights the top horizontal stroke, which is the primary feature distinguishing a 7 from a 1.
Intra-Class Variance: Finding the Odd Ones Out
Most XAI methods can’t explain differences within the same class. DiffCAM can. By setting the reference group to the same class as the target, DiffCAM highlights what makes a specific image “weird” or “abnormal.”

In Figure 4, the model looked at badly written “0"s (some of which were misclassified). DiffCAM highlighted the specific breaks in the loops or extra squiggles that made these zeros deviate from the “average perfect zero.” This feature is incredibly useful for debugging models and finding data anomalies.
Quantitative Evaluation: Faithfulness and Localization
To prove DiffCAM isn’t just generating pretty pictures, the authors performed standard quantitative tests on the ImageNet dataset.
1. Object Localization: Can the method find the object in the image? Using the Intersection over Union (IoU) metric, DiffCAM matches or outperforms state-of-the-art methods like LayerCAM and GradCAM.

2. Faithfulness (The ROAD Test): Faithfulness measures if the highlighted pixels are actually important to the model. The ROAD (Remove And Debias) test involves removing the most salient pixels (replacing them with neighbor averages) and checking if the model’s prediction confidence drops. A steeper drop means the method correctly identified the critical pixels.

As seen in Figure 6, DiffCAM (the purple line) causes a rapid decrease in prediction probability, demonstrating competitive faithfulness compared to other methods.
Figure 5 below visualizes this process. The last column shows the image after the “ROAD” operation based on DiffCAM’s heatmap—the key object is effectively blurred out, destroying the model’s ability to classify it.

Application to Self-Supervised Learning (SSL)
A major limitation of GradCAM is that it requires class labels to calculate gradients. But what about Foundation Models or Self-Supervised Learning (SSL) models like DINO or MoCo, which learn features without labels?
Because DiffCAM operates on feature embeddings and data distributions, it doesn’t need labels. It works “out of the box” on these advanced models.

Figure 7 shows DiffCAM explaining features from MoCo-v3 and DINO. It successfully identifies the soccer ball and the wall clock, proving it can be used to interpret the latent space of foundation models.
Case Study: Medical Imaging
Perhaps the most impactful application of DiffCAM is in medical diagnostics. In medical imaging, the “target” is often a patient with a disease, and the “reference group” is healthy patients.
The goal is to find the abnormality—the feature that makes the patient different from the healthy population. Gradient-based methods often struggle here because they look for classification sensitivity, which might latch onto artifacts (like image tags or bone structures) rather than the pathology itself.
The authors tested DiffCAM on the RSNA Pneumonia Detection dataset.

Look at the comparison in Figure 8:
- Original Image: The yellow box indicates the ground truth area of Pneumonia (lung opacity).
- GradCAM & ScoreCAM: They highlight large, diffuse areas, sometimes missing the lung entirely or focusing on the edges of the ribcage.
- DiffCAM: It produces a tight, accurate highlight directly over the infected area.
By mathematically asking “How does this lung differ from a set of healthy lungs?”, DiffCAM isolates the disease much more effectively than asking “Which pixels maximize the ‘Pneumonia’ class score?”
The quantitative results confirm this superiority. In the table below, looking at the RSNA dataset, DiffCAM achieves an AUPRC (Area Under Precision-Recall Curve) of 0.492, significantly higher than GradCAM (0.467) and ScoreCAM (0.328).

Conclusion
DiffCAM represents a shift in thinking for Explainable AI. It moves away from the model’s internal calculus (gradients and weights) and moves toward a data-centric view. By comparing feature representations of a specific input against a reference distribution, it provides explanations that are arguably more aligned with human intuition.
Key Takeaways:
- Context Matters: Explanations depend on what you compare against. DiffCAM makes this comparison explicit (Target vs. Reference).
- Data over Boundaries: Relying solely on decision boundaries can lead to misleading explanations (the “Writing vs. Reading” student example).
- Versatility: DiffCAM handles “Why,” “Why not,” and “What is weird” questions within a single mathematical framework.
- Foundation Ready: It works seamlessly with modern, self-supervised architectures where gradients might not be accessible or meaningful.
As AI systems become more integrated into high-stakes environments, tools like DiffCAM will be essential for building trust, debugging errors, and ensuring that our “black boxes” are making decisions for the right reasons.
](https://deep-paper.org/en/paper/file-1984/images/cover.png)