Auditing the Auditors: How to Rigorously Measure AI Concept Explanations
In the rapidly evolving world of Large Language Models (LLMs), we face a “black box” problem. We know that these models process vast amounts of text and develop internal representations of the world, but understanding how they do it remains a significant challenge. When an LLM outputs a sentence about “computer security,” which specific neurons fired? Did the model understand the abstract concept of “security,” or was it just matching patterns?
To answer these questions, researchers have turned to Concept-based Explanations. Unlike simple heatmaps that highlight which input words matter, concept-based explanations attempt to identify high-level ideas—like “gender,” “coding,” or “finance”—encoded within the model’s high-dimensional hidden space.
But this solution births a new problem: How do we know if these explanations are any good?
If an explanation tool claims to have found a “happiness” neuron, how do we verify that? Is the explanation faithful (does the model actually use this neuron for happiness)? Is it readable (does the explanation make sense to a human)?
Today, we are diving deep into a research paper titled “Evaluating Readability and Faithfulness of Concept-based Explanations.” This work proposes a unified framework to audit these explanations. It moves beyond ad-hoc scoring and applies rigorous measurement theory—the same statistics used in psychology—to determine which metrics we can actually trust.
The Wild West of Explainability
Before we can measure explanations, we need to understand the landscape. The field of Explainable AI (XAI) for concepts has become somewhat fragmented. Some methods look at individual neurons, while others look at directions in vector space (like the famous King - Man + Woman = Queen example).
As illustrated below, the taxonomy of evaluation metrics is vast and confusing. Researchers have proposed metrics for sparsity, meaningfulness, causality, and robustness, often using different definitions for similar ideas.

The authors of this paper argue that to make progress, we need to consolidate these into a unified framework. We need to formalize what a “concept” is, and then focus on the two most critical dimensions of quality:
- Faithfulness: Does the explanation accurately reflect the model’s internal mechanism?
- Readability: Can a human understand the concept being presented?
Part 1: A Unified Framework for Concepts
The first step is standardization. Whether you are using a supervised method like TCAV (Testing with Concept Activation Vectors) or an unsupervised method like Sparse Autoencoders (SAE), the authors propose that all concept-based explanations can be viewed through the lens of Virtual Neurons.
The framework is visualized in Figure 1 below.

The Virtual Neuron
In panel (a) of the figure above, we see the formalization. A concept is defined by an activation function \(a(h)\). This function takes a hidden representation \(h\) (the model’s internal state at a specific layer) and maps it to a real value. A positive output means the concept is “active.”
- For individual neurons: The function selects a specific dimension of the hidden state.
- For directions (TCAV): The function is a linear projection (dot product) onto a specific vector.
- For Sparse Autoencoders: The function might involve a ReLU activation on a learned feature.
This unification allows the researchers to apply the same evaluation metrics to totally different explanation methods.
Part 2: Measuring Faithfulness via Perturbation
Faithfulness is the bedrock of interpretability. An explanation is unfaithful if it tells a convincing story that has nothing to do with how the model actually works.
To measure this, the authors use Perturbation. The logic is simple: if a concept (e.g., “politeness”) is truly important for the model’s output, then manually tampering with that concept in the model’s brain should drastically change the output.
The faithfulness score, \(\gamma\), is calculated by aggregating the difference in output \(\delta\) over a dataset when the concept is perturbed by a function \(\xi\):

Here, \(y\) is the original output, and \(y'\) is the output after perturbation.
The Optimization Problem
The challenge is defining how to perturb the concept. You can’t just zero out the whole layer; that destroys everything. You need to surgically alter only the concept in question. The authors formulate this as an optimization problem:

- \(\xi_e\) (Concept Addition): We want to find a new hidden state \(h'\) that maximizes the concept’s activation, but stays within a tiny distance \(\epsilon\) of the original state. This tests if “turning up the volume” on the concept changes the output.
- \(\xi_a\) (Concept Ablation): We want to find the nearest hidden state \(h'\) where the concept activation is effectively zero. This tests if “deleting” the concept removes the associated behavior.
For linear concepts (which are very common), these optimization problems have elegant closed-form solutions:

Defining the “Difference”
Once the perturbation is applied, how do we measure the change in the model’s output? The paper proposes three specific metrics:
- Loss: The change in the model’s loss function (does it become more “confused” regarding the ground truth?).
- Div (Divergence): The KL-Divergence between the original probability distribution and the new one.
- Class: The drop in probability for the specific predicted token or the true token.

By combining the perturbation method (Ablation vs. Addition) with the difference metric (Loss vs. Div vs. Class), the researchers created a suite of faithfulness measures (e.g., ABL-Loss, GRAD-TClass).
Part 3: Measuring Readability via Coherence
Even if a concept is faithful, it is useless if it looks like noise to a human. A “readable” concept should correspond to a coherent semantic idea, like “past tense verbs” or “medical terminology.”
Traditionally, measuring readability is expensive. You either need to pay humans to rate explanations, or use LLMs (like GPT-4) to grade them. However, the authors argue that LLM-based evaluation has limits—it’s costly and constrained by context windows.
Instead, they propose automating readability by measuring Coherence.
If a concept is readable, the words that trigger it should be semantically similar to each other. For example, if a concept activates on “north,” “south,” “east,” and “west,” those words are highly coherent. If it activates on “apple,” “run,” “blue,” and “concept,” it is likely incoherent.
The authors tested four automated metrics for this, divided into N-gram based and Embedding based:

- UCI and UMass: These are classic topic modeling metrics based on word co-occurrence probability \(P(x^i, x^j)\) in a large corpus.
- EmbDist and EmbCos: These utilize modern word embeddings (like BERT). EmbCos (Embedding Cosine Similarity) measures the angle between the embeddings of the top activating words. If the angle is small, the words are semantically close.
Visualizing Coherence
To verify this, the authors looked at “topics” extracted from high-activation tokens.

In Figure 6 above, we can see why this works.
- Case 1 clearly groups directional words.
- Case 2 groups data-science terms.
- Case 3 groups LaTeX math symbols.
The embedding-based metrics are particularly good at catching these relationships because pre-trained embeddings already “know” that “north” and “south” are related, even if they don’t appear right next to each other in a sentence.
Part 4: Meta-Evaluation – Auditing the Metrics
This is the most scientifically rigorous part of the paper. We have defined new metrics for Faithfulness (e.g., ABL-Loss) and Readability (e.g., IN-EmbCos). But are these metrics actually measuring what they claim to measure?
To answer this, the authors apply Measurement Theory (Psychometrics). They treat the AI metrics just like a psychologist would treat a personality test. They test for Reliability and Validity.
Reliability: Consistency is Key
A reliable metric should yield consistent results. If you run the test twice, or on a different subset of data, you should get the same score.
The authors assessed two types of reliability:
- Test-Retest Reliability: Do we get the same score on repeated runs?
- Subset Consistency: Does the score hold up across different chunks of the dataset?

Figure 2 reveals some hard truths.
- The Failures: LLM-Score (asking GPT-4 to rate the concept) has poor reliability. It fluctuates too much. Similarly, GRAD-Loss and the N-gram based metrics (IN-UCI, IN-UMass) fell below the acceptable standard (the red dashed line).
- The Winners: The embedding-based metrics (IN-EmbCos) and the ablation-based faithfulness metrics (ABL-TClass) proved to be highly reliable.
Validity: Measuring the Right Thing
Validity asks: Does the metric actually correlate with the real-world quality we care about?
To test Readability Validity, the authors conducted a user study. They had human experts rate the readability of hundreds of concepts and compared those human scores to the automated scores.

Table 4 presents a standout result. EmbCos (Embedding Cosine Similarity) has the highest correlation with human judgment on both the input and output sides.
- This suggests that EmbCos is a valid, cheap, and deterministic proxy for human evaluation.
- Surprisingly, the LLM-Score (asking an LLM to explain the neuron) correlated worse with human judgment than the simple embedding math did.
Why did the LLM fail? The authors provide a case study (Figure 7 below) showing that LLMs sometimes hallucinate or miss the pattern because of limited context windows. In Case 3, the LLM fails to grasp the pattern of “LaTeX symbols,” which the simple embedding metric caught easily.

Construct Validity: The MTMM Matrix
Finally, the authors checked Construct Validity using a Multi-Trait Multi-Method (MTMM) matrix. This complex visualization checks if metrics that should be related are related (Convergent Validity) and if metrics that measure different things are actually uncorrelated (Divergent Validity).

In Figure 3, the “B. Readability” block (bottom right) shows decent correlation among readability metrics. Crucially, the bottom-left block (comparing Faithfulness to Readability) shows very low correlation. This is good news. It confirms that “Faithfulness” and “Readability” are indeed distinct constructs. A concept can be very faithful (the model uses it) but totally unreadable (looks like noise to us), and vice versa.
Part 5: Comparing Explanation Methods
Armed with these validated metrics—specifically EmbCos for readability and ABL variants for faithfulness—the authors compared three popular explanation methods:
- TCAV (Supervised): Training a vector to find a specific concept (e.g., “harmful content”).
- Sparse Autoencoder (SAE): Unsupervised learning to decompose the hidden space.
- Neurons: Looking at raw neuron activations.

Figure 4 highlights the results:
- TCAV (Blue bars) performs the best on readability. This makes sense: because it is supervised, we are explicitly looking for human-defined concepts.
- Sparse Autoencoders (Orange bars) consistently outperform raw Neurons (Green bars). This validates the recent trend in AI research toward using SAEs; they really do find more meaningful “features” than single neurons do.
Conclusion
This research paper provides a crucial “audit” for the field of Explainable AI. It moves us away from vibes-based evaluation (“this neuron looks cool!”) toward rigorous measurement.
Key Takeaways for Students:
- Don’t trust metrics blindly. Just because a number is labeled “interpretability score” doesn’t mean it’s reliable. The authors showed that expensive LLM-based scores can be less reliable than simple cosine similarity.
- Use the right tools. If you need to evaluate concept readability, Embedding Cosine Similarity (EmbCos) is currently the state-of-the-art for a fast, automated, and valid proxy for human judgment.
- Perturbation is powerful. To measure if a model uses a concept, don’t just look at gradients. Use Ablation (ABL) strategies to surgically remove the concept and observe the causal effect on the output.
- Concepts are distinct. Remember that a concept can be highly faithful (important to the model) but completely unreadable to humans. We must optimize for both if we want safe, transparent AI.
By standardizing how we define and measure concepts, this work lays the foundation for a future where we can not only inspect LLMs but actually trust what we see.
](https://deep-paper.org/en/paper/2404.18533/images/cover.png)