Large Language Models (LLMs) are impressive, but they are also black boxes. When an LLM outputs a statement, does it “believe” that statement is true, or is it merely simulating a persona that would say that statement? As we fine-tune models with human preferences, we risk teaching them to be sycophants—telling us what we want to hear rather than what is true.

To build safer and more reliable AI, we need to look inside the black box. We need to extract the model’s internal “knowledge” directly from its activations, bypassing its text output. This field is known as knowledge elicitation.

One of the most promising approaches in this field is unsupervised probing—finding a “truth direction” in the model’s neural activity without needing labeled data. However, recent research has shown that these methods are easily distracted. If a dataset has a prominent feature (like a specific writing style or a recurring random word), unsupervised probes often lock onto that feature instead of the truth.

In this post, we will dive deep into a recent paper, “Cluster-Norm for Unsupervised Probing of Knowledge,” which proposes a clever statistical fix for this problem. We will explore how “salient” but irrelevant features confuse current methods and how a technique called Cluster Normalization can isolate the true knowledge signal amidst the noise.

The Problem: Distracting Features

To understand the solution, we first need to understand the current state of the art: Contrast-Consistent Search (CCS). Introduced by Burns et al. (2022), CCS is a clever way to find truth without labels.

How CCS Works

The intuition behind CCS is simple: logical consistency. If you ask a model “Is the sky blue?” and “Is the sky NOT blue?”, the probabilities of the answers should add up to 1. If the model is confident the sky is blue, it should be equally confident the sky is not not blue.

CCS trains a small “probe” (a linear classifier) on the model’s internal activations to satisfy two conditions:

  1. Consistency: The probability of a statement and its negation must sum to 1.
  2. Confidence: The probe should not be wishy-washy (i.e., avoiding 50/50 probabilities).

Mathematically, the loss function looks like this:

The CCS loss function combining consistency and confidence terms.

Here, \(\mathcal{L}_{\text{consistency}}\) ensures the probabilities add up, and \(\mathcal{L}_{\text{confidence}}\) pushes the probe to be decisive.

The “Salience” Trap

While CCS is brilliant, it has a flaw. It assumes that the “truth” is the most prominent (salient) structure in the data that satisfies logical consistency. But what if there is something more salient?

Farquhar et al. (2023) demonstrated that CCS can be easily tricked. Imagine a dataset where every prompt has a random word appended to it—either “banana” or “shed.”

  • Prompt A: “The Eiffel tower is in Paris. Banana.”
  • Prompt B: “The Eiffel tower is in Rome. Shed.”

To the LLM, the difference between the concepts “Banana” and “Shed” might be much “louder” in the activation space than the difference between “True” and “False.” Because CCS looks for the direction of maximum variance (confidence) that splits the data, it might accidentally learn to classify “Banana vs. Shed” instead of “True vs. False.”

In technical terms, the probe latches onto the most salient feature, which is not always knowledge.

The Solution: Cluster Normalization

The authors of this paper propose a solution called Cluster-Norm. The core idea is intuitive: if a distracting feature (like “Banana/Shed”) creates distinct clusters in the data, we should normalize the data within those clusters to remove that distraction.

The Standard Approach vs. Cluster-Norm

In the standard CCS pipeline (referred to as Burns-Norm), you take all your activations from the whole dataset and normalize them together (mean 0, variance 1). This keeps the global structure intact. If “Banana” activations are huge and “Shed” activations are tiny, normalization preserves that massive variance. Since CCS seeks variance (confidence), it dives right into that trap.

Cluster-Norm adds a step:

  1. Harvest Activations: Get the model’s internal states for contrast pairs (Statements \(x^+\) and \(x^-\)).
  2. Cluster: Use a clustering algorithm (like HDB-Scan) on the activations. This automatically groups the data by its most salient features (e.g., Cluster 1 is “Banana” prompts, Cluster 2 is “Shed” prompts).
  3. Normalize Per Cluster: Normalize the activations independently for each cluster.

By normalizing inside the clusters, you effectively delete the information that defined the cluster. If Cluster 1 was defined by “Banana-ness,” normalizing it sets the mean “Banana-ness” to zero. What remains? The variations inside the cluster—which, ideally, is the “True/False” knowledge signal.

Visualizing the Fix

The impact of this method is best understood visually. The authors performed an experiment where they appended random words to prompts to distract the model.

Look at the PCA (Principal Component Analysis) plots below. These visualizations show the geometry of the model’s activations.

PCA visualizations comparing Burns-Norm (left) and Cluster-Norm (right). On the left, clusters are separated by random words (light vs dark). On the right, Cluster-Norm aligns the data by knowledge (orange vs blue).

  • Left (Standard Normalization): The data is split into two clear groups based on the random words (Light dots vs. Dark dots). The “True” (Orange) and “False” (Blue) points are mixed together within those groups. A probe looking for the biggest difference will learn “Light vs. Dark” (the random words).
  • Right (Cluster-Norm): After clustering and normalizing, the “random word” distinction is collapsed. Now, the primary direction of variance separates Orange from Blue. The probe is forced to learn the knowledge feature.

Why This Works: The Math of Contrastive Features

To understand why Cluster-Norm works theoretically, we have to look at how variance relates to the CCS loss. The paper outlines that the variance of the difference between positive and negative pairs captures both confidence and consistency.

Equation showing the decomposition of variance into confidence and consistency terms.

As shown above, high variance implies high confidence and consistency. Therefore, CCS naturally seeks the direction with the highest variance.

The authors argue that distracting features induce “contrastive features” (directions where \(x^+\) and \(x^-\) differ). Even if a feature like “Banana” is constant for a specific prompt pair, interactions in the neural network (like XOR functions) can mix this feature with the “True/False” direction.

By clustering and normalizing, we attempt to isolate the knowledge direction. If the clustering works, the normalized difference looks like this:

Equation showing the normalized difference effectively isolating the knowledge feature.

Here, \(\vec{F}_{\top/\perp}\) represents the knowledge feature (True/False). By removing the cluster-specific noise, this term becomes the dominant signal for the probe to discover.

Experimental Results

The researchers tested Cluster-Norm on several datasets designed to trick unsupervised probes. They used models like Mistral-7B, Llama-3, and others.

Experiment 1: Random Words

As mentioned earlier, this experiment appends “Banana” or “Shed” to prompts. This is a “syntactical bias” designed to create a high-variance distraction.

The results were stark. Standard CCS fell for the trap, achieving accuracy near 50% (random guessing) on the actual knowledge task because it was classifying the random words instead. Cluster-Norm restored the accuracy significantly.

Violin plots showing accuracy distributions. Left: Standard CCS fails on Random GT (red). Right: Cluster-Norm restores accuracy for Random GT.

In the figure above:

  • Left (Burns-Norm): Look at the red violin (“Random:GT”). It’s centered around 0.5. The probe has completely failed to find the ground truth.
  • Right (Cluster-Norm): The red violin shifts up significantly, centering closer to 0.8. The probe has successfully ignored the random words and found the knowledge.

The improvement holds true across different models and different layers of the model, as seen below:

Line charts showing Cluster-Norm consistently outperforming standard normalization across layers, especially for modified prompts.

The bottom row (biased/modified prompts) shows the red and dark green lines (Cluster-Norm methods) consistently beating the orange and light green lines (Standard methods).

Summary of Accuracy (Mistral-7B):

Table showing Cluster-Norm improving CCS accuracy from 0.53 to 0.77.

Standard CCS accuracy dropped to 0.53 (basically a coin flip). Cluster-Norm brought it back up to 0.77.

Experiment 2: Explicit Opinion (The “Alice” Effect)

In this setup, prompts include a fictional character named Alice who gives an opinion.

  • Prompt: “Alice thinks this movie is great. What do you think?”
  • Alice’s opinion is a distracting feature. She might be wrong.

If the probe learns to predict “What Alice thinks” instead of “What is true,” it has failed.

Violin plots for the explicit opinion experiment. Cluster-Norm (right) shows tighter and higher accuracy distributions for the Alice:GT case (red).

Again, Cluster-Norm helps. In Figure 4 above, the red distribution (Alice prompts evaluated against Ground Truth) moves higher and becomes tighter when using Cluster-Norm.

Table showing accuracy results for the explicit opinion experiment. CCS with Cluster-Norm reaches 0.77 accuracy.

The table confirms this: Standard CCS gets 0.56, while CCS with Cluster-Norm jumps to 0.77.

Experiment 3: Implicit Opinion

The researchers also tried to replicate a more subtle experiment from previous literature where “Alice” has an implicit bias (e.g., she hates capitalism and always answers questions about companies incorrectly).

Surprisingly, the authors found that even standard CCS worked reasonably well here (unlike previous reports), perhaps due to using different models (Mistral-7B vs Chinchilla). However, the PCA visualization confirms that the “knowledge” clusters are distinct.

PCA visualizations for implicit opinion. The first principal component splits the data by correct choice relatively easily.

Because the “knowledge” feature was already the most salient thing here (as seen in the clear separation in the PCA above), both methods worked well. This highlights that Cluster-Norm is most essential when there is a competing salient feature.

Limitations: The “Simulated Knowledge” Problem

It is important to note what Cluster-Norm does not solve.

There is a major open problem in AI alignment called Simulated Knowledge. If you prompt an LLM with “I am a very gullible person who believes urban legends. Is it true that if you swallow gum it stays in your stomach for 7 years?”, the model might say “Yes” because it is simulating a gullible person, even if the model “knows” that is false.

The authors tested this using the “CommonClaim” dataset with prompts like “Professor Smith says…” vs. a standard prompt.

Violin plots showing prompt sensitivity. The distributions overlap significantly, indicating Cluster-Norm does not resolve the differences between prompt templates.

As shown in Figure 5, the accuracy varies depending on the prompt template (Default vs. Literal vs. Professor). Cluster-Norm (right) looks very similar to Standard Norm (left).

Why didn’t it work? Cluster-Norm removes distracting features in the dataset (like half the prompts having the word “Banana”). But in the “Professor Smith” case, the prompt template changes the context of the knowledge itself. It’s not a distraction to be normalized away; it’s a fundamental shift in how the model is processing the question. Differentiating between “Model Knowledge” and “Simulated Persona Knowledge” remains an unsolved challenge.

Conclusion

The paper “Cluster-Norm for Unsupervised Probing of Knowledge” provides a significant step forward in our ability to trust LLMs. It highlights a critical flaw in previous unsupervised probing methods: they are easily mesmerized by shiny, high-variance features that have nothing to do with the truth.

By applying Cluster Normalization, we can statistically strip away these distractions. We cluster the data by its most salient noise, normalize that noise to zero, and let the true knowledge signal shine through.

Key Takeaways:

  1. Unsupervised probing (CCS) is great but fragile; it confuses “salience” (variance) with “truth.”
  2. Cluster-Norm fixes this by grouping distracting inputs and normalizing them locally.
  3. This method massively improves performance on datasets with syntactical distractions (random words) and explicit opinions.
  4. It does not solve the problem of simulated knowledge (personas), which remains a frontier for future research.

As we continue to deploy powerful models, techniques like Cluster-Norm will be essential tools in the interpretability toolbox, helping us ensure that when an AI speaks, it’s drawing from facts, not just statistical noise.