Cracking the Code: How Neural Networks Hide Interpretable Features in Plain Sight

Artificial intelligence, especially the large language models (LLMs) we interact with daily, often feels like a black box. We see the impressive outputs—coherent text, stunning images, insightful analysis—but the inner workings remain shrouded in mystery. How does a network of artificial neurons actually represent concepts like “elephant,” “justice,” or the color “pink”? This isn’t just an academic puzzle. As AI becomes more integrated into critical fields like medicine, finance, and policy, understanding how it represents information is central to ensuring trust, safety, and reliability.

A recent perspective paper, “From superposition to sparse codes: interpretable representations in neural networks,” offers a compelling framework for prying open this black box. The authors argue that despite their nonlinear architectures, neural networks encode features in a surprisingly simple, almost linear way. They represent information through superposition—a process where multiple concepts are linearly overlaid within the same representational space.

This blog unpacks the paper’s three central ideas, which together form a roadmap for transforming entangled, superposed neural activations into human-interpretable features. We’ll explore:

Identifiability Theory: Why neural networks, even when nonlinear, recover the world’s latent features in linear form.
Sparse Coding: How compressed sensing can “unmix” those features into understandable components.
Quantitative Interpretability: How we can measure whether the decoded features truly correspond to meaningful, human-relevant concepts.

By the end, you’ll understand how today’s most advanced models may already be learning in ways that mirror deep principles of cognition and neuroscience—and how we can begin to read what they mean.

The Superposition Surprise: Adding Elephants and Pink Balls

At first glance, it sounds paradoxical: how could a nonlinear deep network represent complex concepts linearly? Yet the evidence is striking. The authors demonstrate this through a surprisingly simple experiment, shown below.

Superposition of neural representations for combined concepts. Panels A–D show that the representation of an “elephant” added to a “pink ball” approximates the representation of “an elephant and a pink ball.” Similar results appear for real images of a cat and a dog.

Figure 1: Neural representations appear to add together—both for generated and real images.

Here’s the setup: using a vision transformer (ViT), they compute the internal representations for an image of an elephant and another of a pink ball. When they add these two vectors together and compare the result to the representation of an image containing both objects, the similarity is almost perfect (cosine similarity = 0.96). The same phenomenon occurs for natural photos of a dog and a cat.

This is the superposition principle—formally introduced decades ago by Paul Smolensky (1990)—which states that the representation of a conjunction of concepts equals the sum of their individual representations:

\[ \Psi\left(\bigwedge_i p_i\right) = \sum_i \Psi(p_i) \]

That is,

\[ \Psi(\text{elephant} \wedge \text{pink ball}) = \Psi(\text{elephant}) + \Psi(\text{pink ball}) \]

The fact that nonlinear models exhibit this linear additivity reveals an unexpected regularity. It also introduces a challenge: if thousands of features are mixed linearly inside high-dimensional activations, how can we isolate any single interpretable component? The authors outline a clear three-step process for doing precisely that.

A Pipeline for Finding Meaning

The theory and method unify insights from neuroscience, representation learning, and information theory. The overall workflow—shown in the next figure—starts with data generation, proceeds through neural representation, and ends with interpretable sparse decoding.

Illustrative pipeline connecting generative models, neural representations, sparse coding, and quantitative evaluation. The theory predicts recovery of interpretable features up to permutation.

Figure 2: Theoretical and practical pipeline for extracting interpretable representations from superposed neural activations.

Let’s walk through it step by step.

Step 1 – Identifiability Theory: Why Representations Become Linear

Imagine that the world is composed of hidden causes or latent variables, noted as z. For images, these could be objects (“dog”), attributes (“furry”), or configurations (“is sitting”). Only a small subset is active in any scene, making them sparse.

A nonlinear generative function g combines these latent variables into observable data x = g(z). Another nonlinear function f—the trained neural network—maps the data into a representation y = f(x).

$Mathematical world model and linear analogy-making. The mapping from latent variables z to data x to neural representation y is linear overall under certain conditions, enabling analogies like “pink\u202f+\u202felephant\u202f≈\u202fpink\u202felephant.”$

Figure 3: A mathematical model showing how nonlinear functions can combine into a linear mapping $h = f ∘ g$.

Surprisingly, theoretical work by Reizinger et al. (2024) proves that under standard supervised learning—optimizing cross‑entropy with a linear classifier—the composed mapping h = f ∘ g becomes linear. In other words, the neural network effectively inverts the nonlinear generative process up to a linear transformation.

This linearity shows up empirically in “neural analogy‑making,” reminiscent of the famous word‑embedding relation $ \text{king} - \text{man} + \text{woman} \approx \text{queen} $.

Analogously, consider color and object combinations:

\[ f ∘ g(\text{pink}) \approx f ∘ g(\text{pink ball}) - f ∘ g(\text{ball}) \approx f ∘ g(\text{pink elephant}) - f ∘ g(\text{elephant}) \]

and

\[ f ∘ g(\text{elephant}) + f ∘ g(\text{pink}) \approx f ∘ g(\text{pink elephant}) \]

These additive properties demonstrate that neural representations behave linearly in high‑dimensional space. The upshot: beneath all their nonlinear layers, neural networks encode the world’s latent features through linear superposition. Next, we need a way to unmix these overlapping features.

Step 2 – Sparse Coding: Pulling Features Out of the Mix

If representations are linear mixtures of sparse features, they can be expressed as

\[ y = \Phi z \]

where $\Phi$ is a linear projection. Typically, the representational dimension M is smaller than the number of latent features N, forcing multiple features to share directions—a phenomenon we call superposition.

Compressed sensing tells us that we can still recover the original sparse signal as long as the number of measurements satisfies

\[ M > \mathcal{O}\bigl(K \log(N/K)\bigr), \]

where K is the number of active components. To achieve this recovery in practice, we use sparse coding, alternately estimating sparse codes $\hat{z}$ and learning a dictionary $\Theta$:

\[ \min_{\hat{z}} \sum_i \|y^{(i)} - \Theta \hat{z}_i\|_2^2 + \lambda \|\hat{z}_i\|_1 \]

\[ \min_{\Theta} \sum_i \|y^{(i)} - \Theta \hat{z}_i\|_2^2 \;\text{s.t.}\; \|\Theta_{:,j}\|=1 \]

Because this optimization is expensive for large datasets, recent research replaces it with amortized inference—learning a small neural network that directly predicts $\hat{z}$. The resulting architecture is a Sparse Autoencoder (SAE).

Compressed sensing bounds showing conditions for recovering sparse signals at various model sizes. Larger models occupy regions where recovery of interpretable codes is theoretically possible.

Figure 4: Theoretical recovery bounds for different embedding sizes $M$. Blue regions denote invertible (“recoverable”) regimes.

While SAEs scale well, new analyses show they may not reach theoretical optimality: their simple encoder structures limit exact recovery in very high‑dimensional models. Improving the efficiency and fidelity of sparse inference for modern LLMs is therefore a key open problem.

Step 3 – Quantitative Interpretability: Measuring Success

After performing sparse inference, we must ask a crucial question: Are the recovered features actually meaningful?

A truly interpretable feature activates for one coherent concept (say, “cats”), whereas a mixed feature that responds partly to cats and partly to cars is far less useful. Quantitatively, interpretability should be permutation‑invariant—it doesn’t matter in which order we identify “cat” and “dog.”

Several human‑based tasks have served as gold standards:

Word Intrusion Task (WIT): Humans identify which word doesn’t fit within a cluster associated with a feature—e.g., {river, boat, water, keyboard, stream}.
Police Lineup Task: Tests whether a feature isolates distinct meanings of polysemous words such as “bank” (river vs. financial).
Visual Intrusion Task: Participants select outliers from image sets that strongly activate a visual feature.

These experiments quantify how well features correspond to discrete, human‑interpretable concepts. However, scaling interpretability assessment across millions of neurons requires automated metrics that correlate reliably with human judgments—a rapidly developing research frontier.

A Brief History of Un‑Mixing Representations

The path to interpretable representations spans four decades of research. Early work in linguistic modeling used SVD, ICA, and LDA to uncover semantic factors in text. Later methods like Non‑Negative Sparse Embedding (NNSE) and Sparse Coding substantially improved human task performance. Modern studies continue this trajectory with Sparse Autoencoders applied to transformer models.

Summary of prior research applying sparse coding and related methods. Across models from SVD to modern LLMs, sparser embeddings consistently yield higher interpretability scores.

Table 1: Overview of research using sparse coding across model generations—from word embeddings to LLMs.

For instance, Faruqui et al. (2015) found that human accuracy on the Word Intrusion Task jumped from 57 % for raw GloVe vectors to 71 % for their sparse versions. Recent works using SAEs on GPT‑4 and Claude 3 have identified thousands of interpretable neurons tied to coherent themes—ranging from “computer security” to “moral reasoning.” Collectively, the findings establish sparse decomposition as a reliable route toward interpretability.

Conclusion: From Black Boxes to Glass Boxes

The framework proposed in From Superposition to Sparse Codes links three previously separate domains into a unified theory:

Representation Learning Theory — shows that supervised networks learn linear mappings from worlds’ latent variables.
Compressed Sensing and Sparse Coding — provides mathematical tools to recover those variables from mixed activations.
Psychophysics and Quantitative Interpretability — give empirical measures to evaluate whether decoded features align with human‑understandable concepts.

These connections have profound implications. For AI transparency, they offer a principled path from opaque “black boxes” to interpretable “glass boxes.” For neuroscience, they reconcile two long‑standing views: the single‑neuron doctrine (one cell, one concept) and population coding (distributed concepts). Superposition may be the mechanism bridging these extremes—efficiently encoding many sparse features within collective activity.

Many challenges remain. The classic binding problem—how networks represent relationships like “a blue triangle beside a red square” without losing structure—persists. Implementing scalable, precise sparse inference for ever‑larger models is still a major technical hurdle.

Yet the direction is clear. By treating neural representations as linear superpositions of sparse latent features, researchers are beginning to translate the language of deep networks into concepts we can understand. Each advance brings us closer to transforming AI from mysterious black boxes into transparent systems whose reasoning we can see, scrutinize, and trust.

The Superposition Surprise: Adding Elephants and Pink Balls#

A Pipeline for Finding Meaning#

Step 1 – Identifiability Theory: Why Representations Become Linear#

Step 2 – Sparse Coding: Pulling Features Out of the Mix#

Step 3 – Quantitative Interpretability: Measuring Success#

A Brief History of Un‑Mixing Representations#

Conclusion: From Black Boxes to Glass Boxes#