Decoding the Black Box: How Dictionary Learning Makes Medical AI Transparent

In the high-stakes world of healthcare, Artificial Intelligence is rapidly becoming an indispensable tool. One of the most critical back-office tasks in medicine is medical coding—the translation of unstructured clinical text (like doctor’s notes) into standardized International Classification of Diseases (ICD) codes. These codes are vital for billing, epidemiology, and treatment tracking.

While Large Language Models (LLMs) have shown incredible prowess in automating this task, they suffer from a significant “black box” problem. When an AI assigns a code for “postoperative wound infection,” a doctor needs to know why. If the model cannot explain its reasoning, it cannot be trusted in a clinical setting.

Existing interpretability methods often fall short, highlighting irrelevant words like “and” or “the” as the justification for a diagnosis. This blog post explores a fascinating new approach called AutoCodeDL, presented in the paper Beyond Label Attention, which uses Dictionary Learning to peel back the layers of neural networks. We will explore how this method decomposes messy, dense signals into clear, human-readable medical concepts.

The Problem with Current Interpretability

To understand why we need a new method, we first need to understand how we currently explain AI decisions in medical coding. The industry standard relies on the Label Attention (LAAT) mechanism.

In simple terms, LAAT looks at the clinical text and assigns an “attention score” to each word (token) relative to a specific ICD code. It tells us which words the model focused on to make its decision.

Highlighting Tokens with Label Attention Pipeline Figure 5: Label Attention identifies the most relevant tokens for each ICD code through a label attention matrix

As shown in the pipeline above, clinical notes are broken into segments, processed by a Pre-trained Language Model (PLM), and then passed through Label Attention to produce ICD codes.

However, researchers have noticed a disturbing trend. LAAT often highlights tokens that seem completely irrelevant to a human. For example, a model might predict “postoperative wound infection” and claim the most important word was the conjunction “and.”

Figure 1: Motivation: LAAT identifes the most relevant tokens for each ICD code.

Look at the figure above. In panel (a), a human would identify “wound,” “breakdown,” and “dehiscence” as the key terms. In panel (b), the standard LAAT mechanism inexplicably highlights “and.”

Is the model broken? Not necessarily. The authors of this paper argue that this is a result of neuron polysemanticity and superposition.

The Concept of Superposition

In high-dimensional vector spaces (where LLMs operate), models often compress more features than they have neurons. This phenomenon is called superposition. A single neuron might be responsible for encoding “wound infection,” “prepositions,” and “billing dates” all at once.

When we look at the attention score for “and,” we are seeing a compressed, “dense” representation. The word “and” in that specific context might actually be carrying a hidden signal about wound failure. The challenge is: how do we disentangle these overlapping signals?

The Solution: Dictionary Learning

To solve this, the researchers turned to Dictionary Learning (DL) using Sparse Autoencoders (SAE).

The core idea is to take the “dense” embedding of a word (which is a mess of overlapping concepts) and map it to a “sparse” representation where each element represents a distinct, singular concept (a “dictionary feature”).

Think of the dense embedding as a smoothie. It tastes like a mix of fruits. Dictionary learning is the machine that un-mixes the smoothie back into separate piles of strawberries, bananas, and kale.

The Sparse Autoencoder Architecture

The researchers employed a Sparse Autoencoder to perform this decomposition. An autoencoder is a neural network designed to copy its input to its output, but with a constraint in the middle that forces it to learn meaningful structure.

Here is the mathematical framework they used:

Equation describing the reconstruction and loss function of the sparse autoencoder.

Let’s break this down:

Encoder: The input embedding \(x\) is projected into a higher-dimensional space using weights \(W_e\) and a bias \(b_e\). A ReLU activation ensures values are non-negative.
Decoder: The resulting feature vector \(f\) is projected back down to the original size using weights \(W_d\).
Loss Function: The training minimizes the difference between the original and reconstructed embedding (Mean Squared Error). Crucially, it adds an \(L_1\) penalty (\(\lambda ||f||_1\)). This penalty forces the network to be sparse—meaning mostly zeros.

This sparsity constraint is the magic ingredient. It forces the model to use only a few active features to describe any given word, effectively isolating specific concepts.

The dictionary matrix \(W_d\) can be viewed as a collection of feature directions:

Equation showing the dictionary weight matrix Wd as a collection of vectors h.

Consequently, any dense token embedding \(x\) can be approximated as a linear combination of these sparse features:

Equation showing x approximated as the sum of features multiplied by dictionary vectors.

Building a Medically Relevant Dictionary

Simply training an autoencoder isn’t enough. We need to know what these sparse features actually mean in a medical context. The researchers developed a method to map these features to specific ICD codes.

The process involves two main steps: Sampling/Encoding and Ablation/Identification.

Figure 2: Building a dictionary involves several steps: A sparse autoencoder decomposes each token embedding into a sparse latent space.

Step 1: Sampling and Encoding The model processes millions of tokens from clinical notes. The Sparse Autoencoder breaks these into feature activations (\(f\)). If a feature activates strongly for tokens like “dehiscence,” “breakdown,” and “rupture,” we start to get a clue about its meaning.

Step 2: Ablation and Identification To confirm a feature’s meaning, the researchers use ablation. This means they mathematically remove a specific feature from the embedding and measure the impact.

The ablation formula is:

Equation showing the ablated vector x tilde equal to x minus the feature component.

By removing feature \(f_i\), we get a modified embedding \(\tilde{x}\). We then look at the model’s prediction output (softmax probabilities) for the original input versus the ablated input:

Equation calculating the difference in probability delta i.

If removing a specific feature causes the probability of the “Wound Infection” ICD code to drop significantly, we know that feature is responsible for that diagnosis.

AutoCodeDL: The New Pipeline

The researchers combined the standard Label Attention mechanism with their new Dictionary Learning approach to create AutoCodeDL.

Instead of stopping at “the model looked at the word ‘and’,” AutoCodeDL digs deeper:

LAAT identifies important words (even seemingly irrelevant ones like “and”).
The Sparse Autoencoder decomposes “and” into its active features.
The system identifies that Feature ID 3728 is active.
The dictionary lookup reveals Feature ID 3728 corresponds to “failure of wound healing.”

Figure 3: Proposed method for automated ICD interpretability pipeline: AutoCodeDL.

This completely changes the narrative. The explanation goes from “Attention on ‘and’” (confusing) to “Attention on ‘and’ because it contains the hidden concept of ‘wound failure’” (useful and trustworthy).

Experimental Results

The researchers validated their method on the MIMIC-III dataset, a massive collection of real-world clinical notes. They compared their method against several baselines, including PCA, ICA, and random encoders.

1. Explainability (Faithfulness)

The first test was to ensure that the dictionary features actually explain the model’s behavior. If we ablate the features identified by AutoCodeDL, does the model change its mind?

The results showed that ablating dictionary features of highlighted tokens had a precise and significant impact on the predicted ICD codes.

Table 9: Softmax probability changes in downstream ICD predictions resulting from ablation experiments.

In the table above, look at the Ratio. A higher ratio means the method successfully lowered the probability of the target code (Top) without destroying the probabilities of other unrelated codes (NGT). AutoCodeDL (L1 and SPINE variants) significantly outperformed baselines like PCA and ICA, proving it provides a more surgical and accurate explanation.

2. Solving the “Stop Word” Mystery

Remember the example where “and” was highlighted? The researchers systematically tested this. They took irrelevant stop words that were highlighted by LAAT and used AutoCodeDL to analyze them.

They found that upwards of 90% of these medically irrelevant tokens actually contained relevant medical concepts in superposition.

Table 2: Proportion of stop word embedding labels correctly identified by our AutoCodeDL framework.

This is a breakthrough for trust in medical AI. It confirms that the model isn’t hallucinating importance on stop words; it is efficiently packing information into them.

3. Model Steering

If these features are truly causal, we should be able to force the model to make a prediction by artificially activating a feature. This is called model steering.

The researchers “clamped” (manually increased) specific feature activations and observed the results.

Table 3: Model steering experiment results comparing AutoCodeDL.

The results were impressive. By manipulating dictionary features, they could flip the decisions for thousands of codes. This confirms that the dictionary features are not just correlated with medical concepts—they are the mechanism the model uses to reason.

The visualization below (UMAP) maps these features, showing clusters of related concepts like “Heart Conditions” or “Renal Conditions” that can steer the model’s behavior.

Figure 4: UMAP of SPINE Embeddings: Dictionary features are interpretable in steering model behavior.

Human Understandability

Finally, a medical explanation system is useless if humans can’t understand it. The researchers evaluated the coherence and distinctiveness of the learned features.

Coherence asks: Do the words that activate a specific feature share a semantic theme?

Here is an example of a highly coherent feature related to Atrial Fibrillation (AFib). Notice how the activated tokens (in red) are all contextually related to heart rhythm issues.

Figure 6: Example of highly interpretable SPINE feature with high cosine similarity.

Conversely, not all features were perfect. Some features grouped visually similar but semantically diverse concepts, or concepts that were too abstract for immediate interpretation.

Figure 8: Example of an uninterpretable SPINE feature with low cosine similarity.

To rigorously test this, they performed a “word intrusion” test (Distinctiveness). They showed human experts (including a licensed physician) a set of words activated by a feature, plus one random “intruder” word. If the feature is distinct, the human should easily spot the intruder.

Figure 9: Example of an interpretable L1 feature.

In the example above, the feature clearly relates to heart valve procedures (“porcine avr,” “bovine avr”). The intruder “nephrectomy” (kidney removal) sticks out like a sore thumb.

However, some cases were harder. In the example below, both annotators correctly identified the intruder, but they noted the feature was less obvious—likely capturing a more abstract concept regarding specific anatomical locations or conditions.

Figure 10: Example of a less interpretable L1 feature differentiated by both annotators.

This human evaluation confirmed that while not every single feature is perfect, the Sparse Autoencoder (especially the L1 variant) produces features that are significantly more interpretable and distinct than baseline methods.

Conclusion

The paper “Beyond Label Attention” presents a significant step forward for AI in healthcare. By acknowledging that language models compress information in complex ways (superposition), the researchers moved beyond simple highlighting tools.

AutoCodeDL offers a transparent window into the “brain” of a medical AI. It allows us to:

Validate Predictions: Confirm that a diagnosis is based on medical evidence, not statistical noise.
Translate “Computer Speak”: Decode why a model might focus on a word like “and.”
Steer Models: Potentially correct model behavior by adjusting specific feature activations.

As medical coding models become larger and more complex, tools like AutoCodeDL will be essential. They provide the necessary bridge between the raw computational power of AI and the trust required for patient care. Instead of asking doctors to trust a black box, we can finally hand them the dictionary to decode it.

The Problem with Current Interpretability#

The Concept of Superposition#

The Solution: Dictionary Learning#

The Sparse Autoencoder Architecture#

Building a Medically Relevant Dictionary#

AutoCodeDL: The New Pipeline#

Experimental Results#

1. Explainability (Faithfulness)#

2. Solving the “Stop Word” Mystery#

3. Model Steering#

Human Understandability#

Conclusion#