](https://deep-paper.org/en/paper/2406.17969/images/cover.png)
Untangling the Black Box: Why Monosemanticity is Key to Better LLM Alignment
Introduction Imagine trying to understand how a complex alien brain works. You probe a single neuron, hoping it corresponds to a specific thought like “happiness” or “the color red.” Instead, that single neuron fires for a chaotic mix of concepts: a specific preposition, the mention of the French Revolution, and the closing bracket of a Python function. This is the reality of polysemanticity in Large Language Models (LLMs). For years, researchers in mechanistic interpretability have struggled with the fact that neural networks are “black boxes.” A major hurdle is that individual neurons often represent multiple, unrelated concepts simultaneously. The “holy grail” of interpretability is achieving monosemanticity—a state where one neuron (or feature) corresponds to exactly one understandable concept. ...
](https://deep-paper.org/en/paper/2405.19723/images/cover.png)
](https://deep-paper.org/en/paper/file-3023/images/cover.png)
](https://deep-paper.org/en/paper/file-3022/images/cover.png)
](https://deep-paper.org/en/paper/2410.00771/images/cover.png)
](https://deep-paper.org/en/paper/2410.04439/images/cover.png)
](https://deep-paper.org/en/paper/2312.14069/images/cover.png)
](https://deep-paper.org/en/paper/2403.02281/images/cover.png)
](https://deep-paper.org/en/paper/2410.00316/images/cover.png)
](https://deep-paper.org/en/paper/file-3016/images/cover.png)
](https://deep-paper.org/en/paper/2403.11747/images/cover.png)
](https://deep-paper.org/en/paper/2406.10957/images/cover.png)
](https://deep-paper.org/en/paper/2408.04259/images/cover.png)
](https://deep-paper.org/en/paper/file-3011/images/cover.png)
](https://deep-paper.org/en/paper/2402.16050/images/cover.png)
](https://deep-paper.org/en/paper/2406.12125/images/cover.png)
](https://deep-paper.org/en/paper/2409.12656/images/cover.png)
](https://deep-paper.org/en/paper/file-3007/images/cover.png)
](https://deep-paper.org/en/paper/2405.05894/images/cover.png)
](https://deep-paper.org/en/paper/file-3005/images/cover.png)