](https://deep-paper.org/en/paper/2503.14615/images/cover.png)
Left vs. Right: How a Trivial Tiebreaking Choice Defines Transformer Expressivity
If you have been following the explosion of theoretical research into Transformers, you know that understanding what these models can actually compute is just as important as watching their loss curves go down. We often idealize the Transformer architecture to study it mathematically. One common simplification is Unique Hard Attention (UHA). In standard “soft” attention (like in GPT-4), the model attends to all previous tokens with varying weights. In UHA, the model attends to exactly one token—the one with the highest attention score. ...
](https://deep-paper.org/en/paper/2502.13442/images/cover.png)
](https://deep-paper.org/en/paper/2502.13487/images/cover.png)
](https://deep-paper.org/en/paper/file-2393/images/cover.png)
](https://deep-paper.org/en/paper/2502.13497/images/cover.png)
](https://deep-paper.org/en/paper/2503.10995/images/cover.png)
](https://deep-paper.org/en/paper/file-2390/images/cover.png)
](https://deep-paper.org/en/paper/file-2389/images/cover.png)
](https://deep-paper.org/en/paper/2504.03561/images/cover.png)
](https://deep-paper.org/en/paper/2502.12835/images/cover.png)
](https://deep-paper.org/en/paper/2503.03499/images/cover.png)
](https://deep-paper.org/en/paper/2506.00134/images/cover.png)
](https://deep-paper.org/en/paper/2505.19155/images/cover.png)
](https://deep-paper.org/en/paper/file-2383/images/cover.png)
](https://deep-paper.org/en/paper/file-2382/images/cover.png)
](https://deep-paper.org/en/paper/file-2381/images/cover.png)
](https://deep-paper.org/en/paper/2505.20679/images/cover.png)
](https://deep-paper.org/en/paper/2504.13677/images/cover.png)
](https://deep-paper.org/en/paper/2506.00457/images/cover.png)
](https://deep-paper.org/en/paper/2505.24778/images/cover.png)