If you have been following the explosion of theoretical research into Transformers, you know that understanding what these models can actually compute is just as important as watching their loss curves go down. We often idealize the Transformer architecture to study it mathematically. One common simplification is Unique Hard Attention (UHA).
In standard “soft” attention (like in GPT-4), the model attends to all previous tokens with varying weights. In UHA, the model attends to exactly one token—the one with the highest attention score.
It sounds simple enough. But what happens if two tokens have the exact same maximum score? You have to pick one. You need a tiebreaking rule. You might pick the leftmost one (the earliest occurrence) or the rightmost one (the most recent occurrence).
For a long time, researchers treated this choice as a minor implementation detail. A new paper, Unique Hard Attention: A Tale of Two Sides, reveals that this “trivial” choice is actually a fundamental pivot point in computational power.
In this deep dive, we will explore why Rightmost tiebreaking makes a Transformer significantly more powerful than Leftmost tiebreaking, and how these models connect to the world of formal logic and automata.
1. The Setup: Idealizing the Transformer
To prove mathematical bounds on expressivity, we need to strip the Transformer down to its logical core. The authors of this paper operate under a specific set of realistic constraints known as the Finite-Precision UHA Transformer:
- Finite Precision: Real hardware uses fixed bits (16-bit, 32-bit). We don’t assume infinite precision floating points.
- No Positional Encodings: The model has to rely on the sequence itself and the masking to understand order.
- Strict Future Masking: A token at position \(t\) can only attend to positions \(t' < t\). It cannot see itself or the future.
- Unique Hard Attention (UHA): The attention mechanism selects a single past position that maximizes a scoring function.
How Information Flows
In this framework, a Transformer layer calculates a new representation for a token by “pulling” information from one specific past token.

As shown in Figure 1 above, the representation \(x_t\) at layer \(\ell\) attends to a selected position \(t^*\) and adds that vector to itself via a residual connection. This means the Transformer is essentially “collecting” symbols from the past.
The mathematical definition of this attention mechanism is straightforward:

Here, norm is the hardmax function. This is where our story begins. The hardmax function must return a single index \(t^*\).
The Fork in the Road: Tiebreaking
When multiple positions \(t'\) yield the same maximum score, we must choose one. The paper defines two distinct operators:
- Leftmost (\(\blacktriangleleft\)): Selects the minimum index (earliest in time).
- Rightmost (\(\blacktriangleright\)): Selects the maximum index (most recent in time).

The central thesis of this work is that \(\blacktriangleleft\) (Leftmost) is strictly weaker than \(\blacktriangleright\) (Rightmost).
2. Intuition: Why is “Left” Weaker?
To understand why the direction matters, we can look at a programming language analogy called B-RASP. B-RASP is a boolean sequence programming language used to describe what Transformers do.
Imagine you are processing a sentence and you want to find specific information from the past.
The Leftmost Operation
The Leftmost operation scans the past and snaps to the first token that matches your search criteria (has a score of 1).

If you have the sequence A ... A ... A and you search for A, the Leftmost mechanism returns the value of the very first A. It is “sticky.” Once it finds a match, it ignores all subsequent matches.
The Rightmost Operation
The Rightmost operation finds the last token that matches your criteria.

Why is this stronger? Because strict future masking allows us to look strictly into the past (\(t' < t\)). If we use Rightmost attention, we can easily inspect the immediate predecessor (\(t-1\)).
The Logic:
- Assign a score of 1 to all past tokens.
- The “Rightmost” tiebreaker will naturally pick the largest index less than \(t\), which is exactly \(t-1\).
If we use Leftmost attention, we cannot easily target \(t-1\). If we assign a score of 1 to all past tokens, the attention snaps to position 0 (the beginning of the string) and gets stuck there. It effectively loses the ability to distinguish “how recent” something is.
This seemingly small mechanical difference means that Rightmost UHA can perform logic equivalent to “Since” (e.g., “Has X happened since the last time Y happened?”), while Leftmost UHA is limited to simpler “Eventually” logic (e.g., “Has X ever happened?”).
3. The Logic Connection: Linear Temporal Logic (LTL)
To make these claims rigorous, the authors connect these Transformer models to Linear Temporal Logic (LTL). LTL is a way of writing formulas about sequences of events using time-based operators.

The key operators are:
- \(\Diamond\) (Eventually Past): “At some point in the past…”
- \(\mathbf{S}\) (Since): “Condition A has held true since Condition B happened.”
The Hierarchy of Expressivity
The paper establishes a strict hierarchy summarized in the table below. This is the most important takeaway for understanding the landscape of Transformer theory.

Let’s break down Table 1:
- Rightmost UHA (\(\mathcal{T}^F_{\blacktriangleright}\)): Is equivalent to full First-Order Logic (FO[<]) and Star-Free languages. It can handle the complex Since and Until operators.
- Leftmost UHA (\(\mathcal{T}^F_{\blacktriangleleft}\)): Is equivalent only to LTL[\(\Diamond\)]. It lacks the Since operator. It is strictly weaker.
The Surprising Link to Soft Attention
Look closely at the second row of Table 1. Future-masked soft attention is also equivalent to LTL[\(\Diamond\)].
This is a profound insight. In finite-precision regimes, standard Softmax Transformers (which average values) share the same expressivity class as Leftmost Hard Attention Transformers. Neither has the “counting” or “immediate predecessor” power of Rightmost Hard Attention.
This suggests that Leftmost UHA is actually a much better theoretical proxy for studying real-world Transformers (like GPT) than Rightmost UHA, even though Rightmost is more powerful.
4. Proving the Equivalence
How do we prove a neural network is equivalent to a logic formula? The authors use a constructive proof strategy involving B-RASP.
From Logic to Transformers
They show that for every logic formula, there is a B-RASP program (and thus a Transformer) that computes it.
If we want to compute a formula involving “Since” (which requires Rightmost attention), the formula looks like this:

This formula uses the S (Since) operator explicitly.
However, for Leftmost attention, we can only support the “Eventually” (\(\Diamond\)) operator. The formula construction is simpler but less expressive:

Equation 25 shows that Leftmost attention can verify if a condition \(\psi_S\) occurred in the past (\(\Diamond \psi_S\)) and pick the first time it happened (\(\neg \Diamond \psi_S\) ensures we are looking at the earliest instance). It cannot express the dependency required for “Since.”
Automata Theory: Partially Ordered Finite Automata (POFA)
The paper also bridges these Transformers to Automata Theory.
- Rightmost UHA connects to Counter-Free Automata (the broad class of non-counting languages).
- Leftmost UHA connects to Partially Ordered Finite Automata (POFA).
A POFA is a specific type of machine where the flow of states is strictly directed. You can move from state A to state B, but you can never go back to A. There are no cycles (except self-loops).

Figure 2 illustrates a “Half-reset.” This represents the “latching” behavior of Leftmost attention. Once the attention mechanism finds a token (transitions to \(q_1\)), it “sticks” there. It cannot easily reset or cycle based on complex patterns, limiting its ability to recognize languages like “Parity” (checking if the number of 1s is even or odd).
5. Direct Translation: Transformers as Formulas
One of the paper’s contributions is a direct “normal form” translation. They show that any Leftmost UHA Transformer can be converted directly into a giant LTL[\(\Diamond\)] formula.
They start by defining the possible states of the Transformer. Since precision is finite, the number of possible context vectors is finite (though large).

Because the states are finite, we can enumerate every possible “accepted” state of the Transformer. The final logical formula that represents the entire Neural Network is essentially a massive “OR” statement: “The input is valid IF the final state is State A OR State B OR State C…”

This confirms that the neural network cannot “magically” compute something outside the bounds of LTL[\(\Diamond\)].
6. Why This Matters for AI
This might feel like abstract math, but it explains empirical behaviors we see in Large Language Models (LLMs).
1. The “Flip-Flop” Problem: Recent empirical studies (Liu et al., 2023) showed that Transformers struggle with “flip-flop” languages—tasks where the model must remember the most recent instruction (e.g., “Write A… (later) Write B…”). This paper explains why. Standard Transformers (which are theoretically close to Leftmost UHA) lack the “Rightmost” capability to easily latch onto the most recent relevant token using hard attention mechanisms. They naturally drift toward aggregating information rather than pinpointing the latest change.
2. Positional Encodings (ALiBi, RoPE): The standard Transformer architecture studied here has no positional encodings. The fact that Rightmost UHA is stronger explains why modern positional encodings (like ALiBi) are so effective. They effectively bias the attention scores based on distance, allowing the model to simulate “Rightmost” behavior (attending to recent tokens) much more easily. They bridge the gap between the weak “Leftmost” theory and the strong “Rightmost” needs.
3. Understanding Limitations: We now know that without specific enhancements, these models cannot solve problems requiring:
- Modulo counting (Parity).
- Nested parentheses of arbitrary depth (Dyck languages).
- Strictly Local dependency chains (finding the immediate neighbor \(t-1\) without positional help).
Conclusion
The paper Unique Hard Attention: A Tale of Two Sides teaches us that in the theoretical analysis of neural networks, the devil is in the details.
A simple implementation choice—min vs max index on a tie—changes the model from one that is equivalent to full Star-Free logic to one that is strictly weaker, equivalent only to simple “Eventually” logic.
By mapping Leftmost UHA to Soft Attention, the authors provide a rigorous baseline for understanding what standard Transformers can and cannot do. They are powerful aggregators of past information, but without explicit positional biases, they struggle to reason about the immediate sequence of events.
So, the next time you look at an attention map, ask yourself: is it looking Left, or is it looking Right? The answer defines the limits of its logic.
](https://deep-paper.org/en/paper/2503.14615/images/cover.png)