If you have been following the explosion of theoretical research into Transformers, you know that understanding what these models can actually compute is just as important as watching their loss curves go down. We often idealize the Transformer architecture to study it mathematically. One common simplification is Unique Hard Attention (UHA).

In standard “soft” attention (like in GPT-4), the model attends to all previous tokens with varying weights. In UHA, the model attends to exactly one token—the one with the highest attention score.

It sounds simple enough. But what happens if two tokens have the exact same maximum score? You have to pick one. You need a tiebreaking rule. You might pick the leftmost one (the earliest occurrence) or the rightmost one (the most recent occurrence).

For a long time, researchers treated this choice as a minor implementation detail. A new paper, Unique Hard Attention: A Tale of Two Sides, reveals that this “trivial” choice is actually a fundamental pivot point in computational power.

In this deep dive, we will explore why Rightmost tiebreaking makes a Transformer significantly more powerful than Leftmost tiebreaking, and how these models connect to the world of formal logic and automata.

1. The Setup: Idealizing the Transformer

To prove mathematical bounds on expressivity, we need to strip the Transformer down to its logical core. The authors of this paper operate under a specific set of realistic constraints known as the Finite-Precision UHA Transformer:

  1. Finite Precision: Real hardware uses fixed bits (16-bit, 32-bit). We don’t assume infinite precision floating points.
  2. No Positional Encodings: The model has to rely on the sequence itself and the masking to understand order.
  3. Strict Future Masking: A token at position \(t\) can only attend to positions \(t' < t\). It cannot see itself or the future.
  4. Unique Hard Attention (UHA): The attention mechanism selects a single past position that maximizes a scoring function.

How Information Flows

In this framework, a Transformer layer calculates a new representation for a token by “pulling” information from one specific past token.

Figure 1: Unique hard attention. The model selects a single past vector to combine with the current vector.

As shown in Figure 1 above, the representation \(x_t\) at layer \(\ell\) attends to a selected position \(t^*\) and adds that vector to itself via a residual connection. This means the Transformer is essentially “collecting” symbols from the past.

The mathematical definition of this attention mechanism is straightforward:

Equation 35: The mathematical definition of attention with normalization.

Here, norm is the hardmax function. This is where our story begins. The hardmax function must return a single index \(t^*\).

The Fork in the Road: Tiebreaking

When multiple positions \(t'\) yield the same maximum score, we must choose one. The paper defines two distinct operators:

  • Leftmost (\(\blacktriangleleft\)): Selects the minimum index (earliest in time).
  • Rightmost (\(\blacktriangleright\)): Selects the maximum index (most recent in time).

Equation 12: Defining the tiebreaking mechanism. Min for Leftmost, Max for Rightmost.

The central thesis of this work is that \(\blacktriangleleft\) (Leftmost) is strictly weaker than \(\blacktriangleright\) (Rightmost).

2. Intuition: Why is “Left” Weaker?

To understand why the direction matters, we can look at a programming language analogy called B-RASP. B-RASP is a boolean sequence programming language used to describe what Transformers do.

Imagine you are processing a sentence and you want to find specific information from the past.

The Leftmost Operation

The Leftmost operation scans the past and snaps to the first token that matches your search criteria (has a score of 1).

Equation 1: The Leftmost operation in B-RASP.

If you have the sequence A ... A ... A and you search for A, the Leftmost mechanism returns the value of the very first A. It is “sticky.” Once it finds a match, it ignores all subsequent matches.

The Rightmost Operation

The Rightmost operation finds the last token that matches your criteria.

Equation 5: The Rightmost operation.

Why is this stronger? Because strict future masking allows us to look strictly into the past (\(t' < t\)). If we use Rightmost attention, we can easily inspect the immediate predecessor (\(t-1\)).

The Logic:

  1. Assign a score of 1 to all past tokens.
  2. The “Rightmost” tiebreaker will naturally pick the largest index less than \(t\), which is exactly \(t-1\).

If we use Leftmost attention, we cannot easily target \(t-1\). If we assign a score of 1 to all past tokens, the attention snaps to position 0 (the beginning of the string) and gets stuck there. It effectively loses the ability to distinguish “how recent” something is.

This seemingly small mechanical difference means that Rightmost UHA can perform logic equivalent to “Since” (e.g., “Has X happened since the last time Y happened?”), while Leftmost UHA is limited to simpler “Eventually” logic (e.g., “Has X ever happened?”).

3. The Logic Connection: Linear Temporal Logic (LTL)

To make these claims rigorous, the authors connect these Transformer models to Linear Temporal Logic (LTL). LTL is a way of writing formulas about sequences of events using time-based operators.

Equation 7: The semantics of Linear Temporal Logic (LTL).

The key operators are:

  • \(\Diamond\) (Eventually Past): “At some point in the past…”
  • \(\mathbf{S}\) (Since): “Condition A has held true since Condition B happened.”

The Hierarchy of Expressivity

The paper establishes a strict hierarchy summarized in the table below. This is the most important takeaway for understanding the landscape of Transformer theory.

Table 1: Known equivalences of finite-precision transformers. Note the difference between Leftmost (Top/Bottom Left) and Rightmost (Top/Bottom Right).

Let’s break down Table 1:

  1. Rightmost UHA (\(\mathcal{T}^F_{\blacktriangleright}\)): Is equivalent to full First-Order Logic (FO[<]) and Star-Free languages. It can handle the complex Since and Until operators.
  2. Leftmost UHA (\(\mathcal{T}^F_{\blacktriangleleft}\)): Is equivalent only to LTL[\(\Diamond\)]. It lacks the Since operator. It is strictly weaker.

Look closely at the second row of Table 1. Future-masked soft attention is also equivalent to LTL[\(\Diamond\)].

This is a profound insight. In finite-precision regimes, standard Softmax Transformers (which average values) share the same expressivity class as Leftmost Hard Attention Transformers. Neither has the “counting” or “immediate predecessor” power of Rightmost Hard Attention.

This suggests that Leftmost UHA is actually a much better theoretical proxy for studying real-world Transformers (like GPT) than Rightmost UHA, even though Rightmost is more powerful.

4. Proving the Equivalence

How do we prove a neural network is equivalent to a logic formula? The authors use a constructive proof strategy involving B-RASP.

From Logic to Transformers

They show that for every logic formula, there is a B-RASP program (and thus a Transformer) that computes it.

If we want to compute a formula involving “Since” (which requires Rightmost attention), the formula looks like this:

Equation 27: The complex logic formula required for Rightmost attention operations.

This formula uses the S (Since) operator explicitly.

However, for Leftmost attention, we can only support the “Eventually” (\(\Diamond\)) operator. The formula construction is simpler but less expressive:

Equation 25: The logic formula for Leftmost attention, using only the ‘Eventually’ operator.

Equation 25 shows that Leftmost attention can verify if a condition \(\psi_S\) occurred in the past (\(\Diamond \psi_S\)) and pick the first time it happened (\(\neg \Diamond \psi_S\) ensures we are looking at the earliest instance). It cannot express the dependency required for “Since.”

Automata Theory: Partially Ordered Finite Automata (POFA)

The paper also bridges these Transformers to Automata Theory.

  • Rightmost UHA connects to Counter-Free Automata (the broad class of non-counting languages).
  • Leftmost UHA connects to Partially Ordered Finite Automata (POFA).

A POFA is a specific type of machine where the flow of states is strictly directed. You can move from state A to state B, but you can never go back to A. There are no cycles (except self-loops).

Figure 2: A 1-way fork (Half-reset). The automaton can transition forward or stay put, but never cycle back.

Figure 2 illustrates a “Half-reset.” This represents the “latching” behavior of Leftmost attention. Once the attention mechanism finds a token (transitions to \(q_1\)), it “sticks” there. It cannot easily reset or cycle based on complex patterns, limiting its ability to recognize languages like “Parity” (checking if the number of 1s is even or odd).

5. Direct Translation: Transformers as Formulas

One of the paper’s contributions is a direct “normal form” translation. They show that any Leftmost UHA Transformer can be converted directly into a giant LTL[\(\Diamond\)] formula.

They start by defining the possible states of the Transformer. Since precision is finite, the number of possible context vectors is finite (though large).

Equation 43: Bounding the number of unique contextual representations.

Because the states are finite, we can enumerate every possible “accepted” state of the Transformer. The final logical formula that represents the entire Neural Network is essentially a massive “OR” statement: “The input is valid IF the final state is State A OR State B OR State C…”

Equation 53: The final formula representing the Transformer is a disjunction of all accepting final states.

This confirms that the neural network cannot “magically” compute something outside the bounds of LTL[\(\Diamond\)].

6. Why This Matters for AI

This might feel like abstract math, but it explains empirical behaviors we see in Large Language Models (LLMs).

1. The “Flip-Flop” Problem: Recent empirical studies (Liu et al., 2023) showed that Transformers struggle with “flip-flop” languages—tasks where the model must remember the most recent instruction (e.g., “Write A… (later) Write B…”). This paper explains why. Standard Transformers (which are theoretically close to Leftmost UHA) lack the “Rightmost” capability to easily latch onto the most recent relevant token using hard attention mechanisms. They naturally drift toward aggregating information rather than pinpointing the latest change.

2. Positional Encodings (ALiBi, RoPE): The standard Transformer architecture studied here has no positional encodings. The fact that Rightmost UHA is stronger explains why modern positional encodings (like ALiBi) are so effective. They effectively bias the attention scores based on distance, allowing the model to simulate “Rightmost” behavior (attending to recent tokens) much more easily. They bridge the gap between the weak “Leftmost” theory and the strong “Rightmost” needs.

3. Understanding Limitations: We now know that without specific enhancements, these models cannot solve problems requiring:

  • Modulo counting (Parity).
  • Nested parentheses of arbitrary depth (Dyck languages).
  • Strictly Local dependency chains (finding the immediate neighbor \(t-1\) without positional help).

Conclusion

The paper Unique Hard Attention: A Tale of Two Sides teaches us that in the theoretical analysis of neural networks, the devil is in the details.

A simple implementation choice—min vs max index on a tie—changes the model from one that is equivalent to full Star-Free logic to one that is strictly weaker, equivalent only to simple “Eventually” logic.

By mapping Leftmost UHA to Soft Attention, the authors provide a rigorous baseline for understanding what standard Transformers can and cannot do. They are powerful aggregators of past information, but without explicit positional biases, they struggle to reason about the immediate sequence of events.

So, the next time you look at an attention map, ask yourself: is it looking Left, or is it looking Right? The answer defines the limits of its logic.