Teaching Transformers to Think Structurally: The STEP Method

If you have ever played with Large Language Models (LLMs) or standard sequence-to-sequence (seq2seq) models like T5, you know they are incredibly powerful. They can translate languages, summarize texts, and even write code. However, despite their versatility, these models often have a hidden weakness: they lack a strong structural inductive bias.

Put simply, standard Transformers are statistical powerhouses that learn correlations between words, but they don’t inherently “understand” the hierarchical tree structures of language (syntax) the way a linguist—or arguably a human brain—does. This leads to problems when the model faces data that requires strict structural logic, such as converting complex active sentences to passive voice or performing semantic parsing on sentence structures deeper than what it saw during training.

In a fascinating paper titled Strengthening Structural Inductive Biases by Pre-training to Perform Syntactic Transformations, researchers Matthias Lindemann, Alexander Koller, and Ivan Titov propose a novel solution called STEP (Syntactic Transformation Enhanced Pre-training). Instead of building a new, complex architecture, they teach a standard Transformer to perform “syntactic gymnastics” before asking it to do real work.

In this post, we will dive deep into how STEP works, why syntactic transformations are the key to better generalization, and what happens inside the model when it learns these structures.

The Problem: Structural Generalization

Standard seq2seq models excel at in-distribution data—data that looks similar to what they were trained on. However, they struggle with structural generalization.

Structural generalization refers to the ability to handle:

Unseen combinations of known phrases.
Longer inputs than seen during training.
Deeper recursion (e.g., “The cat that the dog that the mouse chased…”).

Previous attempts to fix this involved pre-training on massive amounts of text or designing specialized, rigid architectures. While massive pre-training helps, it doesn’t solve the core issue: the model doesn’t necessarily know how to use syntactic information to manipulate a sentence. It might know that “The cat” is a noun phrase, but it struggles to systematically move that noun phrase to a different position in a logical form if the sentence structure gets too complex.

The researchers hypothesized that the lack of structural bias is due to the model’s limited experience in performing transformations based on syntax.

The Solution: STEP

The core idea of STEP is intermediate pre-training. Before fine-tuning a model (like T5) on a specific downstream task (like semantic parsing), the researchers force it to practice transforming sentences based on their syntax trees.

Crucially, the model is not given the syntax tree. It is only given the input sentence and a description of how the tree should be transformed. This forces the model to:

Internally infer the syntax of the sentence.
Learn reusable “dynamics” of how to move parts of the tree around.

Figure 1: Left: Intermediate pre-training of a Transformer to perform syntactic transformations specified in the prefix;the syntax tree forms the basis of the transformation but is not given to the model. Right: fine-tuning the Transformer and the prefix on a downstream task. Tunable parameters are represented in orange.

As shown in Figure 1 above, the process has two stages:

Pre-training (Left): The model receives a prefix describing a transformation (e.g., “Subject becomes bracketed, Object is reversed”) and the sentence “Mary saw a cat.” It must predict the transformed string.
Fine-tuning (Right): For a real task (like Active-to-Passive conversion), the rigid description is replaced by tunable embeddings (soft prompts). These allow the model to “activate” the transformation skills it learned during pre-training.

How to Generate Syntactic Transformations

To pre-train the model, the researchers needed a massive dataset of “syntactic puzzles.” They couldn’t rely on human annotation for millions of sentences, so they automated the process using dependency trees from the C4 corpus.

1. The Dependency Tree

They start with a standard dependency tree (identifying subjects, objects, determiners, etc.). However, dependency trees are not binary, which makes applying systematic operations difficult.

2. Unfolding

To solve this, they “unfold” the dependency tree into a binary structure.

Figure 3: Unfolding a head h and its children.

As seen in Figure 3, a head node (\(h\)) and its children (\(c_1\) through \(c_4\)) are reorganized into a binary tree. The labels of the dependency edges (like nsubj or obj) become the labels of the internal nodes in this new tree.

3. Annotating and Evaluating

Once the tree is binary, the researchers assign a specific operation to each dependency relation. For example, they might decide that every NSUBJ (nominal subject) relation should be wrapped in brackets, while every OBJ (object) relation should have its word order reversed.

Figure 2: Our procedure of applying a syntactic transformation specified as edgewise transformations (grey box): (1)recursively unfolding adependency tree into a binary tree where dependency labels serve as labels of internal nodes, (2)anotation dependency relations with edgewise transformations, (3),recursive evaluation of the edgewise transformations with partial results shown.

Figure 2 illustrates this entire pipeline:

Unfold: The sentence “Mary saw a cat” is unfolded into a binary tree.
Annotate: Rules are applied. NSUBJ gets the BRACKET operation; OBJ gets the REV (reverse) operation.
Evaluate: The tree is collapsed bottom-up. “a cat” (under DET) is concatenated. “saw a cat” (under OBJ) is reversed to “a cat saw”. Finally, the subject “Mary” is bracketed.

The final output is a transformed string that looks nothing like natural language but requires deep syntactic understanding to generate.

The Operations Inventory

The researchers defined 14 different operations to cover a wide range of potential movements. A few examples include:

Table A.2: Fullist of operations we use. We show an example transformation for the sentence Mary saw acat where HEAD = Mary saw and DEP = a cat .HEAD.LEMMA (DEP.LEMMA) refers to the lemma of the head (dependent)

By mixing and matching these operations (e.g., CONCAT, REV, BRACKET, TRIPLE), they created approximately 4.2 million unique training instances.

The Training Setup

Pre-training Input

The input to the Transformer is a sequence of vectors. It starts with the Transformation Encoding (the instructions) followed by the Sentence Embeddings.

Equation showing input construction

The encoding of the transformation is straightforward. For each dependency label (like NSUBJ), they add the embedding of the label to the embedding of the operation (like BRACKET).

Equation showing transformation encoding

This tells the model: “For this sentence, if you see a Subject, bracket it. If you see an Object, reverse it.”

Fine-tuning Input

When moving to a downstream task like translating active voice to passive voice, we don’t have explicit labels like NSUBJ -> BRACKET. We just want the model to figure it out.

To bridge this gap, the researchers replace the explicit transformation encoding with tunable embeddings (essentially a learned prefix or “soft prompt”).

Equation showing tunable embeddings input

The vectors \(\mathbf{h}'\) are parameters that are learned during fine-tuning. They act as switches, potentially “activating” the specific attention heads and transformation logic the model learned during the pre-training phase.

Does it Work? Experiments and Results

The researchers compared STEP against a standard T5 model and a T5 model pre-trained to simply output dependency parses (T5+Dep Parse).

1. Syntactic Tasks (Few-Shot)

They tested the models on tasks requiring structural manipulation with very little training data (only 100 examples).

Passivization: “The boy hit the ball” \(\to\) “The ball was hit by the boy.”
Adjective Emphasis: “The French analysis” \(\to\) “The analysis that is French.”

Table 2: Evaluation on 100-shot syntactic transformation tasks. We report averages of 10 draws of 100 training examples each.

As shown in Table 2, STEP significantly outperforms the baseline T5 and the T5+Dep Parse models. For passivization, accuracy jumps from 40.2% (T5) to 57.9% (STEP). This proves that practicing synthetic tree transformations helps the model learn realistic grammatical transformations much faster.

2. Semantic Parsing & Generalization (SLOG)

The ultimate test is SLOG, a benchmark designed to break models by testing structural generalization. It asks models to parse sentences that have deeper recursion or different structures than the training set.

The results were impressive. STEP outperformed standard T5 in almost every category, particularly in recursion.

Figure B.1: Frequency of recursion depths in our parsed corpus (Section 3.2) according to the dependency trees produced by trankit. Note that the y-axis is in log scale.

Figure B.1 shows the recursion depth in the pre-training data. Despite rarely seeing extremely deep recursion during training, STEP generalized significantly better to deeper structures in the SLOG benchmark compared to the baselines.

Table B.1: Full SLOG results.

Looking at the detailed breakdown in Table B.1, notice the “Deeper center embedding” row. Standard T5 scores a dismal 8.9%. STEP achieves 17.3%, and a simplified version of STEP (Simple STEP) reaches 22.7%. While still low (center embedding is notoriously hard for neural networks), it is a distinct improvement derived purely from the inductive bias of the pre-training.

Why Does It Work? Inside the Black Box

The most exciting part of this paper is the analysis of why STEP works. The researchers performed “brain surgery” on the attention heads of the model.

The “Look-Up” Heads

They discovered that the model develops specialized Transformation Look-Up Heads. When the model is processing a word (e.g., “cat”), specific attention heads look back at the prefix (the instructions) to see what operation should be applied to the dependency relation the word belongs to (e.g., “Object”).

They confirmed this by masking (turning off) these specific heads.

Figure 4: Change in accuracy of predicting the output of edgewise transformations when masking different attention heads. We show accuracy relative to no masking.

Figure 4 shows the results of this intervention.

Blue dots: When the specific “look-up” heads are masked, accuracy drops drastically (up to -50%).
Orange/Green dots: Masking random heads has almost zero effect.

This confirms the hypothesis: the model isn’t just memorizing patterns; it has built a mechanism to check the “rulebook” (the prefix) and apply the correct operation to the correct word.

Reusing Heads for Fine-Tuning

But does this mechanism survive fine-tuning? When the model is retrained to do Passive Voice conversion (where there is no explicit rulebook, just tunable embeddings), does it still use these look-up heads?

Figure 5: Effect of masking look-up heads of models that have been fine-tuned on downstream syntactic tasks. For each task,we show the distribution for the 10 finetuned models from Section 4.2.

Figure 5 suggests yes. Even in the fine-tuned downstream tasks (like Passivization), masking the original look-up heads (Blue box) causes a much higher increase in error (TER) than masking random heads (Orange box).

Similarly, on synthetic downstream tasks, the effect is even more pronounced:

Figure B.3: Effect of masking look-up heads of models fine-tuned on synthetic tasks. The boxplot shows the distribution for 10 synthetic downstream tasks.

Figure B.3 shows that masking the corresponding look-up heads causes a massive drop in accuracy (median around -30%), whereas masking random heads does almost nothing.

This implies that the tunable embeddings (the learned soft prompts) are essentially learning to mimic the explicit instruction codes from pre-training, “activating” the same machinery the model built to handle syntax trees.

Conclusion and Implications

The STEP method provides compelling evidence that we don’t always need new model architectures to solve complex structural problems. Sometimes, we just need to give our existing models better practice drills.

By pre-training a Transformer to perform explicit operations on syntax trees (without seeing the trees themselves), the model:

Strengthens its internal representation of syntax.
Learns a mechanism (via attention heads) to map specific tokens to specific transformation rules.
Successfuly transfers this mechanism to real-world tasks via soft prompts.

This approach offers a promising path forward for low-resource languages or specialized domains where training data is scarce. If we can teach models the “grammar of transformations” synthetically, they can learn to speak the language of the downstream task with far fewer examples.

For students and researchers in NLP, STEP highlights the importance of inductive bias. A model is only as good as the assumptions it can make about the data. By baking in structural assumptions through clever pre-training, we can make our “black box” models a little more structured, and a lot more capable.

The Problem: Structural Generalization#

The Solution: STEP#

How to Generate Syntactic Transformations#

1. The Dependency Tree#

2. Unfolding#

3. Annotating and Evaluating#

The Operations Inventory#

The Training Setup#

Pre-training Input#

Fine-tuning Input#

Does it Work? Experiments and Results#

1. Syntactic Tasks (Few-Shot)#

2. Semantic Parsing & Generalization (SLOG)#

Why Does It Work? Inside the Black Box#

The “Look-Up” Heads#

Reusing Heads for Fine-Tuning#

Conclusion and Implications#