Beyond Words: Uncovering the Syntactic Templates Hidden in LLM Outputs
If you have spent enough time interacting with Large Language Models (LLMs) like GPT-4 or Llama, you might have noticed a specific “vibe” to the text they produce. Even when the content is factually new, or the specific vocabulary is varied, there is often a sense of familiarity—a robotic cadence or a structural repetitiveness that distinguishes model output from human writing.
We often evaluate the diversity of an AI’s language by looking at the words it chooses (lexical diversity). We ask: “Is it repeating the same n-grams (sequences of words)?” But what if the repetition isn’t in the words themselves, but in the grammatical scaffolding holding them together?
In the paper “Detection and Measurement of Syntactic Templates in Generated Text,” researchers Chantal Shaib, Yanai Elazar, Junyi Jessy Li, and Byron C. Wallace dive deep into this phenomenon. They propose that LLMs rely heavily on “Syntactic Templates”—abstract sequences of grammatical categories—and that measuring these templates reveals fascinating insights about how models learn, memorize, and generate text.
The Problem: The Illusion of Diversity
When we train LLMs, we want them to be diverse. We use sampling strategies (like high temperature or nucleus sampling) to prevent the model from saying the same thing twice. However, existing metrics for diversity usually focus on tokens (words).
The researchers argue that models can be lexically diverse while being syntactically repetitive. For example, consider these two sentences:
- “The quick brown fox jumped.”
- “A lazy red dog slept.”
Lexically, these sentences share no important content words. Syntactically, however, they are identical: Determiner -> Adjective -> Adjective -> Noun -> Verb.
If an LLM produces thousands of sentences that all follow this exact structure, it is exhibiting a form of repetition that standard metrics miss. This blog post explores how the authors quantify this structural repetition and what it tells us about the “style” memorized by LLMs.
Core Method: Defining Syntactic Templates
To measure this phenomenon, the authors introduce the concept of Syntactic Templates.
From Words to Tags
The first step is abstraction. Instead of looking at the raw text, the method converts every word into its Part-of-Speech (POS) tag using a standard tagger (like SpaCy).
- Text: “The Last Black Man in San Francisco is a poignant, beautifully shot film…”
- POS Tags:
DT JJ JJ NN IN NNP NNP VBZ DT JJ , RB VBN NN...
By stripping away the vocabulary, we are left with the syntactic skeleton of the sentence.
Identifying the Templates
A “template” is defined as a specific sequence of these POS tags (e.g., a sequence of length \(n=4\) to \(8\)) that appears frequently within a corpus. The authors define “frequent” based on the dataset size, but generally, they look for the top 100 most common patterns.
Figure 1: An example of generated movie reviews. The highlighted sections show different text that maps to the exact same sequence of parts-of-speech. Even though the words differ, the underlying syntax is identical.
As shown in Figure 1, different models (OLMo-7B vs. Mistral-7B) prefer different templates, but both rely on them heavily.
Measuring Syntactic Repetition
To quantify how “templated” a text is, the authors propose three key metrics.
1. Compression Ratio (CR-POS) This metric is inspired by text compression algorithms like gzip. If a sequence of POS tags is very repetitive, it can be compressed efficiently. A higher Compression Ratio indicates lower diversity (more repetition).

2. Template Rate (TR) This measures the percentage of generated texts in a corpus that contain at least one identified template. A high template rate suggests the model is relying on formulaic structures to generate its output.

3. Templates-per-Token (TPT) Since longer texts are statistically more likely to contain a template, the authors normalize the count by the text length. This allows for fair comparison between models that generate outputs of different lengths.

Experimental Setup
The researchers tested a variety of models, including:
- Open Models: OLMo-7B (where training data is available), Mistral-7B, Llama-2, and Llama-3.
- Closed Models: GPT-4o.
They evaluated these models on tasks ranging from Open-Ended Generation (generating text from scratch) to Summarization (news, movie reviews, and biomedical reviews).
Key Findings
1. Models are “Syntactic Parrots”
The first major finding is that models produce templated text at a significantly higher rate than humans.
When analyzing the Rotten Tomatoes dataset (movie reviews), the researchers found that 95% of model-generated outputs contained templates of length 6. In contrast, human-written references contained these templates only 38% of the time.
Figure 6: The percentage of texts containing at least one template. The dashed line represents human-written text. Notice how almost every model (colored bars) significantly exceeds the human baseline, especially for templates of length 4, 5, and 6.
This trend holds true even when users try to increase diversity. You might think that increasing the “temperature” (a setting that makes models more random) would break these templates. Surprisingly, it does not.
Table 1: Even when increasing temperature from 0.8 to 0.95, the percentage of texts containing templates (right column) remains stubbornly high (around 96-97%). While lexical diversity might increase, the syntactic structure remains rigid.
2. Templates are Learned in Pre-training
Where do these templates come from? Are they an artifact of “Instruction Tuning” (where models are taught to follow commands)?
Using the OLMo-7B model, for which the full training data and checkpoints are public, the authors traced the origin of these templates.
They are learned early. By measuring the perplexity (a measure of how “surprised” a model is by a sequence) of templates across training checkpoints, the researchers found that models learn these syntactic patterns almost immediately.
Figure 3: The green line shows the perplexity of template tokens. Note how it drops precipitously within the first few checkpoints and stays low. The model learns the “grammar” of its training data long before it finishes training.
They come from the pre-training data. The study found that 76% of the templates produced by OLMo could be found directly in its pre-training dataset (Dolma). This suggests that the model is not inventing these structures or learning them solely from fine-tuning; it is regurgitating common syntactic patterns it read during its initial training.
Figure 4: A massive 75.4% of OLMo’s generated templates are present in the pre-training data. Compare this to “Non-Templates” (green bar), where random sequences are found much less frequently.
Furthermore, the templates the model chooses to generate are not just random distinct patterns; they are the most frequent patterns from the training data.
Figure 5: The blue bars represent the templates OLMo generates. They cluster heavily to the left, meaning they are among the highest-ranked (most frequent) patterns in the pre-training data.
3. Style Memorization vs. Exact Memorization
This leads to the most intriguing implication of the paper: Memorization.
We usually say a model has “memorized” training data if it outputs the exact same text verbatim. However, this definition is too narrow. Models often hallucinate specific numbers or swap synonyms while keeping the exact sentence structure intact.
The authors define this as Style (POS) Memorization.
Figure 8: An example of Style Memorization. The Left text is the original training data. The Right text is the model output. The model changes “lucky” to “some,” “shy” to “timid,” and completely changes the dollar amounts, but the syntactic structure (POS sequence) is identical.
This “soft” memorization is far more prevalent than exact text memorization.
Figure 7: The green bars (POS Memorized) are consistently higher than the blue bars (Exact Memorized). This indicates that we are underestimating how much training data models retain because we are only looking for exact matches.
Conclusion
The work by Shaib et al. provides a new lens through which to view Large Language Models. We often marvel at the creativity of AI, but this research suggests that a significant portion of that “creativity” is poured into rigid, pre-learned molds.
Key Takeaways:
- Syntactic rigidity: LLMs are far more repetitive in their sentence structure than humans, even when their vocabulary is diverse.
- Deep roots: These structural habits are formed very early in pre-training and persist through fine-tuning and alignment (RLHF).
- Hidden memorization: By looking only for exact text matches, we miss the “style memorization” where models reproduce the syntax of training data while swapping out specific words.
This research implies that if we want truly diverse AI generation, we need to look beyond just the words on the screen and consider the structures underneath. It also suggests that syntactic templates could be a powerful tool for detecting data leakage and understanding the provenance of model behaviors.
](https://deep-paper.org/en/paper/2407.00211/images/cover.png)