Large Language Models (LLMs) like GPT-4 and Llama have revolutionized artificial intelligence, demonstrating capabilities that often feel like genuine reasoning. However, beneath the surface of these impressive systems lies a surprisingly simple training paradigm: next-token prediction. These models are trained to predict the immediate next word based on the words that came before it.
While effective, this “greedy” approach has a fundamental flaw. It forces models to think in a linear, left-to-right fashion without stopping to plan. Imagine trying to write a complex essay or solve a navigation puzzle by only thinking one word at a time, never looking ahead to where your sentence is going.
This limitation leads to a phenomenon known as “shortcut learning” or the “Clever Hans” effect, where models rely on superficial patterns rather than true understanding.
In this deep dive, we will explore a fascinating solution proposed in the paper “Semformer: Transformer Language Models with Semantic Planning.” The researchers introduce a novel architecture that forces the model to construct a high-level “semantic plan” before generating a response. We will walk through the problem of myopic training, break down the Semformer architecture, and analyze how “thinking before speaking” allows models to solve problems that standard Transformers simply cannot.
The Problem: The Myopia of Teacher Forcing
To understand why we need a new architecture, we first need to understand how standard models fail.
Current LLMs are trained using Teacher Forcing. During training, the model is fed a sequence of ground-truth tokens (the “teacher’s” answer) and is asked to predict the next one. Mathematically, for a sequence \(x\), the model tries to maximize the probability of each token \(x_t\) given the previous tokens \(x_{ The issue is that this creates a shortcut. If the model can guess the next token based solely on the last few words (local context), it will do so, ignoring the broader problem structure. The authors illustrate this failure using a graph path-finding task. This is a “minimal lookahead” task: given a start node, a target node, and a list of connections (the graph), find the path from start to target. Look at the graph in Figure 1 above. This is the Clever Hans cheat. The model memorizes local transitions rather than learning the algorithm to look ahead and find the connected path. As we will see in the experiments section, standard GPT-2 models fail spectacularly at this task because they cannot “stop and think” about the destination before taking the first step. The core insight of the Semformer (Semantic Transformer) is simple yet profound: Humans do not rely solely on historical context. We formulate an abstract plan based on the problem, and that plan guides our response. The Semformer incorporates this by introducing a two-stage generation process into a standard decoder-only Transformer: The Semformer consists of two main components: the Language Model (the student) and an Autoencoder (the teacher/guide). Let’s break down Figure 2: In essence, the model is trained to hallucinate the “gist” of the future answer before it actually generates the specific words. The training involves three distinct loss functions combined into one. 1. The Language Modeling Loss (\(\mathcal{L}_{\mathrm{LM}}\))
This is the standard objective. The model must predict the next token in the sequence. Note that the planning tokens (\(d\)) are part of the input context, but the model is not penalized for “predicting” the planning tokens themselves using standard cross-entropy; it simply uses them to predict the text that follows. 2. The Autoencoder Loss (\(\mathcal{L}_{\mathrm{AE}}\))
To ensure the planning tokens mean something, we need a “Gold Standard” for what a plan should look like. The Autoencoder provides this. It encodes the target response (\(x_{n+1:T}\)) into latent vectors \(Z\) and tries to reconstruct the response from \(Z\). If the Autoencoder can reconstruct the sentence from \(Z\), then \(Z\) must contain all the necessary semantic information. 3. The Representation Prediction Loss (\(\mathcal{L}_{\mathrm{RP}}\))
This is the bridge. We force the Language Model’s planning tokens to look like the Autoencoder’s latent vectors \(Z\). We use a Mean Squared Error (\(L_2\)) loss to minimize the distance between the predicted plan and the actual semantic summary of the future. Total Training Objective
The final loss function combines all three elements. \(\alpha\) is a hyperparameter that weights how important the planning task is compared to the generation task. During training, the Autoencoder “cheats” by looking at the answer to create the target plan \(Z\). But during inference (testing), we don’t have the answer. This is where the magic happens. Because the Language Model was trained to minimize \(\mathcal{L}_{\mathrm{RP}}\), it has learned to generate valid planning vectors \(Z\) based only on the input prefix. It effectively predicts the future abstractly before generating it concretely. Does this extra complexity actually help? The researchers tested Semformer on both the specific graph path-finding task and general language modeling. The primary testbed was the graph problem described in the introduction. The researchers compared Semformer against: The results, shown in Table 1 (contained in the image below), are staggering. Key Takeaways: Not only does Semformer learn the task, but it also learns it incredibly fast. In Figure 3, the red line (Semformer) shoots up to high accuracy within 50,000 steps. The Bag-of-Words (BoW) baseline (green), which tries to predict the set of future nodes without order, is much slower and less accurate. The Teacher-less method (orange) completely fails to learn the pattern in these settings. To prove that the model is actually “planning” and not just getting lucky, the researchers visualized the attention weights—what the model is looking at when it makes a decision. In Figure 6: Critically, the authors wanted to ensure this wasn’t just a trick for graph problems. They trained a 125M parameter model on OpenWebText (the same data used for GPT-2) to see if semantic planning helps with writing English. Perplexity Scores
Perplexity is a measure of how confused a model is (lower is better). As shown in Table 4, Semformer achieves lower perplexity on Wikitext and LAMBADA compared to both the standard Transformer (TF) and the Pause baseline. This suggests that even for general text, predicting the “meaning” of the sentence before writing it leads to better predictions. In-Context Learning
The researchers also tested how well the model learns from examples (few-shot learning) on tasks like sentiment analysis (SST-2) and paraphrase detection (MRPC). Figure 7 shows that Semformer (specifically the green bars, \(\alpha=0.1\)) consistently outperforms the standard Transformer (blue bars), particularly on the MRPC dataset. This indicates that the planning capability helps the model grasp the task structure from the prompt more effectively. Summarization
Finally, they fine-tuned the model on text summarization tasks. Summarization inherently requires understanding the whole content before writing, making it an ideal candidate for semantic planning. Table 5 confirms the hypothesis: Semformer achieves higher ROUGE scores (a metric for text overlap) across three different datasets, outperforming models that dive straight into writing without the planning objective. The “Clever Hans” effect is a major bottleneck for the reliability of Large Language Models. When models merely associate adjacent tokens without understanding the broader trajectory of the generation, they become prone to hallucination and logical errors. Semformer offers a compelling architectural fix. By explicitly separating planning (predicting the latent future) from execution (generating the tokens), it forces the model to learn a global understanding of the output. As we strive for AGI and models that can perform complex reasoning, architectures like Semformer highlight the importance of internal state and lookahead. Future language models may well be judged not just by what they say, but by the quality of the silent thoughts they have before they speak.
The Graph Path-Finding Test

The Solution: Semformer
The Architecture

The Mathematical Framework





Inference: How it works in practice
Experimental Validation
Crushing the Graph Path-Finding Task

G(20,5) accuracy is 4.8%). It cannot look ahead.Convergence Speed

Why does it work? Visualizing Attention

Beyond Toy Tasks: General Language Modeling



Implications and Conclusion
Key Takeaways
](https://deep-paper.org/en/paper/2409.11143/images/cover.png)