1. Introduction
One of the most spirited debates in the field of Artificial Intelligence today revolves around the true nature of Large Language Models (LLMs). When a model like GPT-4 solves a complex logic puzzle, is it genuinely reasoning—applying logical rules to derive an answer? Or is it merely acting as a “stochastic parrot,” retrieving memorized patterns from its vast training data?
While much research focuses on data quality or model size to improve reasoning, a fascinating new study titled “An Analysis for Reasoning Bias of Language Models with Small Initialization” looks at a fundamental, often overlooked design choice: parameter initialization.
The researchers propose a startling discovery: the scale at which we initialize a model’s weights can fundamentally alter its learning personality. Specifically, smaller initialization scales bias the model toward reasoning, whereas larger (standard) scales push the model toward memorization.
To visualize this immediately, look at the training dynamics of a GPT-2 model trained on a mix of two datasets: PrOntoQA (a logical reasoning dataset) and TinyStories (a simple narrative dataset).

As shown in Figure 1, the model (initialized with small weights) learns the reasoning task (PrOntoQA, red line) significantly faster than the narrative memorization task (TinyStories, blue line). This suggests a “reasoning bias” inherent in the training dynamics when parameters start small.
In this post, we will deconstruct this paper to understand why this happens. We will explore the geometric structure of embeddings, the mathematics of gradient flow, and how the distribution of labels shapes the “mind” of the model before it even finishes training.
2. Background: The Setup
To rigorously test reasoning versus memorization, we cannot rely solely on natural language, which is messy. We need a controlled environment. The authors utilize a synthetic task framework based on “Anchor Functions.”
2.1 The Synthetic Composition Task
The core idea is to create sequences of tokens that act as arithmetic problems. The model sees a sequence and must predict a label. The sequence contains:
- Keys (\(z\)): Variables or starting points.
- Anchors (\(a\)): Modifiers or operations.
- Noise: Irrelevant tokens.
The general structure of an input sequence \(X\) looks like this:

Here, \(z\) represents keys and \(a\) represents anchors. The researchers define two distinct mappings (tasks) derived from these sequences:
1. The Reasoning Mapping (\(\mathcal{F}_{rsn}\)) This task requires the model to learn a rule. Specifically, the label is the sum of the key and the anchors. If the model learns the addition rule, it can generalize to unseen numbers.

2. The Memory Mapping (\(\mathcal{F}_{mem}\)) This task forces the model to memorize. For a specific pair of key and anchor, the label is a randomly assigned number \(y\). There is no logical rule connecting the input to the output; it is a pure lookup table.

2.2 Visualizing the Data
The dataset is constructed so that the model sees sequences that look identical in structure but require different cognitive processes (addition vs. lookup) to solve.

In Figure 2, the left side shows Memory Mapping. The specific combination of orange (key) and green (anchor) tokens maps to an arbitrary target \(y\). On the right, the Reasoning Mapping shows that the target is derived mathematically (\(78+15=93\)).
The crucial experimental setup is training a model on a dataset containing both types of tasks and observing which one it prefers to learn, and how well it generalizes.
3. The Core Phenomenon: Initialization Scales
Deep learning models are initialized with random weights, typically drawn from a normal distribution \(\mathcal{N}(0, \sigma^2)\). The standard deviation \(\sigma\) is usually defined as \(d^{-\gamma}\), where \(d\) is the layer width and \(\gamma\) (gamma) determines the scale.
- Small Initialization: Large \(\gamma\) (e.g., \(\gamma=0.8\)). The weights are very close to zero.
- Large/Standard Initialization: Small \(\gamma\) (e.g., \(\gamma=0.3\) or \(0.5\)). The weights are more spread out.
3.1 The Trade-off
The researchers trained Transformers on these synthetic tasks using different initialization rates. The results, shown in Figure 3, are striking.

Let’s break down Figure 3 (Panel A):
- \(\gamma = 0.3\) (Large Init, Left Column): The Blue line (Memory) and Purple line (Reasoning Train) drop quickly. However, the Orange line (Reasoning Test) stays high. This is classic overfitting. The model is memorizing the training data for both tasks but fails to learn the reasoning rule.
- \(\gamma = 0.8\) (Small Init, Right Column): The Purple (Reasoning Train) and Orange (Reasoning Test) lines drop together, and they drop faster than the Blue (Memory) line.
Conclusion: Small initialization suppresses memorization and promotes the learning of generalizable rules. When the weights start small, the model “prefers” to find the underlying logic (\(z+a\)) rather than memorizing individual outcomes.
4. The Mechanism: Why Small Init Favors Reasoning
To understand why this happens, we have to look under the hood. The authors simplify the Transformer into a model called Emb-MLP (Embedding layer + Multi-Layer Perceptron) to mathematically analyze the gradient flow.
4.1 The Embedding Space
The “brain” of a language model is its embedding space—where tokens are converted into vectors. The geometry of this space determines what the model finds easy to learn.
When initialization is small, the embedding vectors \(w^{emb}\) are tiny. During backpropagation, the way these vectors grow depends heavily on the distribution of the labels associated with each token.
The gradient flow (how the weights change) for a specific token \(s\) can be approximated by this equation:

This equation says that the change in a token’s embedding is proportional to the average label (\(P^s\)) associated with that token.
4.2 Label Distribution: The Key Differentiator
This is the pivotal insight of the paper.
- Memory Tasks: The labels are random. For any given memory anchor, the associated labels are uniformly distributed across the possible outputs.
- Reasoning Tasks: The labels are structured. For a reasoning anchor (like the number “5” in an addition task), the associated labels are shifted by exactly 5 compared to the key.
Because Memory labels are random and uniform, the “average label” term in the gradient equation tends toward a uniform constant for all memory tokens. They all look the same to the gradient.
However, Reasoning labels have a shifted mean. The average label for the token “5” is different from the average label for “10”.
The result? Reasoning tokens develop distinct, structured embeddings very early in training, while memory tokens remain clumped together and indistinguishable.
4.3 Visualizing the Embeddings
We can see this distinction clearly in the cosine similarity matrices of the embeddings during training.

In Figure 4 (Panel A):
- Top Row (Memory Anchors): The heatmap is mostly red/yellow, indicating high similarity (near 1.0) between different memory anchors. The model cannot tell them apart easily.
- Bottom Row (Reasoning Anchors): We see a beautiful diagonal pattern. The similarity drops off as you move away from the diagonal. This means the number “11” is similar to “12” but distinct from “20”.
This structured geometry emerges purely from the gradient flow on the structured labels. It creates a “number line” representation in the high-dimensional space. Because the reasoning tokens are distinct, the model can easily use them to compute the output. The memory tokens, being indistinguishable, retard the learning of the memory task.
5. Scaling to Transformers
Does this logic hold for the complex architecture of a Transformer? Yes.
The researchers analyzed the embedding space of a full Transformer model trained with small initialization.

Figure 5 confirms the theory. Panel A (bottom) shows that Reasoning Anchors in a Transformer naturally organize themselves hierarchically. Panel B shows a PCA projection where the Reasoning Anchors (green) and Keys (orange) form distinct, ordered structures, while Memory tokens are clustered.
5.1 The Attention Mechanism as an Aggregator
With small initialization, the attention matrix in the first layer behaves in a specific way: it acts as an average operator.

Because the weights are small, the softmax function doesn’t peak sharply; it spreads attention roughly evenly. This allows the model to aggregate information from the entire context.
Crucially, the Value projection matrix (\(W^V\)) aligns itself with the reasoning anchors.

Figure 6 (Panel D) shows that the singular vectors of \(W^V\) have extremely high cosine similarity with the reasoning anchors (red line), but not the memory anchors (blue line).
The Mechanism Summary:
- Embeddings: Small init + structured labels = Distinct embeddings for reasoning tokens.
- Attention 1: Averages the context. Because reasoning embeddings are distinct and aligned with projection matrices, their values are preserved and propagated.
- Attention 2: Identifies the “Key” and combines it with the aggregated “Anchor” info to compute the result (\(Key + \sum Anchors\)).
Memory tasks fail this process because their embeddings never separate sufficiently in the early stages to be useful.
6. Contrast: The “Lazy” Large Initialization
What happens if we use standard (large) initialization?
In high-dimensional spaces, random vectors drawn from a large-scale distribution tend to be orthogonal (perpendicular) to each other.

Figure 14 (Top Row, \(\gamma=0.3\)) illustrates the large initialization scenario. The embedding space is defined by orthogonality—every token is distinct from every other token simply by chance.
- Pros: This is great for Memorization. If every input is unique and orthogonal, it’s easy to map Input A \(\to\) Output A without interfering with Input B.
- Cons: It is terrible for Reasoning. The model treats “10” and “11” as completely unrelated entities. It doesn’t learn the relationship between them.
This explains why the large init models in Figure 3 memorized the data but failed the reasoning test set. They learned to map specific inputs to outputs using the orthogonality of the weights, rather than learning the addition rule.
7. Real-World Implications
The researchers validated these findings on GPT-2 using the PrOntoQA (reasoning) and TinyStories (narrative) datasets.

Figure 7 shows that with small initialization (\(\gamma=0.8\)), the embeddings for the PrOntoQA dataset (Left Heatmap in B) develop a rich structure with correlations, while the TinyStories embeddings (Right Heatmap) remain largely uncorrelated.
Panel A shows the “Reasoning Bias” metric (\(\Delta L\)). As \(\gamma\) increases (initialization becomes smaller), the model increasingly favors minimizing the reasoning loss over the memory loss.
8. Conclusion
This paper provides a theoretical and empirical foundation for a powerful idea: we can control the “cognitive style” of an LLM through initialization.
- Small Initialization allows the data’s label distribution to shape the embedding space. If the task has structure (logic, math, grammar), that structure is imprinted onto the model, facilitating reasoning and generalization.
- Large Initialization forces the model into an orthogonal regime. This facilitates the memorization of arbitrary mappings but hinders the discovery of underlying rules.
For students and practitioners, this suggests that if your goal is to train models that reason and generalize rather than hallucinate or memorize training data, paying attention to the initialization scale is not just a technical detail—it is a fundamental architectural choice.
Key Takeaways
- Initialization is a bias knob: Small weights = Reasoning bias; Large weights = Memorization bias.
- Labels shape embeddings: In the small weight regime, the statistical distribution of token labels drives the geometry of the embedding space.
- Early Dynamics Matter: The “personality” of the model (reasoner vs. memorizer) is determined very early in training based on how easily embeddings can differentiate themselves.
This post is based on “An Analysis for Reasoning Bias of Language Models with Small Initialization” (2025).
](https://deep-paper.org/en/paper/2502.04375/images/cover.png)