1. Introduction

One of the most spirited debates in the field of Artificial Intelligence today revolves around the true nature of Large Language Models (LLMs). When a model like GPT-4 solves a complex logic puzzle, is it genuinely reasoning—applying logical rules to derive an answer? Or is it merely acting as a “stochastic parrot,” retrieving memorized patterns from its vast training data?

While much research focuses on data quality or model size to improve reasoning, a fascinating new study titled “An Analysis for Reasoning Bias of Language Models with Small Initialization” looks at a fundamental, often overlooked design choice: parameter initialization.

The researchers propose a startling discovery: the scale at which we initialize a model’s weights can fundamentally alter its learning personality. Specifically, smaller initialization scales bias the model toward reasoning, whereas larger (standard) scales push the model toward memorization.

To visualize this immediately, look at the training dynamics of a GPT-2 model trained on a mix of two datasets: PrOntoQA (a logical reasoning dataset) and TinyStories (a simple narrative dataset).

Figure 1. Comparison of training loss between PrOntoQA and TinyStories in one next-token prediction training for this mix dataset. The red line represents the training loss on the PrOntoQA dataset, while the blue line depicts the training loss on the TinyStories dataset.

As shown in Figure 1, the model (initialized with small weights) learns the reasoning task (PrOntoQA, red line) significantly faster than the narrative memorization task (TinyStories, blue line). This suggests a “reasoning bias” inherent in the training dynamics when parameters start small.

In this post, we will deconstruct this paper to understand why this happens. We will explore the geometric structure of embeddings, the mathematics of gradient flow, and how the distribution of labels shapes the “mind” of the model before it even finishes training.

2. Background: The Setup

To rigorously test reasoning versus memorization, we cannot rely solely on natural language, which is messy. We need a controlled environment. The authors utilize a synthetic task framework based on “Anchor Functions.”

2.1 The Synthetic Composition Task

The core idea is to create sequences of tokens that act as arithmetic problems. The model sees a sequence and must predict a label. The sequence contains:

  1. Keys (\(z\)): Variables or starting points.
  2. Anchors (\(a\)): Modifiers or operations.
  3. Noise: Irrelevant tokens.

The general structure of an input sequence \(X\) looks like this:

Equation defining the sequence structure with keys and anchors.

Here, \(z\) represents keys and \(a\) represents anchors. The researchers define two distinct mappings (tasks) derived from these sequences:

1. The Reasoning Mapping (\(\mathcal{F}_{rsn}\)) This task requires the model to learn a rule. Specifically, the label is the sum of the key and the anchors. If the model learns the addition rule, it can generalize to unseen numbers.

Equation for Reasoning Mapping: z_p plus the sum of anchors.

2. The Memory Mapping (\(\mathcal{F}_{mem}\)) This task forces the model to memorize. For a specific pair of key and anchor, the label is a randomly assigned number \(y\). There is no logical rule connecting the input to the output; it is a pure lookup table.

Equation for Memory Mapping: random label assignment.

2.2 Visualizing the Data

The dataset is constructed so that the model sees sequences that look identical in structure but require different cognitive processes (addition vs. lookup) to solve.

Figure 2. Schematic diagram of the synthetic composition task.

In Figure 2, the left side shows Memory Mapping. The specific combination of orange (key) and green (anchor) tokens maps to an arbitrary target \(y\). On the right, the Reasoning Mapping shows that the target is derived mathematically (\(78+15=93\)).

The crucial experimental setup is training a model on a dataset containing both types of tasks and observing which one it prefers to learn, and how well it generalizes.

3. The Core Phenomenon: Initialization Scales

Deep learning models are initialized with random weights, typically drawn from a normal distribution \(\mathcal{N}(0, \sigma^2)\). The standard deviation \(\sigma\) is usually defined as \(d^{-\gamma}\), where \(d\) is the layer width and \(\gamma\) (gamma) determines the scale.

  • Small Initialization: Large \(\gamma\) (e.g., \(\gamma=0.8\)). The weights are very close to zero.
  • Large/Standard Initialization: Small \(\gamma\) (e.g., \(\gamma=0.3\) or \(0.5\)). The weights are more spread out.

3.1 The Trade-off

The researchers trained Transformers on these synthetic tasks using different initialization rates. The results, shown in Figure 3, are striking.

Figure 3. Loss and prediction accuracy of the models on different datasets under varying initialization scales.

Let’s break down Figure 3 (Panel A):

  • \(\gamma = 0.3\) (Large Init, Left Column): The Blue line (Memory) and Purple line (Reasoning Train) drop quickly. However, the Orange line (Reasoning Test) stays high. This is classic overfitting. The model is memorizing the training data for both tasks but fails to learn the reasoning rule.
  • \(\gamma = 0.8\) (Small Init, Right Column): The Purple (Reasoning Train) and Orange (Reasoning Test) lines drop together, and they drop faster than the Blue (Memory) line.

Conclusion: Small initialization suppresses memorization and promotes the learning of generalizable rules. When the weights start small, the model “prefers” to find the underlying logic (\(z+a\)) rather than memorizing individual outcomes.

4. The Mechanism: Why Small Init Favors Reasoning

To understand why this happens, we have to look under the hood. The authors simplify the Transformer into a model called Emb-MLP (Embedding layer + Multi-Layer Perceptron) to mathematically analyze the gradient flow.

4.1 The Embedding Space

The “brain” of a language model is its embedding space—where tokens are converted into vectors. The geometry of this space determines what the model finds easy to learn.

When initialization is small, the embedding vectors \(w^{emb}\) are tiny. During backpropagation, the way these vectors grow depends heavily on the distribution of the labels associated with each token.

The gradient flow (how the weights change) for a specific token \(s\) can be approximated by this equation:

Gradient flow equation for embeddings.

This equation says that the change in a token’s embedding is proportional to the average label (\(P^s\)) associated with that token.

4.2 Label Distribution: The Key Differentiator

This is the pivotal insight of the paper.

  1. Memory Tasks: The labels are random. For any given memory anchor, the associated labels are uniformly distributed across the possible outputs.
  2. Reasoning Tasks: The labels are structured. For a reasoning anchor (like the number “5” in an addition task), the associated labels are shifted by exactly 5 compared to the key.

Because Memory labels are random and uniform, the “average label” term in the gradient equation tends toward a uniform constant for all memory tokens. They all look the same to the gradient.

However, Reasoning labels have a shifted mean. The average label for the token “5” is different from the average label for “10”.

The result? Reasoning tokens develop distinct, structured embeddings very early in training, while memory tokens remain clumped together and indistinguishable.

4.3 Visualizing the Embeddings

We can see this distinction clearly in the cosine similarity matrices of the embeddings during training.

Figure 4. Cosine similarity matrices for memory and reasoning anchors.

In Figure 4 (Panel A):

  • Top Row (Memory Anchors): The heatmap is mostly red/yellow, indicating high similarity (near 1.0) between different memory anchors. The model cannot tell them apart easily.
  • Bottom Row (Reasoning Anchors): We see a beautiful diagonal pattern. The similarity drops off as you move away from the diagonal. This means the number “11” is similar to “12” but distinct from “20”.

This structured geometry emerges purely from the gradient flow on the structured labels. It creates a “number line” representation in the high-dimensional space. Because the reasoning tokens are distinct, the model can easily use them to compute the output. The memory tokens, being indistinguishable, retard the learning of the memory task.

5. Scaling to Transformers

Does this logic hold for the complex architecture of a Transformer? Yes.

The researchers analyzed the embedding space of a full Transformer model trained with small initialization.

Figure 5. Embedding structure of a Transformer model with small initialization scale.

Figure 5 confirms the theory. Panel A (bottom) shows that Reasoning Anchors in a Transformer naturally organize themselves hierarchically. Panel B shows a PCA projection where the Reasoning Anchors (green) and Keys (orange) form distinct, ordered structures, while Memory tokens are clustered.

5.1 The Attention Mechanism as an Aggregator

With small initialization, the attention matrix in the first layer behaves in a specific way: it acts as an average operator.

Equation showing the attention output as an average.

Because the weights are small, the softmax function doesn’t peak sharply; it spreads attention roughly evenly. This allows the model to aggregate information from the entire context.

Crucially, the Value projection matrix (\(W^V\)) aligns itself with the reasoning anchors.

Figure 6. Characteristics of the first attention module under small initialization.

Figure 6 (Panel D) shows that the singular vectors of \(W^V\) have extremely high cosine similarity with the reasoning anchors (red line), but not the memory anchors (blue line).

The Mechanism Summary:

  1. Embeddings: Small init + structured labels = Distinct embeddings for reasoning tokens.
  2. Attention 1: Averages the context. Because reasoning embeddings are distinct and aligned with projection matrices, their values are preserved and propagated.
  3. Attention 2: Identifies the “Key” and combines it with the aggregated “Anchor” info to compute the result (\(Key + \sum Anchors\)).

Memory tasks fail this process because their embeddings never separate sufficiently in the early stages to be useful.

6. Contrast: The “Lazy” Large Initialization

What happens if we use standard (large) initialization?

In high-dimensional spaces, random vectors drawn from a large-scale distribution tend to be orthogonal (perpendicular) to each other.

Figure 14. Characteristic of embedding space of PrOntoQA and TinyStories with initialization rates 0.3 vs 0.5.

Figure 14 (Top Row, \(\gamma=0.3\)) illustrates the large initialization scenario. The embedding space is defined by orthogonality—every token is distinct from every other token simply by chance.

  • Pros: This is great for Memorization. If every input is unique and orthogonal, it’s easy to map Input A \(\to\) Output A without interfering with Input B.
  • Cons: It is terrible for Reasoning. The model treats “10” and “11” as completely unrelated entities. It doesn’t learn the relationship between them.

This explains why the large init models in Figure 3 memorized the data but failed the reasoning test set. They learned to map specific inputs to outputs using the orthogonality of the weights, rather than learning the addition rule.

7. Real-World Implications

The researchers validated these findings on GPT-2 using the PrOntoQA (reasoning) and TinyStories (narrative) datasets.

Figure 7. Reasoning bias of GPT-2 in real language tasks.

Figure 7 shows that with small initialization (\(\gamma=0.8\)), the embeddings for the PrOntoQA dataset (Left Heatmap in B) develop a rich structure with correlations, while the TinyStories embeddings (Right Heatmap) remain largely uncorrelated.

Panel A shows the “Reasoning Bias” metric (\(\Delta L\)). As \(\gamma\) increases (initialization becomes smaller), the model increasingly favors minimizing the reasoning loss over the memory loss.

8. Conclusion

This paper provides a theoretical and empirical foundation for a powerful idea: we can control the “cognitive style” of an LLM through initialization.

  • Small Initialization allows the data’s label distribution to shape the embedding space. If the task has structure (logic, math, grammar), that structure is imprinted onto the model, facilitating reasoning and generalization.
  • Large Initialization forces the model into an orthogonal regime. This facilitates the memorization of arbitrary mappings but hinders the discovery of underlying rules.

For students and practitioners, this suggests that if your goal is to train models that reason and generalize rather than hallucinate or memorize training data, paying attention to the initialization scale is not just a technical detail—it is a fundamental architectural choice.

Key Takeaways

  1. Initialization is a bias knob: Small weights = Reasoning bias; Large weights = Memorization bias.
  2. Labels shape embeddings: In the small weight regime, the statistical distribution of token labels drives the geometry of the embedding space.
  3. Early Dynamics Matter: The “personality” of the model (reasoner vs. memorizer) is determined very early in training based on how easily embeddings can differentiate themselves.

This post is based on “An Analysis for Reasoning Bias of Language Models with Small Initialization” (2025).