Inside the Black Box: How and Why LLMs Memorize Training Data

Large Language Models (LLMs) like GPT-4 or LLaMA are often described as having “emergent abilities”—capabilities that appear as the models scale in size. One of the most fascinating, and controversial, of these behaviors is memorization.

Memorization occurs when an LLM generates content verbatim from its training data. On one hand, this allows models to act as vast knowledge bases, recalling historical facts or coding syntax. On the other hand, it poses significant privacy and copyright risks. If a model was trained on sensitive personal data or copyrighted books, elicitng that exact text is a major vulnerability.

While we know that LLMs memorize data, we know surprisingly little about the mechanism behind it. What happens inside the neural network when it switches from “creating” text to “reciting” it?

In a recent paper titled “A Multi-Perspective Analysis of Memorization in Large Language Models,” researchers from the University of Tokyo take a microscope to this phenomenon. Rather than just looking at how much data is memorized, they analyze the input and output dynamics—the frequency of words, the entropy of generation, and the behavior of embeddings—to understand the physics of machine memory.

Multi-Perspective Analysis of Memorization diagram showing the flow from prompt to input dynamics, decoding dynamics, and prediction.

As illustrated in Figure 1, this research breaks the problem down into distinct perspectives: input dynamics (what comes in), decoding dynamics (what happens inside), and the potential to predict whether a specific sequence is memorized.

What is Memorization?

Before diving into the mechanics, we need a precise definition. In this study, the researchers use a metric called \(k\)-extractability. A sequence is considered memorized if, given a specific context (a prompt) of length \(k\), the model can reproduce the continuation of that sequence exactly as it appeared in the training data.

The researchers calculate a memorization score using the following equation:

Equation for Memorization Score M(X,Y).

Here, \(X\) is the predicted sequence, and \(Y\) is the true sequence from the training data. If \(M(X,Y) = 1\), the sequence is fully memorized. If it is 0, the model has generated something entirely different—it is unmemorized.

To study this, the researchers utilized Pythia, a suite of open-source LLMs ranging from 70 million to 12 billion parameters. Pythia is ideal for scientific study because the training data and order are public and consistent across model sizes, allowing for direct comparisons.

The Macro View: Scaling and Capacity

The first question the researchers tackled is one of scale. Does a bigger model simply memorize more?

The answer is yes, but with nuance. The relationship between model size and memorization is not linear. As shown in Figure 2 below, as models scale from 70m to 12b parameters, the number of memorized sentences increases. However, notice the logarithmic scale.

Graphs showing Memorization Statistics Across Model Size, Complement Size, and Context Size.

Key Observations from the Data:

Capacity Limits: The growth of memorized sentences slows down at larger sizes. This suggests a “maximum capacity” for memorization; simply making a model larger doesn’t mean it will eventually memorize the entire internet.
Context Matters: Look at chart (c) in Figure 2. The number of memorized sentences grows almost exponentially as the context size (the length of the prompt) increases. This suggests that large models have “latent” memories that are only unlocked when provided with a sufficiently long or specific prompt.
Partial Memorization: Most training data remains unmemorized. The vast majority of sentences have low memorization scores, meaning the model “understands” the general idea but doesn’t recite the text verbatim.

The Dynamics of Forgetting and Learning

Do models “learn” to memorize specific sentences as they get bigger? The researchers tracked specific sentences across different model sizes to see how their status changed.

Heatmaps showing Transition Matrices across different model sizes.

Figure 3 visualizes these transitions. The diagonal line is the most prominent feature, which tells us something important: Status is sticky. If a sentence is unmemorized in a small model (410m), it is highly likely to remain unmemorized in a larger model (2.8b).

However, there is a drift toward memorization. The transition from “low” to “medium” or “high” memorization is more common than the reverse. Interestingly, there is also a randomness factor—some sentences that were memorized by smaller models are “forgotten” by larger ones, suggesting that memorization isn’t just about data importance, but also involves stochastic (random) training dynamics.

The Micro View: Input Dynamics and the “Boundary Effect”

This is where the research breaks new ground. The authors asked: Is there a signal in the input text that warns us when the model is about to start reciting memorized content?

They analyzed the n-gram frequency of the text. This measures how common a sequence of words is within the pre-training corpus.

They discovered a phenomenon called the Boundary Effect.

Line graph showing One-gram Analysis at Each Index with a dip at the decoding start point.

Figure 4 reveals this distinct behavior. The X-axis represents the position in a sentence, with the dotted blue line marking the “Decoding Start Point”—the moment the prompt ends and the model begins generating text.

For Unmemorized Sentences (Negative Boundary Effect): Look at the green and magenta lines. Right before the model starts generating unmemorized text, the frequency of the input tokens drops sharply. The model encounters a rare or unique sequence in the prompt, loses its “grounding” in the training data, and is forced to generate novel text.
For Memorized Sentences (Positive Boundary Effect): Conversely, for memorized content (the blue and red lines), the frequency often ticks upward or stays relatively stable. High-frequency inputs act as a trigger, leading the model down a well-trodden path of memorization.

This suggests that the “rarity” of the prompt is a strong predictor of whether the LLM will hallucinate/create (unmemorized) or recite (memorized).

The Micro View: Output Dynamics

Once the model starts generating, what happens to its internal state? The researchers looked at two factors: Embeddings and Entropy.

Embedding Clusters

Embeddings are the vector representations of text inside the model. The researchers visualized how these embeddings evolve as the model generates a sentence.

Scatter plot showing Embedding Dynamics across model sizes.

Figure 5 shows a PCA visualization of these embeddings. The key finding here is clustering. Sentences with high memorization scores cluster closely together in the embedding space.

Even more interestingly, sentences that are partially memorized or paraphrased often sit very close to the verbatim memorized sentences. This implies the existence of paraphrased memorization—the model remembers the semantic “gist” of a training document so strongly that even if it changes the words, the internal representation is nearly identical to the original data.

Entropy and Confidence

Entropy in LLMs is essentially a measure of uncertainty. High entropy means the model is considering many different possible next words; low entropy means it is very sure of what comes next.

Line graph showing Averaged Entropy at Each Index.

Figure 6 shows an “Inverse Boundary Effect” for entropy compared to the frequency analysis:

Unmemorized (High Entropy): When the model generates unmemorized text, entropy spikes. The model is “creative” or “guessing,” choosing from a wider pool of possibilities.
Memorized (Low Entropy): When generating memorized text, entropy drops significantly. The model is highly confident. It knows exactly what word follows the previous one because it is reciting a stored pattern.

This confirms the link between statistical frequency and model confidence. High-frequency inputs lead to low-entropy, high-confidence memorization.

Predicting Memorization

Given these clear signals—frequency dips and entropy spikes—can we build a system to predict if an LLM is currently memorizing data?

The researchers trained a small Transformer model to act as a “watchdog.” It takes the LLM’s internal states and statistics as input and tries to predict, token by token, whether the output is memorized.

Equation defining Token Level Accuracy.

The results were promising. As shown in Table 2, the predictor achieved token-level accuracy of around 80%.

Table showing the Performance of Transformer Model on Prediction of Memorization.

However, the difficulty of prediction varies based on the content.

Bar chart showing Distribution Across Model Size of Full Accurate Predictions.

Figure 7 highlights a crucial nuance: It is much easier to predict unmemorized content than memorized content. The bars on the left (low memorization scores) are higher, especially for large models (the pink bars).

Why? Because of the Boundary Effect. The signal for “I don’t know this” (a sharp drop in input frequency and a spike in entropy) is very strong and distinct. The signal for memorization is more subtle.

A Case Study in Prediction

To see this in action, we can look at specific examples from the study.

Table showing Prediction Examples with gold labels and predicted probabilities.

In Figure 8, the third example (about diabetic rats) is telling. The text is unmemorized (Gold label: U). The predictor correctly identifies every single token as Unmemorized (Pred: U) with extremely high confidence (0.99). The “negative boundary effect” provided a loud signal that the model was off-script.

In contrast, the second example shows the difficulty of partial memorization. The model predicts the token is memorized (M), but the actual text generation diverged from the training data, leading to a prediction error.

Conclusion: The Nature of Machine Memory

This multi-perspective analysis moves us beyond treating LLMs as mysterious black boxes. It reveals that memorization is not a binary switch but a dynamic process governed by statistical laws.

The key takeaways for understanding LLM behavior are:

Memory has a signature: Memorization is characterized by high-frequency inputs, low entropy, and stable embedding trajectories.
The Boundary Effect determines the path: The rarity of the first few words in a sequence often dictates whether the model will create or recite.
Paraphrasing is a form of memory: Even when the model doesn’t output verbatim text, its internal state might still be “remembering” the training data, posing subtle privacy risks.

As models continue to scale, understanding these dynamics becomes critical. It opens the door to better “unlearning” techniques, where we might disrupt the boundary effect to prevent models from regurgitating sensitive training data, turning a verbatim recitation back into a creative generation.

What is Memorization?#

The Macro View: Scaling and Capacity#

The Dynamics of Forgetting and Learning#

The Micro View: Input Dynamics and the “Boundary Effect”#

The Micro View: Output Dynamics#

Embedding Clusters#

Entropy and Confidence#

Predicting Memorization#

A Case Study in Prediction#

Conclusion: The Nature of Machine Memory#