Beyond Perplexity: Measuring LLM Memory with the Forgetting Curve

In the rapidly evolving landscape of Large Language Models (LLMs), there is a massive push for longer context windows. We’ve gone from models that could handle a few paragraphs to beasts claiming to process 128k, 200k, or even 1 million tokens. But here is the critical question: just because a model accepts a million tokens, does it actually remember them?

For students and researchers entering this field, this is a tricky problem. We traditionally rely on metrics like Perplexity (PPL) or tasks like “Needle in a Haystack” to evaluate models. However, a new research paper titled “Forgetting Curve: A Reliable Method for Evaluating Memorization Capability for Long-context Models” argues that these existing methods are fundamentally flawed when it comes to long-range memory.

The authors propose a novel, robust metric called the Forgetting Curve. In this post, we will break down why our current measurements fail, how the Forgetting Curve works, and what it reveals about the leading models in the industry today.

The Problem with Current Measurements

Before diving into the solution, we need to understand why measuring long-context memory is so difficult. If you feed a novel into an LLM and ask a question about the first chapter, the model’s ability to answer depends on two things:

Memory: Can it recall the information?
Understanding/Instruction Following: Can it understand your prompt and formulate an answer?

Most current benchmarks conflate these two abilities.

The Limitations of Perplexity and “Needles”

Perplexity is the standard metric for training language models. Roughly speaking, it measures how surprised a model is by the next token. While low perplexity is good, recent research suggests that a model can have low perplexity on long sequences without actually improving its performance on downstream long-context tasks.

On the other hand, the popular “Needle in a Haystack” test hides a specific fact (the needle) in a large amount of text (the haystack) and asks the model to find it. The issue here is Prompt Dependence. If a model is not “aligned” or trained to follow instructions well, it might fail the test even if it perfectly remembers the needle.

The authors of this paper categorize these limitations into specific buckets, such as Limited Memory Usage (LMU) (metrics that don’t actually require long memory) and Prompt Required (PR).

Table 1: Overview of long-context measurements and limitations.

As shown in Table 1 above, almost every existing benchmark suffers from these limitations. They either rely too much on prompts, use toy datasets that models can overfit to, or can’t distinguish between a model being “smart” and a model actually “remembering.”

The Core Method: The Forgetting Curve

To solve this, the researchers introduced the Forgetting Curve. The genius of this method is that it treats memory as an “emergent copy capability.” It doesn’t ask the model to answer a question; it simply tests if the model can predict a sequence it has just seen.

How It Works

The method involves plotting two distinct curves to visualize the gap between what the model knows generally and what it remembers specifically.

Copy Accuracy Curve (The Memory Test): The model is fed a sequence of text (the context). Then, it is fed the beginning of that same sequence again and must predict the rest. This is essentially checking: “You just saw this text; can you replicate it?” This uses a technique called Teacher Forcing, where the model predicts the next token based on the correct previous tokens.
LM Accuracy Curve (The Baseline): We need to control for the fact that the model might guess the text simply because it knows English (or whatever language is used). To measure this, the model is fed an irrelevant, random text prefix, and then asked to predict the target sequence. This measures the model’s natural ability to predict the text without having seen it in the context.

Figure 2: The forgetting curve task measures prediction accuracy under copy and language modeling settings.

Figure 2 illustrates this setup. The top sequence shows the copy task (the model sees the text, then predicts it). The bottom shows the LM task (irrelevant prefix, then predict the text).

Reading the Curve

By subtracting the baseline (LM Accuracy) from the Copy Accuracy, we get a clear picture of the model’s memory. The authors identify three distinct phases of memory, which can be seen in the graph below:

Figure 1: The forgetting curve of Llama-2-base-32k.

Fine-grained Memory (Green Zone): The model has near-perfect recall (99% copy accuracy). It knows exactly what tokens appeared.
Coarse-grained Memory (Blue Zone): The model isn’t perfect, but its accuracy is significantly higher than its baseline language modeling ability. It remembers the “gist” or specific parts, even if it makes errors.
Amnesia (Red Zone): The Copy curve drops down to meet the LM curve. At this point, the context is useless. The model is guessing based on general language rules, effectively having forgotten the specific input.

Why This Method is Robust

A major advantage of the Forgetting Curve is that it is robust. The researchers tested this by changing the irrelevant text (using random noise vs. other books) and changing the source of the copy text.

Figure 3: Llama-2-7b forgetting curve robustness tests.

As shown in Figure 3, the curves remain consistent regardless of the text source. This means the Forgetting Curve measures an intrinsic property of the model architecture, not just a quirk of a specific dataset. Furthermore, because it doesn’t require prompts (like “Please summarize this”), it can be used on base models (not just chat-aligned ones) and models of any size.

Experiments & Results

The researchers applied this methodology to 14 different open-source models, ranging from standard Llama models to those using new architectures like RNNs and State Space Models (SSMs).

Transformers vs. RNN/SSMs

The results provided a fascinating look into the state of current LLMs.

Table 3: Performance of open-source models in Forgetting Curve and Ruler.

Table 3 summarizes the findings. Here are the key takeaways:

Llama-3 is a significant upgrade: Comparing Llama-2-base (7B) to Llama-3-base (8B), the fine-grained memory jumped from 0 tokens (meaning it couldn’t perfectly replicate immediately) to 4k tokens. This suggests that the massive increase in training data for Llama-3 improved its fundamental memorization capabilities.
Extension Techniques Work: Techniques used to extend context windows, such as modifying RoPE (Rotary Positional Embeddings), are validated by this method. Models like Llama-2-base-32k and Yarn-Llama-2 showed coarse memory extending well into their claimed context lengths.
The RNN/SSM Struggle: This is perhaps the most critical finding. New architectures like Mamba and RWKV (which are RNN or Hybrid based) claim to handle very long or infinite contexts efficiently. However, the Forgetting Curve reveals a weakness.

Let’s look at the curves for these non-Transformer models:

Figure 17: Yarn-Llama-2 vs Figure 19: Mamba.

Look specifically at the Mamba chart (Figure 19 in the bottom right of the image above). Notice that the “Fine Length” (perfect copy) is essentially zero. The model struggles to retain exact token-level memory over distance. While these models might be good at capturing general semantics, they lack the high-fidelity retrieval capability of attention-based Transformers.

The Perplexity Trap

One of the most educational aspects of this paper is the debunking of Perplexity as a memory metric. The authors trained a custom model, Llama-XL (based on Transformer-XL), which is designed to have lower perplexity on long sequences.

Figure 5: The forgetting curve vs perplexity for Llama-XL.

In Figure 5, look at the bottom graph. The Perplexity (PPL) drops significantly and stays low as the sequence gets longer. In traditional analysis, we would say, “This model uses long context well!”

But look at the top graph (the Forgetting Curve). The Copy Accuracy (orange) drops to meet the LM Accuracy (blue) almost immediately. Despite the great perplexity score, the model has no memorization capability for the long context. This confirms a suspicion held by many in the field: Perplexity measures short-term predictability, not long-term memory.

Conclusion and Implications

The “Forgetting Curve” provides a much-needed reality check for the LLM industry. It separates the hype of “claimed context length” from the reality of what a model can actually retain.

Key Takeaways for Students:

Memory \(\neq\) Understanding: A model can be smart (high reasoning) but have poor long-term memory. Conversely, a model can have great memory but fail at following instructions. The Forgetting Curve isolates the memory component.
Architecture Matters: The attention mechanism in Transformers seems uniquely deeper at “exact recall” compared to current RNN/SSM approaches, which may be lossy over long distances.
Beware of Metrics: Never trust a single metric like Perplexity to tell the whole story. A model might “cheat” PPL without actually learning to use the long context.

As we move toward agents and models that need to read entire books or codebases, metrics like the Forgetting Curve will be essential for identifying which models can actually hold onto that information, and which ones are just hallucinating based on the vibe of the text.

The Problem with Current Measurements#

The Limitations of Perplexity and “Needles”#

The Core Method: The Forgetting Curve#

How It Works#

Reading the Curve#

Why This Method is Robust#

Experiments & Results#

Transformers vs. RNN/SSMs#

The Perplexity Trap#

Conclusion and Implications#