The era of the “million-token context window” is here. With models like Claude 3 and Gemini 1.5, we are moving away from short prompts into the territory of processing entire books, codebases, and legal archives in a single go.

But there is a catch: hardware hasn’t scaled as fast as our ambitions. Processing a sequence of 1 million tokens requires massive computational resources and GPU memory. If you try to feed a text that long into a standard GPU, you will almost certainly hit an “Out of Memory” (OOM) error before the model even generates the first word.

This bottleneck happens primarily during the prefilling stage—the phase where the model reads and processes your input prompt. To solve this, researchers from Fudan University and Huawei Noah’s Ark Lab have proposed a clever, intuitive method in their paper “Memorize Step by Step.” Their approach, utilizing Incremental Memory (IM) and Decremental Chunks (IMDC), proves that we don’t need to treat every part of a sequence equally. By dynamically adjusting memory and input sizes, we can process infinite-length contexts faster and with less hardware.

In this post, we will deconstruct this paper, explaining why traditional methods are wasteful and how this new “step-by-step” strategy optimizes Large Language Model (LLM) inference.

The Problem: The Heavy Cost of Prefilling

LLM inference consists of two distinct stages:

  1. Prefilling: The model processes the input prompt (the context) to compute the Key-Value (KV) cache.
  2. Decoding: The model generates the output token by token.

While much research focuses on optimizing the decoding stage, prefilling is often the bottleneck for long contexts. The attention mechanism in Transformers has a quadratic time complexity, \(O(L^2)\), meaning that doubling the input length quadruples the computation time. Furthermore, storing the KV cache for massive inputs demands more VRAM than most GPUs possess.

The Standard Solution: Chunking with Fixed-Size Memory

When an input is too long to fit in GPU memory, the standard workaround is Iterative Compression. The sequence is divided into smaller segments, or “chunks.” The model processes one chunk at a time, compressing the information into a “memory” buffer (the KV cache) that is passed to the next chunk.

Traditionally, this memory buffer has a Fixed-Size. Whether the model is reading the 1st chunk or the 100th chunk, it allocates the same amount of memory to store historical context.

Figure 3 illustrating the difference between Fixed-Size Memory, Incremental Memory, and IMDC architectures.

As shown in the top row of Figure 3 (FM), the Fixed-Size Memory approach allocates a static block of memory (the yellow box) right from the start.

But is this efficient? The researchers asked a fundamental question: Do we really need a large memory buffer at the very beginning of the sequence?

The Insight: Early Memory Matters Less

To answer that question, the authors analyzed how much “attention” the model actually pays to its memory at different stages of processing. They used two popular compression methods, SnapKV and StreamingLLM, to visualize attention scores.

Figure 1: (a) Average attention scores of memory over time. (b) Distribution of memory content across chunks.

The results in Figure 1(a) are revealing. The red and cyan lines show the average attention scores of the memory. Notice that at the beginning (low Chunk ID), the attention scores are very low. They gradually increase as processing continues.

This suggests that early-stage memory has minimal influence on the next step of computation.

Furthermore, Figure 1(b) shows the distribution of memory content at the end of the process. The vast majority of the retained information comes from the most recent chunks (the tall bars on the right). Most of the early-stage memory is discarded or ignored by the time the model reaches the end.

The Conclusion: Maintaining a large, fixed-size memory buffer during the early stages of prefilling is a waste of both GPU memory and computational power.

The Solution: Incremental Memory (IM)

Based on this insight, the authors propose Incremental Memory (IM). Instead of allocating the maximum memory size immediately, the memory buffer starts small and gradually grows as the model processes more chunks.

Referencing the middle row of Figure 3 again, you can see the “Incremental Memory” strategy. The memory (orange blocks) starts tiny and expands step by step.

Mathematically, if \(n\) is the number of chunks, \(m_{max}\) is the maximum memory limit, and \(m_0\) is the initial memory, the memory size at step \(i\) increases linearly:

Equation for linear memory growth.

Why does this help?

By keeping the memory small in the early steps, the attention calculation—which depends on the memory size—becomes significantly faster. The time complexity drops because the average memory size over the whole process is roughly half of the maximum size (\(m_{max}/2\)).

Adaptive Growth

The researchers took this a step further. By analyzing different layers of the LLM, they found that higher layers (closer to the output) tend to retain more memory than lower layers.

Figure 6: Heatmap showing memory distribution across layers and chunks.

Figure 6 illustrates this phenomenon. In LLaMA2-7b (left), the higher layers (top rows) show a more uniform distribution of memory usage, while lower layers focus heavily on recent tokens. This led to an Adaptive Function, where memory size is customized per layer based on how much information that layer actually retains.

Optimization: Decremental Chunk (IMDC)

Incremental Memory makes the process faster, but it has a flaw regarding Peak Memory Usage.

Even if you start with small memory, you eventually reach the maximum memory size (\(m_{max}\)) at the final step. Since your GPU memory limit is determined by the peak usage, IM doesn’t necessarily allow you to fit a larger model or batch size than the fixed method; it just finishes faster.

To solve this, the authors introduce Decremental Chunk (IMDC).

The idea is a balancing act:

  • As the Memory grows larger…
  • The input Chunk should get smaller.

This ensures that the total number of tokens being processed at any single step (Chunk Length + Memory Length) remains roughly constant.

Figure 2: Visual comparison of memory and chunk sizes for Fixed, Incremental, and IMDC strategies.

Figure 2 visualizes this perfectly:

  • Fixed-Size Memory (Left): The total height (Memory + Chunk) is high right from the start.
  • Incremental Memory (Center): The chunk size is constant, but memory grows like a staircase. Peak usage is high at the end.
  • Decremental Chunk (Right): As the yellow memory staircase goes up, the blue chunk block shrinks. The total height remains stable and lower than the others.

The Math Behind IMDC

The goal is to keep the computational load stable. The chunk size \(c_i\) at step \(i\) is dynamically calculated so that the sum of the current chunk and the previous memory equals the average load:

Equation determining the chunk size for IMDC.

Here, \(\hat{m}\) represents the average memory size. Because the chunk size decreases as memory increases, the peak GPU memory requirement is significantly reduced. This allows the model to process long contexts using less VRAM, potentially fitting on smaller consumer GPUs (like an RTX 3090) where fixed methods would crash.

Experimental Results

The theory sounds solid, but how does it perform in practice? The researchers tested IM and IMDC against Fixed-Size Memory (FM) using LLaMA-2 and InternLM2 on NVIDIA A800 and RTX 3090 GPUs.

1. Speed and Memory Efficiency

Figure 4: Charts comparing Time To First Token (TTFT) and GPU Memory Usage.

Figure 4 presents the efficiency gains.

  • Top Row (Time): The green (IM) and blue (IMDC) bars are consistently lower than the red (FM) bars. On the A800 GPU using SnapKV, IMDC was up to 1.45x faster than the baseline.
  • Bottom Row (Memory): Look at the blue bars for IMDC. They show a significant drop in GPU memory usage—up to 23.3% reduction.

This confirms that IMDC is not only faster but also much lighter on hardware.

2. Model Performance (Perplexity)

Did this optimization hurt the model’s intelligence? Aggressively compressing memory or shrinking chunks could theoretically cause the model to “forget” or get confused.

Figure 5: Perplexity comparison on GitHub and Arxiv datasets.

Figure 5 shows the Perplexity (PPL), a measure of how well the model predicts text (lower is better). The performance of IM (green) and IMDC (blue) is virtually identical to, and sometimes slightly better than, Fixed-Size Memory (red). The “step-by-step” approach does not degrade the quality of the language modeling.

3. Long-Context Benchmarks

The researchers also ran the models on LongBench, a suite of tasks designed to test long-context understanding (like summarization and Q&A).

Table 2: Performance comparison on LongBench tasks.

As seen in Table 2, the differences between the methods are negligible (often less than 0.5 points). IMDC achieves comparable results to the resource-heavy Fixed-Size Memory, proving it is a “free lunch”—better efficiency without sacrificing accuracy.

4. Scaling to 1 Million Tokens

Finally, the ultimate test: Can it handle 1 million tokens?

Table 4: Evaluation on 1 million tokens.

Table 4 shows the results on a 1M token sequence. Standard “Full Attention” causes an Out-Of-Memory (OOM) error immediately. However, IMDC successfully processes the sequence with low perplexity (around 2.11) and drastically lower latency (328 seconds vs. 505 seconds for FM).

Conclusion

The “Memorize Step by Step” paper offers a compelling shift in how we think about processing long documents. By recognizing that early-stage memory is less critical and that computation loads should be balanced, the authors created a method that is:

  1. Faster: By starting with small memory.
  2. Lighter: By shrinking chunks as memory grows.
  3. Equally Capable: Maintaining high accuracy on benchmarks.

For students and practitioners, this highlights an important lesson in system design: static allocations (like fixed buffers) are rarely optimal. Dynamic, adaptive systems that react to the state of processing (like Incremental Memory) often yield significant efficiency gains. As we push toward infinite-context LLMs, strategies like IMDC will be essential for running powerful models on accessible hardware.