InfiniPot: How to Fit Infinite Context into Finite Memory

The promise of Large Language Models (LLMs) often feels boundless, but in practice, it is strictly limited by memory. Whether you are summarizing a massive legal contract, analyzing a full-length novel, or maintaining a chat history that spans weeks, you eventually hit a wall: the context window.

For cloud-based giants like GPT-4 or Claude 3, simply throwing more GPUs at the problem can extend this window to 100K or even 1M tokens. But what happens when we want to bring this intelligence to the “edge”—to our laptops and mobile phones? In these memory-constrained environments, we cannot simply add more RAM. When the input sequence grows too long, the application crashes or slows to a crawl.

This creates a paradox: we want our personal devices to understand “infinite” context, but they only have finite memory.

In this post, we will dive deep into InfiniPot, a groundbreaking framework presented by researchers from Hanyang University and Qualcomm AI Research. InfiniPot allows pre-trained LLMs to process indefinitely long sequences using a fixed, limited amount of memory—without any additional training. We will explore how they use a clever “boiling pot” analogy to distill information on the fly, ensuring that even a smartphone can find a needle in a haystack of a million tokens.

The Memory Bottleneck

To understand why InfiniPot is necessary, we first need to look at how LLMs “remember” what they have just read.

The KV Cache Constraint

When an LLM generates text, it doesn’t just look at the current word; it looks back at everything that came before it to calculate “attention.” To avoid recalculating the math for every previous word repeatedly, the model stores intermediate calculations in what is called a Key-Value (KV) Cache.

In a standard setup, the KV cache grows linearly with the input length. If you feed the model a 100-page document, the KV cache grows massive. On a server with 80GB of VRAM, this is manageable. On a mobile device with 8GB of shared RAM, this is a showstopper.

The Failure of Existing Solutions

Researchers have tried to solve this before, but most solutions fall into two buckets, neither of which is ideal for edge devices:

Sliding Window Attention (SWA): The model simply forgets everything older than a certain point. It’s like having a conversation with someone who forgets what you said 30 seconds ago. You save memory, but you lose long-term coherence.
Post-Hoc Compression (e.g., SnapKV): These methods act like a highlighter, scanning the whole document and keeping only the important sentences. The problem? You need to load the entire document into memory first to decide what to keep. If your device doesn’t have the memory to hold the document in the first place, this approach fails.

We need a method that can compress information as it arrives, never letting the memory usage spike above a fixed limit.

Enter InfiniPot: The “Perpetual Stew” of AI

The researchers propose a method called Continual Context Distillation (CCD). They liken their approach to a cooking pot. You can keep adding ingredients (tokens) indefinitely. When the pot nears the brim, you don’t buy a bigger pot. Instead, you “boil down” the contents, keeping the concentrated flavor (essential information) and discarding the bulk (unnecessary tokens) to make room for new ingredients.

(a) Memory Unconstrained KV Caching showing full context processing vs (b) Memory Constrained KV Caching using InfiniPot where only a limited length fits.

As shown in Figure 1, standard methods (Panel a) let the memory grow indefinitely. InfiniPot (Panel b) maintains a fixed memory budget—the “Memory Pot.”

How Continual Context Distillation Works

The workflow of InfiniPot is a cycle of Consume \(\rightarrow\) Compress \(\rightarrow\) Repeat:

Fill: The model processes incoming text until the KV cache (the pot) is full.
Assess: It calculates which tokens in the current pot are important.
Evict: It keeps only the most important tokens (the distilled essence) and discards the rest.
Refill: The freed-up space is used for the next chunk of input text.

The critical challenge here is step 2: How do you decide what is important?

In a standard Transformer, “importance” is usually determined by how much future tokens pay attention to past tokens. But in a streaming scenario, we haven’t seen the future yet. How can the model know that a specific name mentioned on page 1 will be crucial for answering a question on page 50?

InfiniPot solves this with two novel metrics: the Catalyst Prompt (CaP) and Novelty under Compression (NuC).

The Core Method: Determining Importance

InfiniPot combines two different perspectives to decide which tokens to keep: one looking forward (Representativeness) and one looking backward (Novelty).

1. The Catalyst Prompt (CaP): Faking the Future

The researchers realized that while they don’t know the exact future tokens, they generally know what kind of task the model is performing (e.g., summarization or question answering).

They introduce a Catalyst Prompt (CaP)—a temporary, “volatile” prompt injected at the end of the current context buffer just before compression. For example, if the buffer is full, they might append a prompt like “Summarize the critical points of this section.”

The model then calculates attention scores as if it were generating a response to that prompt. Tokens that receive high attention from this Catalyst Prompt are deemed “representative” of the text and are preserved.

Ideally, we want to calculate the importance (\(u_t\)) based on all future tokens to infinity:

Equation 1: Ideal future importance based on sum of attention from future tokens to current token.

Since we cannot see to infinity, the Catalyst Prompt acts as a proxy for that future context. We calculate the approximated importance (\(\tilde{u}_t\)) by looking at how the Catalyst Prompt (\(P\)) attends to the current tokens:

Equation 2: Approximated future importance using the Catalyst Prompt.

This cleverly tricks the model into revealing which parts of the current text contain the “load-bearing” information necessary for future tasks.

Figure 2 showing Hit rate between CCD with Catalyst Prompt.

As seen in Figure 2, using a Catalyst Prompt (CaP-G or CaP-Q) dramatically improves the “Hit Rate”—the overlap between what InfiniPot keeps and what a model with infinite memory would have attended to. Without it (the green line), the model performs poorly, similar to a basic sliding window.

2. Novelty under Compression (NuC): Valuing the Unique

The Catalyst Prompt finds the “popular” information. However, sometimes important information isn’t popular—it’s just unique. If the context repeats the same concept five times, the Catalyst Prompt might highlight all five instances. We only need to store it once.

To capture this, InfiniPot uses Novelty under Compression (NuC). This metric asks: “How surprised is the model by this token, given what we have already compressed?”

Mathematically, this uses the negative log-likelihood (entropy). A high value means the token is unexpected (high information density); a low value means it is predictable (redundant).

The ideal calculation looks at the probability of a token (\(x_t\)) given all previous history:

Equation 3: Novelty score defined by negative log probability of current token given past context.

In the InfiniPot framework, this is adapted to consider the compressed context (\(c\)) currently in the pot:

Equation 4: Approximated novelty score using compressed context.

By combining these two scores, InfiniPot ensures a diverse cache. It first fills a reserved number of slots with the most “Novel” tokens (NuC), ensuring unique facts are kept. Then, it fills the remaining space with the most “Representative” tokens (CaP), ensuring the main themes are preserved.

Figure 3 showing hit rates and entropy summation.

Figure 3 illustrates the synergy. The bottom chart shows “Token Entropy” (a measure of information richness). The combined method (CaP + NuC, purple line) consistently retains more information-rich tokens than using the Catalyst Prompt alone (green line), closely hugging the theoretical maximum (black dashed line).

3. Context-Reset Rotary Positional Embedding (CR-RoPE)

There is one final technical hurdle. LLMs use Positional Embeddings (specifically RoPE) to understand the order of words. If you delete tokens from the middle of a sequence (e.g., removing tokens 5 through 50), the relative distances between the remaining tokens become distorted, confusing the model.

InfiniPot introduces Context-Reset RoPE. After every compression cycle, it re-assigns the position IDs of the retained tokens to be continuous (0, 1, 2, 3…). This “de-fragments” the memory, ensuring the model sees a coherent stream of information rather than a disjointed mess. This step effectively prevents the “Out-of-Distribution” (OOD) errors that plague other compression methods.

Experiments & Results

The researchers put InfiniPot to the test on LongBench (a comprehensive suite of long-context tasks) and Needle in a Haystack (finding a specific fact hidden in a massive text).

Beating the Heavyweights on LongBench

The results on LongBench are striking. InfiniPot was restricted to a tiny memory budget (e.g., 2K or 4K tokens), yet it was compared against models allowed to use their full 32K or 128K context windows.

Table 1: LongBench performance comparison.

In Table 1, look at the L3-InfiniPot-4K (LLaMA-3 with InfiniPot). It achieves an average score of 41.50.

This beats L3-SnapKV-4K (41.94 is statistically similar, but SnapKV requires processing the whole context first, which InfiniPot avoids).
It significantly outperforms StreamingLLM and H2O, which struggle to keep the right context.
Amazingly, it is competitive with L3-PT-8K (Full 8K context), which scores 42.91.

InfiniPot effectively turns a 4K memory slot into a window capable of processing much larger documents without losing significant performance.

The “Needle in a Haystack” Test

This test is the ultimate proof of long-term memory. A specific “passkey” is hidden somewhere in a long text, and the model is asked to retrieve it.

Figure 4: Accuracy comparison on Needle in a Haystack benchmark.

Figure 4 is perhaps the most impressive chart in the paper.

The X-axis shows context length on a log scale, going up to 1 Million tokens.
Standard models (like LongChat or LongAlpaca) crash or fail as the length increases (see the lines dropping off).
InfiniPot (Ours, blue and orange lines) maintains nearly 100% accuracy even as the context length scales to 1M.

This proves that the “boil down” method successfully retains the needle (the critical passkey) even while discarding 99.9% of the haystack (the rest of the text).

Efficiency: Speed and Memory

It’s not enough to be accurate; on a mobile device, you must be fast and light.

Figure 7: Comparison of memory usage and latency.

Figure 7 compares the computational cost:

Memory (Top Left): The red line (AllKV - standard caching) shoots up exponentially. InfiniPot (blue) stays flat. It uses a constant amount of RAM regardless of whether the document is 10 pages or 1,000 pages.
Latency (Top Right & Bottom Right): InfiniPot is consistently faster. Because it keeps the cache size small, the time it takes to generate the next token remains low.

Qualitative Success: Why it Works

To make this concrete, let’s look at a specific example from the HotpotQA dataset. The model needs to answer: “Which utility holding company did Alfred A. Marcus work as a consultant?”

This requires connecting two facts found far apart in the text:

Marcus worked for “Xcel Energy”.
“Xcel Energy” is a utility holding company.

Table 8: Qualitative Analysis with HotpotQA.

As shown in Table 8, a baseline compression method (SirLLM) decided the paragraph about Xcel Energy wasn’t important and deleted it to save space. The model failed to answer.

InfiniPot, utilizing the Novelty (NuC) and Catalyst (CaP) scores, recognized that the specific entity names and their definitions were unique and representative. It kept those specific sentences in the “pot,” allowing the model to correctly answer “Xcel Energy.”

Conclusion and Implications

InfiniPot represents a significant shift in how we think about LLM memory. Instead of trying to build hardware that can hold infinite context, InfiniPot adapts the software to fit the hardware we already have.

By intelligently distilling context on the fly using Catalyst Prompts and Novelty scores, InfiniPot allows:

Infinite Context on Edge Devices: A smartphone with limited RAM can now process entire books or long chat histories.
Privacy: Processing can happen locally on the device rather than sending data to a cloud server with massive GPUs.
Efficiency: Constant memory usage means predictable battery drain and performance.

The “Memory Pot” analogy serves as a powerful reminder: you don’t need to remember every single word to understand a story; you just need to remember the parts that matter. With InfiniPot, LLMs are finally learning that lesson.

The Memory Bottleneck#

The KV Cache Constraint#

The Failure of Existing Solutions#

Enter InfiniPot: The “Perpetual Stew” of AI#

How Continual Context Distillation Works#

The Core Method: Determining Importance#

1. The Catalyst Prompt (CaP): Faking the Future#

2. Novelty under Compression (NuC): Valuing the Unique#

3. Context-Reset Rotary Positional Embedding (CR-RoPE)#

Experiments & Results#

Beating the Heavyweights on LongBench#

The “Needle in a Haystack” Test#

Efficiency: Speed and Memory#

Qualitative Success: Why it Works#

Conclusion and Implications#