Solving Information Neglect in Long-Context LLMs with CItruS

Large Language Models (LLMs) like Llama 2 and Mistral have revolutionized how we interact with text. However, they possess a significant limitation: the context window. While models are getting better at handling longer sequences, processing an entire book or a massive legal document remains computationally expensive and memory-intensive.

To manage this, researchers have developed “State Eviction” methods. These techniques compress the model’s memory (the Key-Value cache) by discarding “unimportant” information as the text gets longer. It sounds like a perfect solution: keep the memory footprint low while processing infinite text.

But there is a catch. Current methods decide what to delete based on what helps the model predict the next word (fluency). They do not consider what information is actually needed to answer a specific user query. This leads to a phenomenon the authors of a recent paper call Information Neglect.

In this post, we will dive deep into the paper “CItruS: Chunked Instruction-aware State Eviction”. We will explore why current compression methods fail at downstream tasks, how CItruS solves this by separating “reading” from “solving,” and how it achieves state-of-the-art performance in retrieval without requiring any model retraining.

The Problem: Fluency vs. Utility

To understand the core innovation of CItruS, we first need to understand how LLMs usually handle memory and where the current efficiency methods go wrong.

The Key-Value Cache and Eviction

When a Transformer model generates text, it stores the representations of previous tokens in a Key-Value (KV) Cache. This prevents the model from having to re-compute the entire history for every new word it generates. However, as the input length grows, this cache grows linearly, eventually consuming all available GPU memory.

“State Eviction” is a strategy to prevent this memory overflow. It works by monitoring the attention weights—essentially, how much the model “looks at” specific past tokens. If a token in the cache isn’t receiving much attention, the eviction policy assumes it is unimportant and deletes it.

Existing methods like StreamingLLM, H2O, and TOVA use this approach. They effectively keep the model fluent (maintaining low perplexity) even over millions of tokens.

Defining Information Neglect

The researchers behind CItruS identified a critical flaw in these methods. They argue that fluency does not equal utility.

When an LLM decides which tokens are “important” during the reading process (Language Modeling), it prioritizes tokens that help it maintain grammatical and local coherence. However, the information required to solve a downstream task—like answering a specific question about a detail in Chapter 3—might look “unimportant” to the model when it is just trying to predict the next word in Chapter 4.

Consequently, standard eviction methods often throw away the exact “needle” the user is looking for in the “haystack.” The authors term this Information Neglect.

To prove this, the authors conducted an experiment comparing the attention patterns of a model when it is just reading context versus when it is trying to follow an instruction.

Figure 1: One sample from attention distributions in the 16th layer of the Mistral 7B Instruct model applied to the Qasper dataset.

As seen in Figure 1, the attention distributions are drastically different. The purple line shows attention derived from the document context (Language Modeling), while the pink line shows attention derived from the instruction (Task Solving). The context focuses on one set of tokens, but the instruction cares about a completely different set. If we evict states based solely on the purple line, we lose the information represented by the pink line.

Quantifying the Gap

The researchers formalized this observation by calculating the “intersection” of top-k important states. They took a long document and checked which hidden states were selected as important by the document context versus which were selected by the instruction text.

Figure 2: The illustration of our experiments that apply intersection calculation to explore the information neglect problem in state eviction models.

Figure 2 illustrates the setup: two parallel paths process the document. One path selects states based on context attention, and the other selects based on instruction attention. They then measure the overlap.

Figure 3: The difference between the top-k hidden states selected by the instruction text and the document context with the k set as 20.

The results in Figure 3 are stark. In the middle layers of the model (layers 10-25), the intersection ratio drops below 0.2. This means that less than 20% of the information the instruction needs is being preserved by standard context-based eviction. This is a massive loss of utility and explains why compressed models often fail at reading comprehension tasks.

The Solution: CItruS

To solve Information Neglect, the authors propose CItruS (Chunked Instruction-aware State Eviction). The core philosophy of CItruS is the disentanglement of two cognitive processes: Language Modeling and Task Solving.

Instead of trying to force one cache to serve both purposes, CItruS acknowledges that the “reading” process (encoding the document) and the “thinking” process (answering the question) have different attention preferences.

Figure 4: The illustration of our proposed different subprocesses for task-specific long sequence modeling. Each process serves as different roles.

As shown in Figure 4, the system is split. The Language Modeling track focuses on maintaining fluency and encoding the document structure. The Task Solving track uses the specific instruction (question) to hunt for information.

Method Breakdown

CItruS introduces two major technical components: Standard Chunked State Eviction (CSE) and Instruction-Aware Cache.

1. Standard Chunked State Eviction (CSE)

Processing tokens one by one is inefficient. CItruS processes the document in large chunks (e.g., 256 or 1024 tokens at a time).

For each chunk, the model calculates the importance of the states currently in the cache. The importance score is derived from the attention scores averaged across all tokens in the current chunk. The formula for the importance of a state \(c\) relative to a chunk \(s\) is:

Equation for Importance Score

Here, \(Q_t\) is the query vector of a token in the current chunk, and \(K_c\) is the key vector of a stored state. This formula essentially asks: “On average, how much do the tokens in the current chunk attend to this specific cached state?”

Based on this score, the model selects the top-\(k\) states to keep:

Equation for Selection Function

And updates the cache \(\hat{C}\):

Equation for Cache Update

This creates a baseline efficient model (Standard CSE) that behaves similarly to previous state eviction methods, focusing on fluency.

2. Instruction-Aware Eviction

This is where CItruS diverges from the status quo. To address Information Neglect, CItruS incorporates the Instruction Text (the user’s question) directly into the eviction process. The authors propose two architectures for this, as visualized in Figure 5.

Figure 5: The illustration of different cache designs for our proposed Standard CSE and CItruS.

Figure 5(a) shows the Standard CSE described above. It only considers the context (Chunk 2) when deciding what to keep from Chunk 1.

Figure 5(b): Individual Cache. In this design, the model maintains two separate caches:

Cache \(C\): The standard language modeling cache, managed by the chunk attention.
Instruction Cache \(C^I\): A specialized cache dedicated to the task.

When a chunk is processed, the model uses the Instruction Text as a query to calculate a separate set of attention scores. It then selects the top states that are relevant specifically to the instruction and stores them in \(C^I\).

Equation for Instruction Selection

Equation for Instruction Cache Update

This ensures that even if a piece of information seems irrelevant to the general flow of the document (and is dropped from \(C\)), it is preserved in \(C^I\) if the question asks about it.

Figure 5(c): Shared Cache. Maintaining two caches increases memory usage. The authors hypothesized that there might be enough overlap to combine them. In the Shared Cache design, there is only one cache. However, the decision of what to keep is influenced by the instruction. The top-\(k\) states are selected based on the attention from the instruction text. These states are then used for both task solving and language modeling.

Surprisingly, as we will see in the results, the Shared Cache approach works exceptionally well, suggesting that the “intersection” states (useful for both context and instruction) are often sufficient to maintain fluency.

Experiments and Results

The authors evaluated CItruS on three main categories: Long Document Reading Comprehension, Knowledge Retrieval, and Language Modeling Fluency. They compared it against strong baselines like StreamingLLM, H2O, and TOVA.

1. Knowledge Retrieval (Finding the Needle)

The ultimate test for long-context models is the “Needle-in-a-Haystack” or “Passkey Retrieval” task. Can the model find a specific random number hidden inside a massive document?

Figure 6 displays the accuracy of Passkey Retrieval across different context lengths (log scale).

Figure 6: The results of the passkey retrieval task with Llama 2 7B Chat, Mistral 7B Instruct, and Llama 2 13B Chat.

Blue Line (Standard): This represents standard state eviction. As the document gets longer (moving right on the x-axis), performance collapses. By the time the document is very long, accuracy is near zero.
Red Line (CItruS - Individual & Shared): The performance stays at a perfect 100%.

Whether using Llama 2 or Mistral, CItruS retrieves the information perfectly, even as the sequence length grows exponentially. This empirically proves that the instruction-aware mechanism successfully prevents Information Neglect.

The authors also tested on the “Needle-in-a-Haystack” task using ROUGE scores (measuring text overlap).

Table 2: Results of the needle-in-a-haystack task. Best results are bolded.

Table 2 confirms the dominance of CItruS. Whether using a cache budget of 768 or 1024, CItruS (both Individual and Shared variants) significantly outperforms the Standard CSE. For example, with Mistral 7B, the ROUGE-1 score jumps from roughly 15.17 (Standard) to 63.47 (Shared Cache).

2. Reading Comprehension (Downstream Tasks)

Synthetic retrieval is one thing, but what about real questions? The authors tested on datasets like Qasper, HotpotQA, and TriviaQA.

Table 5: The averaged results on six different long sequence tasks.

Table 5 shows aggregated results. The “Avg Rank” (where 8 is best) shows CItruS consistently at the top.

Standard CSE performs comparably to baselines like H2O and StreamingLLM.
CItruS (Shared Cache) consistently achieves the highest ranks (e.g., 7.33/8 for Llama 2 13B on 4k-8k lengths).

This demonstrates that CItruS isn’t just finding simple passkeys; it is preserving the complex semantic information required for answering questions.

3. Language Modeling Fluency

A major concern with altering the cache based on “Instructions” rather than “Context” is that the model might lose its ability to generate fluent English (perplexity). If we optimize too hard for the answer, we might break the grammar.

Figure 7: The language modeling results on the Llama 2 7B chat and Mistral 7B Instruct model.

Figure 7 plots perplexity (lower is better) over token length.

Blue (Standard CSE) and Orange (Streaming LLM) represent the baselines optimized for fluency.
Green (Shared Cache CSE) represents CItruS.

The green line tracks almost identically with the baselines. This is a significant finding: using the instruction to select cache states does not degrade the model’s general language modeling capability. It turns out that the states useful for answering the question are also sufficient for maintaining the syntax and flow of the document processing.

Analysis: Position Bias

Does it matter where the “needle” is located? Previous research has shown LLMs suffer from “Lost in the Middle” syndrome.

Figure 8: The position-wise results from CItruS with shared cache on needle-in-the-haystack using Mistral 7B Instruct.

Figure 8 serves as a heatmap of performance. The x-axis is the needle position, and the y-axis is document length. Darker blue indicates better performance. While CItruS performs very well (mostly dark blue), there is a slight fading in the middle columns. This indicates that while CItruS vastly improves retrieval, the inherent bias of LLMs to focus on the beginning and end of documents persists to some degree. However, compared to baselines that would be completely white (zero retrieval) at these lengths, this is a massive improvement.

Why CItruS Matters

The implications of CItruS are substantial for students and researchers working with LLMs:

Training-Free: CItruS is an inference-time technique. You do not need to fine-tune Llama or Mistral. You simply change how the KV cache is managed during generation. This makes it accessible and cheap to implement.
Memory Efficiency: By proving that a Shared Cache works, CItruS shows we don’t need to double our memory requirements to get good retrieval. We just need to be smarter about what we keep.
Solving Information Neglect: The paper provides a clear theoretical framework (Information Neglect) and a practical solution. It highlights that “importance” in neural networks is relative to the goal (fluency vs. task), not an absolute property of a token.

Conclusion

The “CItruS” paper tackles the critical bottleneck of long-context modeling: the trade-off between memory efficiency and information retention. By identifying “Information Neglect”—the tendency of standard models to discard task-relevant details in favor of local fluency—the authors propose a novel, instruction-aware eviction strategy.

Through the use of Chunked State Eviction and an Instruction-Aware Cache, CItruS allows standard open-source models to process documents containing up to a million tokens and retrieve specific details with near-perfect accuracy. It bridges the gap between the “reading” mind and the “solving” mind of the LLM, ensuring that when the model reads a book, it remembers exactly what you asked it to look for.

Solving Information Neglect in Long-Context LLMs with CItruS#

The Problem: Fluency vs. Utility#

The Key-Value Cache and Eviction#

Defining Information Neglect#

Quantifying the Gap#

The Solution: CItruS#

Method Breakdown#

1. Standard Chunked State Eviction (CSE)#

2. Instruction-Aware Eviction#

Experiments and Results#

1. Knowledge Retrieval (Finding the Needle)#

2. Reading Comprehension (Downstream Tasks)#

3. Language Modeling Fluency#

Analysis: Position Bias#

Why CItruS Matters#

Conclusion#