Introduction

In the rapidly evolving world of Large Language Models (LLMs), Retrieval-Augmented Generation (RAG) has become the gold standard for grounding AI responses in reality. By fetching relevant documents from an external database, we can prevent hallucinations and give models access to up-to-date information.

However, there is a catching point: the “context window.” While modern models boast about handling 100k or even 1 million tokens, filling that context comes with significant downsides. It is expensive, increases latency, and paradoxically, often confuses the model. Known as the “Lost in the Middle” phenomenon, LLMs struggle to find specific needles in massive haystacks of retrieved text.

So, how do we solve this? The intuitive answer is compression. If we can shrink the retrieved documents down to just the essential facts before feeding them to the LLM, we save money and improve focus. But traditional compression methods are “passive”—they simply shorten text without understanding the broader picture.

Enter COMPACT, a new framework proposed by researchers from Korea University, Upstage AI, and AIGEN Sciences. COMPACT changes the game by treating compression as an active, iterative process. It doesn’t just summarize; it hunts for clues, updates its memory, and knows exactly when to stop reading.

In this deep dive, we will explore how COMPACT works, why “active” compression outperforms traditional methods, and how it achieves a staggering 47x compression rate while actually improving question-answering performance.

Background: The Multi-Hop Problem

To understand why COMPACT is necessary, we first need to look at the complexity of modern Question Answering (QA).

In a simple scenario, a question like “When was Barack Obama born?” requires finding a single document containing his birth date. This is Single-Hop QA.

However, real-world queries are often Multi-Hop. Consider the question: “What ‘Virtual Choir’-noted conductor has created works for the Austin-based ensemble Conspirare?”

To answer this, an AI needs to:

Identify conductors noted for a “Virtual Choir.”
Find which of those conductors worked with “Conspirare.”
Synthesize this information.

Standard RAG systems retrieve top-ranked documents based on keyword matching. Often, the crucial connecting piece of information is buried in the 20th or 30th document, not the top 5. If you feed all 30 documents to an LLM, it might get overwhelmed by the noise. If you simply cut off after the top 5, you miss the answer entirely.

Existing solutions try to compress these documents, but they usually treat each document in isolation or select sentences based on rough similarity. They lack the reasoning capability to say, “I found part A of the answer, now I specifically need to look for part B.”

The Core Method: COMPACT

COMPACT (Compressing Retrieved Documents Actively) operates as a plug-in module that sits between your Retriever (the search engine) and your Reader (the LLM answering the user).

Instead of processing all documents at once or summarizing them individually, COMPACT reads through them in batches (segments), actively updating a “compressed context” as it goes.

The Architecture

The process is best visualized as a loop.

Figure 2: Overall COMPACT framework as a plug-in module between the retriever and the reader LLM.

As shown in Figure 2, the workflow is:

Retrieval: The system fetches a long list of documents (e.g., 30 docs).
Segmentation: These documents are split into smaller groups or “segments” (e.g., 5 docs per group).
Iterative Compression: The model looks at the first segment AND the question. It generates a summary.
Active Update: For the next segment, the model looks at the new documents, the question, and the summary from the previous step.
Early Termination: At each step, the model asks, “Do I have enough info to answer the question?” If yes, it stops.

This architecture ensures that the compressed text evolves. It acts as a rolling memory that accumulates evidence.

Mathematical Formulation

Let’s break down the math to understand exactly what is happening under the hood.

First, the retrieved documents are grouped into segments $S_t$. If we decide to process $j$ documents at a time, the $t$-th segment looks like this:

Equation defining the segment sets.

Here, $S_t$ contains documents $d$ from index $(t-1) \times j + 1$ to $(t-1) \times j + j$.

The core magic happens in the update function. The model $\pi$ takes three inputs:

The Question ($q$)
The current Segment of documents ($S_t$)
The Previous Compressed Context ($C_{t-1}$)

It produces two outputs:

The New Compressed Context ($C_t$)
An Evaluation signal ($E_t$)

Equation defining the compression update step.

The inclusion of $C_{t-1}$ is what makes this “active.” When processing segment 2, the model already knows what it found in segment 1. If segment 1 established that “Eric Whitacre created the Virtual Choir,” the model knows that for segment 2, it only needs to check if “Eric Whitacre worked with Conspirare.” It filters out everything else.

Early Termination: Knowing When to Stop

One of the most efficient features of COMPACT is Early Termination.

In standard RAG, you pay the computational cost for processing all retrieved documents, even if the answer was in the very first paragraph. COMPACT generates an evaluation token ($E_t$) alongside its summary: either [COMPLETE] or [INCOMPLETE].

If the model outputs [COMPLETE], the loop breaks immediately. This saves massive amounts of compute and reduces the chance of distracting the final Reader LLM with irrelevant details found in later documents.

Figure 3: Distribution of iteration points where models determine the compressed contexts to be complete.

Figure 3 illustrates this behavior. The green bars show COMPACT’s decision-making. Notice how for many queries, it determines completeness after just 1 or 2 iterations. This dynamic adjustment means simple questions are cheap to answer, while complex ones get the resources they need.

Training the Compressor

How do you train a model to do this? There isn’t a standard dataset for “iterative compression.”

The authors constructed a synthetic dataset using GPT-4. They used a sophisticated prompting strategy involving:

Sentence Selection: Asking GPT-4 to pick relevant sentences.
Query-Focused Compression: Summarizing those sentences without answering the question directly.
Self-Evaluation: Asking GPT-4 to judge if the summary is sufficient ([COMPLETE]).

They created 28,000 training instances covering both scenarios where the answer is found early and scenarios where it requires deep digging (distractor scenarios). They then fine-tuned a Mistral-7B model on this data to act as the COMPACT compressor.

Experiments and Results

The researchers evaluated COMPACT on several challenging datasets: HotpotQA, MuSiQue, and 2WikiMultiHopQA (for multi-hop reasoning), as well as Natural Questions (NQ) and TriviaQA (for single-hop).

They compared COMPACT against:

Raw Documents: Simply feeding the top-k docs to the LLM.
Long-Context LLMs: Models like GPT-3.5-Turbo and LongLLMLingua.
Other Compressors: AutoCompressors and RECOMP.

The “Reader” model (the one answering the question based on the compression) used in the main experiments was LLaMA-3-8B.

Main Performance

The results were impressive. Let’s look at Table 2:

Table 2: Main results comparing COMPACT against baselines.

Key Takeaways from the Data:

Compression Rate: COMPACT achieved a massive 47.6x compression rate on HotpotQA. This means it reduced the input size by nearly 98%, turning ~3,000 tokens of noise into a concise <200 token summary.
Performance (F1 Score): Despite this massive reduction, it achieved an F1 score of 46.9 on HotpotQA, significantly beating the “Raw Document” baseline (40.3) and other compressors like RECOMP (39.9).
Beating Long-Context Models: It even outperformed standard usage of massive proprietary models like GPT-3.5-Turbo in the multi-hop settings.

This suggests that less is more. By actively filtering noise, COMPACT gives the Reader LLM a cleaner signal, leading to better answers than if the LLM had read the raw text itself.

Robustness to Noise

A major issue in RAG is that the “Gold” document (the one with the actual answer) isn’t always at the top of the search results. It might be ranked #25.

Figure 1: Performance of HotpotQA with different top-k documents.

Figure 1 demonstrates COMPACT’s robustness.

The Brown line (Raw docs) stays flat. As you add more documents (increasing Top-k), the noise confuses the model, cancelling out the benefit of finding the answer.
The Red line (COMPACT) keeps climbing. Because COMPACT actively hunts for the answer and discards the noise, it benefits from looking at more documents (up to Top-40) without suffering from the “Lost in the Middle” problem. It almost matches the performance of the Gold docs (Green line), which is the theoretical upper bound.

Flexibility as a Plug-in

A critical question for any new framework is: “Does it work with my specific setup?” The researchers tested COMPACT with different Retrievers and Readers.

Different Retrievers

They swapped the neural retriever (Contriever) for a classic keyword-based search (BM25).

Figure 5: Performance with BM25 retriever.

As seen in Figure 5, the trend holds. Even with BM25, COMPACT (Red Star) maintains high performance and stability as Top-k increases, whereas other methods struggle or plateau.

Different Readers

They also tested if the summaries generated by COMPACT (which is based on Mistral-7B) are readable by other LLMs.

Figure 6: Performance with LLaMA2-13B and LLaMA3-8B readers.

Figure 6 confirms that COMPACT is model-agnostic. Whether the final reader is LLaMA-2 or LLaMA-3, the compressed text provides a better basis for answering questions than raw documents or RECOMP.

Cost and Efficiency

For students and developers, the “cool factor” often takes a backseat to the “cost factor.” API calls are expensive.

The researchers analyzed the cost of answering questions using proprietary models (like GPT-4 and Claude 3.5) as readers.

Table 4: API cost analysis.

Table 4 reveals the economic impact.

Using GPT-4o with Raw documents cost $10.75.
Using GPT-4o with COMPACT cost $0.28.

That is a 97% cost reduction. Because COMPACT shrinks the input context so dramatically, you pay significantly less per API call. Furthermore, COMPACT achieved a higher F1 score (56.0) than the raw GPT-4o setup (55.8), proving you don’t have to sacrifice quality for savings.

Conclusion and Implications

The COMPACT paper introduces a shift in how we think about handling long contexts in Large Language Models. Rather than building models with infinitely larger context windows, COMPACT suggests we should build smarter mechanisms to curate that context.

By treating compression as an active, iterative reasoning task, COMPACT achieves three major wins:

Precision: It links information across documents to solve multi-hop questions.
Efficiency: It stops processing as soon as it finds the answer (Early Termination).
Economy: It reduces token usage by nearly 50x, slashing deployment costs.

For developers building RAG systems today, COMPACT offers a compelling blueprint: don’t just dump data into your prompt. Process it, reason over it, and compress it first. As the researchers note, while the iterative process introduces some latency compared to a single-pass summary, the gains in accuracy and the flexibility to handle complex reasoning tasks make it a powerful tool for the next generation of AI applications.

Introduction#

Background: The Multi-Hop Problem#

The Core Method: COMPACT#

The Architecture#

Mathematical Formulation#

Early Termination: Knowing When to Stop#

Training the Compressor#

Experiments and Results#

Main Performance#

Robustness to Noise#

Flexibility as a Plug-in#

Different Retrievers#

Different Readers#

Cost and Efficiency#

Conclusion and Implications#