REFRAG: Supercharging RAG with 30× Faster First-Token Generation

Large Language Models (LLMs) have transformed how we interact with information, but they have a well-known Achilles’ heel: their appetite for computational resources. This becomes especially apparent in Retrieval-Augmented Generation (RAG) systems, where large amounts of external text are injected into the model to help it answer questions. The more context we provide, the better the potential answer—but the slower and more expensive the process becomes. This creates a frustrating trade-off between knowledge and efficiency.

Imagine asking a question in a RAG-powered chatbot. Behind the scenes, the system might retrieve ten different documents, stitch them together into a massive prompt, and send it to an LLM. The model must read and process every single word of that context before it can generate the first token of its answer. This initial processing delay—known as Time-to-First-Token (TTFT)—is a major bottleneck in making RAG systems feel truly interactive.

A recent paper from researchers at Meta, “REFRAG: Rethinking RAG based Decoding”, tackles this problem head-on. They argue that we’ve been treating RAG inference like any other LLM task, which is wasteful. RAG context isn’t just one long narrative—it’s a collection of often unrelated documents with highly sparse relationships. By exploiting this unique structure, they developed REFRAG, a decoding framework that speeds up RAG systems by up to 30.85× TTFT—without sacrificing accuracy.

In this deep dive, we’ll unpack the core ideas behind REFRAG—how it compresses, senses, and expands context—and explore the training strategies and results that make it so effective.

The Problem with “Dumb” Context Processing in RAG

The inefficiency begins with how standard LLMs process long prompts. In self-attention, every token interacts with every other token, and the computational cost grows quadratically with input length. Doubling the context can quadruple the TTFT.

For RAG, this is especially wasteful for three reasons:

Sparse Information: A RAG prompt contains the user’s query followed by multiple retrieved passages. Many of these passages are redundant or only marginally relevant—but still consume equal computation time.
Unstructured Attention: Retrieval promotes diversity and deduplicates chunks. Tokens within the same passage strongly attend to each other, but have very little cross-passage attention. This results in a block-diagonal attention pattern where many off-diagonal computations are near-zero.
Discarded Knowledge: When retrieving passages, embeddings are already computed by the retriever. Typically, this pre-computed information is discarded, forcing the LLM to re-encode the text.

Visualization of attention values for different retrieved passages in a LLaMA-2-7B-Chat model. The strong diagonal pattern shows that attention is mostly confined within individual passages, highlighting the computational redundancy of calculating inter-passage attention.

Figure 1: Attention values for different retrieved passages. Diagonal heatmap patterns (P0–P4) show high intra-passage attention but low inter-passage attention—indicating wasted computation.

REFRAG’s insight is radical but simple: if most of the computation over RAG context is unnecessary, eliminate it.

The Core Idea: Compress, Sense, and Expand

REFRAG changes the game by feeding compressed representations of retrieved passages to the LLM instead of the raw tokens. It uses a lightweight encoder model to condense each chunk of text into a chunk embedding.

Pipeline overview:

Chunking: The retrieved context is split into fixed-size chunks (e.g., k=16 tokens).
Encoding & Compression: A lightweight encoder (e.g., RoBERTa) processes each chunk into a single embedding. With k=16, the context length is reduced by a factor of 16.
Decoding: The compressed chunk embeddings are projected to the decoder’s token space and fed in alongside the user query token embeddings.

The main architecture of REFRAG. A lightweight encoder processes text chunks into compressed embeddings, which are then fed to the main decoder-only LLM, significantly reducing input length and latency.

Figure 2: REFRAG architecture. Context is chunked, encoded into pre-computable embeddings, and selectively expanded by an RL policy before decoding.

The compression drastically shortens the decoder’s input, slashing TTFT. Chunk embeddings can be pre-computed and cached—reusing repeated passages across queries.

The Payoff: Speed Gains Without Accuracy Loss

Benchmarks show REFRAG outperforming a standard LLaMA and the previous state-of-the-art CEPE. With k=16, REFRAG achieves 16.53× TTFT acceleration (cached) on 16k-token contexts, and with k=32, the speedup reaches 30.85×.

Graphs showing the inference acceleration of REFRAG compared to baselines. REFRAG achieves significant TTFT acceleration that grows with the number of input tokens.

Figure 3: Inference acceleration with k=16. REFRAG (blue/green) far exceeds CEPE (red) in TTFT acceleration.

Critically, these gains come without significant drops in perplexity or task accuracy—thanks to a specialized training recipe.

The Training Recipe: Teaching an LLM to Understand Compression

Simply feeding compressed embeddings into an unmodified LLM fails. REFRAG’s authors devised a two-stage alignment strategy:

Stage 1: Reconstruction Task

Freeze the decoder.
Train only the encoder and projection layer.
Task: Encode a chunk into an embedding, then have the frozen decoder reconstruct the original tokens from that embedding.
Goal: Align encoder outputs with decoder expectations.

Stage 2: Curriculum Learning for Continual Pre-training

Unfreeze the decoder.
Train encoder + decoder on next-paragraph prediction from compressed chunks.
Start simple—predict from one chunk—and progress to longer, multi-chunk contexts.

A visualization of the data mixture during curriculum learning. The training gradually shifts from easier tasks (reconstructing from a few chunks) to more difficult ones (reconstructing from many chunks).

Figure 4: Curriculum learning shifts from short/easy contexts to longer and more difficult contexts over training stages.

This staged approach enables the decoder to effectively interpret compressed inputs, maintaining accuracy at high compression rates.

Going Further: RL-based Selective Compression

Uniform compression can miss critical details. To mitigate this, REFRAG incorporates selective compression—keeping important chunks in their full token form.

A lightweight RL policy decides which chunks to expand. It maximizes prediction accuracy by minimizing perplexity, learning to identify “VIP” chunks that are most relevant to the task.

Diagram showing selective token compression, where an RL policy decides whether to use a compressed chunk embedding or expand the chunk to its full token representation.

Figure 5: Selective token compression. The RL policy expands crucial chunks and leaves the rest compressed.

Comparison of different selective compression policies. The RL-based policy consistently outperforms heuristic and random methods across various compression rates and datasets.

Figure 6: RL-based selection outperforms heuristic and random approaches in perplexity across datasets and compression rates.

Experimental Results

Perplexity and Long-Context Handling

REFRAG models consistently outperform CEPE, REPLUG, and truncated-context LLaMA baselines. They handle contexts up to 16,384 tokens using a base LLaMA with a 4,096-token limit—effectively extending its window by 4×.

Table of perplexity scores for various models on long-context tasks. REFRAG models consistently outperform other baselines and approach the performance of full-context models.

Table 1: Perplexity results for varying context lengths. REFRAG_8 and REFRAG_16 outperform other efficient baselines.

RAG Application Performance

REFRAG shines on 16 diverse QA datasets. Under equal latency, the compression enables REFRAG to process far more passages. For example:

LLaMA handles 1 passage.
REFRAG_8 handles 8 passages in the same time—boosting accuracy.

Performance vs. Latency and Performance vs. Retrieved Passages for REFRAG and LLaMA. REFRAG significantly outperforms LLaMA under the same latency constraints.

Figure 7: RAG performance. Equal passage counts yield similar results, but at equal latency, REFRAG’s ability to consume more context leads to higher accuracy.

With a strong retriever, REFRAG improves average accuracy by 1.22% at equal latency over LLaMA; in a weak retriever setting, the gain grows to 1.93%.

Multi-Turn Conversation

In multi-turn RAG, conversations accumulate context history. Standard fixed-window LLMs must truncate earlier turns—losing important information. REFRAG retains full history, maintaining accuracy across many turns.

Table showing performance on multi-turn RAG tasks. REFRAG outperforms the standard LLaMA model, especially as the number of conversation turns increases.

Table 2: Multi-turn conversation results. REFRAG avoids the truncation penalty, outperforming fine-tuned LLaMA especially at ≥6 turns.

Conclusion: A Specialized Solution for a Specialized Problem

REFRAG demonstrates that rethinking the problem unlocks massive gains. By recognizing RAG’s sparse, block-diagonal attention structure, REFRAG couples chunk compression, a specialized training pipeline, and intelligent selective expansion to deliver:

Up to 30.85× faster TTFT
Extended effective context windows
Comparable or better accuracy versus full-context baselines

For latency-sensitive, knowledge-intensive systems—like RAG chatbots, multi-turn QA, or long-document summarization—REFRAG shows you can have both depth and speed. It’s a compelling example of tailoring efficiency techniques to the real structure of data, paving the way for future targeted optimizations in LLM inference.

The Problem with “Dumb” Context Processing in RAG#

The Core Idea: Compress, Sense, and Expand#

The Payoff: Speed Gains Without Accuracy Loss#

The Training Recipe: Teaching an LLM to Understand Compression#

Going Further: RL-based Selective Compression#

Experimental Results#

Perplexity and Long-Context Handling#

RAG Application Performance#

Multi-Turn Conversation#

Conclusion: A Specialized Solution for a Specialized Problem#