EfficientRAG: Solving Multi-Hop QA Without Breaking the Bank

In the rapidly evolving landscape of Large Language Models (LLMs), Retrieval-Augmented Generation (RAG) has become the gold standard for grounding AI responses in reality. By fetching relevant data from external sources, RAG reduces hallucinations and enables models to answer questions about specific, private, or up-to-the-minute data.

However, there is a class of questions that continues to stump standard RAG systems: multi-hop questions. These are complex queries that require multiple steps of reasoning to answer, such as “Who is the director of the movie that starred the lead actor of ‘Titanic’?”

To solve this, researchers have developed iterative methods that “hop” from one piece of information to the next. But there is a catch: these methods are notoriously slow and expensive, often requiring multiple calls to massive LLMs just to figure out what to search for next.

In this post, we are doing a deep dive into EfficientRAG, a novel framework presented by researchers from Nanjing University and Microsoft. This paper proposes a way to perform iterative retrieval using lightweight models instead of heavy LLMs, drastically cutting down latency and cost while improving accuracy.

The Problem: Noise and Complexity

Standard RAG operates on a “retrieve-then-read” basis. You ask a question, the system searches a database for keywords, and an LLM reads the results to generate an answer. This works perfectly for simple questions like “What is the capital of France?”

But for multi-hop questions, a single search often fails. The system might find the actor of ‘Titanic’ (Leonardo DiCaprio) but fail to retrieve information about the movies he has directed.

To address this, the field moved toward Iterative RAG. In this setup, the model retrieves some info, reads it, realizes it’s incomplete, generates a new query, retrieves more info, and repeats. While effective, current approaches rely on calling an LLM (like GPT-4) at every single step to reason about what to do next. This introduces two major problems:

Latency: Chaining multiple LLM calls makes the system slow.
Noise Sensitivity: Even powerful LLMs struggle when the retrieved documents contain a mix of relevant facts and irrelevant “noise.”

The researchers conducted an empirical study to demonstrate this sensitivity to noise. As shown below, they tested GPT-3.5, GPT-4, and Llama-3-8B on how well they could answer questions when provided with different types of context chunks.

A grouped bar chart comparing accuracy across three chunking methods—Direct, Oracle Chunks, and Mixed Chunks.

Figure 1 Analysis: The chart above reveals a critical insight. The blue bars (Direct) show that without retrieval, models fail. The orange bars (Oracle Chunks) show that when fed perfect information, models excel. However, look at the green bars (Mixed Chunks). When relevant information is mixed with irrelevant “noise,” performance drops significantly for GPT-3.5 and Llama-3, and even noticeably for GPT-4.

This tells us that simply retrieving more data isn’t the answer. We need to retrieve better data and filter out the noise.

Enter EfficientRAG

The core philosophy of EfficientRAG is simple: We do not need a trillion-parameter model to decide what to search for next.

The researchers posit that identifying relations and entities for the next retrieval hop is a task that small, specialized models can handle effectively. By offloading the “reasoning” about what to search next to lightweight models, EfficientRAG eliminates the intermediate LLM calls entirely.

The Architecture

EfficientRAG replaces the heavy LLM in the retrieval loop with two lightweight components based on a smaller encoder model (DeBERTa-v3-large):

Labeler & Tagger: This component looks at the retrieved documents. It “Labels” the specific tokens (words) that are useful and “Tags” the document to decide if we have enough information or if we need to keep searching.
Filter: This component takes the useful tokens identified by the Labeler and constructs a new, targeted query for the next search.

Let’s look at how this flow works in practice:

Diagram of the EfficientRAG framework showing the flow from query to retrieval, labeling, tagging, and filtering.

Walkthrough of the Framework (Figure 3):

Initial Query: The user asks, “How large is the shopping mall where KGOT radio station has its studios?”
Retrieval: The system retrieves chunks of text.
Labeler & Tagger:

Chunk 1 (Top path): Discusses KGLK and KHPT stations. The Tagger marks this as <Terminate> (useless) or simply filters it out because it contains no labeled tokens regarding KGOT.
Chunk 2 (Bottom path): Contains the sentence “KGOT broadcasts from studios at the Dimond Center.” The Labeler identifies “KGOT” and “Dimond Center” as critical information. The Tagger marks this chunk as <Continue> because while we found the location, we still don’t know how large it is.

Filter: The Filter takes the original question and the newly found fact (“KGOT, in the Dimond Center”). It combines them to generate the next query: “How large is Dimond Center?”
Iteration: This new query is sent back to the retriever.

Crucially, this entire loop happens without prompting a chat-based LLM. The final answer is generated by the LLM only once the retrieval process is complete and verified.

The Power of “Token Labeling”

A unique aspect of this method is Token Labeling. Instead of asking a model to “rewrite the query” (which requires generating new text—a slow process), EfficientRAG treats the problem as a classification task.

The Labeler looks at the retrieved text and classifies each token as True (useful) or False (irrelevant). The Filter then essentially performs a “fill-in-the-blank” operation, replacing the unknown parts of the original query with the labeled tokens found in the text. This is computationally much cheaper than text generation.

Synthetic Data Construction

You might wonder: How do you train a small model to know which tokens are important?

Since there aren’t massive datasets of “questions mapped to specific useful tokens in paragraphs,” the researchers synthesized their own data using a teacher LLM (Llama-3-70B).

They broke the process down into four steps for the teacher LLM:

Decomposition: Break a multi-hop question into single-hop sub-questions.
Token Labeling: Ask the LLM to identify exactly which words in a document answer the sub-question.
Next-hop Filtering: Ask the LLM to formulate what the next search query should be.
Negative Sampling: Find tricky but irrelevant documents to train the model on what not to use.

This synthetic data was then used to fine-tune the smaller DeBERTa model, effectively distilling the reasoning capabilities of the large model into a faster, specialized architecture.

Experimental Results

The researchers tested EfficientRAG against several baselines, including:

Direct: Standard QA without retrieval.
Direct-R: Standard one-hop RAG.
Iter-RetGen & SelfAsk: Advanced iterative methods that use LLMs for query rewriting.

The evaluation covered three challenging multi-hop datasets: HotpotQA, MuSiQue, and 2WikiMQA.

1. Retrieval Efficiency

The first question to answer is: Does EfficientRAG actually find the right documents?

Line graph comparing Recall performance across varying input sizes. Efficient RAG Decompose consistently achieves higher recall.

As shown in Figure 2, EfficientRAG (blue line) demonstrates superior Recall compared to Direct retrieval and even LLM-based decomposition.

The X-axis represents the number of retrieved chunks (log-scaled).
The Y-axis represents Recall (how much of the necessary information was found).

Notice how the blue line shoots up faster? This means EfficientRAG finds the relevant information earlier, requiring fewer chunks to get the full picture. Standard decomposition methods (green line) eventually catch up, but they require retrieving significantly more data to do so.

2. Computational Efficiency

This is arguably the most impactful result of the paper. Since EfficientRAG uses small models for the intermediate steps, it should be faster. But how much faster?

Table comparing efficiency metrics like LLM calls, Iterations, Latency, and GPU usage.

Table 4 presents a stark comparison:

LLM Calls: EfficientRAG requires only 1.00 call (for the final answer). Compare this to Iter-RetGen (3.00) and SelfAsk (7.18).
Latency: EfficientRAG takes about 3.62 seconds. SelfAsk takes a staggering 27.47 seconds.
Speedup: EfficientRAG is roughly 3x faster than Iter-RetGen and nearly 8x faster than SelfAsk, while maintaining similar GPU utilization.

This drastic reduction in latency makes iterative RAG feasible for real-time applications where a user cannot wait 30 seconds for an answer.

3. End-to-End Accuracy

Speed is useless if the answers are wrong. Fortunately, EfficientRAG excels here as well.

Table showing end-to-end QA performance on the 2WikiMQA dataset. EfficientRAG achieves state-of-the-art accuracy.

In Table 5, using GPT-3.5 as the final generator, EfficientRAG achieves an accuracy of 53.41, beating out the advanced Iter-RetGen method (46.59) and the standard Direct retrieval (32.70).

By filtering out irrelevant chunks before they ever reach the final LLM, EfficientRAG provides a “cleaner” context, allowing the generator to answer more accurately.

4. Generalization

Finally, the researchers tested if the model could handle datasets it wasn’t trained on (Out-Of-Domain adaptation).

Table showing out-of-domain experiments. EfficientRAG demonstrates transferability across diverse datasets.

Table 6 shows that a model trained on HotpotQA performs surprisingly well when tested on 2WikiMQA, and vice versa. This suggests that the “skill” of identifying useful tokens and filtering noise is a generalizable capability, not just memorization of specific dataset patterns.

Conclusion and Implications

EfficientRAG represents a significant shift in how we design RAG systems. It challenges the assumption that we need to throw more LLM power at every sub-problem.

Key Takeaways:

Specialization over Scale: Small, fine-tuned models (like the Labeler/Filter) can outperform general-purpose LLMs on specific structural tasks like query formulation.
Noise is the Enemy: Improving RAG isn’t just about finding more documents; it’s about aggressively filtering out the wrong ones.
Efficiency Unlocks Usability: By reducing LLM calls from 7+ to 1, multi-hop QA becomes viable for production environments sensitive to cost and latency.

For students and practitioners, this paper serves as a reminder: before chaining multiple expensive API calls together, consider if a smaller, trained component can do the job faster and better. As AI systems become more complex, efficiency optimizations like EfficientRAG will be the key to scaling them up.

EfficientRAG: Solving Multi-Hop QA Without Breaking the Bank#

The Problem: Noise and Complexity#

Enter EfficientRAG#

The Architecture#

The Power of “Token Labeling”#

Synthetic Data Construction#

Experimental Results#

1. Retrieval Efficiency#

2. Computational Efficiency#

3. End-to-End Accuracy#

4. Generalization#

Conclusion and Implications#