Introduction
Large Language Models (LLMs) like GPT-4 and LLaMA have revolutionized how we interact with information. However, they suffer from a well-known flaw: hallucination. Even the most advanced models can confidently produce outdated information or fabricate facts entirely.
To combat this, the industry has widely adopted Retrieval-Augmented Generation (RAG). By connecting an LLM to an external knowledge base, we can ground the model’s responses in factual, up-to-date documents. It combines the reasoning power of pre-training with the accuracy of a search engine.
But here lies the new problem: Complexity.
A RAG system isn’t just a single algorithm; it is a pipeline composed of multiple distinct modules. Should you rewrite the user’s query? How small should you chunk your documents? Which vector database should you use? Should you summarize the retrieved text before feeding it to the LLM?
With so many moving parts, developers often fall into “analysis paralysis.” In the paper “Searching for Best Practices in Retrieval-Augmented Generation,” researchers from Fudan University conducted a comprehensive systematic study to cut through the noise. They didn’t just test one model; they tested every component of the RAG workflow to identify the optimal configuration.
In this post, we will tear down their framework, analyze each module, and reveal the “recipes” for building the most effective and efficient RAG systems available today.
The Anatomy of a RAG System
To understand how to optimize RAG, we first need to visualize the entire workflow. RAG is not a straight line; it is a series of decisions.
As illustrated in Figure 1, the researchers established a comprehensive framework that includes evaluation, fine-tuning, and a multi-step retrieval process.

The workflow operates like a funnel:
- Query Classification: Deciding if retrieval is even necessary.
- Retrieval: Searching a vector database for relevant content.
- Reranking: Refining the search results to find the highest quality matches.
- Repacking & Summarization: Organizing and compressing the data to fit the LLM’s context window.
- Generation: Producing the final answer.
The authors adopted a “modular” research approach. They optimized one module at a time, fixing the best method for that step before moving to the next. Let’s walk through these steps to see what they found.
Step 1: Query Classification
The first question a RAG system should ask isn’t “What documents do I need?” but rather “Do I need documents at all?”
Retrieval takes time. If a user asks “What is the capital of France?”, an LLM can answer from its internal parametric memory instantly. Triggering a database search for such a simple fact is a waste of resources and latency. Conversely, if a user asks about a specific internal company policy from 2024, the LLM must retrieve data.
The researchers propose a Query Classification module. As shown in Figure 2, tasks can be categorized based on whether the information is “Sufficient” (contained in the prompt or model) or “Insufficient” (requiring external knowledge).

They trained a BERT-based classifier to automatically distinguish between these distinct task types. The results were compelling:

The Takeaway: Implementing a classification step is a “quick win.” It reduces the average response time by bypassing retrieval for simple queries, without sacrificing accuracy for complex ones.
Step 2: Chunking and Embeddings
Before we can retrieve data, we must index it. This involves Chunking (breaking text into pieces) and Embedding (converting text into numbers).
The Chunking Dilemma
If your chunks are too small, you lose context (e.g., a chunk that just says “He said yes” is useless without knowing who “He” is). If chunks are too large, you fill the LLM’s context window with irrelevant noise.
The researchers tested various sizes and found a “Goldilocks” zone. As seen in Table 3, very small chunks (128 tokens) had high relevancy but lower faithfulness. Large chunks (2048 tokens) had the opposite problem.

Best Practice: A chunk size between 256 and 512 tokens offers the best balance. The authors also noted that adding “metadata” (like titles or keywords) to chunks significantly helps, though they stuck to sentence-level chunking for their primary experiments.
Choosing an Embedder
Once chunked, data must be embedded. The study compared general-purpose embedders (like bge-large and e5) against specialized ones. They found that LLM-Embedder, a model specifically fine-tuned for retrieval tasks, offered the best balance of performance and model size.

Step 3: The Retrieval Strategy
This is the heart of the system. How do we find the right documents? The researchers explored three main strategies:
- Sparse Retrieval (BM25): Traditional keyword matching. Fast, but misses semantic meaning.
- Dense Retrieval: Vector-based search. Great for understanding meaning, but can miss exact keywords.
- Hybrid Search: Combining both.
The Hybrid Formula
The researchers found that relying on just one method is insufficient. They utilized a Hybrid Search weighted by a parameter, \(\alpha\).

Here, \(S_s\) is the sparse (keyword) score, and \(S_d\) is the dense (vector) score. But how much weight should be given to keywords versus vectors? By testing various alpha values (see Table 8), they discovered that a value of 0.3 yields the best results. This implies that while exact keywords matter, semantic meaning (the vector score) should carry the majority of the weight.

Query Transformation: The Power of HyDE
Users often write messy queries. To fix this, the researchers tested Query Rewriting and Query Decomposition, but the standout winner was HyDE (Hypothetical Document Embeddings).
HyDE doesn’t search using your question. Instead, it asks an LLM to hallucinate a hypothetical answer to your question. It then embeds that fake answer and searches for real documents that look like the fake one. This bridges the semantic gap between a question (e.g., “Why is the sky blue?”) and an answer (e.g., “The atmosphere scatters sunlight…”).
The Verdict: The combination of Hybrid Search + HyDE provided the highest retrieval accuracy, though it comes with a latency cost due to the extra generation step.
Step 4: Reranking
Retrieval engines prioritize speed, which means they use “Approximate Nearest Neighbor” calculations. They are fast but not perfectly precise. To fix this, a Reranking step is added. The retriever fetches a large pool (e.g., 100 documents), and a specialized “Reranker” model sorts them carefully to keep only the top few.
The paper compared two types of rerankers:
- DLM Reranking (monoT5): Uses a sequence-to-sequence model to check relevance. It is highly accurate but slower.
- TILDE: A specialized efficient ranking method.
The Verdict: If you want the absolute best quality, monoT5 is the winner. If speed is critical, TILDEv2 offers a massive speedup with only a minor drop in performance.
Step 5: Repacking and Summarization
Once we have the top documents, we can’t just throw them at the LLM randomly. The order matters.
Repacking: The “Reverse” Strategy
Research into the “Lost in the Middle” phenomenon shows that LLMs pay the most attention to the beginning and the end of their prompt.
The authors tested “Forward” (best docs first), “Reverse” (best docs last), and “Sides” (best docs at start and end). They found that the Reverse strategy—placing the most relevant documents at the very bottom of the context, right next to the user’s question—yielded the best performance.
Summarization
Finally, if the retrieved documents are long, they might exceed the context window or confuse the model. The authors tested methods to compress this text. They found that Recomp, which uses both extractive and abstractive compression, effectively condensed information without losing the core facts needed to answer the query.
Generator Fine-Tuning
Beyond the pipeline architecture, the researchers also looked at the LLM itself. Can we fine-tune the generator to be better at RAG?
They experimented with different training data mixtures, feeding the model queries paired with:
- Only relevant documents (\(D_q\)).
- Only random/irrelevant documents (\(D_r\)).
- A mixture of both (\(D_{gr}\)).

As shown in Figure 3, the model trained with a mixture (\(M_{gr}\)) performed the best. This teaches the model a critical skill: discrimination. By seeing both relevant and irrelevant data during training, the model learns to ignore noise and focus on the signal.
The Verdict: Optimal RAG Recipes
After testing every single module, the researchers combined their findings into a massive benchmark evaluation. Table 11 summarizes the search for the optimal practice.

Based on this data, the paper proposes two distinct “recipes” for developers:
Recipe 1: Best Performance (The “All-Out” Approach)
Use this when accuracy is paramount, and you can tolerate higher latency (e.g., offline analysis, medical advice).
- Classification: Yes
- Retrieval: Hybrid Search + HyDE
- Reranking: monoT5
- Repacking: Reverse
- Summarization: Recomp
Recipe 2: Balanced Efficiency (The “Production” Approach)
Use this for real-time applications (e.g., chatbots) where speed matters.
- Classification: Yes
- Retrieval: Hybrid Search (Drop HyDE to save time)
- Reranking: TILDEv2 (Faster than monoT5)
- Repacking: Reverse
- Summarization: Recomp (or skip if context window allows)
The “Balanced” recipe drastically cuts latency (from ~11 seconds to ~1.5 seconds in some setups) while maintaining competitive accuracy.
Beyond Text: Multimodal RAG
The paper concludes with an exciting extension: “Retrieval as Generation” for images.
Generating an image from scratch (using tools like Stable Diffusion) is computationally expensive and hard to control. The researchers propose a multimodal RAG system.

As illustrated in Figure 4:
- Text-to-Image: When a user asks for an image (e.g., “A dog sleeping”), the system first checks a database of existing images. If a high-similarity match exists, it retrieves it instantly. It only generates a new image if no good match is found.
- Image-to-Text: Similarly, for image captioning, it retrieves existing captions for similar images before attempting to generate a new one.
This approach ensures Groundedness (using verified media rather than potentially hallucinated visuals) and significantly improves Efficiency.
Conclusion
The “best” RAG system is not a single model, but a carefully orchestrated pipeline. This research highlights that while the Large Language Model often gets all the hype, the engineering around it—how you classify, chunk, retrieve, and rank data—is what actually determines success.
By implementing Query Classification to save resources, utilizing Hybrid Search to capture both keywords and meaning, and employing Reverse Repacking to optimize LLM attention, developers can build RAG systems that are not just accurate, but also efficient enough for the real world.
](https://deep-paper.org/en/paper/2407.01219/images/cover.png)