Beyond RAG: Teaching LLMs to Fish for Evidence with SEER

Retrieval-Augmented Generation (RAG) has become the de facto standard for building reliable AI applications. By connecting Large Language Models (LLMs) to external knowledge bases, we can ground their answers in reality and reduce hallucinations.

However, standard RAG pipelines have a significant inefficiency problem. When we retrieve documents, we typically grab the “top-k” passages and shove them all into the LLM’s context window. This approach relies on the hope that the model will figure out which sentences matter and which are just noise.

This strategy leads to input redundancy. We feed the model irrelevant content, which drives up computational costs (more tokens) and, counter-intuitively, increases the risk of hallucination because the model gets distracted by “distractors” in the retrieved text.

What if we could place a smart filter in the middle? An “evidence extractor” that reads the retrieved documents and passes only the exact evidence required to answer the user’s question?

In this post, we are diving deep into SEER (Self-Aligned Evidence Extraction), a framework proposed by researchers from the Harbin Institute of Technology and Huawei. SEER introduces a novel way to train an evidence extractor without needing expensive human-labeled datasets. It uses a self-alignment process to teach a model to be faithful, helpful, and concise.

The Problem: Why Simple Filtering Fails

Ideally, an LLM should receive supporting content that is helpful enough to answer the query but concise enough to save processing time.

Previous attempts to build these “filters” have relied on heuristics (rule-based methods). For example, a system might split documents into sentences and check if they contain keywords from the query. While simple, these methods suffer from three major flaws:

Poor Generalization: Hand-crafted rules don’t adapt well to new domains.
Semantic Deficiency: Arbitrarily chopping text into chunks often breaks the meaning of the context.
Skewed Length: Rule-based filtering often creates training data with weird length distributions, biasing the model.

The authors of SEER posed a critical question: Can we use the model itself to figure out what good evidence looks like?

To validate this intuition, they compared heuristic-based augmentation (using string inclusion rules) against “Model-based Augmentation” (letting a model pick the evidence).

Comparison between model-based and heuristic-based augmentation w.r.t. context relevance.

As shown in Figure 2, the potential ceiling for model-based approaches (Upper Model-based Aug) is significantly higher than heuristic approaches (StrInc). This suggests that models effectively “know” what good evidence looks like better than our hard-coded rules do; we just need to unlock that capability.

The SEER Framework: A High-Level Overview

SEER stands for Self-Aligned Evidence Extraction for Retrieval-Augmented Generation. The goal is to train a small, efficient “Extractor” model (\(\mathcal{E}\)) that sits between the retriever and the final generator.

The Extractor takes a query and a set of retrieved passages, and it outputs a concise piece of evidence \(e\). This evidence is then passed to the Generator (\(\mathcal{G}\)) to produce the final answer.

The RAG pipeline with the evidence extractor, in which the supporting content and the distracting content are marked in green and yellow, respectively.

As illustrated in Figure 1, effective evidence must satisfy three distinct criteria:

Faithfulness: The evidence must actually exist in the source text (no hallucinations).
Helpfulness: The evidence must contain the information needed to answer the question.
Conciseness: The evidence should be as short as possible to save costs and reduce noise.

Figure 1 shows how balancing these is tricky. Option 1 is faithful but misses the full answer. Option 2 is helpful but redundant. Option 3 is the “Goldilocks” zone—perfectly faithful, helpful, and concise.

The Three-Stage Pipeline

How does SEER train a model to find this “Goldilocks” evidence without human labels? It uses a three-stage pipeline shown in Figure 3.

The overall system framework of our SEER, which mainly consists of three modeling stages.

Evidence Extraction (Sampling): The model generates its own candidate evidence.
Expert Assessment: “Expert” models score these candidates on Faithfulness, Helpfulness, and Conciseness.
Self-Alignment: The model uses these scores to update its weights, learning to prefer better evidence.

Let’s break down each stage in detail.

Stage 1: Evidence Extraction via Sampling

The first step relies on the concept of Response Sampling. Given a query \(q\) and retrieved passages \(P\), the base model generates multiple candidate evidence outputs.

\[ e \sim \tilde { \mathcal { E } } ( \cdot | q \oplus P ) , \quad o \sim \mathcal { G } ( \cdot | q \oplus e ) , \]

Because LLMs often follow a power-law distribution (they tend to output the same “head” responses frequently), SEER removes duplicates to create a uniform set of unique candidates. This diversity is crucial; it gives the alignment algorithm a variety of options to rank, ranging from “terrible evidence” to “perfect evidence.”

Stage 2: Expert Assessment

Once the model generates a set of candidates, how do we know which ones are good? SEER employs three automated “Experts” to grade the evidence. This forms a quadruple dataset called QuadQARE (Query, Answer, Retrieved Passage, Evidence).

1. The Faithfulness Expert

This expert checks if the extracted evidence \(e\) is supported by the original retrieved passage \(P\). It uses an NLI (Natural Language Inference) model called ALIGNSCORE. If the evidence is hallucinated or modified too much, this score drops.

\[ s ^ { f } = \mathrm { A L I G N S C O R E } ( P , e ) , \]

2. The Helpfulness Expert

This expert measures utility. It asks: Does adding this evidence make the correct answer more likely? It calculates the probability of the Generator producing the golden answer \(a\), comparing the probability with the evidence versus without it.

\[ s ^ { h } = \mathrm { S i g } \left( \log \frac { \prod f ( a | q \oplus e ) } { \prod f ( a | q ) } \right) , \]

If the evidence \(e\) boosts the probability of the correct answer, \(s^h\) is close to 1. If it’s irrelevant, the score is low.

3. The Conciseness Expert

To prevent the model from just copying the whole passage (which is faithful and helpful, but inefficient), SEER measures conciseness. It compares the extracted evidence \(e\) to a “Full-Length Answer” \(t\) (a minimal declarative sentence containing the answer).

\[ s ^ { c } = { \mathrm { S B E R T } } _ { \mathrm { c o s i n e } } ( t , e ) , \]

This uses semantic similarity (SBERT). The closer the evidence is to the minimal necessary answer, the better the score.

The “Smoothing CoV-Weighting” Trick

Now we have three scores: \(s^f, s^h, s^c\). You might think we should just average them. However, SEER introduces a clever weighting mechanism based on the Coefficient of Variation (CoV).

The insight is that learning difficulty varies across criteria. If the model is already very good at faithfulness (low variance in scores), we shouldn’t waste training cycles optimizing it further. We should focus on the criteria where the model is inconsistent (high variance).

First, they calculate the CoV for each score type (standard deviation divided by mean):

\[ c ^ { f } = \sigma ^ { f } / \mu ^ { f } , \]

Then, they determine the weights (\(\alpha\)) using a softmax function over these coefficients:

\[ \alpha ^ { f } = \frac { \exp ( c ^ { f } / \tau ) } { \sum _ { * } \exp ( c ^ { * } / \tau ) } , \]

Finally, the total score \(s\) is the weighted sum:

\[ s = \alpha ^ { f } s ^ { f } + \alpha ^ { h } s ^ { h } + \alpha ^ { c } s ^ { c } , \]

This ensures that the optimization process dynamically focuses on the property that needs the most improvement.

Stage 3: Self-Alignment (LPO)

We now have a list of candidate evidence and a quality score for each. Standard reinforcement learning approaches like PPO (Proximal Policy Optimization) or DPO (Direct Preference Optimization) are common here.

However, the authors argue these are insufficient. PPO is unstable and hard to tune. DPO is pairwise (A vs B) and doesn’t account for the magnitude of the difference or the overall list ranking.

SEER introduces Listwise-aware Preference Optimization (LPO). It modifies the DPO loss function to include a weighting factor \(\lambda_{w,l}\) based on the ranking position.

The loss function looks like this:

\[ \begin{array} { r l } & { \mathcal { L } ( \pi _ { \theta } ; \pi _ { \mathrm { r e f } } , \lambda _ { w , l } ) _ { \mathrm { L P O } } = - \mathbb { E } _ { ( x , y _ { w } , y _ { l } ) \sim \mathcal { D } } } \\ & { \quad \left[ \lambda _ { w , l } \log \mathrm { S i g } \left( \beta \frac { \pi _ { \theta } \left( y _ { w } | x \right) } { \pi _ { \mathrm { r e f } } \left( y _ { w } | x \right) } - \beta \frac { \pi _ { \theta } \left( y _ { l } | x \right) } { \pi _ { \mathrm { r e f } } \left( y _ { l } | x \right) } \right) \right] , } \end{array} \]

The critical innovation is the \(\lambda_{w,l}\) term. It scales the loss based on the Mean Reciprocal Rank (MRR) gain. Essentially, the model is penalized more heavily for swapping a high-ranked candidate with a low-ranked one than for swapping two mediocre candidates.

\[ \lambda _ { w , l } = s _ { w } \Delta \mathrm { M R R } _ { w , l } + s _ { l } \Delta \mathrm { M R R } _ { l , w } , \]

This “Listwise-aware” approach allows SEER to optimize the entire ranking of evidence candidates, pushing the model toward the ideal “Goldilocks” evidence.

Experimental Results

Does this complex pipeline actually result in better RAG performance? The authors tested SEER on three benchmarks: NaturalQuestions (NQ), TriviaQA (TQA), and HotpotQA.

1. Accuracy vs. Cost

The most impressive result is the trade-off between performance and token usage.

QA performance comparison table.

In Table 1, compare CGE Full (feeding the full retrieved passage) with FGE SEER:

Performance: SEER consistently achieves higher Exact Match (EM) and F1 scores. For example, on NQ with Llama2, SEER reaches 0.4549 EM compared to Full’s 0.4382.
Cost: Look at the “Tok” (average tokens) row. The Full method uses ~800 tokens. SEER uses fewer than 100.
Efficiency: SEER reduces the evidence length by 9.25 times while improving accuracy. This is a massive cost saving for API-based RAG systems.

2. The Impact of Alignment Methods

The authors validated their choice of LPO over PPO and DPO.

Alignment performance bar charts for faithfulness, helpfulness, and conciseness.

Figure 4 shows the breakdown. In almost every metric (Faithfulness, Helpfulness, Conciseness), LPO (the rightmost bar in each group) outperforms the Base model, PPO, and standard DPO. This confirms that incorporating listwise ranking signals into the preference optimization creates a more robust extractor.

3. Robustness to Noise

Real-world retrieval is messy. Often, the retriever returns 4 bad documents and 1 good one.

Model performance w.r.t. Noise-to-Signal Ratio (NSR).

Figure 5 tests the model’s robustness by increasing the “Noise-to-Signal Ratio” (NSR).

Blue bars/lines (Aligned): The SEER model.
Grey bars/lines (Base): The unaligned model.

As the noise increases (moving right on the x-axis), the performance drops for both. However, the Aligned model (SEER) maintains significantly higher Silver Faithfulness scores and generally suffers smaller performance drops than the base model. This indicates that SEER effectively learns to ignore distracting content.

Conclusion and Implications

SEER represents a significant step forward in making RAG systems more efficient and reliable. By moving away from heuristic filters and towards model-based self-alignment, we can create systems that act as intelligent curators of information.

The key takeaways from this research are:

Less is More: You don’t need to feed LLMs thousands of tokens of context. High-quality, extracted evidence often yields better answers than full documents.
Self-Correction Works: LLMs have the latent ability to identify good evidence; they just need to be aligned using their own generated data.
Ranking Matters in Training: When tuning these models, treating training data as pairs (A is better than B) is good, but treating them as a ranked list (A is the best, B is okay, C is bad) via LPO is significantly better.

For students and practitioners working on RAG, SEER highlights the importance of the “middle layer” in the pipeline. Optimizing the retriever and the generator is standard practice, but optimizing the interface between them—the evidence extractor—might be the lowest-hanging fruit for performance gains.

The Problem: Why Simple Filtering Fails#

The SEER Framework: A High-Level Overview#

The Three-Stage Pipeline#

Stage 1: Evidence Extraction via Sampling#

Stage 2: Expert Assessment#

1. The Faithfulness Expert#

2. The Helpfulness Expert#

3. The Conciseness Expert#

The “Smoothing CoV-Weighting” Trick#

Stage 3: Self-Alignment (LPO)#

Experimental Results#

1. Accuracy vs. Cost#

2. The Impact of Alignment Methods#

3. Robustness to Noise#

Conclusion and Implications#