Introduction

We often think of Large Language Models (LLMs) as vast repositories of knowledge, but they have a significant weakness: they cannot memorize everything, especially real-time events or niche domain knowledge. To solve this, the AI community widely adopted Retrieval-Augmented Generation (RAG). The concept is simple: when an LLM is asked a question, it first searches an external database (like Wikipedia) for relevant documents, then uses those documents to generate an answer.

Ideally, this gives the LLM an “open book” exam capability. However, there is a catch. What happens if the “book” is wrong, irrelevant, or misleading?

Standard RAG systems operate on blind trust. They retrieve documents and feed them to the LLM, assuming the information is useful. But as recent research highlights, LLMs are easily distracted. If the retrieval system pulls in noise, the LLM often hallucinates or provides incorrect answers based on that noise.

Consider the scenarios illustrated below:

Figure 1: LLMs may be misled by irrelevant documents,and struggle to determine the relevance of a document.

As shown in Figure 1, an LLM might know the answer based on its internal training (Row 1). However, if you provide it with a document containing irrelevant or slightly misleading information, the model can be tricked into giving a wrong answer (Row 2). Even worse, standard LLMs struggle to self-assess whether a document is actually relevant (Row 3).

To bridge this gap, researchers have introduced REAR (RElevance-Aware Retrieval-augmented framework). This new approach aims to give LLMs a sense of “self-awareness” regarding external data. Instead of blindly accepting retrieved documents, REAR explicitly assesses their relevance and adaptively decides whether to trust the external source or rely on its own internal knowledge.

In this deep dive, we will explore how REAR architecture works, how it fuses internal and external knowledge, and why its training methodology offers a robust solution to the noise problem in Open-Domain Question Answering (QA).

Background: The Context of Open-Domain QA

Before dissecting REAR, it is helpful to understand the standard workflow of Open-Domain Question Answering.

The task typically involves a Retriever-Reader pipeline. Given a query \(q\) (e.g., “Who won the first Nobel Prize in Physics?”), the Retriever scans a massive document collection \(\mathcal{D}\) to find the top-\(k\) most similar documents.

The Reader (the LLM) then takes these documents and attempts to generate an answer. Mathematically, for a set of retrieved documents, the LLM generates a set of candidate answers:

Equation 1: Standard RAG formulation.

In traditional setups, the LLM treats these documents as ground truth. However, retrieval is imperfect. It frequently returns documents that share keywords with the query but contain no semantic answer, or worse, contradictory information.

Recent attempts to fix this, such as Self-RAG or RobustLM, have tried to make models “introspect” by generating special tokens (like [Relevant] or [Irrelevant]) before answering. While a step in the right direction, these methods often rely on binary classifications (Relevant vs. Not Relevant) which are too sparse to capture the nuance of how useful a document actually is. They also treat the relevance assessment and the answer generation as loosely coupled steps, rather than a deeply integrated process.

This is where REAR changes the game.

The REAR Architecture

The core philosophy of REAR is that an LLM should not just generate text; it should simultaneously act as a judge of its own inputs. The authors propose a novel architecture that integrates an explicit Assessment Module directly into the generation process.

The framework operates in three distinct steps, as visualized in the architecture overview below:

Figure 2: The overview of the proposed REAR framework.

Let’s break down these three phases: Relevance Assessment, Relevance-Guided Generation, and Knowledge Reliability Verification.

1. Relevance Assessment

When the REAR model receives a query \(q\) and a retrieved document \(d\), it doesn’t immediately try to answer the question. First, it encodes the pair to understand their relationship.

The model uses the LLM backbone to create a representation of the query and document. Specifically, it looks at the hidden states of the last token to extract a “relevance embedding,” denoted as \(v_{rel}\).

Equation 2: Calculating the relevance embedding.

This vector \(v_{rel}\) contains the semantic information regarding how well the document matches the query. But an embedding vector is abstract. To make it actionable, REAR passes this vector through a specially designed Assessment Module (a linear projection layer) to output a scalar score, \(s_{rel}\).

Equation 3: Calculating the relevance score.

This score \(s_{rel}\) represents the model’s confidence in the document. A high score means the document is highly relevant; a low score suggests it is noise. Unlike previous methods that just ask the LLM to output the word “Yes” or “No,” this method utilizes the model’s high-dimensional internal states to produce a precise, continuous relevance score.

2. Relevance-Guided Generation

This is the most innovative part of the REAR framework. In standard approaches, once relevance is judged, it is often just used to filter documents. REAR, however, uses the relevance score to guide the generation of the answer itself.

Because the relevance score \(s_{rel}\) is just a number (a scalar), it’s hard for a massive neural network to pay attention to it directly. To fix this, the researchers map this score back into a high-dimensional vector, called the guidance vector (\(v_{guide}\)).

Equation 4: Creating the guidance vector.

This \(v_{guide}\) is essentially a signal flare. It is fed into the LLM alongside the query and document. It tells the model: “This document has a relevance score of 0.9, so trust it,” or “This document scores 0.1, so ignore it and use your internal parametric knowledge.”

With this guidance, the LLM generates the answer \(a\):

Equation 5: Generating the answer with guidance.

This creates a dynamic system where the model adapts its behavior on the fly based on the quality of the retrieved information.

3. Knowledge Reliability Verification

Even with the best assessment, models can make mistakes. The researchers introduced a post-generation verification step to double-check the answer’s validity. They propose two strategies, but the most interesting one is Knowledge Consistency.

The logic is inspired by human reasoning: If you are confident in a fact (e.g., “The sky is blue”), reading a document that says “The sky is green” shouldn’t change your mind easily. Conversely, if you don’t know the answer, you rely entirely on the document.

To test this mathematically, REAR calculates a “consistency score.” It effectively asks: How confused would the model be if we forced it to believe the document is irrelevant?

They do this by setting the relevance signal to zero (\(\hat{s}_{rel} = 0\)) and measuring the Perplexity (PPL) of the generated answer. Perplexity measures how “surprised” a model is by a sequence of words.

Equation 6: Calculating the Knowledge Consistency score.

Scenario A: The model utilized the document to answer because it didn’t know the answer itself. If we set the relevance to 0 (telling the model “ignore the doc”), the model’s perplexity for that answer will skyrocket (it becomes very “surprised” by the answer).
Scenario B: The model used its own internal knowledge. Setting the relevance to 0 won’t change the perplexity much, because the model didn’t need the doc anyway.

By combining the original relevance score (\(s_{rel}\)) with this consistency score (\(c\)), REAR selects the final answer that balances external evidence with internal confidence.

Training: Teaching the Model to Doubt

Architecture is only half the battle. To make REAR effective, the researchers had to devise a sophisticated training regimen. Standard RAG models are often trained on “gold” (perfect) query-document pairs. They rarely see noise during training, so they never learn to identify it.

REAR introduces two major training innovations: Bi-granularity Relevance Fusion and Noise-Resistant Training.

Bi-granularity Relevance Fusion

Binary labels (Relevant / Not Relevant) are often insufficient for complex questions. As shown in Figure 4, documents exist on a spectrum. Some contain the direct answer, some contain clues that require reasoning, and some are completely irrelevant.

Figure 4: The illustration of different retrieved documents and different labeling metrics.

If we only use binary labels (the Check/Cross marks), we lose the nuance between a “perfect match” and a “helpful clue.”

REAR uses Bi-granularity training, which combines:

Coarse-grained (Binary): Is it relevant or not?
Fine-grained (Ranking): How relevant is it compared to other documents?

To achieve fine-grained supervision, the authors use a preference-based loss function. They force the model to assign a higher score to a more relevant document (\(s_i\)) than a less relevant one (\(s_j\)).

Equation 7: Fine-grained preference loss.

The total loss combines both the coarse and fine signals:

Equation 8: Bi-granularity loss function.

This ensures the model can distinguish between “totally wrong,” “kinda helpful,” and “exactly right.”

Noise-Resistant Training

To prevent the model from hallucinating when given bad data, the researchers explicitly include negative samples (irrelevant documents) in the training set.

They use a technique called Hard Negative Sampling. Instead of picking random irrelevant documents (which are too easy to spot), they pick documents that are similar to the query but don’t contain the answer. This forces the model to read carefully.

The sampling probability is carefully calculated to ensure negatives are difficult but not “false negatives” (which are actually correct documents labeled wrongly).

Equation 9: Sampling probability for negatives.

The model is then trained to maximize the probability of the correct answer, even when the document provided is explicitly irrelevant (effectively training it to fallback to internal knowledge).

Equation 10: Noise-resistant loss.

By combining these objectives, the final training loss for REAR prepares the model for the messy reality of open-domain search:

Equation 11: Total REAR loss function.

Experiments and Results

Does adding an assessment module and training with noise actually help? The researchers tested REAR against several competitive baselines, including Self-RAG and RobustLM, across four major datasets: Natural Questions (NQ), TriviaQA, WebQuestions (WebQ), and SQuAD.

Main Performance

The results, summarized in Table 3, show that REAR consistently outperforms the baselines.

Table 3: Comparison between REAR and baselines.

On the Natural Questions (NQ) dataset, REAR achieves an Exact Match (EM) score of 51.41%, significantly higher than Self-RAG (41.02%) and RobustLM (44.40%). This pattern holds true across TriviaQA and SQuAD as well.

It is worth noting the two bottom rows of the table:

w/ Source Rel: Using only the relevance score.
w/ Knowledge Con: Using the “Knowledge Consistency” verification (perplexity check).

The addition of Knowledge Consistency consistently yields the highest scores, proving that verifying the model’s reliance on external data is a crucial final step.

Relevance Discrimination

One of the paper’s key claims is that generative LLMs are bad at judging relevance. Table 4 validates this claim.

Table 4: Relevance discrimination capabilities.

JAcc (Judgmental Accuracy): Measures how often the model correctly labels a document as relevant or irrelevant.
Hit@1: Measures if the document chosen for the final answer was actually relevant.

Standard LLaMA-2 with prompting achieves a JAcc of only 25.04% on Natural Questions. It is essentially guessing. REAR, thanks to its specialized Assessment Module, jumps to 74.04%. This massive gap illustrates why the architectural change—incorporating the explicit scoring module—is necessary.

Robustness to Document Quantity and Quality

In real-world RAG, we often retrieve multiple documents (5, 10, or more). A robust system should get better with more documents, not more confused.

Figure 3 (left) shows the performance as the number of documents increases.

Figure 3: Results of RAG performance vary in overall document count and quality.

While the baseline Llama2-13B (pink line) barely improves or even degrades with more documents (likely due to being overwhelmed by noise), REAR (blue bar) shows a steady improvement.

Similarly, Figure 3 (right) shows performance based on retriever quality (R1 is a weak retriever, R3 is a strong one). REAR maintains a significant lead even with weaker retrievers, suggesting it is capable of filtering out the noise that weaker retrievers introduce.

Efficiency

One might worry that adding an “Assessment Module” and calculating relevance scores slows down the system. However, as Table 2 shows, REAR is surprisingly efficient.

Table 2: Efficiency analysis.

Compared to methods like Chain-of-Note (CoN), which generates long textual explanations for relevance, REAR is vastly faster during inference (0.45s vs 0.82s). It is even faster than Self-RAG. This is because REAR uses a lightweight linear layer for assessment rather than generating extra tokens which require expensive autoregressive decoding.

Conclusion and Implications

The REAR framework addresses one of the most persistent “trust issues” in modern AI: the tendency of LLMs to be sycophantic, agreeing with whatever document is placed in front of them, even if it’s wrong.

By mechanically separating Relevance Assessment from Answer Generation and then fusing them back together through embedding guidance, REAR allows the model to “think” about the validity of its sources before speaking.

The key takeaways from this research are:

Explicit Assessment: LLMs need specific modules to judge relevance; they don’t do it well implicitly.
Guidance Vectors: Passing relevance scores as embeddings allows the model to dynamically adjust its attention between external documents and internal knowledge.
Noise Training: You must train models on “bad” data so they learn to reject it.

As RAG systems become the standard for enterprise and academic AI applications, frameworks like REAR will be essential. They move us away from “blind retrieval” toward a more robust, self-aware, and reliable generation process.

Introduction#

Background: The Context of Open-Domain QA#

The REAR Architecture#

1. Relevance Assessment#

2. Relevance-Guided Generation#

3. Knowledge Reliability Verification#

Training: Teaching the Model to Doubt#

Bi-granularity Relevance Fusion#

Noise-Resistant Training#

Experiments and Results#

Main Performance#

Relevance Discrimination#

Robustness to Document Quantity and Quality#

Efficiency#

Conclusion and Implications#