Teaching Retrievers Logic: How Entailment Tuning is Solving the Relevance Gap in RAG
If you have ever built a Retrieval-Augmented Generation (RAG) system or an open-domain Question Answering (QA) bot, you have likely encountered a frustrating phenomenon: the “keyword trap.”
You ask your system a specific question, like “Who was the first person to step on the moon?” The retriever goes into your vector database and pulls out a passage. But instead of an article about Neil Armstrong’s historic step, it retrieves a biography that says, “Neil Armstrong loved looking at the moon as a child.”
Technically, the passage is “relevant.” It shares the subject (Neil Armstrong) and the object (moon). But practically, it is useless. It doesn’t answer the question.
This highlights a fundamental flaw in how many dense retrieval models define “relevance.” They often conflate semantic similarity (sharing the same topic) with entailment (containing the logical proof of an answer).
In this deep dive, we are exploring a fascinating paper titled “Improve Dense Passage Retrieval with Entailment Tuning.” The researchers propose a novel way to train retrievers not just to match topics, but to understand logical deduction. By borrowing concepts from Natural Language Inference (NLI), they have developed a method called Entailment Tuning that significantly boosts performance in QA and RAG tasks.
The Problem: The Ambiguity of Relevance
To understand the solution, we first need to diagnose the problem with current Dense Passage Retrieval (DPR) methods.
Modern retrievers generally use a dual-encoder architecture. You have a query encoder and a passage encoder (often built on BERT-like models). They map text into a vector space where “similar” texts are close together. The retrieval process usually looks like this:

Here, the system selects \(k\) passages (\(p\)) that maximize a similarity score \(f\) with the query (\(q\)).
The issue lies in how these models are pre-trained. Models like BERT are trained on “Masked Language Modeling” (predicting missing words based on context). This teaches the model that words appearing in the same context window are related. Consequently, “Armstrong,” “step,” and “moon” end up with similar vector representations regardless of the sentence’s logical structure.
For general web search, this is fine. But for Question Answering, we need a stricter relationship. We need the passage to entail the answer.
Visualizing the Logic Gap
Let’s look at a concrete example provided by the researchers to illustrate this distinction.

In Figure 1, we see a query: “Who first step on the moon?”
- Passage A (Neutral): “Armstrong loves the moon since he learns to walk.”
- Passage B (Entailment): “Armstrong became the first man to walk on the moon in 1969.”
A standard dense retriever might score both of these highly because they share high lexical overlap with the query. However, only Passage B logically supports the answer (“Armstrong is the first human step on the moon”).
The researchers argue that a helpful passage must satisfy a condition of entailment. If the passage is the premise, the answer must be a valid hypothesis derived from that premise.
The Hypothesis: Connecting Retrieval to NLI
The core insight of this paper is that we should stop treating retrieval purely as a similarity search and start treating it as a Natural Language Inference (NLI) task.
In NLI, we classify pairs of sentences into three categories:
- Entailment: If the premise is true, the hypothesis must be true.
- Contradiction: If the premise is true, the hypothesis must be false.
- Neutral: The truth of the hypothesis cannot be determined from the premise.
The researchers discovered that standard retrievers are great at distinguishing “Irrelevant” from “Relevant,” but they are terrible at distinguishing “Neutral” (on-topic but useless) from “Entailment” (contains the answer).
To prove this, they ran an experiment comparing how a dedicated NLI model scores passages versus how a standard dense retriever scores them.

As shown in Figure 2, an NLI model (specifically trained for logic) assigns much higher entailment scores to positive passages (blue bars) than negative ones (orange bars). This confirms that logical entailment is a strong signal for relevance.
However, when we look at standard dense retrievers (like Contriever or E5), the picture gets messy.

In Figure 3, look at the graph on the right (Contriever). The red distribution (Entailment) and the blue distribution (Neutral) overlap significantly. This overlap is the “danger zone” where your RAG system retrieves a passage that looks right but fails to answer the user’s question.
The Solution: Entailment Tuning
So, how do we teach a retriever to respect logic? The authors propose Entailment Tuning, a training stage that sits between the general pre-training (like BERT) and the specific fine-tuning (like DPR).
The process involves three clever steps: transforming questions into claims, unifying the data format, and a specialized masking task.
Step 1: From Question to Existence Claim
NLI models work on pairs of declarative sentences (Premise + Hypothesis). But in retrieval, we have a Question + Passage. A question is not a statement, so we cannot directly check if a passage “entails” a question.
The researchers solved this by converting questions into Existence Claims.
The logic is simple: Every valid question implies that an answer exists.
- Question (\(q\)): “When was the movie Titanic released?”
- Existence Claim (\(c\)): “There exists a known time when the movie Titanic was released.”
Mathematically, they define this transformation as:

And the validity of the question implies the claim:

Why do this? Because of the Chain Rule of Logic:

This equation states: If the Passage (\(p\)) implies the Answer (\(a\)), and the Answer (\(a\)) implies the Claim (\(c\)), then the Passage (\(p\)) must imply the Claim (\(c\)).
By training the model to verify if the Passage entails the Existence Claim, we are using the Claim as a proxy for the actual Answer (which the model doesn’t know yet).
Step 2: Unified Prompting
Now that questions are converted into claims, the researchers can combine NLI data (pairs of sentences) with Retrieval data (Query-Passage pairs).
They use a unified prompt structure:
[Passage] entails that [Claim]
This allows them to train the model on massive amounts of labeled NLI data (like the SNLI dataset) alongside standard retrieval datasets (like MSMARCO), reinforcing the concept of logical deduction.
Step 3: Masked Hypothesis Prediction
This is the most technical and innovative part of the method. Standard pre-training uses Masked Language Modeling (MLM), where random words throughout a sentence are hidden, and the model tries to guess them.
In Entailment Tuning, the researchers don’t mask randomly. They specifically mask the Hypothesis (the Claim).
Let’s say:
- Premise (Passage): “Titanic was released in 1997 to critical acclaim.”
- Hypothesis (Claim): “There exists a time when Titanic was released.”
The model input becomes:
Titanic was released in 1997... entails that There exists a time when [MASK] was [MASK].
The masking probability \(\beta\) is set very high for the hypothesis part:

The model is then tasked with predicting these masked tokens based only on the Premise (the Passage).

The loss function is the standard negative log-likelihood for the masked tokens:

Why this works: By masking the claim and forcing the model to fill it in using the passage, the model learns that the information in the claim must be derived from the passage. It forces the embedding of the passage to contain the logical seeds necessary to reconstruct the claim. If the passage was irrelevant (Neutral), the model wouldn’t be able to predict the masked claim accurately.
Experimental Results
Does adding this “logic layer” actually improve search? The authors tested Entailment Tuning across several major benchmarks, primarily Natural Questions (NQ) and MSMARCO.
Retrieval Performance
They applied Entailment Tuning to various base models (BERT, RoBERTa, RetroMAE) and compared them to their standard versions.

Table 2 shows the results. The rows with + Ent. T. (Entailment Tuning) consistently outperform the baselines.
- Recall@1 (R@1): This metric checks if the very first result was the correct one. On the Natural Questions (NQ) dataset, BERT + Entailment Tuning scored 48.53%, compared to 45.21% for standard BERT. That is a significant jump in the world of retrieval.
- Robustness: The improvement is consistent across different architectures, whether it’s an older model like BERT or a modern retrieval-oriented model like RetroMAE.
Impact on RAG Systems
Retrieval is rarely the end goal; usually, it’s a step toward answering a question. The researchers plugged their entailment-tuned retriever into a RAG pipeline (using a T5 model as the reader) to see if it helped generate better answers.

Table 3 shows the Exact Match (EM) scores. The model successfully improved the downstream QA performance. When the retriever understands logic, it feeds better documents to the reader, which results in more accurate answers.
Human Evaluation of Correctness
Perhaps the most important metric for RAG is trust. Does the system hallucinate? Does it answer the prompt?
The authors used GPT-4 to act as a judge (a common practice in evaluating LLMs) to score the “Correctness” and “Relevancy” of the answers generated by LLaMA-2 using retrieved documents.

In Table 4, we see that Entailment Tuning (✓) leads to higher Correctness scores (on a 1-5 scale) across both the ELI5 and ASQA datasets.
To visualize this “head-to-head” performance, Figure 4 shows a pairwise comparison.

The green bars represent “Wins” (where Entailment Tuning was better), and blue bars are “Draws.” The method rarely loses. It either performs as well as the baseline or, in a significant number of cases, provides a superior answer.
What is the Best Masking Strategy?
Finally, the authors asked: “Is masking the hypothesis really that important? Can’t we just mask everything?”
They performed an ablation study (Table 5) to test this.

- \(\beta=0.2/H\): Masking only 20% of the hypothesis (too easy).
- \(\beta=0.8/F\): Masking 80% of the Full prompt (too chaotic).
- \(\beta=0.8/H\): Masking 80% of the Hypothesis only (Goldilocks zone).
The results confirm that the specific strategy of heavily masking the hypothesis is crucial. It forces the model to treat the Passage as the source of truth (Premise) and the Claim as the derivative (Hypothesis).
Conclusion and Implications
The paper “Improve Dense Passage Retrieval with Entailment Tuning” offers a compelling step forward for semantic search. It highlights that “similarity” is a vague concept that often fails in complex QA scenarios.
By formalizing relevance as logical entailment, the researchers have given us a way to build smarter retrievers. These models don’t just look for matching keywords; they look for passages that logically support the existence of an answer.
For students and practitioners working on RAG, the takeaways are clear:
- Don’t trust dot-product similarity blindly. High scores might just mean high keyword overlap.
- Logic matters. Incorporating NLI data or objectives into your training pipeline can help separate “on-topic” noise from “answer-bearing” signal.
- Prompt engineering isn’t just for LLMs. The clever transformation of “Question \(\to\) Existence Claim” shows that how we format data for the retriever is just as important as the model architecture itself.
As we move toward more autonomous AI agents, the ability to retrieve information that is not just relevant, but logically sound, will be the differentiator between a chatty bot and a reliable assistant.
](https://deep-paper.org/en/paper/2410.15801/images/cover.png)