Imagine you are chatting with a friend about movies. You ask, “Who directed Inception?” Your friend replies, “Christopher Nolan.” You then ask, “What else did he direct?”
Your friend instantly knows “he” refers to Nolan. But if you type “What else did he direct?” into a standard search engine, it fails miserably. It lacks the context. This is the fundamental challenge of Conversational Question Answering (CQA). To bridge the gap between human conversation and search engines, we use Query Rewriting (QR). A QR model translates “What else did he direct?” into “What movies did Christopher Nolan direct?”
Traditionally, training these rewriting models has been expensive. It requires massive datasets where humans manually rewrite thousands of questions, or “gold labels” where we know exactly which document contains the answer. But what if we could train a model to rewrite queries perfectly without needing to know which document is the “correct” one?
In this post, we’re diving into AdaQR (Adaptive Query Rewriting), a framework proposed by researchers from CUHK and MIT. They have developed a clever way to align query rewriters using the conversational answer itself as a guide, eliminating the need for expensive passage annotations.
The Problem: The High Cost of Context
In a Retrieval-Augmented Generation (RAG) system, the pipeline usually looks like this:
- User Input: A query in a conversation (often ambiguous or incomplete).
- Rewriter: Converts the conversational query into a standalone search query.
- Retriever: Searches a massive database for relevant passages.
- Reader/Generator: Uses those passages to formulate an answer.
The weak link is often step 2. If the rewriter fails to disambiguate “he” or “it,” the retriever finds garbage, and the generator hallucinates.
Existing solutions try to “align” the rewriter with the retriever. They train the rewriter to generate queries that result in the specific “gold passage” (the correct document) appearing at the top of the search results. But obtaining these gold passage labels is incredibly labor-intensive. Furthermore, a rewriter trained on one topic (e.g., movies) often fails when moved to another (e.g., medical advice).
AdaQR solves this by asking a different question: Instead of forcing the model to find a specific document, why don’t we just reward it if it finds ANY information that makes the correct answer probable?
The AdaQR Framework
The core innovation of AdaQR is how it provides feedback to the model. It uses a concept called Marginal Probability of Conversational Answers.
Let’s break down the architecture.

As shown in Figure 1, the process is a cycle:
- History & Question: We take the conversation history and the current question.
- SFT Rewriter: A “Supervised Fine-Tuned” model generates several candidate rewrites.
- Retrieval: We treat these rewrites as search queries to fetch passages.
- Reward Calculation: We check if the retrieved passages actually help a separate model generate the known answer.
- DPO: We update the rewriter to prefer the variations that worked best.
Let’s dig into the three main phases of this method.
Phase 1: Supervised Fine-Tuning (SFT) with Limited Data
Before the model can be clever, it needs to be competent. The researchers start by taking a standard Large Language Model (LLM)—specifically Mistral-7B—and fine-tuning it on a very small amount of labeled data (about 10% of a seed dataset).
The objective is standard: given the history \(H\) and the current question \(q\), maximize the probability of the human-written rewrite \(r\).

This gives the model the basic ability to rewrite, but it doesn’t teach the model what the retriever likes. It just mimics human grammar. This serves as the “Warm-start” model, denoted as \(\mathcal{M}_{SFT}\).
Phase 2: Reward Collection via Marginal Probability
This is the most technically interesting part of the paper. We want to improve our SFT model, but we don’t have labeled passages. However, in any training dataset for Q&A, we do have the Answer (\(a\)).
For every question, the SFT model generates multiple rewrite candidates (e.g., \(\hat{r}^1, \hat{r}^2, \dots\)). We send these to an off-the-shelf retriever (like BM25 or a dense retriever) to get a set of passages for each candidate.
Now, how do we know which rewrite was “good”? A good rewrite retrieves passages that contain the answer.
Instead of checking for string matching (which fails if the answer is “Nolan” but the text says “Director C. Nolan”), AdaQR uses a pre-trained LLM to calculate the probability of the answer given the passage.
First, they calculate the score \(S_k\) for a single retrieved passage \(p_k\):

This equation asks: What is the log-probability of the Answer (\(a\)) given the History (\(H\)), Question (\(q\)), and this specific Passage (\(p_k^i\))?
However, a retrieval system returns multiple passages (Top-K). We don’t want to rely on just the first one. We want to know if the set of retrieved passages is useful. AdaQR calculates the Marginal Probability by summing up the contributions of the top-K passages:

Here, \(e^i\) is the reward score for rewrite \(i\). It marginalizes (sums) the answer probability over the top-K passages, weighted by how confident the retriever was in each passage (\(\mathcal{P}_{\mathcal{R}}\)).
Why is this brilliant? It creates a reward signal that says, “This rewrite is good because it led to a cluster of information that makes the correct answer highly probable,” without ever needing a human to point at a specific document.
Phase 3: Direct Preference Optimization (DPO)
Now that we have a score \(e^i\) for every rewrite candidate, we can compare them. For a single question, if rewrite A gets a score of 0.9 and rewrite B gets a score of 0.4, we know A is better.
We construct pairs of (Winner, Loser) and use Direct Preference Optimization (DPO). DPO is a stable and efficient way to align LLMs to human (or in this case, retriever) preferences without complex Reinforcement Learning setups.
The loss function looks like this:

The model \(\mathcal{M}_{\theta}\) is trained to increase the likelihood of the winning rewrite (\(\hat{r}^w\)) and decrease the likelihood of the losing rewrite (\(\hat{r}^l\)), effectively aligning the rewriter’s output with what the retriever needs to find the answer.
Experimental Results
The researchers tested AdaQR on four major conversational QA benchmarks: QReCC, TopiOCQA, Doc2Dial, and MultiDoc2Dial. They compared their method against strong baselines, including GPT-4 prompting and methods that use gold-passage labels (the “cheating” upper bound).
Does it work without passage labels?
The most critical test is comparing AdaQR (which sees no passage labels) against methods that do.

In Table 1, look at the “TopiOCQA-SFT” section.
- TopiOCQA-SFT (Baseline): 40.8 MRR (Mean Reciprocal Rank).
- + Gold-Label (Upper Bound): 48.5 MRR. This method effectively “cheats” by using ground-truth passage data.
- + Ours (AdaQR): 50.6 MRR.
Result: AdaQR not only outperforms the baseline SFT model, but it also matches or exceeds the performance of models trained with gold labels. This confirms that the answer acts as a sufficient proxy for the document during training.
Adapting to New Domains
One of the biggest pain points in NLP is “Out-of-Domain” generalization. A model trained on Wikipedia (QReCC) might struggle with technical manuals (Doc2Dial).
The researchers took a model fine-tuned only on QReCC (the seed) and applied AdaQR using the questions/answers from Doc2Dial (the target), without using any human rewrites from the target domain.

Table 2 shows the results. On Doc2Dial (Sparse retrieval), the QReCC-SFT model scores 56.0. After applying AdaQR (+Ours), performance jumps to 59.9, approaching the Gold-Label performance of 60.6. This proves AdaQR is a powerful tool for domain adaptation.
Comparison with Pseudo-Labeling
A common “weak supervision” technique is Pseudo-Labeling: assuming that if a passage shares many words with the answer, it must be the right passage.
The researchers compared AdaQR against this word-overlap method.

Figure 2 is revealing. The X-axis represents the “F1-score” (word overlap) between answers and passages.
- As the overlap decreases (moving left), Pseudo-Labeling (dashed lines) falls apart. If the answer is “Nolan” but the passage says “The director of Tenet,” word overlap fails.
- AdaQR (Solid lines) remains stable and high-performing. Because it uses an LLM to judge probability rather than counting matching words, it understands semantic meaning.
Handling Topic Shifts
Conversations often pivot abruptly. User: “Tell me about apples.” System: “Apples are a fruit…” User: “How about the tech company?”
These “Topic Shifts” are nightmares for rewriters.

Figure 5 shows the performance ratio on turns with topic shifts. A ratio \(>1.0\) means the method handles shifts better than the average query. AdaQR (“Ours”, light blue) consistently achieves ratios near or above 1.0, significantly outperforming the SFT baseline and Pseudo-Labeling methods. This suggests AdaQR learns to become more sensitive to context changes because failing to do so results in very low rewards during the DPO phase.
The Mechanics of Success: Why it Works
The Importance of “K”
In the reward calculation, we sum probabilities over the top-K passages. Does K matter?

Figure 3 shows the impact of K.
- SFT (Baseline): The dashed line at the bottom.
- Top-1: Using only the single best retrieved passage improves performance, but it’s risky.
- Top-5/Top-9: As we increase K (considering more passages), performance generally stabilizes or improves.
The researchers settled on \(K=5\) as a sweet spot. By looking at 5 passages, the model “hedges its bets”—it rewards a rewrite if it lands the retrieval generally in the right area, even if the absolute #1 result isn’t perfect.
Data Efficiency
How much data do you actually need to run this preference optimization?

Figure 4(a) shows that even with just 20% of the preference pairs, AdaQR provides a significant boost over the baseline SFT. This makes the framework extremely computationally efficient compared to training on massive datasets.
Furthermore, Figure 4(b) compares AdaQR to Gold-Label training. AdaQR (Red line) scales better. It can leverage training examples that don’t have gold passage labels (unlabeled data), allowing it to utilize larger portions of the dataset that the Gold-Label method has to discard.
Conclusion and Implications
The AdaQR framework represents a significant step forward in making Conversational Search systems more robust and cheaper to build.
Key Takeaways:
- No Gold Labels Needed: You can align a rewriter using only the conversation’s answers. The “Marginal Probability” acts as a powerful, semantic reward signal.
- Semantic over Lexical: Unlike keyword matching (pseudo-labeling), AdaQR understands meaning, making it robust to diverse vocabulary and low word overlap.
- Domain Adaptation: You can take a rewriter trained on general data and “adapt” it to a specific niche (like legal or medical docs) simply by running AdaQR on the target questions/answers—no human rewriting required.
By treating the retriever’s output as a latent variable and optimizing for the final answer probability, AdaQR aligns the system end-to-end. It stops treating “Rewriting,” “Retrieving,” and “Answering” as separate tasks and starts treating them as a cohesive team working toward a single goal: giving the user the right information.
For students and practitioners in NLP, this highlights the power of weak supervision and preference optimization. Sometimes, you don’t need perfect labels; you just need a clever way to tell the model what “good” looks like.
](https://deep-paper.org/en/paper/2406.10991/images/cover.png)