Can LLMs Replace the Search Bar? Meet ChatRetriever

Imagine you are chatting with a friend about movies. You ask, “Who directed Inception?” They answer, “Christopher Nolan.” Then you ask, “What other movies did he make?”

To a human, “he” clearly refers to Christopher Nolan. To a standard search engine, however, “he” is ambiguous. This is the fundamental challenge of conversational search. Users naturally use pronouns, ellipses, and context-dependent phrasing, assuming the system remembers the history of the conversation.

Traditionally, search systems handle this by using a “Rewriter”—a separate model that translates “What other movies did he make?” into “What movies did Christopher Nolan direct?” before searching. While effective, this two-step process is slow and computationally expensive.

In a recent paper from Renmin University of China and collaborators, researchers proposed ChatRetriever. This model skips the rewriting step entirely. Instead, it adapts a Large Language Model (LLM) to act directly as a dense retriever, understanding complex conversational context natively.

In this post, we will deconstruct how ChatRetriever works, the novel training method called CSIT that powers it, and why this might be the future of search.

The Problem: Accuracy vs. Efficiency

In conversational search, there are generally two camps:

Conversational Query Rewriting (CQR): You use an LLM to rewrite the user’s question into a standalone query.

Pro: Very accurate because LLMs understand context well.
Con: High latency. You have to generate text before you can even start searching.

Conversational Dense Retrieval (CDR): You use a specialized encoder to turn the conversation history directly into a vector (a list of numbers) and search for matching documents.

Pro: Very fast (end-to-end retrieval).
Con: Historically, these models struggle with complex, long-tail conversations compared to LLMs.

The researchers asked a pivotal question: Can we take the raw power of an LLM and force it to act as a dense retriever, combining the accuracy of the first camp with the speed of the second?

Figure 1: Illustration of adapting LLM for query rewriting and conversational dense retrieval.

As shown in Figure 1, the goal is to move from the top approach (Prompting an LLM to rewrite) to the bottom approach (Adapting the LLM to output embeddings directly).

The Solution: ChatRetriever

To create ChatRetriever, the authors didn’t just fine-tune an LLM on search data. They introduced a dual-learning approach called Contrastive Session-Masked Instruction Tuning (CSIT).

This method tackles a specific problem: existing LLM-based retrievers are usually trained on simple, single-turn instructions. They lack the “muscle memory” to compress a long, messy conversation into a precise search vector.

ChatRetriever modifies an LLM (specifically Qwen-7B-Chat in this paper) to output a vector representation of a conversation session. To do this, they append special tokens—[EMB]—to the end of the input. The internal state of these tokens becomes the “search query.”

The training consists of two simultaneous objectives.

1. Contrastive Instruction Tuning

The first objective is standard in the world of dense retrieval. The model looks at a conversational session (\(x\)) and a relevant passage (\(y^+\)). It tries to maximize the similarity between their vector representations while minimizing the similarity with irrelevant passages (\(y^-\)).

The loss function looks like this:

Equation for Contrastive Ranking Loss

Here, \(\phi(x,y)\) calculates the similarity score between the session and the passage. This teaches the model to distinguish between good and bad search results.

However, contrastive learning alone isn’t enough. LLMs are generative models by nature; simply forcing them to rank passages often leads to overfitting on simple keywords rather than deeply understanding the conversation flow. This brings us to the second, more innovative objective.

2. Session-Masked Instruction Tuning (The “Secret Sauce”)

The authors propose a technique to force the LLM to fully digest the conversation history into its embedding tokens.

Usually, an LLM predicts the next word based on all previous words. In Session-Masked Instruction Tuning, the researchers present the model with the session history, the special [EMB] tokens, and the target response.

The Twist: When the model tries to predict the response tokens (\(y^+\)), the researchers mask the original session text (\(x\)).

Overview of CSIT showing the masking mechanism.

Look at the “Session-Masked Attention Matrix” on the right of Figure 2.

The Session tokens can see each other (standard).
The Response tokens (blue squares at the bottom) cannot look back at the Session tokens (white squares). They can only look at the Special Tokens (green squares).

This forces a bottleneck. Because the model cannot “cheat” by looking back at the specific words in the conversation history while generating the response, it must compress all the necessary semantic information into those special [EMB] tokens. If the [EMB] tokens don’t perfectly represent the user’s intent, the model won’t be able to reconstruct the response.

The input sequence is structured as follows:

Equation showing input sequence concatenation

And the loss function for this objective ensures the probability of the response is maximized given only the special tokens:

Equation for Session-Masked Language Modeling Loss

Combined Objective

The final training recipe combines both concepts. The model learns to rank passages correctly (Contrastive Loss) and learns to compress conversation history into dense vectors (Session-Masked Loss) simultaneously.

Equation for the combined loss function

The hyperparameter \(\alpha\) balances the two objectives. This dual approach transforms the LLM from a text generator into a highly capable “ChatRetriever.”

Experimental Results

Does this complex training actually work? The researchers tested ChatRetriever against a wide range of baselines on five standard benchmarks (CAsT-19, CAsT-20, CAsT-21, QReCC, and TopiOCQA).

The baselines included:

Conversational Query Rewriting (CQR): T5QR, ConvGQR, and LLM-based rewriters.
Conversational Dense Retrieval (CDR): ConvDR, LeCoRE.
LLM-based Retrievers: RepLLaMA, E5-mistral.

Main Performance

Table 1: Results of the normal evaluation on five conversational search benchmarks.

Table 1 reveals several key findings:

State-of-the-Art for CDR: ChatRetriever (bottom row) significantly outperforms all other Conversational Dense Retrieval methods (like ConvDR and LeCoRE).
Matching LLM Rewriters: Historically, dense retrievers lagged behind LLM rewriting methods (like LLM4CS). ChatRetriever closes this gap, achieving performance comparable to, and in some cases better than, the best rewriting techniques.
General LLM Retrievers Fail: Notice that general-purpose LLM retrievers like RepLLaMA and E5_mistral (which perform well on standard search) struggle with conversational datasets. This proves that specific conversational tuning is required.

How much data is needed?

You might think training a 7B parameter model requires massive datasets. Surprisingly, ChatRetriever becomes effective very quickly.

Figure 3: Performance of ChatRetriever at different training steps.

As shown in Figure 3, the model performance (measured in NDCG@3) ramps up effectively within just 500 to 1000 training steps. This suggests that the “knowledge” is already inside the LLM; the CSIT method just aligns it efficiently for retrieval.

Ablation Studies: Do we need the masking?

To prove that the “Session-Masked” component was necessary, the authors ran ablation studies.

Table 4: Results of ablation studies.

w/o SIT: Removing the Session-Masked Instruction Tuning causes a significant drop in accuracy across all datasets.
w/o R-CoT: Removing the “Representational Chain of Thought” (using multiple [EMB] tokens instead of just one) also hurts performance.
Vanilla IT: Replacing the masking strategy with standard instruction tuning is better than nothing, but still inferior to the masked approach.

This confirms that forcing the bottleneck—making the model rely only on the embedding tokens to generate responses—is crucial for high-quality representations.

Robustness Evaluation

Real-world conversations are messy. Users change topics, give vague feedback, or modify their constraints. The authors didn’t just test on static datasets; they simulated “noisy” conversations to test robustness.

They modified context in two ways:

Partial Response Modification: Generating new responses turn-by-turn to simulate a drifting conversation.
Full Context Modification: Using ChatGPT to completely rewrite the conversation history while keeping the search intent the same.

In both scenarios, ChatRetriever showed the highest stability (lowest standard deviation in performance) compared to other dense retrievers. It behaves much like an LLM rewriter in its ability to handle varied phrasing, but retains the architecture of a retriever.

Conclusion

ChatRetriever represents a significant step forward in conversational search. By adapting LLMs directly for retrieval, we bypass the latency of query rewriting without sacrificing understanding.

The core innovation, Contrastive Session-Masked Instruction Tuning, offers a blueprint for how to convert generative models into representation models. It forces the model to compress its reasoning into vector form, resulting in a system that “thinks” like a chatbot but “searches” like a dense retriever.

For students and researchers in Information Retrieval, this paper highlights a clear trend: the line between “generation” and “retrieval” is blurring. Future search engines likely won’t just look up keywords; they will use the latent reasoning capabilities of LLMs to understand the entirety of a user’s session in one go.

The Problem: Accuracy vs. Efficiency#

The Solution: ChatRetriever#

1. Contrastive Instruction Tuning#

2. Session-Masked Instruction Tuning (The “Secret Sauce”)#

Combined Objective#

Experimental Results#

Main Performance#

How much data is needed?#

Ablation Studies: Do we need the masking?#

Robustness Evaluation#

Conclusion#