Imagine you are chatting with a friend about movies. You ask, “Who directed Inception?” They answer, “Christopher Nolan.” Then you ask, “What other movies did he make?”
To a human, “he” clearly refers to Christopher Nolan. To a standard search engine, however, “he” is ambiguous. This is the fundamental challenge of conversational search. Users naturally use pronouns, ellipses, and context-dependent phrasing, assuming the system remembers the history of the conversation.
Traditionally, search systems handle this by using a “Rewriter”—a separate model that translates “What other movies did he make?” into “What movies did Christopher Nolan direct?” before searching. While effective, this two-step process is slow and computationally expensive.
In a recent paper from Renmin University of China and collaborators, researchers proposed ChatRetriever. This model skips the rewriting step entirely. Instead, it adapts a Large Language Model (LLM) to act directly as a dense retriever, understanding complex conversational context natively.
In this post, we will deconstruct how ChatRetriever works, the novel training method called CSIT that powers it, and why this might be the future of search.
The Problem: Accuracy vs. Efficiency
In conversational search, there are generally two camps:
- Conversational Query Rewriting (CQR): You use an LLM to rewrite the user’s question into a standalone query.
- Pro: Very accurate because LLMs understand context well.
- Con: High latency. You have to generate text before you can even start searching.
- Conversational Dense Retrieval (CDR): You use a specialized encoder to turn the conversation history directly into a vector (a list of numbers) and search for matching documents.
- Pro: Very fast (end-to-end retrieval).
- Con: Historically, these models struggle with complex, long-tail conversations compared to LLMs.
The researchers asked a pivotal question: Can we take the raw power of an LLM and force it to act as a dense retriever, combining the accuracy of the first camp with the speed of the second?

As shown in Figure 1, the goal is to move from the top approach (Prompting an LLM to rewrite) to the bottom approach (Adapting the LLM to output embeddings directly).
The Solution: ChatRetriever
To create ChatRetriever, the authors didn’t just fine-tune an LLM on search data. They introduced a dual-learning approach called Contrastive Session-Masked Instruction Tuning (CSIT).
This method tackles a specific problem: existing LLM-based retrievers are usually trained on simple, single-turn instructions. They lack the “muscle memory” to compress a long, messy conversation into a precise search vector.
ChatRetriever modifies an LLM (specifically Qwen-7B-Chat in this paper) to output a vector representation of a conversation session. To do this, they append special tokens—[EMB]—to the end of the input. The internal state of these tokens becomes the “search query.”
The training consists of two simultaneous objectives.
1. Contrastive Instruction Tuning
The first objective is standard in the world of dense retrieval. The model looks at a conversational session (\(x\)) and a relevant passage (\(y^+\)). It tries to maximize the similarity between their vector representations while minimizing the similarity with irrelevant passages (\(y^-\)).
The loss function looks like this:

Here, \(\phi(x,y)\) calculates the similarity score between the session and the passage. This teaches the model to distinguish between good and bad search results.
However, contrastive learning alone isn’t enough. LLMs are generative models by nature; simply forcing them to rank passages often leads to overfitting on simple keywords rather than deeply understanding the conversation flow. This brings us to the second, more innovative objective.
2. Session-Masked Instruction Tuning (The “Secret Sauce”)
The authors propose a technique to force the LLM to fully digest the conversation history into its embedding tokens.
Usually, an LLM predicts the next word based on all previous words. In Session-Masked Instruction Tuning, the researchers present the model with the session history, the special [EMB] tokens, and the target response.
The Twist: When the model tries to predict the response tokens (\(y^+\)), the researchers mask the original session text (\(x\)).

Look at the “Session-Masked Attention Matrix” on the right of Figure 2.
- The Session tokens can see each other (standard).
- The Response tokens (blue squares at the bottom) cannot look back at the Session tokens (white squares). They can only look at the Special Tokens (green squares).
This forces a bottleneck. Because the model cannot “cheat” by looking back at the specific words in the conversation history while generating the response, it must compress all the necessary semantic information into those special [EMB] tokens. If the [EMB] tokens don’t perfectly represent the user’s intent, the model won’t be able to reconstruct the response.
The input sequence is structured as follows:

And the loss function for this objective ensures the probability of the response is maximized given only the special tokens:

Combined Objective
The final training recipe combines both concepts. The model learns to rank passages correctly (Contrastive Loss) and learns to compress conversation history into dense vectors (Session-Masked Loss) simultaneously.

The hyperparameter \(\alpha\) balances the two objectives. This dual approach transforms the LLM from a text generator into a highly capable “ChatRetriever.”
Experimental Results
Does this complex training actually work? The researchers tested ChatRetriever against a wide range of baselines on five standard benchmarks (CAsT-19, CAsT-20, CAsT-21, QReCC, and TopiOCQA).
The baselines included:
- Conversational Query Rewriting (CQR): T5QR, ConvGQR, and LLM-based rewriters.
- Conversational Dense Retrieval (CDR): ConvDR, LeCoRE.
- LLM-based Retrievers: RepLLaMA, E5-mistral.
Main Performance

Table 1 reveals several key findings:
- State-of-the-Art for CDR: ChatRetriever (bottom row) significantly outperforms all other Conversational Dense Retrieval methods (like ConvDR and LeCoRE).
- Matching LLM Rewriters: Historically, dense retrievers lagged behind LLM rewriting methods (like LLM4CS). ChatRetriever closes this gap, achieving performance comparable to, and in some cases better than, the best rewriting techniques.
- General LLM Retrievers Fail: Notice that general-purpose LLM retrievers like RepLLaMA and E5_mistral (which perform well on standard search) struggle with conversational datasets. This proves that specific conversational tuning is required.
How much data is needed?
You might think training a 7B parameter model requires massive datasets. Surprisingly, ChatRetriever becomes effective very quickly.

As shown in Figure 3, the model performance (measured in NDCG@3) ramps up effectively within just 500 to 1000 training steps. This suggests that the “knowledge” is already inside the LLM; the CSIT method just aligns it efficiently for retrieval.
Ablation Studies: Do we need the masking?
To prove that the “Session-Masked” component was necessary, the authors ran ablation studies.

- w/o SIT: Removing the Session-Masked Instruction Tuning causes a significant drop in accuracy across all datasets.
- w/o R-CoT: Removing the “Representational Chain of Thought” (using multiple
[EMB]tokens instead of just one) also hurts performance. - Vanilla IT: Replacing the masking strategy with standard instruction tuning is better than nothing, but still inferior to the masked approach.
This confirms that forcing the bottleneck—making the model rely only on the embedding tokens to generate responses—is crucial for high-quality representations.
Robustness Evaluation
Real-world conversations are messy. Users change topics, give vague feedback, or modify their constraints. The authors didn’t just test on static datasets; they simulated “noisy” conversations to test robustness.
They modified context in two ways:
- Partial Response Modification: Generating new responses turn-by-turn to simulate a drifting conversation.
- Full Context Modification: Using ChatGPT to completely rewrite the conversation history while keeping the search intent the same.
In both scenarios, ChatRetriever showed the highest stability (lowest standard deviation in performance) compared to other dense retrievers. It behaves much like an LLM rewriter in its ability to handle varied phrasing, but retains the architecture of a retriever.
Conclusion
ChatRetriever represents a significant step forward in conversational search. By adapting LLMs directly for retrieval, we bypass the latency of query rewriting without sacrificing understanding.
The core innovation, Contrastive Session-Masked Instruction Tuning, offers a blueprint for how to convert generative models into representation models. It forces the model to compress its reasoning into vector form, resulting in a system that “thinks” like a chatbot but “searches” like a dense retriever.
For students and researchers in Information Retrieval, this paper highlights a clear trend: the line between “generation” and “retrieval” is blurring. Future search engines likely won’t just look up keywords; they will use the latent reasoning capabilities of LLMs to understand the entirety of a user’s session in one go.
](https://deep-paper.org/en/paper/2404.13556/images/cover.png)