When you type a query into a search engine, you expect relevant results instantly. Behind the scenes, however, there is a constant tug-of-war between speed and accuracy. Modern Information Retrieval (IR) systems often rely on a two-step process to manage this trade-off: a fast “retriever” that finds a broad set of candidate documents, followed by a slower, more precise “reranker” that sorts them.

For years, researchers have tried to make the fast retriever smarter so that it relies less on the heavy computational cost of the reranker. The standard approach is Knowledge Distillation: teaching the fast model to mimic the scores of the smart model.

But there is a flaw in how we usually do this. Most systems teach the retriever to mimic absolute scores (e.g., “This document is an 8.5/10”). This is called pointwise ranking. The problem? Absolute scores are notoriously difficult to calibrate. Is an 8.5 in one context really better than an 8.4 in another?

In this post, we dive into PAIRDISTILL, a research paper that proposes a shift in perspective. Instead of asking “How good is this document?”, the authors ask “Is Document A better than Document B?” By distilling knowledge through pairwise comparisons, this new method achieves state-of-the-art results, effectively bridging the gap between speed and precision in dense retrieval.

The Landscape: Retrievers vs. Rerankers

To understand PAIRDISTILL, we first need to understand the architecture it aims to improve.

1. The Dense Retriever (The Student)

Dense retrieval models, like the Dense Passage Retriever (DPR), use a Dual-Encoder architecture. They convert the user’s query and the documents into vector embeddings (lists of numbers). The relevance of a document is determined by how close its vector is to the query’s vector in mathematical space.

The similarity function, denoted as \(s(q, d_i)\), is typically a dot product or cosine similarity:

Equation for similarity function.

Because document vectors can be pre-calculated, this process is incredibly fast, allowing systems to search millions of documents in milliseconds.

These models are typically trained using Contrastive Learning. The goal is to maximize the similarity score of a relevant document (\(d^+\)) while minimizing the scores of irrelevant ones (\(d \in D'\)). The loss function usually looks like this:

Equation for Contrastive Loss.

2. The Reranker (The Teacher)

Rerankers, specifically Cross-Encoders, are different. They take the query and the document together and process them simultaneously (often using models like BERT or T5). This allows the model to “pay attention” to the specific interactions between words in the query and the document.

While much more accurate, Cross-Encoders are computationally expensive. You cannot pre-compute the scores because the score depends on the specific query. Therefore, they are usually only used to re-sort the top 100 or so results found by the retriever.

3. Knowledge Distillation

To get the best of both worlds, researchers use knowledge distillation. The idea is to take the accurate scores from the Reranker (Teacher) and force the Dense Retriever (Student) to reproduce them.

Ideally, if the student learns well enough, it might not need the teacher anymore, or it will at least provide a much better initial list of candidates.

The Problem with Pointwise Scores

Most existing distillation methods use Pointwise Rerankers. These models assign an absolute relevance score to every document independently. The distillation process tries to align the probability distribution of the student with the teacher.

The probability of a document \(d_i\) being selected by the student model is calculated using a softmax function over the scores:

Equation for student probability distribution.

Similarly, the teacher (pointwise reranker) produces its own distribution. To make the student learn, we minimize the difference between these two distributions using KL Divergence (a measure of how different two probability distributions are).

The flaw here is inconsistent baselines. A pointwise reranker might give a “relevant” document a score of 3.0 for Query A, but a “relevant” document for Query B might only get a 1.5. Trying to teach a student model to predict these shifting absolute numbers is difficult and noisy. It misses the forest for the trees.

The Solution: PAIRDISTILL

The authors of PAIRDISTILL argue that Pairwise Reranking provides a much cleaner signal. Instead of predicting a raw score, a pairwise reranker takes two documents (\(d_i\) and \(d_j\)) and estimates the probability that \(d_i\) is more relevant than \(d_j\).

Equation for pairwise probability.

This relative comparison is often easier for models to learn and inherently handles the calibration issue. If Document A is better than Document B, it doesn’t matter if their absolute scores are 10 vs 9 or 2 vs 1; the relationship \(A > B\) remains true.

The Architecture

The PAIRDISTILL framework is an elegant combination of traditional pointwise distillation and this new pairwise approach. Let’s look at the overall architecture:

Diagram of the PAIRDISTILL architecture showing the teacher (top) and student (bottom) processes.

As shown in Figure 2, the process works in two parallel tracks:

  1. Top (The Teacher): The system retrieves top documents. It runs a standard Pointwise Reranker to get initial scores. Then, it samples pairs of documents and runs a Pairwise Reranker to see which one is actually better.
  2. Bottom (The Student): The student (the dense retriever) tries to match both the pointwise scores and the pairwise outcomes.

How the Student “Thinks” in Pairs

You might wonder: The student model only outputs a single score for a document. How can it predict pairwise probabilities?

This is a clever part of the mathematics. Even though the student is a dual-encoder producing single scores (\(s(q, d_i)\)), we can interpret those scores in a pairwise context using the softmax function. The student’s estimated probability that Document \(i\) is better than Document \(j\) is calculated as:

Equation for student pairwise probability.

If the student assigns a much higher score to \(d_i\) than \(d_j\), this probability approaches 1. If the scores are equal, it’s 0.5.

The Training Signal

The teacher provides the “ground truth” for these pairs. This ground truth can come from two sources:

  1. Classification: A model trained to output 1 if \(d_i > d_j\) and 0 otherwise.
  2. LLMs: Large Language Models (like GPT or FLAN-T5) can be prompted to judge which document is better (“Answer A or B”).

Equation for LLM-based pairwise probability.

The Pairwise Distillation Loss (\(\mathcal{L}_{pair}\)) measures the distance between the teacher’s pairwise confidence and the student’s implied pairwise probability:

Equation for Pairwise Distillation Loss.

This looks complex, but it simply means: For every sampled pair of documents, make sure the student ranks them in the same order and with the same confidence as the teacher.

The final training objective combines three losses:

  1. Contrastive Loss (\(\mathcal{L}_{CL}\)): The standard training method using labeled data.
  2. Pointwise Distillation (\(\mathcal{L}_{KD}\)): Mimicking the teacher’s absolute scores.
  3. Pairwise Distillation (\(\mathcal{L}_{pair}\)): Mimicking the teacher’s relative comparisons.

Equation for the total loss function.

Experimental Results

Does adding this pairwise complexity actually help? The answer is a resounding yes.

The researchers evaluated PAIRDISTILL on the massive MS MARCO dataset (a standard benchmark for search) and the BEIR benchmark (which tests how well models generalize to new topics like bio-medicine or finance).

Comparison to State-of-the-Art

Let’s look at the summary of results in Table 1.

Table 1 comparing retrieval performance across benchmarks.

There are several key takeaways from this table:

  • In-Domain Dominance: On the MS MARCO Dev set (the “home turf” for these models), PAIRDISTILL achieves an MRR@10 of 40.7%. This beats well-established heavyweights like SPLADE++ and ColBERTv2.
  • Out-of-Domain Generalization: On the BEIR benchmark (Average BEIR-13), PAIRDISTILL scores 51.2%, again outperforming the baselines. This suggests that learning relationships between documents helps the model understand relevance even in topics it hasn’t seen before.

To visualize this, the authors plotted the performance of various models, with the X-axis representing in-domain performance and the Y-axis representing out-of-domain generalization.

Scatter plot comparing MS MARCO vs BEIR performance.

As seen in Figure 1, PAIRDISTILL (the red dot) sits at the top right, indicating it is superior in both categories compared to competitors like GTR-XXL and Dragon+.

Why is Pairwise Better?

To justify the extra effort, the authors analyzed the rerankers themselves. If the teacher isn’t smarter, the student can’t learn.

Table 5 showing reranking performance.

Table 5 shows that the Pairwise Reranker (duoT5) achieves a significantly higher MRR@10 (41.5) compared to the Pointwise Reranker (MiniLM at 40.5). This confirms that pairwise models are indeed “smarter” teachers, providing a higher ceiling for the student to aim for.

Ablation Studies: Do we need all the parts?

You might ask if we can just use the pairwise loss and ignore the rest. The authors tested this by removing components one by one.

Table 3 showing ablation study results.

Table 3 reveals that removing the pairwise loss (\(\mathcal{L}_{pair}\)) drops the score from 40.7 to 39.7. Interestingly, relying only on pairwise loss (without pointwise) also degrades performance slightly. This suggests that the two signals are complementary: pointwise gives a rough global estimate, while pairwise refines the difficult distinctions between similar documents.

Zero-Shot Domain Adaptation

One of the most exciting implications of PAIRDISTILL is its ability to work without labeled training data (Zero-Shot). By using an LLM to generate pairwise labels for queries in a new domain (like medical or legal), we can fine-tune a retriever specifically for that field.

In this scenario, the contrastive loss (which requires human labels) is dropped, leaving only the distillation losses:

Equation for Zero-Shot Loss.

The results on specific domain datasets (FiQA, BioASQ, Climate-FEVER) show that this method works surprisingly well.

Table 4 showing Zero-Shot domain adaptation results.

As shown in Table 4, using PAIRDISTILL for domain adaptation consistently outperforms using just standard knowledge distillation (\(\mathcal{L}_{KD}\) only) or the base ColBERTv2 model.

Conclusion

PAIRDISTILL represents a logical evolution in how we train search AI. We started by counting keywords (BM25), moved to embedding meanings (Dense Retrieval), and then refined those results with heavy AI (Reranking).

Now, with PAIRDISTILL, we are closing the loop by effectively compressing the intelligence of those heavy rerankers back into the fast retrievers. By moving from pointwise scores (“This is an 8”) to pairwise decisions (“A is better than B”), the model learns the nuanced distinctions that define true relevance.

For students and practitioners in NLP and Information Retrieval, the key takeaway is clear: when absolute calibration is difficult, look for relative signals. Sometimes, knowing what is better is more valuable than knowing what is good.