Introduction

Large Language Models (LLMs) have revolutionized how we process text, and naturally, they are reshaping Information Retrieval (IR). When you search for something, you want the best results at the top (Ranking) and you want to know how relevant those results actually are (Relevance Prediction).

In the current research landscape, there is a dichotomy in how LLMs are used for these tasks. You can ask an LLM to rate a document directly (e.g., “Rate this from 1 to 5”), which gives you a meaningful score but often a mediocre ranking order. Alternatively, you can ask the LLM to compare documents (e.g., “Is Document A better than Document B?”). This “Pairwise” approach produces excellent ranking orders but results in arbitrary scores that don’t tell you much about the actual relevance of the content.

The paper Consolidating Ranking and Relevance Predictions of Large Language Models through Post-Processing tackles this exact problem. The researchers propose a clever post-processing pipeline that combines the accurate “labels” of the rating approach with the superior “ordering” of the pairwise approach.

In this post, we will break down the problem of LLM-based ranking, explain the proposed Constrained Regression method, and look at the experimental results showing how we can finally have our cake and eat it too.

Background: The Two Modes of LLM Ranking

To understand the solution, we first need to understand the two distinct ways LLMs are currently deployed in search systems.

1. The Pseudo-Rater (Pointwise)

In this mode, the LLM acts like a human judge. It looks at a query and a single document and asks: “Does this passage answer the query?”

The model outputs a probability for “Yes” or “No.” We can interpret the probability of the token “Yes” as a relevance score, denoted as \(\hat{y}\).

Equation for normalized relevance prediction.

This method is efficient (\(O(n)\) complexity for \(n\) documents) and produces calibrated scores. If the model says a document is 0.8 relevant, it generally means it’s highly relevant. However, when you sort documents based on these scores, the final ranking is often suboptimal compared to more complex methods.

2. Pairwise Ranking Prompting (PRP)

Here, the LLM is given two documents and asked which one is more relevant. This leverages the LLM’s strong reasoning capabilities to compare content directly.

To get a score for a specific document, the system counts how many “duels” that document won against other candidates.

Equation for calculating ranking score based on pairwise wins.

While this produces state-of-the-art ranking (high NDCG scores), the resulting scores are uncalibrated. A score of “5 wins” in one query might mean something totally different than “5 wins” in another query. The absolute numbers are meaningless; only the relative order matters.

The Core Problem: Calibration vs. Ranking

The conflict is visualized below. On the left, you see the issue with Pairwise Ranking (PRP). The scores (y-axis) vary wildly across different queries (the colored lines). A document with a score of -5 in one query might be the best result, while a score of 5 in another might be mediocre. This makes it impossible to set a global threshold for “relevance.”

Left: PRP scores are uncalibrated. Right: The proposed Ranking-aware Pseudo-Rater pipeline.

As shown in the right side of Figure 1, the researchers propose a Ranking-Aware Pseudo-Rater. The idea is to take the calibrated scores from the Pseudo-Rater (which have good absolute values) and the pairwise preferences from the PRP (which have good relative order) and fuse them together.

To measure success, we need to look at two metrics simultaneously:

  1. NDCG (Normalized Discounted Cumulative Gain): Measures how good the ranking order is. Higher is better.
  2. ECE (Empirical Calibration Error): Measures how accurate the probability scores are. Lower is better.

Equation for Empirical Calibration Error (ECE).

The Solution: Constrained Regression

The researchers introduce a post-processing step that takes the “Pointwise Ratings” and adjusts them minimally so that they satisfy the “Pairwise Constraints.”

The Math

Let \(\hat{y}\) be the initial relevance score from the Pseudo-Rater. We want to find a small adjustment, \(\delta\), for each document. The goal is to minimize the total changes made to the original scores (keeping them calibrated) while enforcing the rule that if the Pairwise method prefers Document A over Document B, the final score of A must be higher than B.

This is formulated as a constrained optimization problem:

Equation for the Constrained Regression optimization problem.

Here, we are minimizing the squared perturbations (\(\delta^2\)) subject to the constraint that the final order matches the pairwise preference (\(\Delta_{ij}\)).

Improving Efficiency

Running a full pairwise comparison for every document against every other document requires \(O(n^2)\) LLM calls, which is slow and expensive. The paper introduces two efficient variants to reduce the number of constraints the regression needs to satisfy.

1. SlideWin (Sliding Window): Instead of comparing everything, we only compare documents that are close to each other in an initial ranking (like BM25). We slide a window over the list and generate constraints only for neighbors. This reduces complexity to \(O(kn)\).

2. TopAll (Top-k vs. All): We assume the top results are the most important. We select the top-k documents (based on the initial pointwise score) and compare them against everyone else. This ensures the best documents are correctly pushed to the top, without wasting resources sorting the garbage at the bottom.

Illustration of SlideWin (top) and TopAll (bottom) constraint selection methods.

Table 1 summarizes the complexity of these methods. Notice how the proposed efficient methods (SlideWin, TopAll) maintain linear complexity \(O(n)\) similar to the basic Pseudo-Rater, unlike the quadratic cost of full PRP.

Summary of methods and their complexities.

Experiments and Results

The researchers tested these methods on standard datasets like TREC-DL (2019, 2020), TREC-Covid, and others.

Main Performance

The table below shows the core results.

  • PRater: Good ECE (calibration), mediocre NDCG (ranking).
  • PRP: Excellent NDCG, terrible ECE.
  • Allpair / SlideWin / TopAll (Ours): The proposed methods achieve NDCG scores comparable to PRP (the best ranker) while maintaining low ECE scores comparable to PRater.

Table 3: Detailed experimental results comparing ranking and relevance metrics.

Look at the TREC-DL 2020 row (second large block). The Allpair method achieves an NDCG@10 of 0.7054 (statistically tied with PRP’s 0.7069) and an ECE of 0.0865 (better than PRater’s 0.0991). This confirms the method successfully consolidates the strengths of both approaches.

The Trade-off Landscape

To visualize this “best of both worlds” achievement, we can look at the Pareto frontier of Ranking vs. Calibration.

In the charts below, the x-axis is ranking quality (NDCG, higher is better) and the y-axis is calibration error (ECE, lower is better). Ideally, you want to be in the bottom-right corner.

Figure 3: Trade-off plots of ECE versus NDCG.

The blue line represents a simple weighted ensemble (just adding the scores together). The specific shapes (triangles) represent the proposed constrained regression methods. Notice how the triangles consistently appear closer to the bottom-right than the baselines or the ensemble line. This proves that constrained regression is a more effective way to combine these signals than simple averaging.

Does model size matter?

One might wonder if this only works for massive models. The researchers compared the FLAN-T5-XXL model against the smaller UL2 model.

Table 4: Model size effects on performance.

As shown in Table 4, the method scales well. Even with different model sizes, the consolidation methods (Allpair, SlideWin, TopAll) consistently bridge the gap between ranking and relevance prediction.

Conclusion

This research highlights a crucial nuance in applying LLMs to search: asking a model to “rank” and asking it to “rate” yield fundamentally different types of signals.

  • Ranking (Pairwise) gives you the correct order but arbitrary numbers.
  • Rating (Pointwise) gives you meaningful numbers but a sloppy order.

By using Constrained Regression, we can post-process the output of these models to satisfy pairwise constraints without destroying the calibration of the pointwise scores.

For students and practitioners, this implies that you don’t have to choose between a system that ranks well and a system that explains its confidence accurately. With the right mathematical framework, you can align the geometry of relevance scores with the topology of ranking preferences.