Introduction

If you have ever taken a computer science algorithms course, you know the drill. When asked “What is the efficient sorting algorithm?”, the answer is almost reflexively “Merge Sort,” “Heap Sort,” or “Quick Sort.” Why? Because of Big O notation. We are taught that \(O(n \log n)\) is the gold standard for comparison-based sorting, while algorithms like Bubble Sort (\(O(n^2)\)) are relegated to the “never use in production” bin.

But what if the fundamental assumption behind Big O notation—that every comparison costs roughly the same and is very cheap—was suddenly wrong?

This is exactly the scenario we face in modern Information Retrieval (IR) using Large Language Models (LLMs). When we use an LLM to re-rank search results, we aren’t just comparing two integers in a CPU register (a nanosecond operation). We are feeding two documents into a massive neural network and asking it to generate a preference (an operation that takes milliseconds or even seconds).

In this blog post, we will dive deep into a fascinating research paper titled “Are Optimal Algorithms Still Optimal? Rethinking Sorting in LLM-Based Pairwise Ranking with Batching and Caching.” The researchers challenge the status quo of sorting theory in the context of LLMs. They reveal that when the cost of comparison becomes expensive, “inefficient” algorithms can become surprisingly effective, and “optimal” algorithms might actually be slowing you down.

Background: The Rise of LLM Re-Ranking

Before we get into sorting algorithms, let’s establish the context. In modern search engines, the process usually happens in two stages:

  1. Retrieval: A fast, cheap algorithm (like BM25) finds the top 100 or 1000 documents relevant to a query.
  2. Re-ranking: A more powerful model looks at those top candidates and re-orders them to put the absolute best ones at the top.

Recently, Pairwise Ranking Prompting (PRP) has emerged as a powerful technique for this second stage. The idea is simple: instead of training a specific ranking model, you just ask a standard LLM (like GPT-4, Llama 3, or Flan-T5) to compare two documents.

“Given Query Q, and Documents A and B, which one is more relevant?”

The LLM outputs “A” or “B.”

This is elegant because it requires no training data (zero-shot). However, it introduces a massive computational bottleneck. If you have 100 documents to sort, a naive approach might require comparing every document against every other document. That is highly expensive when every single comparison requires a GPU inference call.

To solve this, practitioners naturally turned to classical sorting algorithms to minimize the number of comparisons. If you use Heapsort, for example, you can sort \(N\) items with \(O(n \log n)\) comparisons rather than \(O(n^2)\). Until now, the field assumed that minimizing the count of comparisons was the only metric that mattered.

The Core Problem: Inferences vs. Comparisons

The researchers argue that classical analysis makes a fatal mistake in this context. It treats comparisons as atomic, uniform-cost operations.

In a CPU, comparing 5 < 10 is atomic. But in an LLM system, a “comparison” is an API call or a forward pass through a GPU. This difference unlocks two powerful optimizations that standard Big O notation ignores:

  1. Batching: In a GPU-based system, checking one pair (A vs B) takes almost the same amount of time as checking ten pairs (A vs B, A vs C, A vs D...) simultaneously, provided they fit in memory.
  2. Caching: If your sorting algorithm redundantly compares A vs B multiple times, you can store the result and skip the expensive inference the second time.

The researchers propose a new framework: Re-center the cost model around LLM Inferences, not Comparison Counts.

When you view sorting through this lens, the hierarchy of algorithms changes drastically. An algorithm that performs more comparisons but allows for fewer inference calls (via batching) becomes the superior choice.

A New Look at Old Algorithms

Let’s analyze three classic sorting algorithms—Heapsort, Bubble Sort, and Quick Sort—under this new “LLM-Centric” cost model.

Table 1 summarizing the optimizations applicable to each algorithm. Heapsort supports neither batching nor caching. Bubble sort supports caching. Quick sort supports batching.

As shown in the table above, the structural properties of these algorithms determine whether they can leverage modern hardware optimizations.

1. Heapsort: The Theoretically “Optimal” Trap

Heapsort has long been favored in PRP research. It guarantees \(O(n \log n)\) complexity and is excellent at extracting the “Top-K” items without sorting the whole list.

However, Heapsort relies on a binary tree structure. To know which comparison to make next, you strictly need the result of the previous comparison. It is inherently sequential. You cannot bundle 10 comparisons into one batch because you don’t know what those 10 comparisons will be until you traverse the tree step-by-step. Furthermore, Heapsort rarely compares the same pair twice, meaning caching offers no benefit.

In the world of LLMs, Heapsort forces the GPU to act like a CPU: processing one tiny task at a time, leaving massive potential throughput on the table.

2. Bubble Sort: The Redemption of \(O(n^2)\)

Bubble Sort is famously inefficient. It scans through the list, swapping adjacent elements if they are in the wrong order. In a list of 100 items, a naive Bubble Sort is a disaster.

However, the researchers note a unique property: Redundancy. Bubble Sort compares adjacent elements repeatedly across multiple passes. In a CPU context, this is wasted work. In an LLM context, if we employ a simple Cache (a dictionary storing key="DocA_DocB", value="DocA"), those repeated comparisons become “free” instantaneous lookups.

Diagram showing Bubble Sort with caching. Solid red arrows represent expensive non-cached inferences. Dotted arrows represent cheap cached lookups.

As visualized in Figure 1, the second pass of Bubble Sort relies heavily on cached results (dotted arrows). While Bubble Sort doesn’t support batching (because swaps are dependent on the immediate previous result), the caching mechanism drastically reduces the number of actual LLM calls required.

3. Quick Sort: The Batching Powerhouse

This is where the paper makes its strongest contribution. Quick Sort works by selecting a “pivot” element and partitioning the remaining items into two groups: those “smaller” (less relevant) than the pivot and those “larger” (more relevant).

Classically, you would compare the pivot against Item 1, then the pivot against Item 2, and so on. But why wait?

Since the pivot is fixed for the duration of the partition step, the comparisons are independent. We can construct a single prompt:

“Here is a Pivot Document P. Here is a list of Candidate Documents {A, B, C, D}. For each candidate, tell me if it is more relevant than P.”

This allows us to leverage Batching.

Diagram contrasting sequential inference (one comparison per call) vs batched inference (multiple comparisons against a pivot per call).

Figure 2 illustrates this shift. On the left (classic analysis), we execute one inference per pair. On the right (batching), we gain information on multiple documents in a single inference step. Even though the total number of comparisons (mathematical operations) remains the same, the number of inference calls (expensive GPU operations) plummets.

Experiments and Results

The researchers validated their theory using the standard TREC and BEIR datasets, utilizing models like Flan-T5, Mistral, and Llama 3. They measured “Efficiency” not by time complexity, but by the raw count of LLM inference calls.

The Impact of Batching on Quick Sort

The results for Quick Sort are dramatic. When the batch size is 1 (standard behavior), Heapsort is indeed more efficient than Quick Sort. This aligns with traditional theory.

However, as soon as the batch size increases to just 2, the script flips.

Chart showing the drop in inference counts for Quick Sort as batch size increases. Heapsort sees minimal to no gain, while Quick Sort efficiency improves linearly.

In Figure 3, look at the steep drop for Quick Sort (the blue bars). With a batch size of 2, Quick Sort generates 44% fewer inference calls than Heapsort. At batch size 128, the reduction is massive—over 90%.

Because Heapsort (orange bars) cannot effectively batch comparisons due to its tree dependency, its efficiency remains stagnant regardless of the hardware’s capability.

The Impact of Caching on Bubble Sort

Bubble Sort also sees a revival. By implementing a simple cache, the algorithm avoids re-computing known relationships.

Bar chart comparing Bubble Sort with and without cache. Caching reduces inferences by roughly 46% across datasets.

Figure 4 shows that caching reduces the total inferences for Bubble Sort by an average of 46%. While it doesn’t beat Batched Quick Sort, it makes Bubble Sort a much more viable candidate than previously thought, especially in scenarios where batching might be difficult (e.g., memory constraints) but memory for caching is available.

Real-World Latency

Theoretical inference counts are great, but does this translate to wall-clock speed? The researchers ran these algorithms on NVIDIA A100, RTX 3090, and RTX 2080 Ti GPUs.

Line graph showing speedup vs batch size. The A100 shows near-linear speedup up to batch size 8 and continues to improve up to 128.

As shown in Figure 5, the “Speed-up” is real. On an A100 (blue line), increasing the batch size yields significant performance gains. The scaling isn’t perfectly linear forever—eventually, you hit the GPU’s compute limits—but the gains in the 2-32 batch size range are substantial.

In a full end-to-end test on the BEIR benchmark, Quick Sort (batch 128) was 5.52x faster than Heapsort while achieving similar ranking quality.

Does Efficiency Hurt Ranking Quality?

A critical question remains: by switching algorithms or optimizing for speed, do we degrade the quality of the search results?

The short answer is no.

Graph showing nDCG@10 scores across datasets. The lines for Bubble Sort, Heapsort, and Quick Sort are nearly identical, indicating stable performance.

Figure 6 demonstrates the nDCG@10 (a standard metric for ranking quality). The performance lines for the different algorithms are tightly clustered. This means you can choose your algorithm based on your available hardware and latency requirements without worrying about hurting the user experience.

For a detailed breakdown of the exact numbers, we can look at the specific results on datasets like DBPedia and FiQA:

Detailed table showing NDCG scores, number of comparisons, and number of inferences for various methods. Quick Sort with batching shows drastically lower inference counts.

In Table 2, notice the #Inf (Number of Inferences) column.

  • Heapsort on DBPedia: ~225 inferences.
  • Quick Sort (Batch=128) on DBPedia: ~13 inferences.

The difference is an order of magnitude. The ranking score (NDCG@10) remains roughly 0.41 for both. You get the same result with less than 10% of the cost.

Conclusion and Implications

This research highlights a crucial lesson for modern software engineering: Context changes complexity.

Algorithms that have been taught as “optimal” for decades were designed for a world where comparisons were cheap atomic operations. In the era of Large Language Models, where a single comparison is a computationally heavy inference, we have to rewrite the rulebook.

The key takeaways from this paper are:

  1. Stop counting comparisons; start counting inferences. This is the true bottleneck in AI systems.
  2. Heapsort is obsolete for LLM ranking. Its sequential nature prevents it from using the parallel processing power of GPUs.
  3. Quick Sort is King. Its partitioning phase is perfectly suited for batching, allowing us to compare one document against dozens of others in a single prompt.
  4. Don’t ignore the “bad” algorithms. Even Bubble Sort can be redeemed with simple caching strategies because LLM costs expose different bottlenecks than CPU cycles.

For students and developers working with RAG (Retrieval-Augmented Generation) or search systems, this suggests an immediate practical change. If you are implementing a re-ranker, move away from sequential pairwise comparisons. Structure your prompts to handle batches, and leverage algorithms like Quick Sort that align with the parallel nature of your hardware. The theoretical “best” is not always the practical winner.