Introduction

Imagine you are looking for a specific piece of information in a library with millions of books. You approach the librarian with a vague request. A standard librarian might give you a list of books based on keywords. A better librarian might first ask clarifying questions to understand your intent, then curate a list, check the books personally to ensure they are relevant, and finally cross-reference them to give you the ultimate reading list.

This sophisticated process mirrors the challenge of Zero-Shot Document Retrieval in computer science. We want our systems to find relevant documents for queries they have never seen before, without specific training on that topic. While Large Language Models (LLMs) like GPT-4 or Llama have revolutionized this field, they are not perfect. They can “hallucinate” information, get confused by the order of documents, or simply lose focus when given too much text.

In this post, we will deep dive into GENRA (Generative Retrieval with Rank Aggregation), a novel approach presented by researchers Georgios Katsimpras and Georgios Paliouras. Their method moves beyond simple search by creating a multi-step pipeline that generates context, verifies relevance, and—crucially—uses rank aggregation to combine multiple perspectives into a single, highly accurate result.

The Background: Zero-Shot Retrieval and LLMs

Before dissecting GENRA, we need to understand the landscape of modern information retrieval.

The Standard Approach

Traditionally, retrieval systems rely on two main paradigms:

  1. Sparse Retrieval: Think of BM25. It matches keywords in your query to keywords in documents. It is fast but struggles with synonyms or semantic meaning.
  2. Dense Retrieval: This uses encoders (like BERT or Contriever) to turn text into mathematical vectors. Similar concepts end up close to each other in vector space.

In a zero-shot setting, these models must perform well on datasets they were never trained on. To improve accuracy, researchers often add a Re-ranking step. The system retrieves 100 documents, and a more powerful model sorts them to put the best ones at the top.

The Role of LLMs

Recently, LLMs have been used to enhance this process. For example, a method called HyDE (Hypothetical Document Embeddings) asks an LLM to write a fake document that answers the query, then searches for real documents that look like the fake one.

However, relying on a single pass of generation or retrieval has risks. If the LLM generates a hallucinated concept, the search goes off-track. If the re-ranker is biased, valuable documents get buried.

Visualizing the Difference

To understand how GENRA differs from these typical setups, let’s look at the system architecture.

Comparison between Typical Retrieval and GENRA architectures.

As shown in Figure 1 above:

  • Typical Retrieval (a): It is a linear path. Query \(\rightarrow\) Retriever \(\rightarrow\) Re-ranker. The LLM supports the modules but acts as a single point of failure if it errs.
  • GENRA Retrieval (b): It is a branching and converging path. It includes generation, assessment (checking if documents are actually good), and a final aggregation stage.

The Core Method: How GENRA Works

GENRA stands out because it doesn’t just trust the initial search. It creates a feedback loop of generation, verification, and aggregation. The method operates in three distinct steps.

The three key steps of the GENRA pipeline: Passage Generation, Relevance Assessment, and Rank Aggregation.

Let’s break down the workflow illustrated in Figure 2.

Step 1: Passage Generation (Expansion)

Queries are often short and ambiguous. If you search for “apple,” do you mean the fruit or the tech company?

GENRA starts by prompting an LLM to generate informative passages that capture the user’s intent. Instead of searching for the raw query, the system uses these generated passages to guide the search.

  • The LLM creates a set of passages \(P\).
  • These passages are encoded into vectors.
  • The system averages these vectors to create a rich “Query Representation.”
  • This enhanced query retrieves the top \(k\) candidate documents from the dataset.

By expanding the query, GENRA bridges the vocabulary gap between a user’s short question and the detailed language found in documents.

Step 2: Relevance Assessment (The Filter)

Now we have a list of candidate documents. But are they actually relevant? Standard methods might just re-rank them, but GENRA imposes a strict validity check.

The researchers propose two ways to do this:

  1. LLM-based Filtering (Yes/No): The system feeds the query and one document at a time to an LLM and asks, “Does this document answer the query? Yes or No?”
  • Why one at a time? Feeding a long list of documents to an LLM often leads to the “Lost in the Middle” phenomenon, where the model forgets items in the middle of the list. Processing them independently ensures a fair evaluation.
  1. Re-ranking: Alternatively, they use a specialized re-ranking model (like RankVicuna) to sort the documents and keep only the top \(m\).

Only the documents that pass this “sanity check” proceed to the final stage. This acts as a noise filter, removing irrelevant content that might confuse the final ranking.

Step 3: Rank Aggregation (The “Wisdom of Crowds”)

This is the most innovative part of the paper.

In standard retrieval, you produce one ranked list. If that list is flawed, you have no backup. GENRA assumes that no single ranking is perfect. Instead, it generates multiple rankings and combines them.

Here is the process:

  1. Take the set of verified documents from Step 2.
  2. Treat each verified document as a new query.
  3. For every verified document, perform a search to find other similar documents in the database.
  4. This results in multiple lists of documents (multiple rankings).

Finally, GENRA uses Rank Aggregation to merge these lists. If a document appears at the top of many different lists, it is highly likely to be the correct answer. The researchers found that a Linear Aggregation method (summing up the normalized scores of a document across all lists) worked best.

This approach mitigates bias. Even if one generated passage was slightly off-topic, the aggregate signal from all other verified documents corrects the final result.

Experiments and Results

To prove that this complex pipeline is worth the effort, the authors tested GENRA on several benchmarks, including the TREC Deep Learning tracks (DL19, DL20) and the BEIR benchmark.

Ablation Studies: Fine-Tuning the System

Before looking at the final scores, it is interesting to see how the different components affect performance.

How many passages should we generate?

In Step 1, the LLM generates passages to expand the query. Is more always better?

Line charts showing the impact of the number of generated passages on performance.

Figure 3 shows that performance improves as you generate more passages (moving from 1 to 10), but it plateaus or even drops after that. Too much generated text introduces noise and redundancy. The “sweet spot” appears to be around 10 passages.

How many relevant documents should we keep?

In Step 2, the system filters documents. If we keep too many, we let noise back in. If we keep too few, we lose information.

Line charts showing the impact of the number of relevant documents kept.

Figure 4 reveals that keeping the top 5 to 10 verified documents yields the best results. Beyond that, the performance gains diminish. This validates the importance of a strict relevance filter.

Which Aggregation Method is best?

The researchers tested several mathematical ways to combine the rankings (Borda count, Outrank, DIBRA).

Table comparing different rank aggregation methods.

Table 1 demonstrates that simple Linear aggregation (summing scores) consistently outperforms more complex voting mechanisms like Borda or Outrank. Sometimes, the simplest mathematical approach is the most robust.

Performance vs. Efficiency

One valid concern with a multi-step pipeline is speed. If you have to generate text, check relevance, and aggregate rankings, isn’t it slow?

Bar chart comparing execution time of different methods.

Figure 5 compares the processing time.

  • HyDE (Blue) is the fastest of the generative methods.
  • GENRA + Judgements (Light Brown) is slower than HyDE but significantly faster than full re-ranking models.
  • RankVicuna (Red/Grey) is the most computationally expensive.

This suggests that GENRA offers a “middle ground” in efficiency while aiming for top-tier accuracy.

Benchmark Results

TREC Deep Learning (DL19 & DL20)

These are the gold-standard datasets for retrieval.

Results on DL19 and DL20 datasets.

As seen in Table 2, GENRA (specifically combined with RankVicuna and Contriever) achieves the highest scores (bolded). It outperforms HyDE by a significant margin (up to 7.4 percentage points in some metrics). This confirms that the extra steps of verification and aggregation provide a substantial quality boost over simple query expansion.

BEIR Benchmark

BEIR is a collection of diverse datasets (Bio-medical, News, Tweets) designed to test zero-shot capabilities strictly.

Results on BEIR datasets.

Table 3 reinforces the trend. GENRA consistently achieves the best or second-best results across different domains. Whether searching for Covid-19 research or financial news, the aggregation strategy proves robust.

Real-World Application: CrisisFACTS

The authors went a step further and tested GENRA on CrisisFACTS, a task that involves summarizing streams of social media and news during disasters. This requires high factual accuracy.

Results on the CrisisFACTS 2022 dataset.

Table 4 shows that GENRA produces summaries that are comparable to or better than the top systems from the 2022 challenge. Notably, GENRA achieves this using open-source models without needing fine-tuning on crisis data, whereas competing methods often relied on proprietary APIs (like OpenAI) or specific training.

Conclusion and Implications

GENRA represents a significant step forward in zero-shot retrieval. By acknowledging that LLMs are fallible, the authors designed a system that doesn’t just ask an LLM for the answer—it collaborates with the LLM to generate, verify, and cross-reference information.

Key Takeaways:

  1. Verification Matters: Simply generating context isn’t enough. You must verify if the retrieved documents are actually relevant using a “sanity check” step.
  2. Aggregation Reduces Noise: Relying on a single ranked list is risky. Generating multiple rankings derived from verified documents and combining them smooths out errors and highlights the true signal.
  3. Open Source Power: The success of GENRA using open-source models (like Mistral and Solar) proves that we can build state-of-the-art search systems without relying solely on closed, proprietary APIs.

For students and researchers, GENRA illustrates a powerful design pattern: Pipeline reliability. Instead of trying to train one perfect model, we can often achieve better results by chaining weaker signals together and aggregating them intelligently. As LLMs continue to evolve, methods like GENRA that structure the model’s reasoning process will likely become the standard for high-stakes information retrieval.