Introduction

In the rapidly evolving landscape of Natural Language Processing (NLP), Large Language Models (LLMs) like GPT-4 and Llama-3 have become the de facto standard for generating text, writing code, and answering questions. Their ability to understand context is unparalleled. However, a significant challenge remains: how do we use these generative giants to effectively find information within massive datasets without breaking the bank?

Traditionally, utilizing LLMs for Information Retrieval (IR) has fallen into two distinct but imperfect camps. First, there are prompt-based re-ranking methods. In this scenario, you retrieve a small set of documents using a simple keyword search and then ask the LLM, “Is this document relevant to the user’s query?” While this yields high accuracy, it is computationally excruciating. Imagine running a massive model like GPT-4 hundreds of times for every single search query; it is too slow and costly for real-time applications.

The second camp involves dense retrieval methods. Here, models are trained to convert text into numerical vectors (embeddings). Similar texts end up close to each other in vector space. While efficient, training these models requires massive amounts of paired data (queries and relevant documents) and significant computational resources for contrastive training. For example, training the state-of-the-art “E5-mistral” model took approximately 18 hours on 32 high-end GPUs—a cost that is prohibitive for many researchers and startups.

But what if there was a third way? What if we could convince a standard, off-the-shelf LLM to act as a retrieval system without any training, fine-tuning, or expensive data collection?

This is the question addressed in the paper “PromptReps: Prompting Large Language Models to Generate Dense and Sparse Representations for Zero-Shot Document Retrieval.” The researchers propose a novel method that leverages the internal mechanics of LLMs—specifically their hidden states and next-token predictions—to generate robust search representations solely through prompting.

Background: The Retrieval Landscape

To understand why PromptReps is significant, we must first establish the context of modern information retrieval.

Dense vs. Sparse Retrieval

At a high level, search technologies rely on representing text as numbers.

  • Sparse Retrieval (Bag-of-Words): Classic algorithms like BM25 represent text based on exact keyword matches. If your query contains “apple,” it looks for documents containing “apple.” The vectors are “sparse” because most dimensions (words in the vocabulary) are zero. These are fast and precise but fail at understanding synonyms or context (e.g., matching “apple” to “fruit”).
  • Dense Retrieval (Embeddings): Neural networks (like BERT) encode text into short, dense vectors of numbers. These capture semantic meaning. A dense retriever knows that “canine” and “dog” are related. However, they require heavy training to align the vector space correctly.

The Zero-Shot Challenge

“Zero-shot” refers to the ability of a model to perform a task without being explicitly trained for it. LLMs are famous for their zero-shot capabilities in generation. You can ask an LLM to write a poem about quantum physics, and it will do so without having been trained specifically on a “quantum physics poem” dataset.

However, LLMs are trained to predict the next word, not to map sentences to vector spaces. Previous attempts to use LLMs for retrieval usually involved generating synthetic data to train other smaller models, or using the LLM to hallucinate a hypothetical document and searching for that. PromptReps attempts to bypass these intermediate steps and extract retrieval representations directly from the LLM’s inference pass.

Core Method: PromptReps

The genius of PromptReps lies in its simplicity. Instead of modifying the model weights, the authors modify the input to force the model into a specific state.

The Prompting Strategy

The core idea is to ask the LLM to compress the semantic meaning of a document into a single word. The prompt used is:

<System> You are an AI assistant that can understand human language. <User> Passage: "[text]". Use one word to represent the passage in a retrieval task. Make sure your word is in lowercase. <Assistant> The word is: "

When the LLM processes this prompt, it reads the document and attempts to predict the single best word that summarizes it. At the exact moment before it generates that word (at the last token of the prompt), the model’s internal state is incredibly rich. It contains all the information necessary to decide what the document is about.

PromptReps extracts two types of data at this specific moment to build a Hybrid Retrieval System:

Overview of PromptReps. The diagram shows the flow from input text to the LLM, extracting both the last hidden state for dense retrieval and next token logits for sparse retrieval.

As shown in Figure 1, the system performs a single forward pass through the LLM and branches into two paths:

  1. Dense Representation (The “Vibe”): The system extracts the last hidden state of the last token. This is a vector (a list of numbers) that represents the model’s “thought process” right before speaking. This vector captures the deep semantic meaning of the text and can be used for dense retrieval (Approximate Nearest Neighbor search).
  2. Sparse Representation (The “Keywords”): The system extracts the next token logits. Logits represent the probability of every word in the vocabulary being the next word.

Deep Dive: Constructing the Sparse Vector

The extraction of the sparse representation is particularly clever. The “logits” are essentially a list of scores for every word the model knows (often 30,000+ words). If the document is about “dogs,” the logits for words like “puppy,” “canine,” “bark,” and “pet” will all be high, even if those specific words don’t appear in the text.

This creates a “Bag-of-Words” representation that includes document expansion. The model implicitly adds relevant synonyms and related terms to the index.

To make this practical for a search engine (an Inverted Index), the authors apply several processing steps to sparsify these logits:

  1. Filtering: They keep only the values corresponding to words actually present in the document (to ensure exact matching precision) or high-probability expansion terms.
  2. Rectification: They apply a ReLU function (removing negative values) and log-saturation (dampening extremely high values so they don’t dominate).
  3. Quantization: The floating-point values are converted into integers to save space and speed up search.

The Hybrid Approach

By combining these two representations, PromptReps gets the best of both worlds:

  • The Dense vector provides semantic understanding (concept matching).
  • The Sparse vector provides lexical precision (keyword matching) and expansion.

During retrieval, the query is processed using the exact same prompt (swapping “Passage” for “Query”). The system calculates a score using the dense vector (dot product) and the sparse vector (term matching), then combines them using a weighted sum.

Experiments & Results

To validate this approach, the researchers tested PromptReps on the BEIR benchmark, a rigorous collection of 13 diverse datasets covering topics from bio-medical queries (NFCorpus) to financial questions (FiQA).

The baselines for comparison included:

  • BM25: The industry standard for keyword search.
  • E5-PT: A state-of-the-art dense retriever trained on 1.3 billion text pairs.
  • LLM2Vec: A method that adapts Llama-3 for retrieval using masked training.

Zero-Shot Effectiveness

The results, presented in Table 1 below, are striking.

Table 1: nDCG@10 scores of BEIR 13 publicly available datasets comparing PromptReps against baselines like BM25 and E5-PT.

The key takeaways from the BEIR evaluation are:

  1. BM25 is hard to beat: Notice that the standard BM25 (first column) often outperforms complex trained models like LLM2Vec. This illustrates how difficult zero-shot retrieval is.
  2. PromptReps (Hybrid) shines: Look at the “Llama3-70B-I Hybrid” column. It achieves an average score of 45.97, significantly outperforming BM25 (43.70) and coming incredibly close to the heavily trained E5-PT (46.06).
  3. No Training Required: It is crucial to remember that E5-PT was trained on billions of pairs. PromptReps achieved similar performance purely through prompting a standard Llama-3 model.

The Power of Hybridization

The data reveals that neither dense nor sparse representations are perfect on their own. For example, in the “Llama3-8B-I” columns, the Dense score is 29.70 and the Sparse score is 22.85 for the arguana dataset. However, when combined (Hybrid), the score jumps to 33.32.

This synergy is further analyzed in Figure 6, which shows the impact of weighting the dense vs. sparse components.

Figure 6: A line graph showing how MRR scores change based on the fusion weight between sparse and dense representations.

The graph shows a clear “sweet spot” (around 0.4 to 0.5) where the combination of both signals yields significantly higher accuracy (MRR@10) than using either method in isolation. This confirms that the semantic “vibe” and the keyword “precision” are complementary.

Scaling Laws: Bigger is Better

One of the most promising aspects of PromptReps is that it benefits directly from the “scaling laws” of LLMs. As the base language model gets smarter, the retrieval performance improves.

Figure 2: Bar chart comparing MRR scores across different LLM sizes, showing that larger models like Llama-3-70B yield better retrieval performance.

Figure 2 demonstrates that shifting from a 7B parameter model (Mistral) to Llama-3-8B, and finally to Llama-3-70B, results in consistent performance gains. The “Hybrid” bars (on the right) show that the Llama-3-70B model (purple bar) pushes the performance well past the BM25 baseline (dotted line). This suggests that as future LLMs (like GPT-5 or Llama-4) are released, PromptReps will naturally become more effective without any changes to the algorithm.

Sensitivity to Prompts

Is the prompt “Use one word…” really the best choice? The authors conducted an ablation study to test different instructions.

Table 4: Retrieval effectiveness of different prompts. Prompt 6, which includes specific instructions like ’lowercase’, performs best.

Table 4 reveals that phrasing matters. Prompt #6 (the one used in the main method) performs best. Interestingly, removing the instruction “Make sure your word is in lowercase” (Prompt #1) slightly hurts performance. This is likely because the sparse retrieval mechanism relies on exact token matching, and ensuring lowercase output aligns better with the pre-processing steps. Crucially, Prompt #4, which removes the cue “The word is:”, fails completely (scores of 0.00). This confirms that forcing the model into the generation state is essential for priming the hidden states and logits correctly.

Alternative Architectures

The researchers also questioned if generating just “one word” was enough. They experimented with prompts asking for multiple words or using multi-vector representations (similar to ColBERT).

Figure 3: Diagram showing alternative strategies like First-word single-representation and Multi-token single-representation.

As shown in Figure 3, they tried letting the model generate more tokens and pooling the results. However, their results (shown in Figure 5 of the paper) indicated that the simplest approach—using the representation of the very first token—was generally as effective or better than more complex multi-token strategies. This is a win for efficiency, as generating fewer tokens means lower latency.

PromptReps as a Training Initialization

While PromptReps is designed as a zero-shot method, the authors also explored if it could serve as a “jump start” for supervised training.

Standard dense retrievers are usually initialized with a base language model that has no concept of retrieval. The authors hypothesized that initializing the model with PromptReps (using the prompt to guide the initial state) would make fine-tuning faster and more effective.

Table 5: Supervised fine-tuning results showing that PromptReps serves as a strong initialization for training, especially with limited data.

Table 5 confirms this hypothesis. With only 1,000 examples of training data (a tiny fraction of standard datasets), a standard Llama-3 model (RepLlama3) achieves a score of 27.88. However, PromptReps (Dense only) starts at 28.48. When trained on the full dataset, PromptReps remains competitive. This indicates that prompting places the model in a better “feature space” for retrieval tasks right from the start.

Conclusion and Implications

The PromptReps paper presents a compelling argument: Large Language Models are already secretly powerful retrieval systems; they just need to be asked the right question.

By prompting an LLM to summarize a document into a single word, we can harvest its internal neural states to create high-quality dense and sparse vector representations. This method, PromptReps, offers several distinct advantages:

  1. Zero Training Cost: It requires no expensive contrastive learning or GPU farms for pre-training.
  2. Full Corpus Retrieval: Unlike re-ranking methods that only look at top-10 results, PromptReps generates vectors that can be indexed, allowing for search across millions of documents.
  3. Future-Proofing: It scales with model size. As open-source models improve, PromptReps improves with them.

While it currently has higher query latency than smaller, dedicated BERT-based models (due to the size of LLMs like Llama-3), the rapid advancement in model compression and inference acceleration could mitigate this. PromptReps bridges the gap between the generative capabilities of LLMs and the structural needs of search engines, proving that sometimes, the best way to find what you’re looking for is to simply ask the model to describe it.