Introduction

In the last two years, the “context window”—the amount of text an AI model can process at once—has exploded. We have moved rapidly from models that could barely read a long article (4k tokens) to behemoths capable of ingesting entire libraries (1 million+ tokens). Models like Gemini-1.5-pro and Claude 3 Opus promise to process vast amounts of information in a single pass. Alternatively, Retrieval Augmented Generation (RAG) systems promise to sift through millions of documents to find exactly what we need.

But a massive context window poses a massive problem: How do we know if the model actually understands what it is reading?

Until now, the standard test has been the “Needle-in-a-Haystack” challenge. You hide a random fact (the needle) inside a massive amount of text (the haystack) and ask the model to find it. If the model retrieves the passkey, it passes. While useful, this task is rudimentary. It tests retrieval, not reasoning. It is the AI equivalent of “Ctrl+F.”

Real-world tasks are rarely about finding a single sentence. They are about reading dozens of documents, synthesizing the information, identifying repeating trends, and—crucially—citing sources.

In a recent paper titled Summary of a Haystack: A Challenge to Long-Context LLMs and RAG Systems, researchers from Salesforce AI Research introduce a new, far more rigorous benchmark. They argue that to truly test the next generation of AI, we need to move beyond finding needles. We need to ask models to summarize the haystack.

This post explores the “Summary of a Haystack” (SummHay) framework, a novel method for evaluating long-context reasoning that reveals uncomfortable truths: despite the hype, today’s best models still lag significantly behind human performance when tasked with complex synthesis and citation.

Background: The Evaluation Crisis

To understand why SummHay is necessary, we must first look at the current state of Long-Context Large Language Models (LLMs) and RAG systems.

The Contenders: Long-Context vs. RAG

There are currently two ways to answer questions based on large datasets:

  1. Long-Context LLMs: You feed the entire corpus (e.g., 100 documents) directly into the model’s prompt. The model “reads” everything at once.
  2. RAG (Retrieval Augmented Generation): A retriever algorithm scans the corpus, selects the top few relevant chunks, and feeds only those chunks to the LLM.

Both approaches aim to solve the same problem, but comparing them is difficult. RAG is efficient but might miss context. Long-context models see everything but can get overwhelmed (“lost in the middle”).

The Problem with Current Benchmarks

Existing benchmarks fall into two traps:

  1. Too Simple: Tasks like “Needle-in-a-Haystack” only require retrieving a specific string of text. They do not test the ability to aggregate information across multiple documents.
  2. Too Subjective: Traditional summarization benchmarks rely on human-written “gold standard” summaries. However, comparing an AI’s summary to a human’s summary is notoriously difficult. Metrics like ROUGE (which counts word overlap) correlate poorly with actual quality. Furthermore, for massive datasets (100k+ tokens), creating human reference summaries is prohibitively expensive and slow.

The researchers realized that to evaluate complex reasoning at scale, they needed a way to generate “Haystacks” where they knew exactly what the summary should look like, without relying on subjective human writing.

The SummHay Method: Synthesizing the Truth

The core innovation of this paper is a synthetic data generation pipeline. Instead of taking existing real-world documents and hoping to evaluate a summary of them, the researchers create the documents from scratch based on a structured set of facts.

Because they generate the documents, they know exactly which facts appear in which document. This effectively turns the subjective task of “summarization” into an objective task of “coverage and citation.”

Step 1: Subtopics and Insights

The process begins with a high-level topic, such as “students discussing exam strategies.” An LLM breaks this down into distinct subtopics (e.g., “Study Techniques,” “Managing Stress”).

For each subtopic, the system generates specific atomic facts called Insights. An insight is a standalone piece of information, often containing a specific entity or number.

  • Subtopic: Managing Stress.
  • Insight: “A student recommends using the ‘Calm’ meditation app for 15 minutes each morning.”

Step 2: Document Generation (Building the Haystack)

Once the insights are defined, the system generates the actual documents (the Haystack). A Haystack consists of roughly 100 documents, totaling about 100,000 tokens.

Here is the clever part: The system programmatically decides which insights go into which document.

  • Document 1 might contain the “Calm app” insight and a “Pomodoro technique” insight.
  • Document 2 might contain the “Calm app” insight and a “Flashcards” insight.
  • Document 3 might contain no relevant insights at all.

This creates a precise “Gold Mapping.” The researchers know that if a model is asked about “Managing Stress,” it must mention the Calm app, and it must cite Document 1 and Document 2.

Diagram illustrating the steps to synthesize a Haystack of documents given an input scenario: subtopic and insight creation followed by document generation. Once a Haystack is synthesized, it can be used to benchmark LLMs / RAG systems on query-focused summarization tasks.

As shown in Figure 1, the pipeline flows from abstract concepts to concrete documents. This synthetic approach ensures that “insights repeat across documents,” mimicking real-world research where multiple sources might corroborate a single fact.

Step 3: The Task

The model is given the Haystack (or access to it via RAG) and a query like: “Summarize the insights regarding stress management.”

Crucially, the prompt instructs the model to:

  1. Produce a bulleted list of insights.
  2. Cite every source document that contributes to that insight using a specific format (e.g., [Doc 1, Doc 2]).

The Evaluation Protocol: Coverage, Citation, and Joint Score

Because the researchers know the ground truth (the mapping of insights to documents), they don’t need to ask a human “Does this summary read well?” Instead, they can mathematically calculate accuracy.

They introduce three key metrics:

1. Coverage Score

This measures recall. Did the model find all the insights relevant to the query?

  • If there were 5 expected insights (e.g., Calm App, Deep Breathing, Walking, etc.) and the model lists 3 of them, it gets a partial score.
  • Scoring is granular: An insight can be “Fully Covered,” “Partially Covered,” or “Not Covered.”

2. Citation Score

This measures attribution. For every insight the model successfully found, did it cite the correct documents?

  • This is calculated as an F1 score of Precision and Recall.
  • Precision: Did the model cite documents that actually contain the insight? (Avoiding hallucinations).
  • Recall: Did the model cite all the documents that contain the insight? (Thoroughness).

3. Joint Score

This is the ultimate metric. It combines Coverage and Citation. To get a high Joint Score, a model must find the information and attribute it correctly. If a model hallucinates a fact, it gets zero. If it finds a fact but cites a wrong document, its score drops heavily.

Example evaluation of a candidate summary (right) for its coverage of reference insights (left). Each reference insight is assigned a Coverage Score by mapping it to a single candidate bullet. A mapped bullet’s citations are used to calculate the Citation Score. The total score is the average across reference insights. See Appendix A.7 for four additional examples.

Figure 2 visualizes this scoring. On the left, we see the “Reference Insights”—the ground truth. On the right, the “Candidate Summary” generated by an AI. The evaluation maps the AI’s bullet points to the reference insights to check for accuracy.

Automating the Judge

Evaluating these outputs manually would be slow. The researchers validated that GPT-4o can act as an automated judge. They compared GPT-4o’s grading against human annotators and found a high correlation.

Table 1: Reproducibility and cost of manual and automated evaluation for SummHay. We compute coverage correlation, linking accuracy, and evaluation cost.

Table 1 shows that GPT-4o achieves a 0.716 correlation with human annotators, which is reliable enough for benchmarking, while reducing the cost from $325 (manual) to roughly $7 (automated) per batch.

Experiments & Results: The Reality Check

The researchers tested 10 state-of-the-art LLMs (including GPT-4o, Claude 3 Opus, Gemini 1.5 Pro) and 50 different RAG configurations across two domains: News and Conversation.

The results, summarized below, reveal that SummHay is a brutal test for current AI.

1. Humans Are Still Superior

The most striking finding is the gap between the best AI systems and human performance.

Estimates of human performance on the SummHay task, plotted over time as participants complete the task in the Oracle setting during two-hour sessions.

Figure 3 tracks human annotators performing the task. Over a 2-hour session, humans (the orange and green lines) steadily improve their citation and joint scores, eventually reaching a Joint Score of roughly 56%.

In contrast, look at Table 2 below. The best “Joint Score” achieved by any AI system in a realistic setting (using Rerank3 retrieval) is only 36.0% (Gemini-1.5-pro).

Table 2: Summary of a Haystack results of human performance, RAG systems, and Long-Context LLMs. Results are reported using three metrics: Coverage (left), Citation (center), and Joint (right) scores. Full corresponds to model performance when inputting the entire Haystack, whereas Rand, Vect, LongE, KWs, RR3, Orac correspond to retrieval components RAG systems. Models ranked by Oracle Joint Score. For each model, #W_b report the average number of words per bullet point.

(Note: “Orac” in the table stands for Oracle—a cheat mode where the model is given only the perfect documents. Even with this unfair advantage, models barely match human performance, suggesting a reasoning deficit, not just a retrieval deficit.)

2. The RAG vs. Long-Context Trade-off

The data reveals a fascinating dilemma for engineers choosing between RAG and Long-Context windows.

  • Long-Context LLMs (The “Full” column): These models generally struggle with Citation. When reading 100k tokens at once, models like GPT-4o and Claude 3 Opus have trouble pinpointing exactly which document contributed to a specific bullet point. Their Joint Scores in the “Full” setting are often poor (16.2 for GPT-4o).
  • Exception: Gemini-1.5-pro performed significantly better than others in the full-context setting (Joint Score 51.0), suggesting its long-context architecture is currently superior for attribution tasks.
  • RAG Systems: Using a retriever (like “RR3” - Cohere Rerank 3) significantly boosts Citation scores. By filtering the noise and giving the LLM only the relevant 15k tokens, the models make fewer attribution errors. However, strict retrieval often lowers Coverage because the retriever might accidentally filter out a relevant document before the LLM ever sees it.

The Takeaway: If you need your AI to know everything (Coverage), use a massive context window. If you need your AI to be accountable and cite sources accurately (Citation), a high-quality RAG pipeline is currently the safer bet.

3. The “Lost in the Middle” Phenomenon

Does the order of documents matter? Yes. The researchers shuffled the Haystack to place relevant documents at the Top, Middle, or Bottom of the context window.

Table 3: Joint Scores of LLMs in the Full Context Setting, based on how documents are sorted. Documents can be in Random order or sorted such that relevant ones are at the Top or Bottom of the context window.

Table 3 (referenced here as Table 3 based on the image deck context) shows significant Position Sensitivity.

  • GPT-4o and Claude 3 Opus performed nearly 10 points better when relevant information was at the Bottom (near the end of the prompt) compared to the Top.
  • Gemini-1.5-pro had the opposite bias, preferring information at the Top.

This “positional bias” confirms that models are not processing the 100k+ token window uniformly. They have attention spikes at the beginning or end, often skimming over information buried in the middle.

Discussion and Implications

The “Summary of a Haystack” paper serves as a reality check for the AI industry. While marketing materials boast of “10 million token windows” and “perfect recall,” the ability to perform useful work over that context is a different story.

Why This Matters for Students and Researchers

  1. Recall \(\neq\) Reasoning: Just because a model can repeat a password hidden in a book doesn’t mean it can write a book report. We need to stop being impressed by simple retrieval.
  2. Citation is Hard: Hallucination remains a major issue. In the SummHay tests, models frequently hallucinated citations or failed to cite documents that clearly contained the information. For academic or enterprise applications, this is a critical failure point.
  3. Holistic RAG Evaluation: SummHay offers a way to test the entire pipeline. Usually, developers test the Retriever and the Generator separately. SummHay tests the end result: Did the user get the right answer with the right proof?

Future Directions

The authors note that even their “Human Performance” estimate (56%) is not a hard ceiling—it was done under time constraints. However, the fact that state-of-the-art models are 20 points behind suggests we have a long way to go.

Future systems will need to improve in two specific areas:

  1. Active Reasoning: Models need to better identify when information is repeated across documents and synthesize it, rather than just treating documents as a bag of unconnected words.
  2. Attribution Mechanisms: We may need architectural changes or specific fine-tuning that forces models to track where a piece of information came from while they are generating text.

Conclusion

As we move into the era of “infinite context,” our benchmarks must evolve. “Summary of a Haystack” (SummHay) provides a blueprint for this evolution. By using synthetic data to create objective ground truth for complex summarization tasks, it exposes the limitations of current LLMs and RAG systems.

The challenge is set: Can the next generation of models read a library, understand the narrative, and cite its sources with the precision of a human researcher? Currently, the answer is “not quite.” But thanks to benchmarks like SummHay, we now have a measuring stick to track our progress.