The race for longer context windows in Large Language Models (LLMs) has been one of the defining trends of the last year. We have moved rapidly from models that could read a few pages to models like Gemini-1.5-Pro and GPT-4o, which boast context windows of 128k, 200k, or even 1 million tokens. Theoretically, this allows an AI to ingest hundreds of financial reports, legal contracts, or academic papers simultaneously and answer complex questions about them.

But there is a discrepancy between marketing claims and technical reality. Can these models actually reason across that much data, or are they just really good at finding specific keywords?

A recent paper, “Leave No Document Behind: Benchmarking Long-Context LLMs with Extended Multi-Doc QA,” introduces a new benchmark called Loong to answer this question. The researchers argue that previous benchmarks are too simple, often relying on “needle-in-a-haystack” tests that don’t reflect real-world usage. By testing models on complex, multi-document tasks, they reveal that even the most advanced LLMs struggle significantly when required to synthesize information rather than just retrieve it.

This post breaks down the Loong benchmark, the methodology behind it, and the surprising results regarding the limitations of current Long-Context LLMs (LCLMs) and Retrieval-Augmented Generation (RAG).

The Problem: The “Needle” Illusion

To understand why we need a new benchmark, we first have to look at how long-context models are currently evaluated. The industry standard has largely been the “Needle-in-a-Haystack” (NIAH) test. In this setup, a specific piece of information (the needle) is inserted into a large amount of irrelevant text (the haystack). The model is asked to find that information.

While NIAH is a useful sanity check, it is fundamentally a retrieval task, not a reasoning task. In the real world, a financial analyst doesn’t just want to know “What was the revenue in Q3?” (retrieval). They want to know “Which company had the highest debt-to-equity ratio across these 20 reports, and how does that compare to the industry trend?” (reasoning).

The authors of the Loong paper identify a critical flaw in existing benchmarks: Evidence Centralization. In datasets like LongBench or simple QA tasks, the answer is often contained entirely within a single segment of the input. The model can simply locate that segment and ignore the rest.

Comparison of previous benchmarks versus Loong. The top image shows a single document containing the answer, while the bottom image shows the Loong approach where evidence is scattered across multiple documents.

As illustrated in Figure 1, previous benchmarks allow models to take shortcuts. Loong, however, enforces a “Leave No Document Behind” philosophy. The evidence required to answer the prompt is scattered across multiple documents. If the model ignores any part of the input—due to attention loss or context limits—it will fail the task.

How Loong Compares to Existing Benchmarks

The researchers compared Loong against other popular benchmarks like L-Eval, LongBench, and RULER. The key differentiators are the focus on multi-document tasks and “High Evidence Dispersion”—meaning the answer isn’t just in one place.

Table comparing the characteristics of Loong against benchmarks like L-Eval, LongBench, and RULER.

The Loong Benchmark: Methodology

To build a benchmark that reflects reality, the researchers moved away from synthetic data. They collected documents from three complex, real-world domains:

Financial Reports: 2024 quarterly and annual reports.
Legal Cases: Judgments from higher/intermediate courts.
Academic Papers: Recent Arxiv papers.

The benchmark consists of 1,600 newly annotated test instances across varying lengths, from 10k tokens all the way up to 250k+ tokens.

Four Types of Reasoning Tasks

The core contribution of this paper is the categorization of long-context tasks. The researchers designed four distinct task types to test different cognitive capabilities of LLMs.

Diagram illustrating the four evaluation tasks: Spotlight Locating, Comparison, Clustering, and Chain of Reasoning.

1. Spotlight Locating

This is the baseline task, similar to traditional retrieval. The model must find a specific piece of information located in one document among many. It tests the model’s ability to filter out noise.

Example: “What is the Basic Earnings Per Share for Dominari Holdings?”

2. Comparison

This task forces the model to locate evidence in multiple documents and compare the values.

Example: “Which of the listed companies has the highest non-current assets?” To answer this, the model must extract the asset value for every single company in the context window and then mathematically compare them.

3. Clustering

Here, the model must extract relevant data and group it based on specific criteria.

Example: “Categorize these companies into ‘high payable’, ‘medium payable’, and ’low payable’ groups.” This requires a global understanding of the entire context window to establish the grouping logic and apply it to every entity.

4. Chain of Reasoning

This is the most complex task. It requires multi-hop logic where one piece of evidence leads to another.

Example: “Analyze the cash flow trend for Company X from 2022 to 2024.” The model must chronologically order documents, extract data points from each year, and describe the trajectory.

Data Statistics

The distribution of these tasks is rigorous. As shown in Table 2, the average token length for these tasks hovers around 100k-119k tokens. This ensures the benchmark is testing the “long” in Long-Context LLMs.

Data statistics of the Loong benchmark showing average token counts and number of instances per task.

Experiments and Results

The researchers evaluated a suite of state-of-the-art models, including GPT-4o (128k), Gemini-1.5-Pro (1M), Claude 3.5 Sonnet (200k), and open-source models like Qwen2-72B.

The evaluation metric used was GPT-4-as-a-Judge, scoring answers on Accuracy, Hallucinations, and Completeness. They reported two metrics:

Avg Score: A score from 0-100.
Perfect Rate: The percentage of answers that received a perfect 100 score.

Main Findings

The results were sobering. Even the most powerful models struggle with holistic multi-document reasoning.

Table showing the performance of various LLMs on the Loong tasks.

Table 4 breaks down performance by context length sets (Set 1 being the shortest, Set 4 the longest). Several key trends emerge:

Gemini-1.5-Pro Takes the Lead: Thanks to its massive 1M context window training, Gemini showed the most consistent performance, particularly in the ultra-long (200k-250k) sets.
The “Effective” Window is Smaller than Advertised: Look at GPT-4o and Qwen2. While they claim 128k windows, their performance degrades significantly once the input exceeds 50k tokens (Set 2). This suggests an “effective zone” that is much smaller than the technical limit.
Complexity Kills Performance: Models performed decent on “Spotlight Locating” (the simple search task). However, scores plummeted for “Clustering” and “Chain of Reasoning.” For example, in the longest setting (Set 4), Kimi-Chat scored a 0.00 on the Chain of Reasoning task.

The Failure of RAG

Perhaps the most interesting experiment in the paper involves Retrieval-Augmented Generation (RAG). RAG is the standard industry solution for handling long context: chunk the data, store it in a database, retrieve the top-k relevant chunks, and feed them to the LLM.

The researchers tested GPT-4o and Qwen2 equipped with RAG (using OpenAI and BGE embeddings) against the standard long-context input.

Bar charts showing that adding RAG actually decreases the average score compared to the baseline model.

Figure 3 shows a counter-intuitive result: RAG made the models worse.

Why? Because Loong tasks require global context.

In a Spotlight task, RAG works fine because the answer is in one specific chunk.
In a Clustering or Comparison task, the evidence is distributed evenly across the whole text. If RAG retrieves only the “Top-5” chunks, it inevitably misses necessary documents.

The researchers quantified this “Recall Rate” of RAG.

Table showing the recall rate of documents using RAG. Even with Top-50 retrieval, recall never reaches 100%.

As shown in Table 6, even when retrieving the Top-50 chunks, the system only recalled about 60-64% of the necessary documents. If a question asks you to compare 10 companies, and RAG only retrieves 6 of them, it is impossible for the LLM to answer correctly.

This proves that for complex, multi-document analysis, relying solely on RAG is insufficient. The model effectively needs to “read” the whole text.

The Scaling Law of Context

The paper touches on a concept of “Scaling Law” regarding context windows. The performance degradation seen in models like GPT-4o and Qwen2 as the input length increases (even within their allowed window) suggests that simply using positional encoding tricks (like RoPE) to stretch a window is not enough.

To have a truly effective 128k window, a model likely needs to be trained on sequences longer than 128k. Gemini-1.5-Pro’s relative stability across lengths supports this; because it was trained for extreme lengths (up to 1M or 10M), the 200k range is well within its comfort zone.

Chart showing the length distribution of test cases in Loong.

As Figure 4 indicates, Loong provides a balanced distribution of test lengths, allowing us to pinpoint exactly where a model’s “attention” begins to fail.

Conclusion and Implications

The “Loong” benchmark serves as a reality check for the AI industry. It moves the goalposts from simple retrieval to complex, integrated reasoning over long contexts.

The key takeaways for students and researchers are:

Don’t Trust the Context Window Label: Just because a model accepts 128k tokens doesn’t mean it pays attention to all of them. Performance often drops sharply after 50k.
RAG has Limits: RAG is excellent for finding a needle in a haystack. It is poor at summarizing the haystack or comparing every straw in the haystack.
Future Architectures: The results imply that we need better training methodologies for long-context modeling. We cannot rely on “extending” short models; we must train models to hold massive amounts of data in active memory for reasoning.

For real-world applications like financial auditing, legal discovery, or meta-analysis of scientific literature, “Leave No Document Behind” is not just a catchy title—it’s a strict requirement. Until LLMs can pass benchmarks like Loong with high marks, their utility in these high-stakes, multi-document environments remains limited.

The Problem: The “Needle” Illusion#

How Loong Compares to Existing Benchmarks#

The Loong Benchmark: Methodology#

Four Types of Reasoning Tasks#

1. Spotlight Locating#

2. Comparison#

3. Clustering#

4. Chain of Reasoning#

Data Statistics#

Experiments and Results#

Main Findings#

The Failure of RAG#

The Scaling Law of Context#

Conclusion and Implications#