Introduction

For years, the “holy grail” of natural language processing has been true reading comprehension. We have moved from simple keyword matching to semantic search, and now to Large Language Models (LLMs) that can process massive amounts of information. But there is a distinct difference between processing text and understanding literature.

Consider the act of analyzing a novel like The Great Gatsby or Frankenstein. When a literary scholar makes a claim—for example, arguing that a character’s clothing symbolizes their moral decay—they must support that claim with specific textual evidence. They don’t just search for the word “coat”; they recall the narrative arc, understand the subtext, and pinpoint the exact moment where the description supports their theory.

This process is known as Literary Evidence Retrieval.

In a recent paper, researchers Katherine Thai and Mohit Iyyer explore whether modern “long-context” LLMs—models capable of ingesting entire books in a single prompt—can perform this high-level task. Their work repurposes the RELiC dataset to test if models like Gemini Pro 2.5 and GPT-4o can act as literary detectives. The results are surprising: the best models are now outperforming human experts, yet they still struggle with the subtle nuances that define great literature.

In this post, we will tear down their methodology, examine the transition from “retrieval” to “reasoning,” and analyze why even the smartest AI still has trouble reading between the lines.

From Retrieval to Reasoning

To understand the innovation in this paper, we first need to look at how this problem was previously solved.

The Old Way: RAG and Embeddings

Traditionally, if you wanted an AI to find a quote in a book, you would use a Retrieval-Augmented Generation (RAG) approach. You would chop the book into small chunks, turn those chunks into mathematical vectors (embeddings), and then search for the chunk that is mathematically closest to your query.

The problem with this approach in literature is that the “query” (a literary critique) often doesn’t share keywords with the “answer” (the quote). A critic might talk about “isolation” and “melancholy,” while the supporting quote describes “a cold, empty room.” A standard retriever might miss the connection entirely because the vocabulary doesn’t overlap.

The New Way: Long-Context Reading

With the advent of models that can handle context windows of 128k, 200k, or even 1 million tokens, we no longer need to chop the book up. We can feed the entire novel into the model’s working memory alongside the critique.

The researchers define the task as follows:

  1. Input: The full text of a primary source (e.g., The Scarlet Letter).
  2. Input: An excerpt of literary criticism about that book, where the supporting quote has been replaced by a <MASK>.
  3. Goal: The model must generate the exact missing quotation from the book that supports the critic’s argument.

This mirrors the human process of literary analysis: holding the global context of the narrative in mind while performing a “close reading” to find specific evidence.

Dataset Curation: Cleaning the Library

The researchers utilized the RELiC dataset, which contains thousands of literary claims. However, raw data is rarely perfect. To create a rigorous benchmark, they had to clean the data extensively.

The original dataset contained issues like:

  • OCR Artifacts: Garbled text from scanning old books.
  • Quote Leakage: Sometimes the critique would accidentally include parts of the quote, giving the answer away.
  • Location Spoilers: Phrases like “In the final chapter…” which make the retrieval task too easy.

After filtering for these issues, the researchers curated a high-quality subset of 292 examples spanning classic works of fiction.

Table 4: Books included in the dataset. The token count was provided as per tiktoken tokenization.

As shown in Table 4 above, the primary sources are substantial. Novels like What Maisie Knew by Henry James exceed 124,000 tokens. This confirms that the task is genuinely “long-context”; the model cannot simply guess; it must navigate a massive search space to find a specific needle in the haystack.

To further illustrate the scale of the challenge, we can look at the summary statistics of the dataset.

Table 1: Summary statistics for long-context RELiC. Token counts were computed with the o2ook_base encoding via tiktoken (https://github.com/openai/ tiktoken) and word counts were computed by spliting on whitespace.

Table 1 reveals that for the 292 curated examples, the models must process an average of roughly 85,000 tokens per book. This requires a model architecture that doesn’t just “see” the text but can attend to specific details across a vast distance.

Experimental Setup

The researchers tested a variety of models, ranging from closed-source giants to open-weight contenders.

The Models:

  • Closed-Source: Gemini Pro 1.5 & 2.5, GPT-4o, o1, o3, Claude 3.7 Sonnet.
  • Open-Weight: Llama 3.1 & 3.3, Qwen 2.5, DeepSeek-R1.
  • Baseline: A standard embedding-based retriever (GTE-Qwen2-7B) to represent the “old way” of doing things.
  • Human Expert: One of the authors, with a degree in English literature, manually attempted a subset of the tasks to establish a human baseline.

Table 5: The upper rows display the evaluated LLMs, while the bottom row displays the text embedding model used for the baseline. Context lengths marked with * were extended with YaRN.

Table 5 lists the technical specifications. Note the context windows: most models tested support at least 128k context, with Gemini Pro pushing up to 1 million tokens.

The Prompts: They tested two prompting strategies:

  1. Simple: Just ask the model to fill in the mask.
  2. Explanation: Ask the model to first explain its reasoning for why a quote fits, and then provide the quote. This tests if “Chain of Thought” reasoning helps with literary interpretation.

Results: The Rise of the Machine Critic

The results of this study mark a significant moment in NLP history. For the first time on this specific task, an AI model has outperformed a human expert.

Table 2: Percentage of test set examples where the model generated the correct ground truth quotation for different folds of the test set. Theand \\(\\pmb { \\theta }\\) columns contain the accuracy of each model on the human-evaluated and close reading subsets of the data, respectively.

Table 2 provides the core results. Let’s break down the key takeaways:

1. The Human Baseline is Broken

Look at the bottom row. The Human expert achieved 55.0% accuracy on the evaluated subset (\(\alpha\)). This highlights how difficult the task is. Even for a human with a literature degree, correctly identifying the exact quote a critic had in mind is challenging because literary interpretation is subjective.

2. Gemini Pro 2.5 Takes the Crown

The top performing model, Gemini Pro 2.5, achieved 62.5% on the human-evaluated subset, surpassing the human expert. On the full dataset, it reached 64.7% accuracy using the Explanation prompt. This suggests that the model’s ability to scan the entire text and cross-reference it with the semantic meaning of the critique is superior to human memory and search strategies.

3. The Embedding Baseline Fails

The traditional retrieval method (GTE-Qwen2-7B) scored a dismal 4.5%. This proves that literary evidence retrieval is not a keyword matching task. It requires deep semantic understanding that simple vector similarity cannot capture.

4. The Open-Weight Gap

There is a stark contrast between closed-source and open-weight models. The best open model, DeepSeek-R1, achieved only 29.1% accuracy. This indicates that while open models are catching up in coding and math, they still lag significantly in the nuanced “interpretive reasoning” required for literature.

5. Close Reading (\(\beta\) fold)

The \(\beta\) column in Table 2 represents “Close Reading” examples. These are easier tasks where the critique actually quotes snippets of the text. You would expect high performance here.

  • Gemini Pro 2.5 dominates with 79.5%.
  • Llama 3.1 (8B) scores a tiny 2.6%. This shows that smaller models struggle to even utilize direct lexical overlap when the context (the whole book) is overwhelming. They essentially get “lost” in the long context.

Why Do Models Fail?

Despite the high scores of top models, they are far from perfect. The paper identifies two major failure modes: Overgeneration and Nuance Blindness.

The Problem of Overgeneration

The prompt explicitly asked models to provide a quote of “no more than five consecutive sentences.” However, models frequently ignored this, providing entire paragraphs or pages.

Table 3: The average length ratios for each model, defined as the ratio of the length model generation to that of the ground truth (measured in characters).All models have an average ratio greater than that of the human annotator (2.1),indicating overgeneration.

Table 3 shows the “Length Ratio.” A ratio of 1.0 would mean the model generated a quote exactly the same length as the ground truth.

  • Humans had a ratio of 2.1, meaning they naturally provided a bit more context than necessary.
  • GPT-4.1 had a ratio of 4.8, providing nearly five times as much text as needed.
  • Llama 3.1 spiked to 5.9.

The researchers hypothesize that weaker models “compensate for uncertainty by producing longer outputs.” Essentially, they prefer to spray-and-pray, hoping the correct answer is somewhere inside the massive block of text they return.

Struggling with Nuance

The most fascinating failures occur when the model finds evidence, but not the right evidence.

In one example from The Scarlet Letter, the critic discusses “melodrama” in the description of the character Chillingworth.

  • The Human correctly identified a passage describing “a writhing horror” on Chillingworth’s face—a description that clearly fits the definition of melodrama.
  • The Models (Gemini, o3, GPT-4) all selected a different description of Chillingworth that appeared nearby. This passage described his physical uneven shoulders.

While the models found a description of the character, they failed to connect the specific concept of “melodrama” (exaggerated emotion) to the text. They understood the who (Chillingworth) and the where (the scene), but missed the why (the thematic connection).

This highlights that while models have massive processing power, they still lack the “literary taste” or the deep semantic alignment that a trained human reader possesses.

Conclusion

The work by Thai and Iyyer demonstrates a massive leap forward for Long-Context LLMs. We have moved from models that can simply summarize a book to models that can actively search through it to support complex arguments.

The fact that Gemini Pro 2.5 can outperform a human expert on this benchmark is a testament to the power of modern context windows and reasoning capabilities. However, the results also serve as a reality check. The massive gap between finding relevant text and finding the perfect textual evidence remains.

The models’ tendency to overgenerate and their struggle with thematic nuance suggests that we haven’t solved literary analysis just yet. We have built a very fast, very well-read librarian, but we haven’t yet built a literary critic.

For students and researchers, this paper opens up exciting avenues. It suggests that “reasoning” isn’t just about math or logic puzzles; it’s also about the interpretive, subjective reasoning required to understand art. As models improve, we may soon see AI tools that can assist scholars in navigating the vast ocean of world literature, uncovering connections that no single human could ever find.