Summary generation is often cited as one of the “solved” problems in Natural Language Processing (NLP). We feed a news article into ChatGPT or Claude, ask for a paragraph summary, and usually get a coherent result. But when we move away from dry, factual news reports into the world of narrative text—stories, novels, and creative writing—the cracks begin to show.

In the news, facts are explicit. In stories, meaning is often hidden in subtext, dialogue, and irony. If an LLM misses the subtext, it writes a summary that looks right but is fundamentally wrong.

This blog post explores STORYSUMM, a recent research paper that exposes just how difficult it is to evaluate faithfulness in narrative summarization. The researchers introduce a new dataset, prove that human annotators often miss obvious errors, and demonstrate that our current automatic evaluation metrics are struggling to keep up.

The Problem: Faithfulness in Fiction

In NLP, faithfulness refers to factual consistency. A faithful summary contains only information that can be supported by the source text. It doesn’t hallucinate new details, and it doesn’t twist the original meaning.

Evaluating faithfulness in news is relatively straightforward. If the article says “inflation rose by 2%,” and the summary says “inflation dropped,” it’s an error. But stories rely on interpretation.

Consider the example below from the STORYSUMM dataset. A protagonist visits their mother’s grave. The text contains lines like “You brat!” and “You didn’t know how to wash in the tub, so I showed you how!” The protagonist smiles and says, “At least you won’t hurt me… anymore.”

Table 3 showing LLM errors regarding a mother’s ghost.

As shown in Table 3, three different state-of-the-art models (Davinci-3, ChatGPT, and Claude-2) completely misinterpreted this scene. They summarized it as a “comforting” reunion where the protagonist “reminisces.” They missed the subtext of abuse entirely. This isn’t just a minor detail; it changes the entire plot.

This is the core challenge: How do we detect these errors when even the models usually sound so convincing?

Introducing STORYSUMM

To answer this, the researchers created STORYSUMM, a new benchmark dataset designed to stress-test evaluation methods.

The Dataset

The dataset consists of 96 story-summary pairs.

  1. Stories: Sourced from Reddit communities like r/shortstories. These are amateur-written, short (under a page), and—crucially—unlikely to be in the training data of older LLMs. They are also dense with narrative elements like plot twists and dialogue.
  2. Summaries: Generated by a variety of models, ranging from older models like Davinci-3 to newer powerhouses like GPT-4 and Claude-3.

The researchers found that while newer models are improving, unfaithfulness remains a massive issue.

Table 2 showing the percentage of faithful summaries by model.

As seen in Table 2, the “faithfulness” rate varies wildly. While Claude-3 performed exceptionally well (90.5% faithful), other highly capable models like GPT-4 and ChatGPT hovered around 50-60%. This means nearly half of the summaries generated by these models contained at least one factual error.

The Human Factor: “All That Glitters is Not Gold”

Here is where the paper makes its most significant contribution. Usually, in NLP research, human evaluation is considered the “Gold Standard.” If a human annotator says a summary is good, we accept it as truth.

However, the STORYSUMM team hypothesized that for complex stories, one human opinion isn’t enough. They tested three different human annotation protocols to see which one actually caught the errors.

1. The Standard Annotators (Upwork)

They hired workers from Upwork (finding that Mechanical Turk workers produced unusable data) and paid them to perform fine-grained, sentence-by-sentence evaluation. This is the standard “high-quality” human eval used in most research.

2. The Experts (The Authors)

Three authors of the paper manually reviewed every summary, discussing disagreements until they reached a consensus. This was a time-consuming, rigorous process.

3. The Hybrid Method (AI-Assisted)

They used GPT-4 to generate a list of potential inconsistencies and asked humans to verify them. The idea was that the AI might spot subtle details the humans missed.

Figure 6 showing the interface for expert labeling.

The Results: Humans Miss Things Too

The researchers found that no single method caught every error.

  • Annotators missed subtle plot points (called “Hard” errors).
  • Experts missed some errors that the Annotators caught.
  • The Hybrid method found the most errors but also generated many false positives (where the AI hallucinated an error that wasn’t there).

Confusion matrices showing disagreement between human methods.

Figure 3 illustrates the confusion matrices between methods. The numbers in the bottom-left and top-right quadrants represent disagreement. For example, there were 19 summaries that the Standard Annotators thought were faithful, but the Experts identified as unfaithful. Conversely, the Annotators found 11 errors the Experts missed.

Defining “Easy” vs. “Hard” Errors

The researchers categorized errors based on how difficult they were to detect:

  • Easy Errors: Detected by all three annotators. These are usually obvious contradictions (e.g., the summary says the character died, but the story says they lived).
  • Hard Errors: Detected by only one annotator or specific protocols. These often involve misinterpreting pronouns, chronology, or the motivations of a character.

Table 4 showing examples of Easy vs Hard errors.

Table 4 provides fascinating examples of this distinction. Look at the “Hard” error in the middle: the story mentions a character named Jane throwing AirTags out a window. The summary claims Jane forced Margot to throw them. This is a subtle attribution error that is easy to skim past if you aren’t reading carefully.

The Hybrid Hallucination Problem

While the Hybrid method (asking GPT-4 to find errors for humans to check) had high coverage, it wasn’t perfect.

Figure 2 showing incorrect inconsistencies generated by the Hybrid method.

As shown in Figure 2, the AI sometimes hallucinates inconsistencies. In this example, GPT-4 claims the summary is wrong about “Hope” being a victim. The human annotators agreed with the AI. However, upon closer inspection, the story confirms Hope is a victim. The AI (and the humans trusting it) were wrong. This highlights the danger of over-relying on AI for evaluation without rigorous checks.

The “Expanded Gold” Standard

Because no single method was perfect, the researchers merged the validated labels from all three methods to create an Expanded Gold set of labels. This combined dataset serves as a much more rigorous ground truth than any previous study in this domain.

Figure 4 showing the overlap of errors found by different methods.

Figure 4 visualizes this overlap. You can see that the “Hybrid” method (Right) caught 15 errors that were missed by the other methods, while the “Expert” method (Center) caught errors the others missed. This confirms the paper’s recommendation: To truly evaluate faithfulness in narrative summaries, you need diverse annotation protocols.

Benchmarking Automatic Metrics

With a high-quality “Expanded Gold” dataset in hand, the researchers turned their attention to automatic metrics. If humans struggle this much, can automated systems do any better?

They tested several methods:

  • Binary / Chain-of-Thought (CoT): Prompting LLMs (GPT-4, Claude-3) to rate the summary.
  • FABLES: An approach that breaks summaries into atomic claims.
  • MiniCheck: A specialized fact-checking model.
  • UniEval / AlignScore: Trained evaluation metrics.

The results were sobering.

Table 7 showing model scores against annotator labels.

Table 7 shows the performance of these metrics against the standard Annotator labels.

  • Pure Prompting (Binary/CoT): These models (like Claude-3 and GPT-4) tend to be overly optimistic, labeling 90-95% of summaries as faithful. They miss almost all the “Hard” errors (detecting only ~9-20%).
  • MiniCheck: This model went the other way, labeling almost everything as unfaithful (only 18% faithful). It caught the hard errors but had terrible precision.
  • FABLES (GPT-4): This was the strongest performer, achieving the best balance. However, even the best method only reached a balanced accuracy of about 67%.

When tested against the stricter Expanded Gold labels (the ones that include the tricky errors humans missed), the performance of most metrics dropped even further. FABLES, for example, saw a significant drop in its ability to detect hard errors.

Conclusion: The Path Forward

The STORYSUMM paper serves as a reality check for the NLP community. As LLMs become part of our daily reading and writing workflows, we implicitly trust them to summarize information accurately.

However, this research highlights two critical gaps:

  1. The Narrative Gap: LLMs still struggle to understand the “soul” of a story—the subtext, the character motivations, and the subtle twists. They can summarize the plot points while missing the point of the story.
  2. The Evaluation Gap: We cannot rely on simple “thumbs up/thumbs down” human annotation or current automatic metrics to catch these errors.

The authors recommend that future evaluation campaigns for narrative summarization should not rely on a single protocol. Instead, we must use a combination of expert review, fine-grained annotation, and AI-assisted checks to establish ground truth.

Until we can build evaluation metrics that understand irony as well as a human reader (or better), we should remain skeptical of the summaries we read—especially when the story gets complicated.