What Makes a Story? How Humans and AI Disagree on Narrative

If you were asked to define a “story,” you might instinctively reach for concepts like “characters,” “beginning, middle, and end,” or “conflict.” It feels intuitive. We tell stories every day—complaining about a boss, recounting a date, or explaining how we fixed a leaky faucet.

But when you try to teach a computer to identify a story, that intuition falls apart. Is a description of your morning routine a story? Is a rant about politics a story? What about a dream you had?

For years, Natural Language Processing (NLP) researchers have tried to solve this by creating strict rulebooks—“prescriptive” definitions of stories. They hand these rules to annotators and say, “Label this text based on these rules.” But a new research paper, The Empirical Variability of Narrative Perceptions of Social Media Texts, suggests this approach misses the messy, subjective reality of how human beings actually perceive narratives.

In this post, we will tear down the traditional “gold standard” of narrative detection. We will explore STORYPERCEPTIONS, a new dataset that captures the chaotic, opinionated, and fascinating way crowd workers (regular people) identify stories. We will also see how their perceptions clash with academic researchers and state-of-the-art Large Language Models (LLMs) like GPT-4.

The Problem with “Gold Labels”

In supervised machine learning, we usually rely on a “gold label”—the correct answer. If we are training a model to spot a cat, we show it thousands of pictures of cats. If a human says it’s a cat, it’s a cat.

But identifying a story isn’t like identifying a cat. It is a subjective experience.

Historically, NLP researchers have handled this by being prescriptive. They define a story (e.g., “A specific series of causally related events involving a character”) and train students to apply that definition rigidly. This creates clean data, but it might not reflect reality. It ignores the “reader response”—the feeling of being transported, the sense of suspense, or the emotional connection that makes something feel like a story to a layperson.

The researchers behind this paper took a different approach. They employed a descriptive paradigm. Instead of telling annotators what a story is, they asked them: “Do you think this is a story? And why?”

They collected 2,496 responses from 255 crowd workers on 502 Reddit posts (from the StorySeeker dataset). The result is a map of human disagreement, revealing that “storyness” is not a binary switch, but a complex interplay of text, intent, and reader emotion.

The diagram illustrates how three distinct audiences—Crowd, Researchers, and LLM—interpret whether a given piece of text constitutes ‘a story.’

As shown in Figure 1 above, different observers see the same text differently. A crowd worker might see a conflict and say “Yes.” A researcher might see a description of a habit and say “No.” An LLM might get confused by the hypothetical nature of the text (like a dream). This variability is not noise; it is the signal we need to understand.

Deconstructing the “Why”: A Taxonomy of Storytelling

To understand the crowd’s perspective, the researchers didn’t just look at the Yes/No labels. They analyzed the rationales—the free-text explanations workers wrote to justify their choices. Through a process called open coding, they distilled thousands of messy explanations into a structured Taxonomy of Narrative Perceptions.

This taxonomy revealed that when people talk about stories, they look at two main categories: Discourse and Features.

1. Discourse: The Author’s Goal

Discourse refers to the mode of communication. Is the author trying to argue? Explain? Entertain?

Table 1 shows co-occurring feature pairs, while Figure 3 illustrates the association of discourse categories with the perceived goal of the text.

Figure 3 (bottom half of the image above) shows how different communicative goals correlate with the “Story” label.

Narrative/Story: Unsurprisingly, when the perceived goal is to narrate, the text is almost always labeled a story.
Description/Expression: These are also highly correlated with stories.
Argument/Suggestion/Rant: These are negatively correlated. If a reader feels you are trying to sell them something or yell at them, they are less likely to perceive your text as a story, even if you use anecdotes.

2. Features: Textual vs. Extra-Textual

This is where the findings get profound. The researchers found that crowd workers don’t just look for textual ingredients (things you can point to in the sentence). They also look for extra-textual ingredients (things that happen in the reader’s mind).

Textual Features: Characters, events, dialogue, time markers.
Extra-Textual (Aesthetic) Features: Suspense, emotional sensation, “feels like a story,” cohesion.

The relative prevalence of feature codes in story (vs. non-story) rationales.

Figure 2 visualizes which features predict a “Story” label.

Look at the top bars: Event Experience, Character Person, and Plot Sequence. These are the “Big Three.” If a text has specific events involving people in a sequence, it’s likely a story.
But look slightly lower: Cohesive/Interpretable and Problem/Conflict. These are aesthetic judgments. A text needs to “hang together” (cohesion) to be a story.
Now look at the bottom (orange bars): These are the absence of features. The strongest predictor of a “Not Story” label is the explicit absence of a plot sequence (NOT_plot_sequence).

This confirms that for a general audience, a story isn’t just a bag of words. It’s a structure that creates a specific feeling of coherence.

The Interplay of Features

One of the most interesting contributions of this paper is the analysis of how these features interact. A story rarely relies on a single element.

If you look back at Table 1 (in the image shared earlier with Figure 3), you can see the “Pointwise Mutual Information” (NPMI) between features. This measures how often two concepts appear together in a rationale.

Cohesive Interpretable & Plot Sequence: This pairing has a very high score (0.4). This suggests that readers perceive a text as “cohesive” specifically because it has a “plot sequence.” The structure provides the glue.
Feels Like Story & Plot Sequence: This confirms that the intuitive “vibe” of a story is heavily dependent on identifying a sequence of events.

This implies that you cannot easily separate the “objective” text from the “subjective” reader experience. The plot allows the cohesion; the cohesion creates the “story feeling.”

Why We Disagree: The “Plot” Problem

If everyone agrees that characters and events make a story, why is there so much disagreement? The dataset showed that crowd workers often split on the same text.

The disagreement usually isn’t about whether a person is present (that’s easy to spot). It is about the complexity and nature of the action.

Relative feature code prevalence in majority story (vs. minority non-story) rationales.

Figure 4 analyzes cases where the majority said “Yes, it’s a story,” but a minority said “No.”

The dissenters (the minority “No” voters) were most likely to cite NOT_plot_sequence or NOT_cohesive_interpretable.
This implies that the minority had a higher threshold for what counts as a plot. To the majority, “I went to the store and bought milk” might be enough of a sequence. To the minority, that’s just a statement—it lacks the twist or complication required for a plot.

The “Habit” vs. “Event” Distinction

One of the hardest challenges for both humans and AI is distinguishing between a specific Event (something that happened once at a specific time) and a Behavior/Strategy (something that happens generally).

Event: “I walked my dog yesterday and we saw a coyote.” (Specific time, specific occurrence).
Behavior: “I walk my dog every day to stay healthy.” (General habit, abstract).

Relative feature prevalence with unanimously-voted stories versus substantially divided-vote stories.

Figure 5 shows the features that appear in Unanimous Stories (everyone agrees) vs. Divided Stories (people disagree).

Unifying Features: Plot_sequence and Event_experience. When distinct events happen, everyone agrees it’s a story.
Contentious Features: Behavior_strategy and Concept_definition. When a text discusses generalized behaviors (“How I usually handle stress”), the crowd gets confused. Some think the description of the behavior is a story; others think it’s just an explanation.

This distinction is critical for data scientists. If your dataset is full of “habitual” descriptions labeled as stories, your model might learn to flag any verb-heavy sentence as a narrative, missing the crucial element of specificity.

Humans vs. Researchers vs. AI

Finally, the paper asks the big question: How do these crowd perceptions compare to the “experts” (researchers) and the “machines” (LLMs)?

The authors fine-tuned models (RoBERTa) on different datasets and also prompted GPT-4 to act as a descriptive annotator.

1. The Strictness of the Crowd

Interestingly, the Crowd was often stricter than the Researchers. Researchers, following a technical codebook, might label a text as a story if it contains a causal sequence of events, even if it’s boring or trivial. Crowd workers, relying on their intuition, often withheld the “Story” label if the text lacked a point, a moral, or an interesting plot hook. They needed the text to be tellable.

2. The GPT-4 Factor

How did GPT-4 do? It generally aligned well with humans, but with specific blind spots.

Agreement: GPT-4 (gpt-4-0613) achieved a Cohen’s Kappa of ~0.6 with the Crowd Majority. This is “moderate” agreement—good, but not human-level replacement.
The “Habit” Blind Spot: Just like the divided crowd workers, GPT-4 struggled with Behavior_strategy. However, GPT-4 tended to be less likely to call a behavior a story compared to the average human. It often strictly categorized abstract plans or habits as non-stories, whereas humans might be more lenient if the habit was described vividly.
Llama-3: The smaller model (Llama-3 8B) performed significantly worse, largely because it had a massive confirmation bias—it wanted to label almost everything as a story.

Comparison of story prediction rates between RoBERTa models fine-tuned on prescriptive labels from researchers vs. descriptive crowd majority vote labels.

Figure 10 compares prediction rates across different Reddit topics (subreddits).

The X-axis is the prediction rate based on a model trained on Researcher labels.
The Y-axis is the prediction rate based on a model trained on Crowd labels.
There is a strong correlation, but notice the Pink dots (Tech). The Researcher-trained model finds way more stories in Tech subreddits than the Crowd-trained model does.
Why? Tech posts often contain “troubleshooting steps” (First I did X, then I clicked Y). To a researcher looking for causal sequences, that’s a story. To a crowd worker, that’s a manual. The crowd requires an aesthetic narrative arc that technical troubleshooting often lacks.

Conclusion: The Future of Narrative AI

The STORYPERCEPTIONS paper teaches us that “story” is not a static property of a text. It is a relationship between a writer and a reader.

If we want to build AI that truly understands narrative—for creative writing, mental health analysis, or toxicity detection—we cannot rely solely on rigid, prescriptive definitions. We need to incorporate the messy, aesthetic judgments of real people.

Key takeaways for students and practitioners:

Context Matters: A sequence of events isn’t always a story. It might be a recipe, a troubleshooting guide, or a habit.
Reader Response: The “feeling” of cohesion and suspense is just as measurable and important as the presence of characters.
Disagreement is Data: When annotators disagree, it often reveals a boundary condition in the concept (like the Event vs. Behavior split). Don’t just throw that data away.

As LLMs become more integrated into our lives, understanding how they perceive human concepts like storytelling becomes essential. This research pushes us toward models that don’t just read the text, but understand the human intent behind it.

The Problem with “Gold Labels”#

Deconstructing the “Why”: A Taxonomy of Storytelling#

1. Discourse: The Author’s Goal#

2. Features: Textual vs. Extra-Textual#

The Interplay of Features#

Why We Disagree: The “Plot” Problem#

The “Habit” vs. “Event” Distinction#

Humans vs. Researchers vs. AI#

1. The Strictness of the Crowd#

2. The GPT-4 Factor#

Conclusion: The Future of Narrative AI#