Introduction
We are living in the golden age of automated text generation. With the rise of Large Language Models (LLMs) like GPT-4 and Claude, generating a fluent, grammatically perfect story takes seconds. Yet, if you have ever asked an AI to write a screenplay or a novel, you likely noticed something missing. The text is readable, but the soul of the story often feels hollow. The plot might wander, the emotional stakes feel low, or the ending feels rushed and unearned.
While humans have mastered the art of storytelling over millennia—incorporating complex structures, suspense, and emotional payoffs—machines seem to struggle with the “big picture.” A recent paper titled “Are Large Language Models Capable of Generating Human-Level Narratives?” dives deep into this discrepancy.
The researchers propose that the gap between human and machine storytelling isn’t about vocabulary or grammar; it is about discourse structure. They introduce a novel computational framework to analyze stories on three levels: the macro (story arcs), the meso (turning points), and the micro (emotional arousal). By comparing thousands of human-written movie synopses against those generated by LLMs, they uncover exactly why AI stories often fall flat—and offer a roadmap for how to fix it.
The Architecture of a Story
To understand why LLMs struggle, we first need to define what makes a story “human.” The researchers broke down narrative structure into a specific framework.
1. Macro-Level: The Story Arc
At the highest level, a story is defined by the transformation of the protagonist’s fortune. Kurt Vonnegut famously lectured on the “shapes of stories,” proposing that most narratives fit into specific graphical curves. The researchers adopted a schema of seven story arcs, ranging from the classic “Rags to Riches” to the tragic “Oedipus.”

As shown in the table above, these arcs map the protagonist’s journey on a spectrum of fortune. A “Man in a Hole” story involves a fall followed by a rise, while an “Icarus” arc is a rise followed by a sharp, tragic fall.
2. Meso-Level: Turning Points
If the arc is the shape of the story, turning points are the structural beams holding it up. The researchers identified five crucial events that dictate narrative pacing:
- Opportunity (TP1): The introductory event starting the plot.
- Change of Plans (TP2): The goal is defined or altered.
- Point of No Return (TP3): The protagonist is fully committed; there is no going back.
- Major Setback (TP4): The “all is lost” moment.
- Climax (TP5): The peak of conflict and resolution.
3. Micro-Level: Affective Dimensions
Finally, on a sentence-by-sentence basis, stories are driven by emotion. This is measured in two dimensions: Arousal (the intensity of emotion/suspense) and Valence (the positivity or negativity of the emotion).
The Comparison: Human vs. Machine
The core contribution of this paper is a rigorous comparison between human-written movie synopses and those generated by GPT-4. The researchers collected a dataset of movie plots, stripped them of identifying proper nouns, and asked GPT-4 to write stories based on the same premises.
The results revealed a stark contrast in narrative capability.

As illustrated in Figure 1, human stories (the orange line) are jagged and complex. They feature deep lows (setbacks) followed by sharp highs. The machine story (the blue line), conversely, is a smooth, monotonic ride upward. It lacks tension. It is “unexcitingly positive.”
Let’s break down the specific failures the researchers discovered.
Failure 1: Poor Pacing and Rushed Endings
One of the most significant findings was that LLMs have terrible narrative pacing. In a well-structured story, the climax and major setbacks usually occur distinctively to give the story room to breathe.

The violin plots above show where in the story these turning points occur.
- TP1, TP2, TP3: Humans and AI place these roughly in the same spots (the bottom half of the graph).
- TP4 (Major Setback) and TP5 (Climax): Look at the blue shapes for TP4 and TP5. They are shifted significantly “earlier” (lower on the Y-axis) compared to the orange human shapes.
This indicates that LLMs tend to introduce the climax and resolution too early, rushing the ending. They fail to build adequate suspense, resolving conflicts almost as soon as they are introduced.
Failure 2: The “Toxic Positivity” Bias
A good story needs conflict. It needs moments where the audience genuinely fears for the protagonist. LLMs, perhaps due to the Reinforcement Learning from Human Feedback (RLHF) processes designed to make them “helpful and harmless,” seem allergic to negative outcomes.

Figure 3 tracks Arousal (suspense/intensity). Notice how human stories (orange) maintain high tension through the second half of the narrative. The AI (blue) drops off significantly. The machine simply cannot sustain the tension required for a compelling third act.

This issue is even more visible in Figure 5, which tracks Valence (positivity). Human stories dip into negative emotions (setbacks, tragedies, fears) deeply and frequently. The AI stories hover in a zone of persistent positivity. The researchers note that while human stories are diverse in their emotional trajectory, LLM stories are homogeneously happy.
Failure 3: Lack of Structural Diversity
When you ask an AI to write a story, it defaults to specific “safe” structures. The researchers analyzed the distribution of Story Arc types across the two datasets.

The difference is striking.
- Humans: A wide distribution. While “Man in a Hole” (30%) is popular, humans also write “Riches to Rags” (14.6%) and “Oedipus” (9.3%) tragedies.
- GPT-4: It relies heavily on “Man in a Hole” (51.3%) and almost never writes a tragedy. The “Oedipus” arc appears in only 1.3% of AI stories.
The AI has a distinct bias toward positive outcomes and redemption arcs, effectively ignoring whole genres of storytelling that involve tragedy or unmitigated failure.
Can LLMs Even Understand Narrative Structure?
Given these generation failures, the researchers asked a follow-up question: Is this a writing problem or a reading problem? Do LLMs generate bad structures because they don’t understand concepts like “Climax” or “Rags to Riches”?
They benchmarked several models (GPT-4, Gemini, Claude, Llama 3) on their ability to identify these structures in existing texts.

The results (Figure 6) show that models generally perform poorly at identifying Story Arcs compared to humans (the light blue bars vs. the dashed line). However, an interesting phenomenon emerged: Discourse Interdependence.
When the models were given “hints” (e.g., providing the specific Turning Points to help identify the Story Arc), their performance improved drastically (the dark blue bars). This suggests that while LLMs struggle to reason about narrative structure in a vacuum, they can leverage structural relationships if explicitly guided.
Engineering Better Stories
The final and perhaps most exciting part of the paper asks: Can we use this framework to make AI writers better?
The authors hypothesized that if they explicitly prompted the LLM with discourse constraints—forcing it to plan the story arc or define turning points before writing—the output would improve.
They tested three methods:
- Outline-Only: Standard prompting.
- + Self-Generated TPs: Asking the model to plan its own turning points first.
- + Human TPs: Forcing the model to use human-created turning points (specifically the Major Setback and Climax).
They then asked human annotators to read the stories and judge them on suspense, emotion, and diversity.

The results were statistically significant. Explicitly integrating turning point awareness into the prompt improved narrative suspense and engagement by over 40%.
As seen in the qualitative feedback above, human readers noticed the difference immediately. One annotator noted that the standard AI stories were “very straightforward and extremely positive,” while the structurally enhanced versions included “unexpected twists” and felt “more real and authentic.”
However, there was a catch. While using human turning points (+ Human TP) created the highest emotional provocation, it sometimes disrupted the logical flow because the AI struggled to connect the dots between its own setup and the human’s forced climax. The most balanced approach was + Self-Generated TPs, where the AI was forced to plan its own structure explicitly before writing.
Conclusion
This research highlights a critical limitation in current Generative AI: Fluency does not equal storytelling. An LLM can write perfect sentences while telling a structurally broken story.
The models exhibit a clear bias toward positivity, repetitive structures, and rushed pacing. They avoid the narrative valleys—the tragedies and major setbacks—that give human stories their resonance.
However, the paper also offers a solution. By moving beyond simple prompts and treating storytelling as a structured discourse task—focusing on Arcs and Turning Points—we can unlock significantly better performance. For students and researchers in NLP, this underscores that the future of creative AI isn’t just about bigger models, but about better structural understanding of what makes a story human.
](https://deep-paper.org/en/paper/2407.13248/images/cover.png)