Teaching Machines to Read Between the Lines: Extracting Story Morals with LLMs

Introduction

“All stories teach.” This quote by literary critic Wayne Booth encapsulates a fundamental truth about human communication. Whether it’s a bedtime fairy tale, a gripping novel, or a snippet of news on social media, narratives rarely exist solely to entertain. They are vehicles for values, encoding lessons about how the world works, how we should behave, and what we should believe.

For computer scientists and digital humanists, understanding the “plot” of a story (the sequence of events) has become increasingly easier with modern Natural Language Processing (NLP). We can automatically summarize what happened: “The King died, and then the Queen died of grief.”

But understanding why a story was told—surfacing the underlying moral or message—remains a significant challenge. A moral isn’t explicitly written in the text; it is a high-level abstraction derived from the events. It requires reading between the lines.

In the research paper “Story Morals: Surfacing value-driven narrative schemas using large language models,” researchers from McGill University propose a novel framework for automating this complex task. They demonstrate that Large Language Models (LLMs), specifically GPT-4, can be coaxed into extracting value-driven lessons from diverse narratives. By doing so, they open a new window into large-scale cultural analysis, allowing us to map the “moral landscape” of human storytelling across different cultures and genres.

The Problem: From Plot to Purpose

To understand why this research is significant, we must distinguish between narrative content and narrative schema.

Content asks: “What happened?” (e.g., A tortoise raced a hare. The hare slept. The tortoise won.)
Schema (specifically Story Morals) asks: “What lesson is this story teaching?” (e.g., Persistence outweighs natural talent, or “Slow and steady wins the race.”)

Traditional computational methods struggle with the latter because morals are subjective and interpretative. Different readers might extract slightly different lessons from the same text, yet those lessons usually share a semantic core.

The researchers argue that extracting these morals is crucial for Computational Narrative Understanding. If we can automate moral extraction, we can analyze thousands of stories to see what values different cultures prioritize, how narrative lessons shift over time, and how different genres (like news vs. fiction) encode their messages.

The Method: Prompting for Wisdom

The core contribution of this paper is a methodology for using LLMs to extract these morals. The researchers did not simply ask the model, “What is the moral?” and call it a day. Instead, they recognized that deriving a lesson requires a cognitive process of synthesis.

They developed a multi-step prompting sequence designed to mimic a human’s “chain-of-thought” reasoning. To extract a high-quality moral, the model is first asked to identify the building blocks of the narrative.

Table 2: Prompts used for narrative comprehension and story moral labelling.

As shown in Table 2 above, the pipeline functions as follows:

Summary: First, the model summarizes the story. This ensures the LLM holds the key events in its “context” before making higher-level judgments.
Agents: The model identifies the protagonist and antagonist, and classifies the protagonist’s role (hero, villain, or victim). This focuses attention on the characters whose actions drive the moral.
Topic: The model identifies the central issue (e.g., “betrayal,” “ambition”).
The Moral: Finally, the model is asked for the moral in three different formats:

Free-text Moral: A single sentence summarizing the lesson.
Positive Moral (Moral+): A keyword phrase ending in “…is good behavior.”
Negative Moral (Moral-): A keyword phrase ending in “…is bad behavior.”

This structured approach allows the model to “reason” through the text, leading to more robust and context-aware outputs.

Validation: Do LLMs Understand Morals?

A major hurdle in this field is that there is no single “ground truth.” If you ask five people for the moral of The Great Gatsby, you might get five different answers. However, those answers will likely be relevant and semantically related.

To validate their method, the authors constructed a diverse dataset of 144 narratives, ranging from folktales and book summaries to Reddit personal stories and political news. They then compared the LLM’s outputs against human annotations.

The Subjectivity of Morals

The table below illustrates the complexity of the task. It shows human-generated morals for a folktale and a news article.

In Table 1, notice how the morals for “The Lost Camel” vary. One person writes “Good benevolent leadership pays off,” while another writes “Intelligence will be rewarded.” These are different concepts, yet both are valid interpretations of the same text. The goal of the AI is not to match one human perfectly, but to produce a moral that fits within this distribution of valid human interpretations.

Automated and Human Evaluation

The researchers used automated metrics (like BERTScore and GloVe embeddings) to measure the semantic similarity between human-written morals and GPT-4’s morals.

$Table 3: Median rouge and similarity scores (out of 100)of pairwise morals between the different groups of annotators in the validation dataset. P-values for a Mann-Whitney U-test (rank-sum test) were all less than \$1 0 ^ { - 5 }\$ under a null hypothesis that the human-human and human-GPT distributions were the same.$

Table 3 reveals a fascinating insight: The similarity between Human and GPT responses (Center column) is often higher than the similarity between different Humans (Left column).

What does this mean? It suggests that GPT-4 is acting as a sort of “average reader.” Its interpretations land centrally within the cloud of human meaning, rather than being an outlier.

The Crowd Prefers the Machine

To push the validation further, the researchers conducted a “Turing test” of sorts using Amazon Mechanical Turk. Crowd workers were shown a story and three potential morals (two written by humans, one by GPT-4) and asked to vote for the “Most Applicable” one. They were blind to which one was generated by AI.

Table 10: Percent of passages by genre where the GPT response was selected by a majority of AMT workers.

The results in Table 10 are striking. Across almost every genre—whether Folktales, News, or Book summaries—the GPT-generated moral was selected as the most applicable answer the majority of the time.

Qualitative analysis suggested that humans preferred the AI morals because they were often more articulate and explicit. While human annotators sometimes wrote short, imperative commands (“Stay true to convictions”), the LLM tended to explain cause and effect (“The pursuit of vengeance can lead to complex alliances…”), which readers found more comprehensive.

Application: The Moral Map of the World

Having validated that the tool works, the authors applied it to a massive dataset: 1,760 folktales from 54 distinct cultures.

This is where the power of “Story Morals” becomes evident. In traditional text analysis, if you cluster stories based on their raw text, the computer groups them by vocabulary. Stories about “wolves” go in one pile; stories about “kings” go in another.

But a story about a wolf and a story about a king might teach the exact same lesson (e.g., “Don’t be greedy”). By clustering the morals extracted by the LLM, the researchers could map stories based on their values rather than their surface features.

Surface Text vs. Deep Values

The table below compares the clusters generated by analyzing full story text versus analyzing the extracted morals.

Table 5: Top 10 largest clusters of the folktales using the embeddings of the full sentence morals compared to the full stories. The given words are the top 3 most representative words for each cluster as measured by c-TF-IDF. Only words larger than 3 letters are included.

Table 5 shows a stark difference:

Full Text (Right Column): The clusters are based on nouns and characters: “daughter, king, princess,” “boy, farmer, old,” “hen, woman, little.” This tells us who is in the stories.
Full Sentence Moral (Left Column): The clusters are based on abstract concepts: “challenges, ingenuity, overcome,” “love, obstacles, true,” “cunning, deceit, trickery.” This tells us what the stories mean.

This method allows researchers to find “Thematic Neighbors”—stories from completely different parts of the world that share the same soul.

Visualizing the Moral Landscape

The researchers projected these moral clusters into a 2D map to visualize the distribution of lessons in the folktale dataset.

Figure 1: 2D representation of the cluster centroids for the full-sentence morals when reduced using UMAP. Each circle corresponds to a cluster. Only a few of the largest clusters in each island are colored for readability. The circle sizes are related to the sizes of the clusters.

In Figure 1, each circle represents a cluster of stories sharing a specific moral theme.

Red: Themes of overcoming challenges through ingenuity.
Blue: Themes of love overcoming obstacles.
Pink: Themes of justice prevailing.

This visualization proves that specific moral archetypes are pervasive. While the characters change (a shark in Oceania, a wolf in Europe), the underlying structure of the lesson—the “moral”—remains remarkably consistent across human cultures.

Nuance in Positivity and Negativity

The researchers also analyzed the “Positive” and “Negative” moral keywords separately. This helped disentangle complex lessons.

Figure 3 displays these separated value clusters.

Positive (a): Focuses on virtues like Perseverance, Generosity, Cleverness, and Wisdom.
Negative (b): Focuses on vices like Deception, Greed, Pride, and Disobedience.

This separation allows for granular cultural analysis. For example, the researchers noted statistically significant regional differences. Folktales from North America in their dataset had a higher-than-chance frequency of morals regarding contentment and appreciation. Tales from Africa showed a higher frequency of morals regarding cleverness.

While the authors are careful to note that their dataset might not be perfectly representative of all cultures (it relies on English translations), the method proves that we can now quantitatively measure these cultural “vibes.”

A Closer Look at the Clusters

Finally, we can look at the specific content of these clusters to appreciate the depth of the AI’s understanding.

Table 16: All clusters for the full-sentence morals

Table 16 provides a detailed list of the moral themes found. We see clusters as specific as:

Cluster 14: “Breaking a promise, especially one made to someone you love, can lead to irreversible loss and regret.”
Cluster 33: “Be careful what you wish for, as the pursuit of desire can lead to unintended consequences.”
Cluster 52: “True leadership comes from humility and service to a higher cause.”

This level of semantic grouping is virtually impossible to achieve with standard keyword searching. It requires the LLM to “understand” the story’s intent.

Conclusion and Implications

The paper “Story Morals” demonstrates a significant leap forward in Digital Humanities and NLP. By shifting focus from “what happened” to “why it was told,” the authors have provided a framework for surfacing the value systems encoded in our narratives.

Key Takeaways

LLMs are effective moralists: GPT-4 can extract story morals that are not only consistent with human interpretations but often preferred by human readers for their clarity.
Chain-of-Thought is key: A structured prompting sequence (Summary -> Agents -> Topic -> Moral) helps the model synthesize the narrative effectively.
Values over Vocabulary: Clustering stories by their morals reveals cross-cultural connections that are invisible when clustering by raw text.

Future Directions & Limitations

The authors candidly acknowledge limitations. Their analysis relied on GPT-4, which may harbor Western cultural biases from its training data. Additionally, using English translations of folktales filters the original cultural nuances.

However, the implications of this work extend far beyond folktales. This same framework could be applied to:

News Analysis: Understanding how different media outlets frame the “moral” of a political event.
Social Media: Analyzing personal narratives on Reddit or Twitter to understand shifting societal values.
Screenwriting: Helping writers analyze the thematic consistency of their scripts.

As we continue to integrate AI into the study of human culture, tools like this offer a way to read the collective “mind” of humanity, one story at a time.

Introduction#

The Problem: From Plot to Purpose#

The Method: Prompting for Wisdom#

Validation: Do LLMs Understand Morals?#

The Subjectivity of Morals#

Automated and Human Evaluation#

The Crowd Prefers the Machine#

Application: The Moral Map of the World#

Surface Text vs. Deep Values#

Visualizing the Moral Landscape#

Nuance in Positivity and Negativity#

A Closer Look at the Clusters#

Conclusion and Implications#

Key Takeaways#

Future Directions & Limitations#