Why does one story leave you in tears while another, describing a similar tragedy, leaves you feeling indifferent?

For psychologists and computer scientists alike, empathy is a fascinating, complex mechanism. It is the cornerstone of prosocial behavior—the engine that drives us to help others and build community. Traditionally, we assume empathy is triggered by content: the tragic loss, the triumphant win, or the relatable struggle. But intuitively, we know that the way a story is told—its narrative style—plays a massive role in how it lands.

Until recently, analyzing narrative style at scale was incredibly difficult. You could count words (lexica), but how do you quantify “character vulnerability” or “plot volume” without having a human read every single text?

In a new paper titled “HEART-felt Narratives: Tracing Empathy and Narrative Style in Personal Stories with LLMs,” researchers from MIT and Carnegie Mellon University tackle this exact problem. They propose a new framework for understanding the mechanics of empathy and demonstrate how Large Language Models (LLMs) like GPT-4 can act as expert literary critics, unlocking deep insights into how we connect with one another.

The Problem with Content-Only Analysis

In the field of Natural Language Processing (NLP), understanding narratives has often been limited to “bag-of-words” approaches. Researchers might count how many negative emotion words appear in a text to predict if it’s sad. While useful, this misses the forest for the trees. A story can be packed with sad words but felt “flat” or “distant” due to its style.

Conversely, a story might use neutral language but employ a narrative structure that pulls the reader into the protagonist’s shoes, triggering a profound emotional response. This is the domain of narrative style.

Figure 1 illustrates the concept of narrative empathy. A story about a childbirth experience is broken down into narrative elements like Flatness/Roundness and Plot volume, which point towards Narrative empathy.

As shown in Figure 1, the researchers argue that empathy isn’t just a product of the story’s events. It is synthesized through specific narrative elements—like how a character perceives the world, the volume of the plot, and emotional shifts. To study this scientifically, the authors needed a map.

Introducing the HEART Taxonomy

The core theoretical contribution of this work is HEART (Human Empathy and Narrative Taxonomy). Drawing from narratology (the study of narrative structure) and psychology, the authors constructed a comprehensive taxonomy of stylistic features that theoretically drive empathy.

This isn’t just a random list of features; it is a structured hierarchy designed to capture the nuance of storytelling.

Figure 2 displays the full HEART taxonomy tree. The central node branches into Character Identification, Plot, Point of view, and Setting, each with detailed sub-categories.

As visualized in Figure 2, HEART breaks narrative style down into four primary categories:

  1. Character Identification: This is the largest category, focusing on how the story draws the reader into the narrator’s perspective. It includes:
  • Flatness/Roundness: Is the character complex? Do they show development or vulnerability?
  • Emotional Subject: How vivid are the emotions? Is the tone optimistic or pessimistic?
  • Cognitive Subject: Does the story express the character’s internal thinking and planning?
  • Temporal References: Is the narrator looking back with nostalgia or forward with anticipation?
  1. Plot: This defines the structure of events.
  • Plot Volume: The frequency and significance of events (e.g., a life-altering day vs. a boring afternoon).
  • Emotion Shifts: How the emotional trajectory fluctuates (e.g., from hope to despair).
  • Resolution: Does the story offer closure?
  1. Point of View: The perspective from which the story is told (e.g., the use of first-person “I”).
  2. Setting: The vividness of the environment and context, which aids in world-building.

Can LLMs Act as Literary Critics?

Defining a taxonomy is one thing; detecting it in thousands of stories is another. The researchers wanted to know if LLMs could replace human experts in annotating these complex features.

To test this, they annotated a dataset of personal stories using both expert human raters and two LLMs: GPT-4 and Llama 3 (8B Instruct). They provided the models with a codebook—a set of instructions explaining concepts like “character vulnerability” or “plot volume”—and asked them to rate stories on a scale.

The results were promising, particularly for GPT-4.

Table 2 compares the agreement scores between human annotators and the LLMs. GPT-4 shows high correlation with human ratings for features like Character Vulnerability and Optimistic Tone.

As seen in Table 2, GPT-4 achieved “reasonable, human-level annotations” for many features.

  • High Performance: It excelled at identifying Character Vulnerability (\(\rho = 80.15\)), Optimistic Tone, and Vivid Setting.
  • Challenges: It struggled with Evaluations (understanding the narrator’s opinions/beliefs) and Cognition. The error analysis showed that GPT-4 sometimes conflates emotional reactions with cognitive processes (thinking/planning).

The authors also compared the LLMs against LIWC (Linguistic Inquiry and Word Count), the industry-standard tool for text analysis that relies on keyword dictionaries.

Table 3 shows that GPT-4 generally outperforms LIWC in correlation with human ratings, particularly for Vivid emotions and Character vulnerability.

Table 3 highlights a crucial finding: LLMs are better at capturing nuance. For features like “Vivid emotions” and “Character vulnerability,” GPT-4 significantly outperformed the lexicon-based approach. While a lexicon counts “sad” words, an LLM understands that a description of “a hollow ache in my chest” represents vivid vulnerability, even if the word “vulnerable” isn’t used.

The Experiment: Narratives in the Wild

With the HEART taxonomy established and GPT-4 validated as a reliable annotator, the researchers launched a large-scale empirical study. They recruited 2,624 participants to read personal stories and rate their empathy.

The goal was to map the “pathways” of empathy. How do specific narrative styles interplay with the reader’s own personality to create an emotional connection?

Figure 3 presents a conceptual model. Narrative Style Elements and Reader Characteristics feed into Narrative-reader interaction (like transportation and similarity), which ultimately leads to Narrative empathy.

Figure 3 illustrates the complex system they modeled. It’s not a straight line from Story \(\rightarrow\) Empathy. It involves:

  1. Narrative Style: The HEART features (e.g., vividness, plot).
  2. Reader Characteristics: The reader’s age, gender, and trait empathy (their baseline tendency to feel for others).
  3. Interaction Effects: Variables like Narrative Transportation (how “lost” the reader gets in the story) and Perceived Similarity (does the reader feel like the narrator?).

Key Findings: What Makes a Story “Work”?

The study produced several fascinating insights into human psychology and storytelling mechanics.

1. Plot and Character Drive Empathy

When the researchers aggregated empathy ratings, they found that specific stylistic choices reliably boosted engagement.

Figure 5 shows box plots comparing high vs. low presence of narrative features. Stories with high Character development and high Plot volume show statistically higher average state empathy.

As shown in Figure 5, stories with High Character Development (where the narrator undergoes a change) and High Plot Volume (significant, impactful events) resulted in significantly higher empathy scores. This validates literary intuition: we care more about people who grow, and we are more engaged by stories where “things happen.”

2. Empathy is Not “One Size Fits All”

Perhaps the most critical finding for future AI systems is that empathy is highly subjective. You cannot simply predict a single “empathy score” for a story and expect it to apply to everyone.

Figure 6 is a histogram showing the standard deviation in empathy ratings for the same story. The wide distribution indicates that different readers have vastly different emotional reactions to the same text.

Figure 6 reveals the high variance in responses. The standard deviation for empathy ratings on the same story was significantly greater than zero. This means that while a story might be a tear-jerker for one demographic, it might leave another cold. The authors found that incorporating reader demographics (age, sex, trait empathy) significantly improved the statistical models. Empathy is personalized.

3. The Pathway: Vividness Leads to Transportation

Using Structural Equation Modeling (SEM)—a statistical technique for testing causal relationships—the authors mapped the exact psychological pathway that leads to empathy.

Figure 4 details the Structural Equation Model. Vividness of emotions strongly correlates with Narrative transportation, which is the strongest predictor of State Empathy.

Figure 4 is the “circuit board” of narrative empathy. Here is how to read it:

  • Vividness of Emotions is a major input.
  • It points strongly to Narrative Transportation (\(r=0.33\)). This confirms that when a writer describes emotions vividly (using imagery, metaphor, or strong language), it helps the reader become “absorbed” or transported into the story world.
  • Narrative Transportation, in turn, is the strongest driver of State Empathy.

Essentially, style (Vividness) enables the mechanism (Transportation), which produces the result (Empathy).

The model also highlights the role of the reader. Similar Experience (having been through the same thing as the narrator) and Trait Empathy (being a naturally empathetic person) are independent predictors of whether a reader will care.

Conclusion and Future Implications

This research bridges the gap between literary theory and computational social science. By introducing the HEART taxonomy, the authors have provided a vocabulary for AI to understand how stories are told, not just what they are about.

The implications are broad:

  • For Writers: The data confirms that focusing on character growth, vivid emotional descriptions, and significant plot movements creates a measurable increase in reader empathy.
  • For AI & Mental Health: Understanding that empathy is personalized is crucial. A chatbot designed to offer support needs to adapt its narrative style to the specific user, rather than using a generic “empathetic” tone.
  • For Social Science: We can now use LLMs to analyze millions of stories online to understand the pulse of human connection.

The study serves as a reminder that while AI models like GPT-4 are often viewed as cold calculation engines, they possess a surprising ability to decode the warmest and most complex of human traits: our ability to feel for one another. Through frameworks like HEART, we are beginning to understand the algorithms of the soul.