Why does one story leave you in tears while another, describing a similar tragedy, leaves you feeling indifferent?
For psychologists and computer scientists alike, empathy is a fascinating, complex mechanism. It is the cornerstone of prosocial behavior—the engine that drives us to help others and build community. Traditionally, we assume empathy is triggered by content: the tragic loss, the triumphant win, or the relatable struggle. But intuitively, we know that the way a story is told—its narrative style—plays a massive role in how it lands.
Until recently, analyzing narrative style at scale was incredibly difficult. You could count words (lexica), but how do you quantify “character vulnerability” or “plot volume” without having a human read every single text?
In a new paper titled “HEART-felt Narratives: Tracing Empathy and Narrative Style in Personal Stories with LLMs,” researchers from MIT and Carnegie Mellon University tackle this exact problem. They propose a new framework for understanding the mechanics of empathy and demonstrate how Large Language Models (LLMs) like GPT-4 can act as expert literary critics, unlocking deep insights into how we connect with one another.
The Problem with Content-Only Analysis
In the field of Natural Language Processing (NLP), understanding narratives has often been limited to “bag-of-words” approaches. Researchers might count how many negative emotion words appear in a text to predict if it’s sad. While useful, this misses the forest for the trees. A story can be packed with sad words but felt “flat” or “distant” due to its style.
Conversely, a story might use neutral language but employ a narrative structure that pulls the reader into the protagonist’s shoes, triggering a profound emotional response. This is the domain of narrative style.

As shown in Figure 1, the researchers argue that empathy isn’t just a product of the story’s events. It is synthesized through specific narrative elements—like how a character perceives the world, the volume of the plot, and emotional shifts. To study this scientifically, the authors needed a map.
Introducing the HEART Taxonomy
The core theoretical contribution of this work is HEART (Human Empathy and Narrative Taxonomy). Drawing from narratology (the study of narrative structure) and psychology, the authors constructed a comprehensive taxonomy of stylistic features that theoretically drive empathy.
This isn’t just a random list of features; it is a structured hierarchy designed to capture the nuance of storytelling.

As visualized in Figure 2, HEART breaks narrative style down into four primary categories:
- Character Identification: This is the largest category, focusing on how the story draws the reader into the narrator’s perspective. It includes:
- Flatness/Roundness: Is the character complex? Do they show development or vulnerability?
- Emotional Subject: How vivid are the emotions? Is the tone optimistic or pessimistic?
- Cognitive Subject: Does the story express the character’s internal thinking and planning?
- Temporal References: Is the narrator looking back with nostalgia or forward with anticipation?
- Plot: This defines the structure of events.
- Plot Volume: The frequency and significance of events (e.g., a life-altering day vs. a boring afternoon).
- Emotion Shifts: How the emotional trajectory fluctuates (e.g., from hope to despair).
- Resolution: Does the story offer closure?
- Point of View: The perspective from which the story is told (e.g., the use of first-person “I”).
- Setting: The vividness of the environment and context, which aids in world-building.
Can LLMs Act as Literary Critics?
Defining a taxonomy is one thing; detecting it in thousands of stories is another. The researchers wanted to know if LLMs could replace human experts in annotating these complex features.
To test this, they annotated a dataset of personal stories using both expert human raters and two LLMs: GPT-4 and Llama 3 (8B Instruct). They provided the models with a codebook—a set of instructions explaining concepts like “character vulnerability” or “plot volume”—and asked them to rate stories on a scale.
The results were promising, particularly for GPT-4.

As seen in Table 2, GPT-4 achieved “reasonable, human-level annotations” for many features.
- High Performance: It excelled at identifying Character Vulnerability (\(\rho = 80.15\)), Optimistic Tone, and Vivid Setting.
- Challenges: It struggled with Evaluations (understanding the narrator’s opinions/beliefs) and Cognition. The error analysis showed that GPT-4 sometimes conflates emotional reactions with cognitive processes (thinking/planning).
The authors also compared the LLMs against LIWC (Linguistic Inquiry and Word Count), the industry-standard tool for text analysis that relies on keyword dictionaries.

Table 3 highlights a crucial finding: LLMs are better at capturing nuance. For features like “Vivid emotions” and “Character vulnerability,” GPT-4 significantly outperformed the lexicon-based approach. While a lexicon counts “sad” words, an LLM understands that a description of “a hollow ache in my chest” represents vivid vulnerability, even if the word “vulnerable” isn’t used.
The Experiment: Narratives in the Wild
With the HEART taxonomy established and GPT-4 validated as a reliable annotator, the researchers launched a large-scale empirical study. They recruited 2,624 participants to read personal stories and rate their empathy.
The goal was to map the “pathways” of empathy. How do specific narrative styles interplay with the reader’s own personality to create an emotional connection?

Figure 3 illustrates the complex system they modeled. It’s not a straight line from Story \(\rightarrow\) Empathy. It involves:
- Narrative Style: The HEART features (e.g., vividness, plot).
- Reader Characteristics: The reader’s age, gender, and trait empathy (their baseline tendency to feel for others).
- Interaction Effects: Variables like Narrative Transportation (how “lost” the reader gets in the story) and Perceived Similarity (does the reader feel like the narrator?).
Key Findings: What Makes a Story “Work”?
The study produced several fascinating insights into human psychology and storytelling mechanics.
1. Plot and Character Drive Empathy
When the researchers aggregated empathy ratings, they found that specific stylistic choices reliably boosted engagement.

As shown in Figure 5, stories with High Character Development (where the narrator undergoes a change) and High Plot Volume (significant, impactful events) resulted in significantly higher empathy scores. This validates literary intuition: we care more about people who grow, and we are more engaged by stories where “things happen.”
2. Empathy is Not “One Size Fits All”
Perhaps the most critical finding for future AI systems is that empathy is highly subjective. You cannot simply predict a single “empathy score” for a story and expect it to apply to everyone.

Figure 6 reveals the high variance in responses. The standard deviation for empathy ratings on the same story was significantly greater than zero. This means that while a story might be a tear-jerker for one demographic, it might leave another cold. The authors found that incorporating reader demographics (age, sex, trait empathy) significantly improved the statistical models. Empathy is personalized.
3. The Pathway: Vividness Leads to Transportation
Using Structural Equation Modeling (SEM)—a statistical technique for testing causal relationships—the authors mapped the exact psychological pathway that leads to empathy.

Figure 4 is the “circuit board” of narrative empathy. Here is how to read it:
- Vividness of Emotions is a major input.
- It points strongly to Narrative Transportation (\(r=0.33\)). This confirms that when a writer describes emotions vividly (using imagery, metaphor, or strong language), it helps the reader become “absorbed” or transported into the story world.
- Narrative Transportation, in turn, is the strongest driver of State Empathy.
Essentially, style (Vividness) enables the mechanism (Transportation), which produces the result (Empathy).
The model also highlights the role of the reader. Similar Experience (having been through the same thing as the narrator) and Trait Empathy (being a naturally empathetic person) are independent predictors of whether a reader will care.
Conclusion and Future Implications
This research bridges the gap between literary theory and computational social science. By introducing the HEART taxonomy, the authors have provided a vocabulary for AI to understand how stories are told, not just what they are about.
The implications are broad:
- For Writers: The data confirms that focusing on character growth, vivid emotional descriptions, and significant plot movements creates a measurable increase in reader empathy.
- For AI & Mental Health: Understanding that empathy is personalized is crucial. A chatbot designed to offer support needs to adapt its narrative style to the specific user, rather than using a generic “empathetic” tone.
- For Social Science: We can now use LLMs to analyze millions of stories online to understand the pulse of human connection.
The study serves as a reminder that while AI models like GPT-4 are often viewed as cold calculation engines, they possess a surprising ability to decode the warmest and most complex of human traits: our ability to feel for one another. Through frameworks like HEART, we are beginning to understand the algorithms of the soul.
](https://deep-paper.org/en/paper/2405.17633/images/cover.png)