Introduction

We have reached a point in the evolution of Artificial Intelligence where machines can generate text that is grammatically perfect, stylistically consistent, and undeniably coherent. If you ask GPT-4 to write a sonnet about a toaster, it will do so with impressive rhyme and meter. But there is a frontier that has remained elusive, a quality that separates a technical manual from a heartbreaking novel: Psychological Depth.

Evaluations of Large Language Models (LLMs) have traditionally focused on objective metrics. We measure perplexity, distinct-n-grams, and discourse coherence. We check for toxicity and bias. While indispensable, these metrics treat text as data. They do not account for the reader. They cannot tell us if a story evokes empathy, if it makes your heart race, or if the characters feel like genuine human beings rather than cardboard cutouts.

In the research paper “Measuring Psychological Depth in Language Models,” researchers from UCLA introduce a groundbreaking framework designed to bridge this gap. They propose the Psychological Depth Scale (PDS), a method rooted in literary theory rather than pure statistics, to quantify an LLM’s ability to produce narratively complex and emotionally resonant stories.

This post will take you through their journey of defining “depth,” the novel prompting strategies they used to turn LLMs into creative writers, and the surprising results that suggest AI might be closer to the “human touch” than we previously thought.

In this GPT-4 story, the psychological depth scale highlights strengths and weaknesses contributing to the overall reader experience.

The Problem: When “Good” Text Isn’t Enough

Imagine reading a story that is perfectly spelled and has no grammatical errors, yet leaves you feeling absolutely nothing. In the world of Natural Language Generation (NLG), this is a common issue. A model might maximize a “BLEU” score (a metric for text similarity) but fail to capture the nuances of human experience.

The authors of this paper argue that current evaluations are too focused on the text itself. To truly measure storytelling quality, we must shift our focus to the reader. This approach draws from Reader-Response Criticism, a literary theory that argues a story’s meaning isn’t contained entirely in the words on the page, but is created through the interaction between the text and the reader’s mind.

Complementing this is Text World Theory, which suggests that when we read, we construct a “text world”—a complex mental model of the characters, settings, and emotions. A “deep” story is one that allows the reader to construct a rich, vivid text world.

The challenge, therefore, was to translate these abstract literary concepts into a concrete, measurable framework for Computer Science.

The Solution: The Psychological Depth Scale (PDS)

To build the PDS, the researchers didn’t just guess what makes a story good. They conducted an extensive literature review of 95 peer-reviewed articles across cognitive psychology, media studies, and narrative analysis. They distilled 143 different evaluation criteria down to five core dimensions that define psychological depth.

Overview of our approach to developing and validating the Psychological Depth Scale.

As shown in the overview above, the process moves from theory to generation, and finally to annotation. Let’s break down the five pillars of the PDS:

1. Empathy (EMP)

This measures the narrative’s ability to make the reader step out of their own shoes and into the characters’. It’s not just about liking a character; it’s about perspective-taking. Does the story trigger the cognitive shift required to feel what the character is feeling? High empathy scores indicate that the reader feels a “deep resonance” with the character’s plight.

2. Emotion Provocation (PROV)

While empathy is about sharing a character’s feelings, emotion provocation is about the intensity of the reader’s own emotional response. Does the story elicit joy, fear, sadness, or anger? Importantly, the research notes that “congruent” emotions (where the text’s tone matches the reader’s reaction) are more cognitively effective. A deep story doesn’t just mention sad things; it makes you feel sad.

3. Engagement (ENG)

This metric assesses “transportation.” Did you lose track of time? Did the world around you fade away while you were reading? Engagement is the engine of storytelling; without it, the reader abandons the text world. It is reciprocal: high engagement often leads to higher empathy and emotional response.

4. Authenticity (AUTH)

This is perhaps the most challenging metric for an AI. Authenticity captures whether the narrative expressions feel like genuine human experiences. Even in a sci-fi story about aliens, the emotions must feel real. It relies on the concept of “Einfühlung” (feeling one’s way in). A high score here means the story captures the essence of existence in a way that resonates with the reader’s understanding of reality.

5. Narrative Complexity (NCOM)

This does not mean using big words or confusing plotlines. It refers to the richness of the storyline and character development. Are the characters multifaceted? Does the plot invite the reader to solve a puzzle or restructure their understanding of events (like a plot twist)? Complexity requires the reader to exert cognitive effort, which, paradoxically, makes the reading experience more enjoyable.

Generating Deep Stories: Prompt Engineering

Defining the scale is only half the battle. To test it, the researchers needed stories. They sourced human-written stories from Reddit’s r/WritingPrompts community, categorizing them by upvotes (Novice, Intermediate, and Advanced).

For the AI stories, they didn’t just ask ChatGPT to “write a story.” They recognized that standard prompting often leads to generic, flat outputs. To give the models a fighting chance at psychological depth, they developed two specific prompting strategies.

Strategy A: The Writer Profile (WP)

This technique leverages in-context impersonation. Previous research has shown that LLMs perform better when told they are experts. The researchers crafted a prompt that framed the AI as a “seasoned writer known for psychologically deep, engaging stories.”

Illustration of WRITERPROFILE’s template.

By establishing this persona before delivering the prompt premise, the model is “primed” to access the latent space associated with high-quality literature, focusing on human psyche and emotional landscapes rather than just plot completion.

Strategy B: Plan + Write (P+W)

Writing a coherent, deep story in one shot is hard for LLMs because they predict one token at a time. They can’t “look ahead” to the ending. The Plan + Write strategy mitigates this by breaking the process into two phases.

Illustration of PLAN + WRITE’s workflow.

  1. Phase 1 (Character Portraits): The model is asked to generate detailed profiles for the characters, focusing specifically on their emotional states and inner thoughts.
  2. Phase 2 (Story Composition): The model then writes the story using the premise and the character profiles it just created.

This method effectively gives the model a “memory” of the characters’ internal motivations, leading to more consistent and psychologically grounded behavior throughout the narrative.

The Experiments: Human vs. Machine

With the PDS defined and the stories generated (using models like GPT-4, Llama-2, and Vicuna), the researchers launched a comprehensive human study. They recruited undergraduate students with backgrounds in English and Psychology—“informed laypeople”—to rate the stories.

They sought to answer three critical questions.

RQ1: Can Humans Agree on “Depth”?

Subjectivity is the enemy of scientific measurement. If one person thinks a story is “deep” and another thinks it’s “shallow,” the scale is useless.

The study used Krippendorff’s alpha, a statistical measure of agreement.

Table 1: Rater agreement on each PSD component.

The results were highly encouraging. The human raters achieved an average alpha of 0.72, which indicates substantial agreement. This validates the PDS framework: despite the subjective nature of art, people generally agree on what constitutes empathy, authenticity, and complexity when given clear criteria.

RQ2: Can AI Act as the Critic?

Human annotation is slow and expensive. Could an LLM (specifically GPT-4 or Llama-3) replace the human judges?

The researchers tested this by asking LLMs to rate the stories. Interestingly, a standard zero-shot prompt (just asking “rate this story”) had mixed results. To improve this, they introduced a Mixture-of-Personas (MoP) prompting strategy.

Instead of asking one AI to rate the story, they asked the AI to adopt specific personas relevant to the metrics (e.g., “You are an AI who specializes in evaluating genuineness…” or “You are an expert in narrative structure…”).

The Result: The MoP strategy significantly improved the correlation between AI ratings and human ratings. For example, Llama-3-70B achieved a correlation of 0.68 with humans on the Empathy metric. This suggests that while humans are still the gold standard, we can create automated pipelines that approximate human literary judgment with surprising accuracy.

RQ3: The Showdown – Human vs. AI Authors

This is the most provocative part of the study. How did the AI models stack up against human writers from Reddit?

The researchers compared the PDS scores of stories written by:

  1. Humans (Novice, Intermediate, Advanced)
  2. Open Source Models (Llama-2, Vicuna)
  3. Proprietary Models (GPT-4)

The results defy the common wisdom that AI lacks “soul.”

Spider plot comparing the psychological depth scores of 5 popular LLMs vs spectrum of human writers.

As seen in the spider plot above, GPT-4 (the purple line) performs exceptionally well. It encompasses the shape of the human writers.

When we look at the statistical breakdown, the results are even more stark.

Heatmap comparing whether differences in author scores are statistically significant.

Here is what the data tells us:

  • Narrative Complexity & Empathy: GPT-4 stories were rated statistically higher than even the “Advanced” human stories from Reddit.
  • Authenticity & Engagement: GPT-4 was statistically indistinguishable from advanced human writers.
  • Consistency: The Cumulative Distribution Function (CDF) below shows that GPT-4 (the brown line) consistently pushes toward the right (higher scores), behaving very similarly to the best human writers.

Cumulative Distribution Function (CDF) plots for each component of psychological depth.

Perhaps most surprisingly, the study included an authorship identification task. Readers were asked to guess if a story was written by a human or an AI. They were only accurate 56% of the time—barely better than a coin flip. For GPT-4 specifically, accuracy dropped to 27%, meaning readers frequently mistook GPT-4’s work for human writing.

Why This Matters

This research represents a significant shift in how we think about Generative AI. We are moving past the era of checking if an AI can write a sentence that makes sense. We are entering an era where we must ask if an AI can write a sentence that matters.

The success of the Psychological Depth Scale demonstrates two key things:

  1. We can measure the intangible. By grounding evaluation in literary theory (Reader-Response and Text World Theory), we can quantify the subjective “feel” of a story.
  2. AI is crossing the “Empathy Gap.” The fact that GPT-4 outperformed amateur human writers on empathy and complexity challenges the notion that machines cannot replicate human emotional depth.

However, the authors note limitations. Reddit stories, while a good proxy, may not represent the pinnacle of human literature (like a Pulitzer Prize winner). Furthermore, the fact that humans perceive depth in AI text doesn’t mean the AI has depth—it means the AI is excellent at manipulating the symbols of human emotion to trigger a response in us.

Conclusion

The “soul” of a story has always been considered the unique province of the human spirit. But as the Psychological Depth Scale shows, that soul is becoming something we can prompt, measure, and optimize.

Whether you view this as a triumph of engineering or a concerning encroachment on human creativity, the reality is clear: LLMs are no longer just text generators. With strategies like Writer Profiles and Plan + Write, they are becoming engaging storytellers capable of weaving narratives that feel authentic, complex, and emotionally real. As these models evolve, the line between the “ghost in the machine” and the human author continues to blur.