If you have ever played around with a “Role-Playing Agent” (RPA)—an AI chatbot designed to act like Harry Potter, Sherlock Holmes, or a character from your favorite anime—you might have been impressed by its ability to mimic their speech style. But have you ever wondered: does the AI actually understand the character? Or is it merely parroting catchphrases and surface-level traits?
As Large Language Models (LLMs) like GPT-4 and Claude 3 continue to evolve, the demand for sophisticated RPAs is skyrocketing. However, ensuring these agents truly grasp the depth of a character—their complex relationships, evolving personalities, and hidden motivations—remains a massive challenge.
In a fascinating paper titled “Evaluating Character Understanding of Large Language Models via Character Profiling from Fictional Works,” researchers from Fudan University propose a new way to test this “character literacy.” Instead of just asking an AI to “act like Batman,” they ask it to build a comprehensive psychological and biographical profile of the character directly from the novel, and then use that profile to explain why the character makes certain decisions.
In this deep dive, we will explore how this “Character Profiling” framework works, the unique dataset the researchers built, and what the results tell us about the current limits of AI literary comprehension.
The Problem: Beyond Mimicry
Prior to this study, evaluating how well an LLM understood a fictional character was often done through basic classification tasks or imitation.
- Classification: Can the model guess who is speaking based on a line of dialogue?
- Imitation: Can the model generate text that sounds like the character?
While useful, these methods are superficial. A model might know that Yoda speaks in inverted syntax without understanding why he warns Luke Skywalker about the Dark Side. True understanding requires grasping the nuance of a character—their history, their shifting relationships, and the psychological drivers behind their actions.
The researchers argue that if an LLM truly understands a character, it should be able to perform Character Profiling: the act of summarizing a character’s life and psyche from the raw text of a book. This profile then becomes the foundation for understanding the character’s decisions.
The Solution: The Character Profiling Framework
The core contribution of this paper is a new evaluation framework that mimics how a literary scholar might analyze a text. The process is twofold: generating a profile and then testing that profile’s utility.
1. Constructing the Profile
The researchers define a high-quality character profile not as a simple bio, but as a structured summary covering four distinct dimensions:
- Attributes: The basics—gender, skills, talents, objectives, and background.
- Relationships: How the character interacts with others (friends, enemies, family).
- Events: A chronological summary of the key experiences the character goes through.
- Personality: The internal traits and behaviors that define who they are.

As shown in Figure 1, the process begins by feeding the raw text of a novel (like Harry Potter) into an LLM. The LLM must synthesize the text into the four profile dimensions. Once the profile is generated, it is subjected to two types of rigorous testing:
- Factual Consistency Examination (FCE): Is the generated profile actually true to the book?
- Motivation Recognition (MR): Can the AI use this profile to correctly identify the motivation behind a specific plot decision? (e.g., Why did Harry decide to keep the Elder Wand’s allegiance secret?)
2. The CROSS Dataset
To evaluate these profiles, you need a “ground truth”—a correct answer key. The researchers created the CROSS dataset (Character Profiles from SuperSummary).
They selected 126 novels published in 2022 and 2023 (to minimize the chance that older LLMs had memorized the books during pre-training). They then used SuperSummary—a platform where literature experts provide detailed book summaries and character analyses—as the gold standard. By comparing the AI’s generated profiles against these expert-written analyses, they could objectively measure performance.
Handling Long Novels: The Context Window Challenge
One of the biggest technical hurdles in this research is the sheer length of novels. A typical novel might contain over 100,000 tokens (words and punctuation). While some modern models have massive context windows, many do not. How do you fit an entire book into a model to generate a profile?
The researchers tested three different summarization strategies, illustrated in Figure 2:

- (a) Hierarchical Merging: The book is split into chunks. The model summarizes each chunk (Level 1), then summarizes those summaries (Level 2), and so on, until a final “Master Summary” is created. This is efficient but risks losing details in the upper layers.
- (b) Incremental Updating: This mimics human reading. The model reads the first chunk and makes a summary. It then reads the second chunk and updates the previous summary with new information. This helps maintain narrative flow but can be slow and prone to “forgetting” earlier details if the summary gets too compressed.
- (c) Summarizing in One Go: For models with massive context windows (like GPT-4-Turbo), the entire book is fed in at once.
Evaluation Phase 1: Is the Profile Accurate? (Intrinsic Evaluation)
Once the LLMs generated their profiles using the methods above, the researchers had to check them for accuracy. They used a “LLM-as-a-Judge” approach, employing Llama-3-70B to compare the generated profiles against the expert summaries from the CROSS dataset.
The results, measured by a “Consistency Score” (1 to 5), revealed some interesting trends.

Key Takeaways from the Data:
- Bigger is Better: Unsurprisingly, GPT-4-Turbo dominated the field, achieving the highest consistency scores across almost all dimensions.
- One-Shot Wins: For books that fit within the context window, the “Summarizing in One Go” method (shown in the bottom section of Table 2) generally outperformed the chunking methods. This suggests that having access to the full text globally allows the model to better connect dots that appear chapters apart.
- The “Events” Struggle: Look closely at the scores in Table 2. Across almost all models, the scores for the “Even” (Events) dimension are consistently lower than “Attr” (Attributes) or “Pers” (Personality). LLMs differ from humans here; they are great at describing who someone is (personality), but they struggle to accurately summarize the chronological cause-and-effect chain of what happened (events).
Evaluation Phase 2: Do Profiles Help Us Understand Motivation? (Extrinsic Evaluation)
The second, perhaps more innovative part of this study is the Motivation Recognition (MR) task.
In narrative psychology, understanding a character isn’t just about listing facts; it’s about “Theory of Mind”—understanding why they do what they do. The researchers created multiple-choice questions about specific decisions characters made in the books.
- The Task: The LLM is given a scenario (e.g., “Nora decides to break up with Charlie”) and the profile it generated. It must then select the correct reason for this decision from four options.
The results validated the importance of profiling. Models equipped with high-quality generated profiles performed significantly better at identifying motivations. There was a strong correlation: the more factually consistent the profile (Phase 1), the better the model understood the character’s motivation (Phase 2).
The Ablation Study: What Information Matters Most?
To understand exactly which parts of a character profile drive understanding, the researchers performed an “ablation study.” They selectively removed dimensions (like deleting the “Events” section or the “Personality” section) and checked how the model’s performance on Motivation Recognition dropped.

The Critical Finding: As seen in Table 3, removing the Events dimension caused the biggest drop in accuracy (from 57.75% down to 48.54%).
This is a crucial insight for developers of Role-Playing Agents. We often focus on giving AI agents a “personality” prompt (e.g., “You are grumpy and cynical”). However, this research suggests that plot history (Events) is actually the most important factor for an AI to correctly reason about a character’s decisions. Without knowing what the character has been through, the AI cannot understand their motivations, regardless of how well defined their personality traits are.
Where Do LLMs Fail? A Look at Hallucinations
Despite the successes of models like GPT-4, they are not perfect. The researchers conducted a manual error analysis to see where the profiling process broke down.

Table 4 highlights the common pitfalls:
- Character Misidentification: In complex books with “stories within stories” or shifting perspectives, LLMs sometimes confuse characters. For example, in the book Trust, the model confused a character from a fictional manuscript inside the novel with the “real” character in the novel’s main timeline.
- Event Misinterpretation: LLMs struggle with plot twists. If a character is revealed to be a traitor in the final chapter, the “Incremental Updating” method might fail to correct the earlier summaries that described them as a loyal friend.
- Relationship Errors: The nuance of relationships (e.g., “neighbor” vs. “grandson”) is frequently lost, leading to factual hallucinations.
Did the Models Just Memorize the Books?
A common criticism of LLM research is “data contamination”—the idea that the model already knows the answer because it read the book during its training phase.
To mitigate this, the researchers used books published in 2022 and 2023. They also ran a sanity check, comparing the performance on these recent books against performance on highly famous books from the 20th century (like 1984 or The Little Prince).

As expected (shown in Table 7), the models scored incredibly high (4.7 consistency) on 20th-century classics, likely because they have seen endless discussions and summaries of 1984 online.
However, when looking at the main dataset of recent books, the performance was lower and, crucially, stable across the years.

Figure 3 shows that there is no significant correlation between the publication year (over the last decade) and the model’s performance. This suggests the models are genuinely summarizing the text provided to them, rather than relying on memorized Wikipedia pages.
Conclusion and Future Implications
The paper “Evaluating Character Understanding of Large Language Models via Character Profiling from Fictional Works” provides a significant step forward in how we build and test AI characters.
Key Takeaways:
- Profiling Works: Generating a structured profile is a valid way to distill a whole novel into a usable format for AI.
- Context is King: Being able to process a whole book at once (“Summarizing in One Go”) yields better results than breaking it into chunks.
- History Matters: To make an AI understand a character’s motivation, feeding it a list of personality traits isn’t enough. It needs a chronological summary of events.
- Complexity is Hard: LLMs still struggle with the complex, non-linear narratives that human readers enjoy (flashbacks, unreliable narrators, nested stories).
For students and developers interested in Role-Playing Agents, the message is clear: if you want your AI to truly embody a character, don’t just tell it how to act. Feed it the character’s life story. The “Events” dimension—the history of what the character has suffered and achieved—is the key to unlocking true character understanding.
](https://deep-paper.org/en/paper/2404.12726/images/cover.png)