Decoding the Plot: How ‘StoryEmb’ Teaches AI to Understand Narrative Structure

Imagine you are reading a summary of The Lion King. Now, imagine reading a summary of Shakespeare’s Hamlet. On the surface, these texts look nothing alike. One mentions Simba, Mufasa, and the African savanna; the other mentions Prince Hamlet, King Claudius, and Denmark. A traditional computer program analyzing word frequencies or surface-level semantics would tell you these stories are completely different.

But as a human reader, you know better. You recognize the underlying structure: a jealous uncle kills the king, the prince is exiled, the prince returns to reclaim his rightful place. The narrative is strikingly similar, even if the surface text is not.

This distinction—between the “what” (the plot events) and the “how” (the specific names, setting, and writing style)—is a massive challenge in Natural Language Processing (NLP). Most embedding models are distracted by keywords. They glue themselves to character names and specific entities.

In this post, we are going to dive deep into a paper titled “Story Embeddings – Narrative-Focused Representations of Fictional Stories”. We will explore a new model called StoryEmb that learns to look past the surface details to capture the “soul” of a story.

The Problem: Fabula vs. Syuzhet

To understand why this is hard for AI, we have to borrow some terms from narratology (the study of narrative structure). There is a classic distinction between:

  1. Fabula (Story): The chronological sequence of events that actually happen in the fictional world.
  2. Syuzhet (Discourse): How those events are presented to the reader (the order, the words chosen, the style).

Standard text embeddings (like BERT or standard E5) represent the Syuzhet very well. If you change “The King died” to “The Monarch passed away,” they know it’s similar. But if you change “The King died” to “The CEO was fired” (in a modern corporate retelling of the same plot), standard models struggle to see the narrative equivalence because the vocabulary is too distinct.

The researchers behind StoryEmb wanted to build a representation that prioritizes the Fabula—the sequence of events—while ignoring the specific “dressing” of the story, like character names.

The Core Method: Building StoryEmb

How do you teach a model to ignore names? You force it to learn that they don’t matter.

1. The Architecture

The researchers didn’t start from scratch. They used a foundation model called Mistral-7B, specifically a version fine-tuned for embeddings called E5. They used a technique called LoRA (Low-Rank Adaptation), which allows for efficient fine-tuning of large models without retraining every single parameter.

2. The Dataset: “Tell-Me-Again”

To train a model to recognize the same story in different words, you need a dataset of… exactly that. The authors utilized the Tell-Me-Again dataset. This corpus contains thousands of stories where each story has multiple summaries associated with it.

These summaries were sourced from different language versions of Wikipedia (e.g., the English summary of Harry Potter and the German summary of Harry Potter, translated back to English). Because different cultures and translators describe plots differently, these provide natural variations of the same narrative.

3. The “Secret Sauce”: Pseudonymization

This is the most critical part of their methodology. If the model sees “Harry Potter” in summary A and “Harry Potter” in summary B, it will take a shortcut: it will just match the name “Harry Potter” and ignore the plot.

To prevent this, the researchers applied a Pseudonymization strategy. They took the summaries and systematically replaced entities. In one version, “Harry” might become “Alice”; in another, he might become “Bob.”

By training the model on these pseudonymized texts, the AI is forced to look at the relationships and actions. It can no longer rely on the token “Harry” to make a match. It has to learn that “Protagonist A defeats Antagonist B” is the core feature, regardless of whether they are named Harry/Voldemort or Alice/Bob.

4. Training Objective

They used Contrastive Learning. In simple terms, the model is given a “query” story and a batch of “candidate” stories.

  • Positive Pair: The query and a different summary of the same story.
  • Negative Pair: The query and a summary of a different story.

The model adjusts its internal weights to pull the positive pairs closer together in vector space and push the negative pairs apart.

Experiments: Does it Work?

The researchers put StoryEmb through a gauntlet of tests to see if it really understands narrative better than existing state-of-the-art models.

Experiment 1: The “Tell-Me-Again” Retrieval Task

The first test was straightforward: unique summaries from the “Tell-Me-Again” dataset were held back for testing. The model was given a summary and asked to find its partner (the other summary of the same story) from a massive list.

They tested two scenarios:

  1. Non-Pseudonymized: The real names remain (e.g., identifying Harry Potter).
  2. Pseudonymized: The names are scrubbed and replaced (e.g., identifying the plot of Harry Potter where he is named “User_1”).

Below are the results. P@1 means “Precision at 1”—did the model put the correct story at the very top of the list? P@N allows for a bit more leeway, checking if the correct answer was in the top N results.

Table 1: Retrieval performance on the Tell-Me-Again test set by Hatzel and Bieman (2024), with and without their anonymization strategy.

The Takeaway: Look at the Pseudonymized columns on the left.

  • E5 (a standard strong model) crashes. Its P@1 is only 33.20%. It relies heavily on names.
  • Sentence-T5 (another giant model) hits 67.28%.
  • StoryEmb + aug (the author’s model trained with pseudonymization) achieves 82.60%.

This is a massive improvement. It proves that when you take away the crutch of character names, StoryEmb is vastly superior at recognizing the plot structure. Even on non-pseudonymized text (right side), StoryEmb remains highly competitive, showing it hasn’t lost the ability to read normal text.

Experiment 2: Movie Remakes

Training on Wikipedia is one thing, but does this transfer to other domains? The researchers tested the model on a dataset of Movie Remakes. These are summaries of original movies and their remakes (e.g., Ocean’s Eleven 1960 vs. 2001). Remakes often change dialogue, setting, and pacing, but keep the narrative arc.

Table 2: Test set retrieval performance on the dataset by Chaturvedi et al. (2O18), with and without the anonymization strategy by Hatzel and Biemann (2O24) applied to the dataset.

The Takeaway: StoryEmb sets a new State-of-the-Art (SOTA).

  • The previous best (Chaturvedi) had a P@1 of 63.7%.
  • StoryEmb (augmented) reaches 83.26%.
  • Interestingly, Sentence-T5 performed worse here than on the previous task, suggesting it struggles to generalize to this specific movie domain, whereas StoryEmb adapted well.

Experiment 3: Retellings (Hard Mode)

Retellings are different from remakes. A retelling might transport Pride and Prejudice to modern-day Los Angeles. The themes match, but the scenes might be vastly different. The researchers curated a small, difficult dataset of literary retellings.

Table 7: Retrieval performance on retelling dataset introduced in Section 4.1.3

The Takeaway: Here, we see the limits. On this specific “Retellings” dataset, the massive Sentence-T5XXL model actually outperformed StoryEmb (70% vs 56.67% P@1).

Why? The researchers hypothesize that in very small datasets (only 30 summaries), specific keywords and remaining entity hints become very powerful discriminators. The “bag-of-words” approach (matching specific terms) helps T5 here. However, StoryEmb still significantly outperformed the base E5 model (36.67%), showing it is learning narrative, just perhaps not enough to beat a massive model on this specific, tiny dataset.

Experiment 4: Narrative Understanding (ROCStories)

This is perhaps the most fascinating experiment because it tests “common sense.” The ROCStories Cloze Task gives a computer a 4-sentence story and two possible endings.

  • Ending 1: Logically follows the plot.
  • Ending 2: Uses similar words but makes no narrative sense.

StoryEmb was not trained to classify endings. It is just an embedding model. To solve this, the researchers used a clever distance-based trick, defined by this equation:

Equation for deciding the correct story ending based on distance.

Translation: They calculate the “distance” (\(d\)) between the story so far (\(a\)) and the two possible full stories (\(s_1\) and \(s_2\)).

  • If the distance to \(s_1\) is smaller, pick \(s_1\).
  • If the distance to \(s_2\) is smaller, pick \(s_2\).

The logic is that a coherent story should form a tighter cluster in vector space than an incoherent one.

Table 5: Accuracy at picking the correct story ending on ROCStories.

The Takeaway: StoryEmb achieves 89.4% accuracy (Zero-shot).

  • It beats the base E5 model (78.5%).
  • It beats GPT-3 in a zero-shot setting (83.2%).
  • It performs nearly as well as models that are explicitly “few-shot” prompted.

This confirms that StoryEmb isn’t just matching keywords; it has acquired a genuine sense of narrative flow. It knows that “falling down” usually leads to “getting hurt,” not “flying away,” and encodes that likelihood into the embedding.

Looking Under the Hood: Why does it work?

We know StoryEmb works, but how? The researchers used an attribution analysis. This technique looks at which words in a sentence contribute most to the similarity score.

If the model is working as intended, it should care less about names and more about actions (verbs) and roles.

The researchers visualized the “Delta” (difference) between StoryEmb and the original E5 model. In the heatmap below, Red means StoryEmb cares about this word less than E5 did. Blue means it cares more.

Figure 1: Attribution scores on individual tokens in the final layer of our StoryEmb model are shown as a delta from the E5 model.

Look at the top left. The name “Alice” is dark red. StoryEmb has learned to downweight the specific name “Alice.”

Now, look at the aggregate data across many sentences:

Table 6: The average contribution to sentence similarity of selected named-entity and parts-of-speech tags.

The Takeaway:

  • PROPN (Proper Nouns): Negative delta. The model ignores them more.
  • VERB: Positive delta. The model pays more attention to what is happening.
  • PERSON / ORG: Negative delta. Specific entities matter less.

This data provides concrete proof that the pseudonymization training strategy succeeded. The model essentially re-wired its attention mechanisms to focus on the plot (verbs, common nouns) rather than the cast list (proper nouns, people).

Conclusion

StoryEmb represents a significant step forward in how computers process fiction. By moving away from surface-level keyword matching and forcing the model to grapple with pseudonymized texts, the researchers created an embedding space where Hamlet and The Lion King (hypothetically) sit closer together than Hamlet and a biography of a real Danish Prince.

Key Takeaways:

  1. Narrative over Names: NLP can be trained to recognize plot structure by stripping away entity names during training.
  2. Data Augmentation Works: The “Tell-Me-Again” strategy of swapping names (Alice/Bob) is a powerful way to bias models toward semantic structure.
  3. Generalization: A model trained on Wikipedia summaries can successfully identify movie remakes and predict logical story endings, showing robust narrative understanding.

As we move toward AI that can write, critique, or recommend literature, tools like StoryEmb will be essential. They bridge the gap between simply reading the words and actually understanding the story.