Imagine you are analyzing the novel Aladdin. You want to track every time the text refers to the protagonist, whether by his name (“Aladdin”), a nominal phrase (“the boy”), or a pronoun (“he”).

In Natural Language Processing (NLP), this is classically handled by Coreference Resolution (CR). The goal of CR is to find every mention in a text and cluster them together based on which entity they refer to. It sounds straightforward, but in practice, CR is notoriously difficult. Models trained on news articles often fail spectacularly when applied to literature or medical texts. Why? because they get bogged down trying to resolve everything, including insignificant background characters or abstract concepts, often disagreeing on what even counts as a “mention.”

In this post, we’ll dive into a fascinating paper titled “Major Entity Identification: A Generalizable Alternative to Coreference Resolution”. The researchers propose a shift in perspective: instead of trying to resolve every single entity from scratch, what if we tell the model who the Major Entities are beforehand?

The result is a task that is not only easier to evaluate and faster to run but also generalizes far better across different domains.

The Problem with Traditional Coreference Resolution

To understand why we need a new approach, we first need to look at why standard CR is struggling.

Coreference resolution is an unsupervised clustering problem at heart. The model reads the text and tries to group spans of text (mentions) into clusters.

  1. Mention Detection: “Is this phrase referring to something?”
  2. Clustering: “Do ’the poor tailor’ and ‘he’ refer to the same person?”

The bottleneck lies in generalization. Different datasets have different rules for annotation. For example, the famous OntoNotes dataset (used to train many models) has strict rules that differ from literary datasets like LitBank. A model trained on OntoNotes learns specific idiosyncrasies about what a “mention” is. When you test that model on a novel, it fails—not necessarily because it doesn’t understand language, but because it’s looking for the wrong types of mentions.

Enter Major Entity Identification (MEI)

The authors propose Major Entity Identification (MEI). The core idea is simple: In most real-world applications, we don’t care about every single entity. If you are analyzing a movie script, you care about the main characters. If you are analyzing financial news, you care about specific companies.

In MEI, the target entities are provided as input.

Figure 1: CR vs. MEI. The CR task aims to detect and cluster all mentions into different entities, shown in various colors. MEI takes major entities as additional input and aims to detect and classify the mentions that refer only to these entities.

As shown in Figure 1, the difference is structural:

  • CR (Left): The model must figure out that “Mustapha,” “poor tailor,” “father,” and “friends” are all entities, and then cluster them.
  • MEI (Right): We provide the input set \(E = \{Mustapha, Aladdin\}\). The model’s only job is to scan the text and tag mentions that refer to these specific inputs. Everything else is ignored.

This shift transforms the problem from clustering (grouping unknown items) to classification (assigning items to known classes).

Why Focus on Major Entities?

Statistically, a few major entities dominate the discourse. In literary datasets, a small percentage of characters account for the vast majority of mentions. By ignoring “singletons” (entities mentioned only once) and background noise, the model can focus on what matters.

Table 1: Comparing CR and MEI. MEI has fewer but larger clusters, and a smaller mean antecedent distance (Mean ant. dist.). Our formulation’s frequency-based criterion for deciding major entities means that singleton mentions are typically not a part of MEI.

Table 1 highlights this distinction. Notice that while the number of mentions drops (because we ignore minor characters), the Average cluster size skyrockets. In the FantasyCoref dataset, the average cluster size jumps from roughly 9.7 in CR to 38.1 in MEI. This indicates that MEI focuses on the entities that drive the narrative.

The Solution: MEIRa

The researchers introduce a supervised model called MEIRa (Major Entity Identification via Ranking).

Unlike traditional CR models that build clusters dynamically, MEIRa works by ranking. It takes a document and a list of “designative phrases” (e.g., “Aladdin”, “Mustapha”) and learns to link mentions in the text to these phrases.

The Architecture

The process consists of three main steps:

  1. Document Encoding: A Longformer model reads the document to understand the context.
  2. Mention Proposal: The model identifies potential spans of text that could be mentions.
  3. Identification Module: This is the core engine that links mentions to major entities.

Figure 2: Identification module of MEIRa. A mention encoding m_i is concatenated with each entity’s embedding in E^W and the metadata. Network f scores the likelihood of assigning m_i to each major entity.

Figure 2 illustrates the architecture. Let’s break it down:

  • Inputs: We have a mention \(m\) (from the text) and a set of entity representations \(E\) (derived from the provided major entity names).
  • Scoring (\(f\)): A neural network scores the likelihood that mention \(m\) belongs to entity \(e\).
  • Thresholding (\(\tau\)): If the score is high enough, it’s a match. If not, the mention is assigned to a “null” class (meaning it refers to a minor entity or isn’t a mention at all).

The scoring function is defined mathematically as:

Equation for scoring mention assignment

Here, the function \(f\) takes the vector for the mention (\(\mathbf{m}_i\)), the vector for the entity (\(\mathbf{e}_j\)), and some metadata \(\chi\) (like the distance between them). It outputs a score.

The decision logic is a simple threshold check:

Equation for thresholding assignment

If the best score \(s_i^*\) exceeds the threshold \(\tau\), the mention is assigned to that entity. Otherwise, it returns \(\emptyset\).

Two Flavors of MEIRa: Static vs. Hybrid

The authors propose two variations of this model, visible in the pathways of Figure 2:

  1. MEIRa-Static (MEIRa-S): The representation of the Major Entity (e.g., “Aladdin”) is computed once at the beginning. It never changes. This makes the model incredibly fast because every mention can be processed in parallel.
  2. MEIRa-Hybrid (MEIRa-H): This works more like traditional coreference. As the model finds mentions of Aladdin (e.g., “the boy”, “he”), it updates the representation of Aladdin to include this new context. This “dynamic memory” usually improves accuracy but requires processing the text sequentially.

Evaluation: Does it Generalize?

The primary claim of the paper is that MEI generalizes better than CR. To test this, the researchers trained models on one dataset (OntoNotes) and tested them on completely different literary datasets (LitBank and FantasyCoref).

The results were stark.

Table 3: Results for models trained on OntoNotes.

Table 3 shows the generalization performance. Look at the drop-off for the baseline CR models (Coref-ID, Coref-CM, Coref-FM). When trained on OntoNotes and tested on LitBank, their performance collapses (e.g., Coref-ID drops to 57.7 Micro-F1).

In contrast, MEIRa-H maintains high performance (78.6 Micro-F1 on LitBank). Because MEI offloads the definition of “who matters” to the input, the model doesn’t get confused by different annotation styles regarding singleton mentions or minor characters. It simply looks for the entities you asked for.

Speed Comparison

We mentioned earlier that MEIRa-S (Static) allows for parallel processing. How much of a difference does that make?

Figure 3: Linking speed comparison between MEIRa-S and longdoc for the combined LitBank and FantasyCoref test set.

As shown in Figure 3, MEIRa-S provides a massive speedup—roughly 25x faster than the baseline longdoc model. For long books or large archives, this efficiency is a game-changer.

MEI with Large Language Models

The paper also explores how Large Language Models (LLMs) like GPT-4 handle this task. You might assume LLMs would be perfect for this, but they have a specific weakness: Mention Boundaries.

If you ask an LLM to “Find all mentions of Alice,” it understands the assignment semantically. However, it struggles to identify the exact span of text (e.g., identifying “the curious little girl” vs just “girl”). This leads to poor performance on strict metrics.

The Two-Stage Prompting Solution

To fix this, the authors devised a clever two-stage prompting strategy:

  1. Word-Level Detection: First, ask the LLM to tag only the head word of the mention (e.g., in “the big red dog,” just tag “dog”). LLMs are much better at this precise task.
  2. Head-to-Span Retrieval (H2S): Use a second prompt (or a deterministic tool like spaCy) to expand that head word into the full grammatical span (“the big red dog”).

Table 6: Results on LLMs with different mention detection and linking strategies.

Table 6 demonstrates the impact of this strategy.

  • Single Prompt: GPT-4 achieves a Micro-F1 of 70.7 on LitBank.
  • Two-Stage Prompt (Word-level + H2S): GPT-4 jumps to 85.5, actually outperforming the supervised MEIRa model on that specific dataset.

This shows that while supervised models like MEIRa are efficient and robust, LLMs can be adapted to MEI with the right prompting engineering, bypassing their weakness in span detection.

Conclusion

The shift from Coreference Resolution to Major Entity Identification represents a practical evolution in NLP. By acknowledging that we usually know who we are interested in, we can simplify the problem, making models that are:

  1. More Robust: They don’t break when moving between domains (e.g., News \(\to\) Literature).
  2. Easier to Evaluate: We can use standard classification metrics (Precision/Recall) instead of complex clustering metrics.
  3. Faster: The MEIRa-Static architecture allows for massive parallelization.

Whether you are building a tool to analyze character arcs in novels or tracking entities in legal contracts, MEI offers a streamlined, “user-centric” alternative to the heavy lifting of traditional coreference resolution.

For students and researchers, this paper serves as a great example of how reframing a task (from “cluster everything” to “classify the important things”) can solve persistent bottlenecks in model performance.