In the vast ocean of unstructured text on the internet—Wikipedia pages, news articles, financial reports—lies a treasure trove of data waiting to be organized. For years, the field of Information Extraction (IE) has been the miner of this digital age, digging through paragraphs to find relationships between things.
Traditionally, this has been done by hunting for “triplets”: a Subject, a Relation, and an Object (e.g., Bill Gates, Co-founder, Microsoft). While effective, this approach has limits. It treats information as a bag of disconnected facts rather than a cohesive profile of an entity.
A recent research paper, “Learning to Extract Structured Entities Using Language Models,” proposes a paradigm shift. Instead of just hunting for isolated links, the researchers argue we should be extracting Structured Entities—comprehensive, object-oriented representations of people, organizations, and places.
In this deep dive, we will explore why the old “triplet” methods fall short, how this new “entity-centric” approach works, the novel AESOP metric designed to evaluate it, and MuSEE, a multi-stage language model architecture that achieves state-of-the-art results with impressive efficiency.
The Problem with Triplets
To understand the innovation, we must first look at the limitation of the status quo. Information Extraction has predominantly focused on the tuple (subject, relation, object). Models are trained to read a sentence and output a list of these tuples.
Standard evaluation metrics, like Precision and Recall, are calculated based on these triplets. This leads to a phenomenon the authors describe as a “holistic blind spot.”
Consider a paragraph mentioning ten different people.
- Person A is the main character and has 10 associated facts (relations).
- Persons B through J (9 people) are minor characters, each with only 1 associated fact.
- Total facts: 19 triplets.
Now, imagine a model that gets obsessive about Person A. It correctly identifies all 10 triplets for Person A but completely ignores the other 9 people.
- Recall: 10/19 (approx 52%).
- Precision: 100%.
Statistically, this model looks decent. But functionally? It failed to recognize 90% of the entities in the text. A human evaluator would call this a failure, but triplet-centric metrics give it a pass. This disconnect drives the need for a new formulation: Structured Entity Extraction (SEE).
What is Structured Entity Extraction?
Structured Entity Extraction moves the goalposts from “finding links” to “building profiles.” The task is to define an entity (like “Bill Gates”) and then populate a schema of properties associated with that specific entity.

As shown in Figure 1, the process takes unstructured text and a predefined schema (a list of possible entity types and property keys) as input. The output is not a flat list of relations, but a structured JSON object.
Notice the hierarchy in the “Expected output json”:
- Identify the Entity: “Bill Gates”.
- Classify the Entity:
ent_type_human. - Populate Properties:
pk_country: America,pk_occupation: Businessman.
This approach handles coreference implicitly. If the text mentions “Bill Gates” in sentence one and “he” in sentence two, the model must understand they belong to the same entity object. This requires a deeper level of text comprehension than simple pattern matching.
The AESOP Metric: A New Way to Grade
If we change the task from extracting triplets to extracting complex objects, we can no longer use simple triplet-based Precision and Recall. We need a metric that can compare two sets of complex objects—the “Predicted Set” from the model and the “Ground Truth Set” from the dataset.
The researchers introduce AESOP (Approximate Entity Set OverlaP).
Evaluating sets of entities is difficult because they are unordered. If the model outputs “Microsoft” then “Bill Gates,” but the ground truth lists “Bill Gates” then “Microsoft,” the metric must first figure out which prediction matches which truth before it can grade the accuracy.
AESOP solves this in two phases:
Phase 1: Optimal Entity Assignment
First, the metric plays “matchmaker.” It looks at the pool of predicted entities and the pool of ground truth entities and calculates a similarity matrix.
\[ \mathbf { F } = \underset { \mathbf { F } } { \arg \operatorname* { m a x } } \sum _ { i = 1 } ^ { m } \sum _ { j = 1 } ^ { n } \mathbf { F } _ { i , j } \cdot \mathbf { S } _ { i , j } , \]In this equation (Equation 2 from the paper), \(\mathbf{F}\) is the assignment matrix. The algorithm tries to maximize the total similarity (\(\mathbf{S}\)) across all matches. It ensures a one-to-one mapping—each predicted entity is matched to the best possible ground truth entity (or none, if it’s a complete hallucination).
The similarity \(\mathbf{S}\) is usually weighted. The authors assign a high weight (0.9) to the Entity Name and a lower weight (0.1) to the overlapping properties. This mimics human logic: if you get the name wrong, the properties don’t matter much.
Phase 2: Pairwise Entity Comparison
Once the entities are paired up, AESOP grades the quality of the match.
\[ \psi _ { \mathrm { e n t } } ( e ^ { \prime } , e ) = \bigotimes _ { p \in \mathcal { P } } \psi _ { \mathrm { p r o p } } \mathopen { } \mathclose \bgroup \left( v _ { e ^ { \prime } , p } , v _ { e , p } \aftergroup \egroup \right) , \]This equation calculates the score (\(\psi_{\mathrm{ent}}\)) for a specific pair of entities. It looks at every property \(p\) in the schema (like “occupation” or “age”) and compares the predicted value \(v'\) against the true value \(v\). The comparison is typically done using the Jaccard index (intersection over union of tokens), allowing for partial credit if the text is slightly different but semantically close.
The Final Score
Finally, the scores are aggregated into a single number:
\[ \Psi ( \mathcal { E } ^ { \prime } , \mathcal { E } ) = \frac { 1 } { \mu } \bigoplus _ { i , j } ^ { m , n } \mathbf { F } _ { i , j } \cdot \psi _ { \mathrm { e n t } } \bigl ( \vec { E ^ { \prime } } _ { i } , \vec { E _ { j } } \bigr ) , \]Here, \(\mu\) represents the normalization factor (usually the maximum size of either the predicted set or the target set). This ensures that if a model predicts 5 entities perfectly but misses 5 others, it is penalized appropriately.
The Solution: MuSEE Architecture
With the task defined and the grading rubric set, the authors propose a novel model: MuSEE (Multi-stage Structured Entity Extraction).
Using Large Language Models (LLMs) like T5 for information extraction is powerful, but simply asking an LLM to “output all JSON data” in one go leads to issues. The sequences get too long, causing the model to lose focus, hallucinate, or crash due to memory constraints.
MuSEE addresses this by decomposing the task into three distinct, parallelizable stages.

As illustrated in Figure 2, MuSEE utilizes an Encoder-Decoder architecture (specifically based on T5). Here is the breakdown of the pipeline:
1. Encode Once, Predict Many
The input text is fed into the Encoder only once. The resulting rich representation of the text is then reused across all three stages. This is a significant efficiency win compared to pipelines that require re-reading the text for every sub-task.
Stage 1: Entity Identification
The model is first prompted to output only the unique names of the entities present in the text.
- Input:
[Text Representation] + [Prompt: Find Entities] - Output:
"Bill Gates", "Microsoft"
This forces the model to perform coreference resolution immediately. It collapses mentions like “Bill”, “Gates”, and “he” into a single identifier: “Bill Gates.”
Stage 2: Type and Property Key Prediction
Now that the model knows who is in the text, it needs to determine what they are and what info is available.
- Input:
[Text Representation] + [Prompt: {"Bill Gates"}] - Output:
ent_type_human,pk_country,pk_occupation
The Tokenization Innovation: Notice the output isn’t natural language like “Human” or “Country.” The authors introduce special tokens (e.g., pk_country). Instead of generating the word “Occupation” (which might be 2-3 tokens), the model generates a single token. This drastically reduces the sequence length, making the model faster and less prone to generation errors.
Stage 3: Property Value Prediction
Finally, the model extracts the actual values.
- Input:
[Text Representation] + [Prompt: {"Bill Gates"} {ent_type_human} {pk_occupation}] - Output:
"Businessman"
Why Multi-Stage?
You might think running three stages is slower than one. However, MuSEE allows for batch processing.
- In Stage 2, the model can predict the schemas for “Bill Gates” and “Microsoft” simultaneously in parallel.
- In Stage 3, it can predict the values for “Country” and “Occupation” simultaneously.
By breaking a massive, complex generation task into small, focused sub-tasks, the model maintains high accuracy without sacrificing speed.
Experiments and Results
The authors tested MuSEE against several strong baselines, including:
- LM-JSON: A T5 model simply fine-tuned to output the raw JSON string (the “brute force” method).
- GenIE: A state-of-the-art generative constraint model.
- GEN2OIE: A two-stage relation extraction model.
They utilized datasets like NYT, CoNLL04, REBEL, and a custom “Wikidata-based” dataset.
Effectiveness (AESOP Scores)
The results were decisive.

Looking at Table 1, MuSEE (both T5-Base and T5-Large versions) consistently achieves the highest AESOP scores.
- On the NYT dataset, MuSEE (T5-L) hit an AESOP score of 82.67, compared to GenIE’s 79.64.
- On the Wikidata-based dataset, the gap was even larger, with MuSEE scoring 50.94 versus GenIE’s 43.50.
This indicates that MuSEE is significantly better at the holistic capture of entity profiles, rather than just grabbing scattered facts.
Efficiency: The Speed Advantage
Perhaps the most practical finding for real-world application is the efficiency gain.

Figure 3 plots Effectiveness (y-axis) against Efficiency (x-axis, samples per second).
- The ideal model resides in the top-right corner (Fast and Accurate).
- MuSEE (T5-B) is the star on the far right. It processes over 50 samples per second, making it roughly 5x faster than GenIE and 10x faster than IMoJIE.
This speed is a direct result of the “Short Token” strategy and the parallel generation capability of the multi-stage pipeline.
Grounding and Hallucination
A common fear with Large Language Models is hallucination—generating facts that sound plausible but aren’t in the text. Since T5 is pre-trained on Wikipedia, there is a risk it might “remember” facts about Bill Gates rather than extracting them from the provided text.
To test this, the authors created a “Perturbed” dataset where they swapped facts (e.g., changing Bill Gates’ occupation in the text). If the model ignores the text and relies on memory, it will get the wrong answer.

Figure 4 shows the drop in performance when using perturbed data.
- Purple bars are original data.
- Pink bars are perturbed data.
Models like LM-JSON saw a massive drop in F1 score (Right chart), implying they were relying heavily on memorization. MuSEE, however, showed the most resilience (least performance drop), proving it is actually “reading” the text and extracting grounded information.
Conclusion
The shift from Triplet Extraction to Structured Entity Extraction represents a maturing of the Information Extraction field. By focusing on entities as complex objects, we align model outputs closer to the way humans understand the world—and the way databases store it.
The contributions of this paper are threefold:
- Formulation: Redefining the task to be entity-centric.
- Evaluation: The AESOP metric, which provides a flexible, rigorous way to grade complex entity profiles.
- Architecture: The MuSEE model, which proves that decomposing a hard task into parallel, schema-guided stages yields better accuracy and vastly superior speed.
For students and practitioners, MuSEE offers a blueprint for how to engineer LLM pipelines: don’t just ask the model to “do it all.” Structure the prompt, shorten the outputs, and parallelize the work. As we move toward automated Knowledge Base construction, these structured approaches will be the foundation of the next generation of intelligent data systems.
](https://deep-paper.org/en/paper/2402.04437/images/cover.png)