When "The Angels' Superstar" Becomes "The Dodgers' Number 17": Solving Dynamic Entity Resolution in LLMs

Language is a living, breathing thing. It changes constantly, often faster than our digital systems can keep up. Consider the baseball superstar Shohei Ohtani. A few years ago, calling him “The Angels’ Ace” was accurate. Today, referencing him requires new language like “The Dodgers’ number 17.”

For humans, this mental update is automatic. For Large Language Models (LLMs) and Retrieval-Augmented Generation (RAG) systems, it is a significant point of failure. If a user asks about “The Dodgers’ number 17,” but the knowledge base only recognizes Ohtani as an Angels player, the retrieval system fails to find the relevant documents. The result? The LLM hallucinates or provides outdated information.

In this post, we dive into DynamicER, a fascinating research paper from Seoul National University. The researchers identify a critical gap in current AI systems: the inability to link emerging mentions (new nicknames or descriptions) to dynamic entities (people or things that change over time). We will explore their new benchmark, DYNAMICER, and their proposed solution, TempCCA, which allows models to adapt to the shifting sands of language without needing full retraining.

The Problem: When Language Outpaces Knowledge

In the world of Natural Language Processing (NLP), Entity Linking (EL) is the task of connecting a mention in text (e.g., “The Tesla CEO”) to a unique entry in a Knowledge Base (e.g., the Wikipedia page for Elon Musk).

Traditional EL assumes a static world. It assumes that the way we refer to entities is relatively constant. But in reality, attributes change. Elon Musk was once “The PayPal Co-Founder,” then “The Tesla CEO,” and more recently “The Twitter Owner” or “X Owner.”

This dynamic nature creates two specific hurdles for RAG systems:

Lexical Variation: New mentions often look nothing like the entity’s name (e.g., “The Bronx Bombers” vs. “New York Yankees”).
Temporal Ambiguity: A phrase like “The British Prime Minister” refers to different people depending on the year the text was written.

As illustrated below, the motivation for this research comes from the realization that as time progresses, new mentions emerge that previous models simply cannot resolve.

Figure 1: Motivation of our DYNAMICER benchmark. New mentions referring to the same entity are constantly created over time.

In the example above, an RAG system trying to answer “Where is the Dodgers’ number 17 from?” might fail if it hasn’t linked that specific new phrase to Shohei Ohtani.

Introducing DYNAMICER: A Benchmark for Evolving Entities

To solve this, the researchers first needed a way to measure it. They introduced DYNAMICER (Dynamic Entity Resolution for Emerging Mentions), a dataset specifically designed to test how well models handle new expressions over time.

They chose the sports domain (Soccer and Baseball) for data collection. Why sports? Because it is inherently volatile. Players transfer teams, new nicknames are coined weekly based on performance, and roles change (e.g., a player becoming a coach).

The dataset was constructed by scraping social media (Tumblr) over a timeline, using GPT-4 to identify potential mentions, and then rigorously verifying them with human annotators. The result is a benchmark that is significantly different from previous Entity Linking datasets.

Table 1: Comparison of DYNAMICER with existing entity linking benchmarks.

As shown in Table 1, while previous datasets like MedMentions or Reddit EL cover variations, they lack Temporal Dynamics—the aspect of time-evolving entities. DYNAMICER is unique because it tracks the emergence of mentions across sequential time segments.

The Core Method: TempCCA

The heart of the paper is the proposed method: Temporal Segmented Clustering with Continual Adaptation (TempCCA).

Standard approaches might try to link a mention directly to a static entity embedding. However, if the entity has changed (e.g., Ohtani is now a Dodger), the static embedding might be too dissimilar to the new mention.

TempCCA takes a different approach. It posits that we shouldn’t just look at the entity’s original definition. Instead, we should look at the cluster of mentions that have recently referred to that entity. If we know that last month people started calling Ohtani “The $700M Man,” we can use that information to help identify “The Dodgers’ New Star” this month.

1. The Architecture

The method uses a dual-encoder architecture. It treats the problem as a clustering task where mentions and entities are nodes in a graph.

Figure 2: An illustrative example of TempCCA.

As visualized in Figure 2 above:

Left (0506): In the first time step, we have clusters for players like Declan Rice (West Ham) and Mason Mount (Chelsea).
Right (0708): In the next time step, the entities have evolved. Declan Rice is now associated with Arsenal. TempCCA uses the resolved mentions from the previous step to update the representation of the entity for the current step.

2. Measuring Affinity

To cluster mentions with entities, the model needs to calculate how “similar” they are. The researchers define two affinity functions using dot products of embeddings.

Equations for affinity functions between entities and mentions, and between mentions and mentions.

$\phi(e, m_i)$: Measures the similarity between an entity cluster $e$ and a mention $m_i$.
$\psi(m_i, m_j)$: Measures the similarity between two mentions.

This allows the model to say, “This new mention is similar to this entity,” OR “This new mention is similar to this other mention we already linked to this entity.”

3. Continual Adaptation (The Update Rule)

This is the most critical innovation. At every time step, the representation of the entity is updated. It isn’t just the static Wikipedia embedding anymore; it becomes a blend of the original entity definition and the aggregate of all recent mentions linked to it.

Equation for updating the entity cluster representation.

In this equation:

$\mathbf{Enc}_E(e)$ is the static encoding of the entity (e.g., from its name and description).
The summation part averages the encodings of all mentions $m_i$ that were linked to the entity in the previous time step $\mathcal{C}(e)$.
$\alpha$ is a hyperparameter that balances how much we trust the static definition versus the recent trends.

By continuously updating $\mathbf{u}_C(e)$, the model “drifts” along with the entity. When Ohtani joins the Dodgers, the cluster absorbs “Dodgers”-related mentions, shifting the embedding space so that future Dodgers-related mentions are easier to link.

Experiments and Results

The researchers tested TempCCA against several state-of-the-art baselines, including ArboEL (a strong static entity linker) and SpEL. They split the dataset into time segments to simulate a real-world streaming scenario.

Entity Linking Performance

The results show that considering temporal dynamics is superior to static methods.

Table 3: Results of the entity linking task by lexical similarity and time segment.

In Table 3, TempCCA (Ours) consistently achieves the highest accuracy across different time sets.

A key finding is related to lexical similarity. The researchers broke down performance based on how much the mention text resembled the entity name. As expected, all models struggle when the mention looks nothing like the name (low Jaccard similarity). However, TempCCA showed significant gains in these difficult cases because it could leverage context from recent similar mentions rather than relying solely on the name match.

Impact on Retrieval-Augmented Generation (RAG)

The ultimate goal of this work isn’t just to link entities, but to improve downstream tasks like Question Answering (QA). The researchers set up an Entity-Centric QA task where the questions used the tricky emerging mentions (e.g., “Who did The Dodgers’ number 17 play for previously?”).

They compared several setups:

LLM: Standard Llama-3.
RaLM: Retrieval-Augmented Language Model (Standard RAG).
RaLM-ER: RAG enhanced with the Entity Resolution from TempCCA.

Table 5: Results of entity-centric QA for each time segment in F1 scores.

Table 5 demonstrates a clear hierarchy:

RaLM-ER (Ours) performs the best. By resolving the mention before retrieval, the system can search for “Shohei Ohtani” instead of the ambiguous “Number 17,” leading to better documents being found.
Standard RaLM helps significantly over a base LLM but lags behind RaLM-ER because it often misses documents when the query contains a new slang term or nickname.
LLM-ER (resolving the entity but not using retrieval) improves over the base LLM, proving that simply knowing who is being talked about helps the model hallucinate less.

Reducing Hallucination

One of the most dangerous behaviors of LLMs is confident hallucination. The study showed that resolving entities correctly serves as a guardrail.

Table 6: Comparison of RaLM and RaLM-ER performance in retrieval hits and misses.

Table 6 reveals an interesting nuance. When retrieval fails (Retrieval Miss), RaLM-ER still outperforms standard RaLM. Why? Because explicitly telling the LLM “This mention refers to Entity X” allows the model to rely on its internal parametric knowledge about Entity X, even if the external documents weren’t found. It grounds the generation.

Case Study: Seeing the Difference

To make this concrete, let’s look at a specific example where the models diverge.

Table 19: Case study showing RaLM-ER correctly identifying Trent Alexander-Arnold.

In this example (Table 19), the question asks about a record set by “Trentnation” (a nickname for Trent Alexander-Arnold).

Standard RaLM fails to retrieve relevant info because it likely searches for “Trentnation,” finding nothing or irrelevant noise. It answers “None.”
RaLM-ER successfully links “Trentnation” to “Trent Alexander-Arnold.” It then retrieves the correct documents regarding his FA Cup victory and answers correctly that he became the youngest player to lift the trophy.

Conclusion and Implications

The DynamicER paper highlights a fundamental truth about AI and language: models cannot remain static in a dynamic world. As culture moves, language shifts, and “common knowledge” changes.

By introducing TempCCA, the authors provide a robust method for keeping up with these changes. The technique of updating entity embeddings based on the “cluster” of recent mentions mimics how humans learn—we update our mental models of people based on what we hear about them today, not just what we knew about them a year ago.

For students and practitioners working on RAG pipelines, the takeaway is clear: Retrieval is only as good as your query understanding. If your user uses a new term that your vector database hasn’t indexed, you will fail. Dynamic Entity Resolution is a promising layer to add to your stack to bridge that gap.

Key Takeaways

Emerging Mentions cause RAG failure because standard retrievers miss documents that don’t match new nicknames.
DYNAMICER is a new benchmark for testing how models handle time-sensitive entity changes in sports.
TempCCA uses continuous clustering to update entity representations, allowing the model to adapt to new vocabulary without full retraining.
Entity Resolution acts as a crucial pre-processing step in RAG, improving retrieval accuracy and reducing hallucinations.

The Problem: When Language Outpaces Knowledge#

Introducing DYNAMICER: A Benchmark for Evolving Entities#

The Core Method: TempCCA#

1. The Architecture#

2. Measuring Affinity#

3. Continual Adaptation (The Update Rule)#

Experiments and Results#

Entity Linking Performance#

Impact on Retrieval-Augmented Generation (RAG)#

Reducing Hallucination#

Case Study: Seeing the Difference#

Conclusion and Implications#

Key Takeaways#