Unlocking History with AI: How Bi-Encoders and Hard Negatives Resolve Ambiguity in Historical News

Imagine you are a historian analyzing the political climate of the 1960s. You have digitized millions of newspaper pages from that era. You want to track the media coverage of “John Kennedy.”

It sounds simple, but computationally, it is a nightmare.

In one article, “John Kennedy” refers to the 35th U.S. President. In another, published in a local Louisiana paper, “John Kennedy” refers to a state politician. In a third, he is referred to simply as “Jack.” To a human reader, context clues usually make the distinction clear. To a machine, these are just strings of characters.

This problem is known as Entity Disambiguation (ED) and Coreference Resolution. While modern Natural Language Processing (NLP) has made great strides in this area, it often fails when applied to historical texts. Why? Because history is full of people who don’t have Wikipedia pages today (Out-of-Knowledge-Base entities), and historical documents are filled with OCR (Optical Character Recognition) noise and outdated phrasing.

In this post, we will deep dive into a fascinating paper, Contrastive Entity Coreference and Disambiguation for Historical Texts, which proposes a scalable, high-accuracy solution for this specific problem. We will explore how the researchers used contrastive learning, massive-scale “hard negative” mining, and a clever bi-encoder architecture to build a system that understands history better than standard state-of-the-art models.

The Problem: Why Historical Text is Hard

Before we look at the solution, we need to understand why existing tools fall short.

Most modern Entity Disambiguation models, like BLINK or GENRE, are trained on modern internet data (like Wikipedia or Common Crawl). They operate under the assumption that the entity mentioned likely exists in their knowledge base.

Historical archives present three specific challenges:

The “Forgotten” People: Historical newspapers are full of individuals who were famous locally or temporarily but are not recorded in modern knowledge bases like Wikipedia. If a model tries to force a link between a local 1950s mayor and a modern Wikipedia entry, it hallucinates a connection.
Hard Negatives: History is repetitive. Names like “John Smith” or “Robert Kennedy” (the politician vs. the Scottish footballer) appear constantly. Distinguishing between two people with the same name requires subtle contextual understanding.
Scale: Historical archives are massive. The U.S. National Archives alone holds billions of documents. Any solution needs to be computationally efficient; we cannot afford to run heavy, slow models on billions of tokens.

The Foundation: WikiConfusables

The first major contribution of this research is not the model itself, but the data used to teach it. To train a model to distinguish between “John F. Kennedy” and “John Kennedy (Louisiana politician),” the model needs to see examples of these confusing pairs.

The researchers created a massive dataset called WikiConfusables.

Standard training usually involves “easy negatives.” For example, pairing “John F. Kennedy” with “The Eiffel Tower.” It is easy for a model to learn that these are different. It is much harder—and more educational for the AI—to distinguish “John F. Kennedy” from “John F. Kennedy Jr.”

The researchers mined Wikipedia disambiguation pages (lists of people with similar names) and Wikidata family trees to generate hard negatives. They over-sampled family members (fathers and sons with the same name) because this is a frequent source of confusion in historical texts.

Table 1: Statistics on dataset size.

As shown in Table 1, the scale of this dataset is immense. The coreference training set alone includes over 179 million pairs. This massive volume of “confusable” data forces the model to look closely at the context surrounding a name, rather than just matching the string of text.

The Core Method: Contrastive Bi-Encoders

Now, let’s look at the architecture. The researchers aimed for a solution that was accurate, scalable, and capable of handling entities not found in Wikipedia. They settled on a contrastively trained bi-encoder architecture.

What is a Bi-Encoder?

A bi-encoder consists of two independent neural networks (encoders).

Mention Encoder: Takes the sentence from the newspaper containing the name (e.g., “Kennedy spoke at the rally…”).
Entity Encoder: Takes the description from the knowledge base (e.g., the first paragraph of JFK’s Wikipedia page).

Each encoder outputs a vector (a list of numbers) representing the semantic meaning of the text. If the mention and the entity refer to the same person, their vectors should be very close to each other in mathematical space. If they are different people, the vectors should be far apart.

This is where Contrastive Learning comes in. During training, the model is fed the massive WikiConfusables dataset. It is penalized if it places the vector for “John F. Kennedy” too close to “John Kennedy (Louisiana)” and rewarded if it places it close to the correct Wikipedia entry.

The Pipeline: Coreference First, Disambiguation Second

The researchers introduce a two-step approach: LinkMentions (for coreference) and LinkWikipedia (for disambiguation).

Figure 1: Model architecture of LinkWikipedia.

Let’s walk through the architecture diagram in Figure 1 above, as this visualizes the entire process.

1. Input (Top Left): We start with a historical news snippet mentioning “Kennedy” four times. Note that mentions 1, 2, and 3 might refer to different people, or the same person.

2. LinkMentions Coreference Model (Middle Left): First, the system does not look at Wikipedia. It looks only at the document. It uses the LinkMentions model to embed all four mentions. It then performs Agglomerative Clustering. It groups mentions that seem to be the same person based on context.

Cluster {1, 3}: The model determines these refer to the same “Kennedy.”
Cluster {2}: This is a different “Kennedy.”
Cluster {4}: Yet another unique individual.

3. Prototype Generation: For each cluster, the system creates a “Query Prototype” by averaging the embeddings of the mentions in that cluster. This creates a cleaner, stronger signal than a single mention would provide.

4. LinkWikipedia Disambiguation (Right Side): Now, the system searches the Knowledge Base (Wikipedia). The LinkWikipedia model compares the Query Prototypes against millions of pre-computed embeddings of Wikipedia entities (labeled a, b, c, d in the diagram).

It retrieves the nearest neighbors.
It calculates a “Rank Score.”

5. The Output (Bottom):

Mention #1 is linked to Entity ‘a’ (Robert F. Kennedy) with high confidence.
Mention #2 is linked to Entity ‘b’ (JFK).
Crucially, Mention #4 is marked as OOKB (Out-of-Knowledge-Base). The score (0.4) was below the acceptance threshold. This means the model correctly realized that this specific “Kennedy” (perhaps the Scottish footballer) is not the one discussed in the political context or isn’t in the database effectively.

This architecture allows the system to process billions of documents efficiently because the Wikipedia entities only need to be encoded once.

The Benchmark: Entities of the Union

To prove their model works on history, the researchers couldn’t rely on standard modern datasets. They created a new high-quality benchmark called Entities of the Union (EotU).

They took 157 newswire articles from the 1950s and 60s, specifically from days surrounding the State of the Union addresses. These articles were hand-labeled by humans to create a “Gold Standard” truth.

Table 2:Entity and people mentions across different benchmarks.

Table 2 compares this historical benchmark to other common datasets. While EotU might look smaller in total mentions, notice the People Mentions column. It is highly focused on people, which is the hardest category to disambiguate. Furthermore, unlike standard datasets where every entity exists in the knowledge base, EotU contains many people who do not have Wikipedia pages, making it a much more realistic test of historical analysis.

Experiments & Results

So, how did the model perform?

The researchers compared their approach (LinkWikipedia and a version fine-tuned on news called LinkNewsWikipedia) against the heavyweights of the field: BLINK, GENRE, and ReFinED.

Beating the State-of-the-Art in History

Table 3: Benchmark performance comparison across diferent methods.The first row evaluates on allentities in Entities of the Union, whereas the second row only considers in-knowledgebase entities.

Table 3 shows the results. The numbers are striking:

Overall Accuracy: The fine-tuned model (LinkNewsWikipedia) achieved 78.3% accuracy on the historical dataset. The closest competitor, ReFinED, only reached 65.4%.
In-Knowledge-Base Accuracy: Even when we ignore the difficult “forgotten” people and only look at those who do have Wikipedia pages, the new model still wins (89.0% vs 80.9% for GENRE).

This proves that training on “hard negatives” (WikiConfusables) significantly boosts the model’s ability to understand subtle context in historical writing.

Analyzing the Components (Ablation)

The researchers also performed an ablation study to see which parts of their pipeline mattered most.

dataset ACE2004, which has very few people. On other modern benchmarks,there are model(s) that perform better but our performance is in the range of the other models.

Note: The image above displays the ablation study table (Table 4).

Looking at the table in the image above, we can see the impact of each step:

Base MPNet: Using a standard off-the-shelf model yields terrible results (26.5%).
Disambiguation Only: Training the model helps massively, jumping to 69.1%.
Add Coref: This is the game changer. Adding the coreference step (clustering mentions before linking) boosts accuracy to 77.9%.

This validates the hypothesis that resolving coreferences within the document before trying to link to Wikipedia provides a much cleaner signal for the AI.

Applying the Model: A Century of News

Finally, to demonstrate the power of this tool, the researchers applied it to a massive corpus of 2.7 million historical newswire articles spanning from 1878 to 1977.

They were able to disambiguate over 15 million person mentions. This allows for quantitative analysis of history that was previously impossible.

Figure 2: Mentions over time of entities that appeared most commonly in newswire articles.

Figure 2 visualizes the rise and fall of historical figures. We can see the clear “heartbeat” of American democracy in the spikes of mentions for Presidents Eisenhower, Truman, and Nixon corresponding to election cycles. We can also see the massive, dark dominance of Hitler’s mentions during the 1940s.

But the model allows for more than just counting famous names. It allows us to analyze prominence.

Figure 3: Mentions against Wikipedia Qrank.

Figure 3 shows a correlation between how often a person was mentioned in historical newspapers (x-axis) and their popularity on Wikipedia today (Qrank, y-axis).

The Trend: generally, people famous then are famous now (the upward slope).
The Outliers: The scatter also hints at people who were heavily mentioned in the past but have lower prominence today—the “forgotten” figures of history that this model is uniquely suited to rediscover.

Conclusion and Implications

This research bridges a crucial gap between modern AI capabilities and the needs of historians and social scientists. By creating a massive dataset focused on “confusable” entities and utilizing a contrastive bi-encoder architecture, the authors have built a tool that can navigate the messy, noisy, incomplete world of historical text.

Key takeaways for students and researchers:

Data Quality Matters: The creation of WikiConfusables (with hard negatives) was just as important as the model architecture. Random negatives are often insufficient for fine-grained tasks.
Coreference is Key: Grouping mentions within a document before external retrieval significantly improves accuracy.
OOKB Handling: For real-world applications, your model must know when to say “I don’t know” or “This person isn’t in the database.”

This work opens the door for massive-scale quantitative analysis of history, allowing us to trace the lives, careers, and media coverage of millions of individuals who have shaped our world, whether they have a Wikipedia page or not.

The Problem: Why Historical Text is Hard#

The Foundation: WikiConfusables#

The Core Method: Contrastive Bi-Encoders#

What is a Bi-Encoder?#

The Pipeline: Coreference First, Disambiguation Second#

The Benchmark: Entities of the Union#

Experiments & Results#

Beating the State-of-the-Art in History#

Analyzing the Components (Ablation)#

Applying the Model: A Century of News#

Conclusion and Implications#