Breaking the Habit - How Counterfactual Training Fixes Entity Disambiguation

If you read the sentence, “Michael Jordan published a new paper on machine learning,” who do you picture?

If you are like most people—and more importantly, like most Machine Learning models—you probably immediately thought of the basketball legend, #23 of the Chicago Bulls. But you would be wrong. The sentence refers to Michael I. Jordan, a renowned computer science professor at UC Berkeley.

This specific problem is known as overshadowing in Natural Language Processing (NLP). When an ambiguous name (like “Michael Jordan”) is shared by a very popular entity and a less common one, models almost exclusively predict the popular one, ignoring the context clues that suggest otherwise.

In this post, we are going to do a deep dive into a fascinating research paper: Efficient Overshadowed Entity Disambiguation by Mitigating Shortcut Learning. We will explore why modern AI models are “lazy learners,” how they take shortcuts that hurt accuracy, and how a new method called Counterfactual Training (CFT) forces them to actually read the context—achieving state-of-the-art results without sacrificing speed.

The Problem: Entity Disambiguation and Shortcuts

Entity Disambiguation (ED) is the task of linking a mention in text (e.g., “Jaguar”) to the correct entry in a Knowledge Base (e.g., Jaguar Cars vs. Jaguar the animal).

For an ED model to be robust, it needs two key properties:

Context-awareness: It must look at the surrounding words (“speeding down the highway” vs. “hunting in the jungle”) to decide the identity.
Scalability: It needs to do this quickly, processing thousands of queries per second.

Current models are generally good at this, but they have a fatal flaw: Shortcut Learning.

The Lazy Model Hypothesis

Deep learning models are optimization machines. Their goal is to minimize loss (error) by any means necessary. Often, the easiest way to minimize error is to memorize correlations between the surface form of a word (the name itself) and the most likely label, completely ignoring the context.

If 99% of the “Michael Jordan” examples in the training data refer to the basketball player, the model learns a simple heuristic: If I see “Michael Jordan,” predict the athlete. This is a “spurious correlation.” The mention text is a spurious feature, while the surrounding sentence is the intended feature.

The causal graph of ED models. Due to the strong correlations between the spurious feature and training labels, typical ED models are prone to shortcut learning and fail to resolve overshadowed entities.

Figure 1 above illustrates this causal graph.

\(X_m\) (Mention Surface): The name “Michael Jordan.” This is the spurious feature.
\(X_c\) (Mention Context): The rest of the sentence (“published a new paper…”). This is the intended feature.
\(E\) (Predicted Entity): The output.

As the diagram shows, models often bypass the context (\(X_c\)) entirely, relying solely on the strong link between the name (\(X_m\)) and the popular entity (Gold Label Q41421). This results in the “overshadowed” entity (the professor) being ignored, leading to 0.00% prediction probability for the correct answer.

Previous Solutions vs. The New Approach

The problem of overshadowing isn’t new. Previous state-of-the-art methods, such as KBED (Knowledge Base Entity Disambiguation), tried to fix this by “reasoning.” They would extract all entities in a document and check a Knowledge Base to see how they relate to each other.

While KBED works well, it is computationally heavy. It turns a fast retrieval task into a complex logic puzzle during inference (prediction time). The authors of this paper argue that we don’t need complex reasoning during inference; we just need to stop the model from cheating during training.

The Solution: Counterfactual Training (CFT)

The researchers propose Counterfactual Training (CFT). The core idea is simple but brilliant: if the model is cheating by looking at the name (\(X_m\)), let’s hide the name during training.

By masking the entity mention, we create a “counterfactual” scenario. We ask the model: “If you couldn’t see the name ‘Michael Jordan’, but only saw ‘…published a new paper on machine learning’, who would this be?”

This forces the model to learn the connection between the context and the entity, breaking the spurious correlation with the name.

System Overview

Let’s look at how this fits into the model architecture.

The system overview of the proposed method.

As shown in Figure 2, the standard training loop (left) allows the model to see the Spurious Feature (\(X_m\)). The CFT loop (right) applies an intervention called do_mask_mention. It replaces the name with [MASK] tokens.

The model is then trained to predict the correct entity using only the context (\(X_c\)). This effectively “de-biases” the model.

The Mathematics of CFT

Let’s break down the math to see exactly how this is implemented.

1. The Intervention First, we create a counterfactual example \(\hat{X}\) by masking the mention surface.

Equation 1 showing the masking operation where mention tokens are replaced with [MASK].

If a word \(w_i\) is part of the mention (\(X_m\)), it becomes a [MASK] token. If it is part of the context (\(X_c\)), it remains unchanged.

2. The Standard Objective (\(\mathcal{L}_{ED}\)) In a typical training setup, we want to minimize the difference between the prediction and the ground truth. This is the standard Entity Disambiguation loss:

Equation 2 showing the standard Entity Disambiguation loss function based on mention surface and context.

Here, the model \(f\) uses both the name \(X_m\) and context \(X_c\) to make a prediction \(E\).

3. The Counterfactual Objective (\(\mathcal{L}_{CFT}\)) To this standard loss, the researchers add a new regularization term. This term forces the model to predict the correct entity \(\hat{E}\) using the masked input \(\hat{X}_m\) and the context \(X_c\).

Equation 3 showing the Counterfactual Training loss function based on masked mention and context.

4. The Final Training Objective Finally, these two objectives are combined.

Equation 4 showing the final loss function combining ED loss and CFT loss with a hyperparameter mu.

The parameter \(\mu\) (mu) controls how much weight is given to the counterfactual loss.

If \(\mu\) is 0, we have a standard model (prone to shortcuts).
If \(\mu\) is too high, the model might ignore the mention entirely (which is also bad, because “Jordan” helps narrow it down to people named Jordan).

The researchers found that a small \(\mu\) is sufficient to regularize the model, teaching it to trust context without forgetting the name entirely.

Experiments and Results

To prove that CFT works, the authors compared it against several baselines, including:

ReFinED: A strong, efficient baseline model.
KBED: The previous state-of-the-art for overshadowed entities (which uses slow reasoning).
Focal Loss, Entity Masking (EM), Counterfactual Inference (CFI): Other debiasing techniques from computer vision and relation extraction.

They tested on Standard Datasets (long news articles with rich context) and Challenge Datasets (short text like Tweets or questions).

1. Performance on Standard Datasets

The results on standard datasets (AIDA, MSNBC, etc.) are summarized below. The entities are split into Sha (Shadow/Overshadowed) and Top (Common/Popular).

Table 1: Experimental (InKB micro F1-Score) results on standard datasets with abundant contextual information.

Key Takeaways from Table 1:

CFT vs. ReFinED: CFT significantly improves performance on Overshadowed entities (Sha column) compared to the base ReFinED model (83.8 vs 79.4 on AIDA).
CFT vs. KBED: CFT outperforms the complex KBED method on overshadowed entities (83.8 vs 82.2) while maintaining high accuracy on common entities.
Inference Rate (Q/s): Look at the last column. ReFinED and CFT run at 3.3 Queries/second. KBED drops drastically to 0.6 Q/s. This confirms that CFT improves accuracy without slowing down the model during use.

2. Performance on Challenge Datasets (Limited Context)

What happens when the text is very short, like a tweet?

Table 2: Results on challenge datasets with limited contextual information.

Table 2 shows that CFT is incredibly robust. On datasets like TWEEKI and MINTAKA, where context is scarce, CFT outperforms all other methods. This is surprising and highlights that even in short sentences, maximizing the utility of the few available context words is crucial.

3. Scalability and Speed

One of the paper’s strongest arguments is efficiency. Complex reasoning methods (like KBED) get slower as the number of mentions in a document increases because they have to check relationships between all of them.

Figure 3: Time taken to process queries with different numbers of mentions.

Figure 3 organizes queries into “octiles” based on how many mentions they contain (Octile 8 has the most mentions).

Dark Green (CFT): The time per query increases slightly but remains low (under 1 second even for complex queries).
Grey (KBED): The time explodes as complexity increases (over 3 seconds).

This proves that CFT is scalable. It shifts the “hard work” to the training phase. Once trained, the model is just as fast as a standard, “lazy” model, but much smarter.

4. Tuning the Hyperparameter (\(\mu\))

How much should we mask? The researchers tuned the \(\mu\) parameter (the weight of the counterfactual loss).

Figure 4: Results of ReFinED with CFT with different mu values on the validation set of AIDA dataset.

As Figure 4 shows, performance peaks around \(\mu = 0.1\).

Left Graph (Overshadowed): There is a clear improvement over the baseline (dashed line) when CFT is introduced.
Right Graph (Overall): Interestingly, overall performance also improves. This suggests that relying on context helps with common entities too, not just overshadowed ones.

Why does this matter? Qualitative Analysis

To understand the real-world impact, the authors provide examples where CFT succeeds where others fail.

Success Case:

Text: “…An Air Afrique Boeing-727 jet was the third passenger liner… Lagos Guardian newspaper reported…”

Goal: Identify “Guardian” (The Nigerian newspaper).
Prior/Shortcut: The British “The Guardian” (common entity).
KBED: Failed. It looked for relations but got confused.
CFT: Succeeded. It focused on the context “Lagos” and “Air Afrique” to correctly identify the Nigerian paper.

Success Case (Science):

Text: “…The smoke is vaporized wax…”

Goal: Identify “Vaporization” (Phase transition).
Prior/Shortcut: “Evaporation” (Similar concept, often confused).
CFT: Succeeded by paying close attention to the scientific context distinguishing boiling/vaporizing from evaporating.

These examples illustrate that by removing the crutch of the entity name, the model becomes a more attentive reader.

Conclusion

The paper Efficient Overshadowed Entity Disambiguation by Mitigating Shortcut Learning offers a compelling lesson for machine learning practitioners: Harder inference isn’t always the answer; better training is.

By diagnosing the root cause of the error—shortcut learning—the researchers designed a training intervention (Counterfactual Training) that forces the model to learn robust features.

The Highlights:

Solves Overshadowing: Successfully identifies rare entities that share names with famous ones.
Zero Inference Cost: Unlike reasoning-based methods that slow down production systems, CFT is free at inference time.
Simple Implementation: It requires modifying the loss function and data loading, not the model architecture itself.

As models become larger and more complex, techniques like CFT remind us that how we teach the model is just as important as the model itself. By occasionally “blinding” our models during training, we teach them to see more clearly.

The Problem: Entity Disambiguation and Shortcuts#

The Lazy Model Hypothesis#

Previous Solutions vs. The New Approach#

The Solution: Counterfactual Training (CFT)#

System Overview#

The Mathematics of CFT#

Experiments and Results#

1. Performance on Standard Datasets#

2. Performance on Challenge Datasets (Limited Context)#

3. Scalability and Speed#

4. Tuning the Hyperparameter (\(\mu\))#

Why does this matter? Qualitative Analysis#

Conclusion#