Imagine you are editing a Wikipedia article about a 1950s actress. You want to add a link to the page for “Private School” because it is relevant to her early life. You scan the text. The words “Private School” do not appear anywhere in the article.

What do you do? You don’t just give up. You write a new sentence—perhaps, “She also worked at a private school”—and insert it into the biography.

This specific action represents a massive, often overlooked challenge in Natural Language Processing (NLP). For years, researchers have focused on Entity Linking (connecting an existing name in a text to a database entry). But real-world knowledge construction often requires Entity Insertion: finding the perfect spot in a text to introduce a new concept that isn’t mentioned yet.

In this deep dive, we explore a fascinating paper titled “Entity Insertion in Multilingual Linked Corpora: The Case of Wikipedia.” The researchers introduce a novel framework called LOCEI (and its multilingual cousin xLOCEI) that outperforms even powerful Large Language Models (LLMs) like GPT-4 at this specific task.

The Problem: Knowledge Islands

The internet is not just a collection of documents; it is a network. Hyperlinks are the synapses that turn isolated pages into a cohesive brain. In Wikipedia, links are vital for navigation and context. However, adding these links is surprisingly difficult.

There are millions of “orphan articles” on Wikipedia—pages with few or no incoming links. To fix this, human editors must manually read through potential source articles to find a place to insert a link.

Entity Linking vs. Entity Insertion

Most automated tools today rely on Entity Linking. This assumes the text already mentions the entity. For example, if the text says “She starred in The Archers,” an algorithm can easily link “The Archers” to its Wikipedia page.

Entity Insertion is harder. It assumes the text does not currently mention the entity, but the concept fits contextually. The algorithm must understand the meaning of the text well enough to say, “This exact spot between sentence A and sentence B is where we should mention this new topic.”

Figure 1: Entity linking vs. Entity insertion. On the left (linking), the text ‘Margaret Peggy Woolley’ already exists. On the right (insertion), the concept ‘Private school’ is missing entirely and must be added via a new sentence.

As shown in Figure 1 above, the difference is structural. In the “After” panel on the right, the editor didn’t just hyperlink a word; they modified the text structure to accommodate new knowledge.

The Scale of the Challenge

You might wonder, how often does this actually happen? Is it an edge case?

The researchers analyzed millions of edits and discovered something startling. In 60-70% of cases where a link was added to Wikipedia, the text for that link did not previously exist.

Figure 2: Challenges of entity insertion. The left chart shows that for most languages, the ‘Absent’ category (orange) dominates, meaning the text wasn’t there before the link was added. The right chart shows the cognitive load: editors must choose from hundreds of candidate sentences.

Figure 2 illustrates this reality. The orange bars (Absent) represent cases where the mention was missing. This means that if we only build tools for Entity Linking (the blue bars), we are ignoring the majority of the problem. Furthermore, the graph on the right shows the cognitive load: an editor has to select the correct insertion point from hundreds of candidate sentences in a long article.

The Data: How to Train an Editor

To solve this, the researchers first needed a dataset. There is no standard benchmark for “inserting knowledge into text.” So, they built one using the edit history of Wikipedia itself.

The team looked at Wikipedia snapshots from consecutive months (e.g., September vs. October 2023). By calculating the “set difference” of links, they identified exactly which links were added by human editors during that month.

Figure 3: Data processing pipeline. The system compares links between two months. If a link appears in the second month but not the first, it traces back the revision history to find the exact edit and categorizes the type of insertion.

As outlined in Figure 3, the pipeline is rigorous:

  1. Identify New Links: Find links present in the new version (\(v_M\)) but absent in the old one (\(v_0\)).
  2. Trace Revision History: Look at the exact HTML difference to see how the text changed.
  3. Categorize: Did the editor just link an existing word? Did they add a word? Did they add a whole sentence?

This process resulted in a massive dataset covering 105 languages, from English and French to low-resource languages like Xhosa and Guarani.

The Hierarchy of Missing Information

Not all insertions are created equal. The researchers categorized the difficulty of the task based on how much text the editor had to write:

  • Text Present: The easiest case. The word was already there (e.g., “cheese”), just not clicked.
  • Missing Mention: The editor added a few words (e.g., adding “Soviet space dog” to a sentence about Laika).
  • Missing Sentence: The editor wrote a full new sentence to bridge the gap.
  • Missing Span: The editor added multiple sentences.

Figure 5: The distribution of entity insertion categories across 20 languages. Notice that ‘Text Present’ (blue) is often the minority compared to the combined categories where text is missing.

Figure 5 visualizes the distribution. In languages like English (en) or French (fr), the “Text Present” slice is surprisingly small. The problem of missing text is universal.

The Solution: LOCEI

The researchers proposed LOCEI (Localized Entity Insertion). Unlike generative LLMs that try to write the text immediately, LOCEI frames this as a ranking problem.

Given a Target Entity (the thing we want to link to) and a Source Article (where we want to put the link), the model breaks the source article down into candidate text spans (sentences or paragraphs). It then scores every candidate span on how relevant it is to the target entity.

The Architecture

The architecture relies on a Transformer encoder (specifically XLM-RoBERTa for the multilingual version).

Figure 4: Architectural overview of LOCEI. The model takes the Target Entity and a Candidate Span, concatenates them, and feeds them into a Transformer. An MLP then predicts a relevance score.

As shown in Figure 4, the input combines:

  1. Target Entity (\(E_{tgt}\)): Represented by its title and its lead paragraph.
  2. Candidate Span (\(x\)): A sentence from the source article.

These are concatenated into a single sequence. The Transformer processes this sequence to understand the semantic relationship between the two. The [CLS] token (a special token representing the whole sequence) is then fed into a Multi-Layer Perceptron (MLP) to output a single scalar: the Relevance Score.

The mathematical formulation for the input representation is:

Equation for input representation. The input phi is a tokenized sequence of the Target Title, Target Lead, and the candidate text span t.

Optimization: Learning to Rank

The model isn’t just asked “Is this a good spot? Yes/No.” It is trained with a List-wise Ranking Objective.

For every correct insertion spot (positive sample), the system selects \(N\) negative samples (spots in the article where the link was not added). The model must learn to score the positive spot higher than all the negative spots.

Equation for the ranking loss function. It maximizes the probability of the correct candidate relative to the sum of all candidates.

This Softmax-based loss function forces the model to distinguish between “somewhat relevant” and “perfect insertion point.”

The Secret Sauce: Two-Stage Training

Here lies the cleverness of the LOCEI framework. Real-world data of “added links” is high quality but relatively scarce (editors don’t add millions of links every day in every language). However, existing links are abundant.

The researchers used a Two-Stage Training Pipeline:

  1. Stage 1 (Warm Start): Train on the millions of links that already exist.
  2. Stage 2 (Expansion): Fine-tune on the smaller dataset of actual “added links” (the edits).

Dynamic Context Removal

There is a problem with Stage 1. Existing links always fall into the “Text Present” category. If the model only trains on existing links, it will learn to just look for the entity’s name (String Matching) and fail when the name is missing.

To solve this, the researchers invented Dynamic Context Removal. During training, they take an existing link and artificially “damage” the text to simulate the harder scenarios.

Table 10: Examples of dynamic context removal strategies. ‘rm_mention’ deletes the specific name. ‘rm_sent’ deletes the whole sentence containing the link.

As seen in Table 10, they apply different strategies:

  • rm_mention: Delete the words “Perthes-lès-Brienne” but keep the sentence. The model must learn that the remaining context implies that specific commune.
  • rm_sent: Delete the whole sentence. The model must learn that the surrounding sentences create a context hole that this entity fills.

This forces the model to learn deep semantic context rather than simple keyword matching.

Knowledge Injection

To further boost performance, the researchers injected extra metadata. They included the Section Title (\(s\)) of the candidate span and a list of known aliases/mentions (\(M_{tgt}\)) for the target entity.

Equation showing the enhanced input representation including section title ’s’ and mentions ‘M_tgt’.

This helps the model understand that a link to “1984” (the book) belongs in the “Literature” section, while a link to “1984” (the year) belongs in “History.”

Multilinguality: xLOCEI

Wikipedia exists in over 300 languages. Training a separate model for each is inefficient and hurts low-resource languages that don’t have enough training data.

The team developed xLOCEI (Cross-Lingual LOCEI). By training a single XLM-RoBERTa model on a mix of 20 languages simultaneously, the model learns universal patterns of logic and context.

Does it work? Let’s look at the results.

Experimental Results

The researchers compared xLOCEI against several baselines:

  • String Match: Simple keyword search.
  • BM25: The classic information retrieval algorithm (similar to TF-IDF).
  • GPT-3.5 and GPT-4: Using state-of-the-art LLMs to rank the sentences (zero-shot).

Overall Performance

Figure 6: Hits@1 performance across 20 languages. The xLOCEI model (brown) consistently outperforms baselines like BM25 (orange) and String Match (green), especially in difficult languages.

Figure 6 shows the “Hits@1” metric (did the model pick the exact right sentence as its top choice?). The brown dots (xLOCEI) are consistently at the top.

The numerical breakdown is even more revealing:

Table 2: Aggregated results over 20 languages. xLOCEI achieves 72.6% accuracy overall, compared to 50.8% for BM25. In ‘Missing’ scenarios, xLOCEI scores 57.9% while String Match fails at 27.0%.

In Table 2, look at the “Missing” column. This is the hardest task (where the text isn’t there).

  • String Match gets 27.0%.
  • BM25 gets 28.0%.
  • xLOCEI achieves 57.9%.

This is a massive leap. It doubles the performance of traditional search methods by “reading between the lines.”

Beating GPT-4

You might expect GPT-4 to dominate this task. Interestingly, xLOCEI holds its own and often wins, particularly because it is fine-tuned for this specific structural understanding of Wikipedia.

Table 3: Performance on English. xLOCEI (0.677) significantly outperforms GPT-3.5 (0.160) and GPT-4 (0.370) in Overall Hits@1.

Table 3 focuses on English. While GPT-4 is good at the “Present” category (0.833), it struggles with the full ranking task compared to the specialized xLOCEI model (0.677 vs 0.370 overall). This proves that massive scale isn’t the only answer; specialized training pipelines matter.

The Power of Zero-Shot Transfer

Perhaps the most impressive finding is how xLOCEI performs on languages it has never seen before.

The researchers trained a version called xLOCEI\(_{11}\) on just 11 languages. They then tested it on 9 other languages that were completely held out during training.

Figure 8: Zero-shot performance. The orange triangles (xLOCEI_11) represent performance on unseen languages. It nearly matches the blue squares (model trained on all languages).

Figure 8 shows that the performance drop is minimal. The model learned the concept of “insertion” so well in English, French, and Japanese that it could apply it to Portuguese or Slovak without specific fine-tuning.

Table 4: Zero-shot results. xLOCEI_11 retains over 90% of the performance of the fully trained model and still beats GPT-4 in zero-shot settings.

Table 4 confirms this quantitatively. The zero-shot model (\(xLOCEI_{11}\)) achieves 0.690 Hits@1, vastly outperforming GPT-4 (0.571) and coming very close to the fully supervised model (0.709). This is a game-changer for low-resource languages on Wikipedia that desperately need better connectivity but lack the data to train custom models.

Conclusion: Bridging the Gaps

The “Entity Insertion” paper shifts the perspective from simply linking words to connecting ideas. By recognizing that 65% of link edits require adding new text, the authors identified a major blind spot in current NLP tools.

Their solution, LOCEI, combines a smart ranking architecture with a creative training strategy (Dynamic Context Removal) to solve this. The resulting model is not only accurate but incredibly robust across languages, offering a way to “de-orphan” millions of articles across the 300+ language versions of Wikipedia.

For students of AI and NLP, this work highlights a crucial lesson: Data preparation is often as important as model architecture. By synthetically generating “missing text” scenarios from existing links, the researchers taught a machine to visualize where information should be, rather than just seeing what is.

This technology paves the way for AI assistants that don’t just proofread our writing, but actively help us weave our isolated thoughts into the broader web of human knowledge.