Can LLMs Fix Biomedical Entity Linking? A New State-of-the-Art Approach

The biomedical field is notoriously difficult when it comes to text processing. Consider the term “diabetes.” In a general conversation, we know what this means. But in a medical paper, does it refer to Diabetes Mellitus? Diabetes Insipidus? Nephrogenic Diabetes Insipidus? Or perhaps a specific experimental induction of the disease in a lab rat?

This is the challenge of Biomedical Entity Linking (EL). It isn’t enough to just find the word (Named Entity Recognition); we have to map that word to a specific, unique identifier (CUI) in a massive knowledge base like UMLS or MeSH.

For years, this problem has been tackled by specialized BERT-based models. They are good, but they struggle with the nuance of context and the sheer variety of aliases used in medical literature.

In this post, we are breaking down a fascinating research paper titled “LLM as Entity Disambiguator for Biomedical Entity-Linking.” The researchers propose a novel, two-step approach: keep the traditional models to find the candidates, but bring in a Large Language Model (LLM) to make the final decision. The result? A massive jump in accuracy—up to 16 points—without a single step of fine-tuning.

Let’s dive into how it works.

The Core Problem: Ambiguity and Aliases

Most Entity Linking systems follow a two-stage process:

Candidate Generation: The model looks at a mention (e.g., “depression”) and retrieves a list of potential matches from the database (e.g., “Depressive disorder,” “Depression, bipolar,” “Depression, unipolar”).
Disambiguation: The model re-ranks these candidates to pick the best one.

There are two main “schools of thought” for existing biomedical EL models, and understanding them is crucial to understanding why the LLM approach is so effective.

1. Alias-Matching Models (e.g., SapBERT)

These models rely heavily on lexicographical properties—fancy talk for “string similarity.” If the mention is “heart attack,” SapBERT looks for database entries listed as synonyms for “heart attack” (like “Myocardial Infarction”).

Pros: High recall. They rarely miss the correct answer if it’s in the list of aliases.
Cons: Terrible at context. If “CAT” appears in the text, SapBERT might rank the animal and the enzyme (Catalase) similarly, ignoring the fact that the sentence is about biochemistry.

2. Contextualized Models (e.g., ArboEL)

These models are smarter. They learn vector representations based on the surrounding sentence.

Pros: Very precise.
Cons: They tend to over-filter. If the correct entity looks “contextually different” in the vector space, the model might push it so far down the list that it gets cut off.

The New Method: LLM as the Judge

The researchers realized that LLMs (like GPT-4, Llama 3, or Mistral) possess the general reasoning capabilities that Alias-Matching models lack. However, LLMs are bad at generating database IDs directly (they hallucinate).

So, the authors proposed a hybrid pipeline. They use a standard model (SapBERT) to generate a “shortlist” of candidates, and then they feed that list into an LLM to pick the winner.

Figure 1: LLM as Entity Disambiguator.

As shown in Figure 1, the workflow is elegant in its simplicity:

Candidate Generator: The Entity Linking Model (SapBERT) finds the top \(k\) candidates.
Entity Disambiguator: The LLM receives the mention, the context, and the list of candidates.
Result: The LLM outputs the best candidate.

The Secret Sauce: In-Context Learning

You can’t just ask an LLM “Which one is it?” and expect SOTA results. The authors utilized In-context Learning via a Retrieval-Augmented Generation (RAG) style system.

Instead of zero-shot prompting, the system dynamically retrieves relevant examples from the training set. If the model is trying to link a mention of a specific gene, the system finds other solved examples of gene linking from the training data and includes them in the prompt.

The probability of the correct answer is modeled as:

Equation for LLM prediction

Here, the LLM predicts the output \(\omega\) based on the query \(x\) and a set of \(k\) retrieved examples \((x_1, y_1)...(x_k, y_k)\).

To make this efficient, they used Faiss (a library for efficient similarity search) to index all mentions in the training set. During inference, they retrieve the most contextually similar training examples to guide the LLM.

Constructing the Prompt

The prompt engineering here is critical. The researchers provide the LLM with:

The Mention: Specifically marked (e.g., ... suffered from [ENTITY] anemia [ENTITY] ...).
The Context: The surrounding text.
The Candidates: A structured dictionary including the Name, Definition, Aliases, and Type for each candidate.

Figure 3: Prompt for the accuracy task, outputing only the best candidate CUI.

Figure 3 shows the prompt used for the accuracy task. Notice that it is a direct instruction: “Output only the best candidate CUI.” We will discuss later why this direct approach beat out complex “Chain of Thought” reasoning.

Experimental Setup

To prove this works, the authors tested the method on five prevalent biomedical datasets, covering diseases, chemicals, and genes.

Table 1:Datasets used for evaluation.

They evaluated using several LLMs, including proprietary ones (GPT-4o) and open-source ones (Llama-3, Mistral, Qwen2.5).

Key Results: A Massive Leap for Alias-Matching

The results were transformative for Alias-Matching models (SapBERT).

When SapBERT is used alone, it often fails on datasets like GNormPlus (Gene Normalization) because gene names are incredibly ambiguous (e.g., distinct genes sharing the same alias). SapBERT groups them all together because they look the same.

Enter the LLM. Because the LLM reads the context, it can look at that list of identical-looking gene names and use the definitions/context to pick the right one.

Table 2: Comparison Accuracy (Recall @ 1) and Recall @ 5 between initial base model SapBERT and after applying different LLM disambiguators.

Take a look at the GNormPlus row in Table 2 above:

Base SapBERT Accuracy: 19.1% (Abysmal)
SapBERT + GPT-4o: 74.8% (State-of-the-Art)
SapBERT + Llama-3: 44.4%

This is a 16-point improvement over the previous State-of-the-Art (SOTA). Even smaller open-source models provided significant boosts.

The “Contextualized” Paradox

Interestingly, the method did not help Contextualized models (ArboEL) nearly as much. In fact, in some cases, it hurt performance.

Table 3: Comparison Accuracy (Recall @ 1) and Recall @ 5 between initial base model ArboEL…

Why? The authors explain this through the “Performance Disparity.”

Contextualized models like ArboEL are already “smart” filters. They cluster candidates based on context. If ArboEL is confused, it usually means the correct candidate has been pushed far down the list or removed entirely because it didn’t fit the vector cluster. If the correct candidate isn’t in the top \(k\) list passed to the LLM, the LLM cannot pick it.

SapBERT, being “dumber,” casts a wider net. It retrieves a messy, diverse list of candidates. This is actually perfect for the LLM, which acts as a filter.

Figure 2: Embedding Space of Candidates from Alias Matching (left) and Contextualized EL model (right)

Figure 2 illustrates this beautifully.

Left (Alias Matching): The candidates (red/colored dots) are scattered all over the embedding space. The LLM has a diverse menu to choose from.
Right (Contextualized): The candidates are tightly clustered. If the answer (blue dot) is distinct from the cluster, the model might exclude it, leaving the LLM with nothing to salvage.

Optimization: How to prompt the LLM?

The paper offers several practical insights for anyone looking to implement this.

1. Direct Answer vs. Reasoning

You might expect that asking the LLM to “think step-by-step” (Chain of Thought) would improve accuracy. Surprisingly, the authors found the opposite.

Figure 11: Accuracy vs. Runtime. Comparison of three prompting strategies across four datasets

As seen in Figure 11, Prompt 1 (Direct Output) was not only significantly faster (x-axis) but also generally more accurate than Prompt 2 (Reasoning). The authors hypothesize that for this specific task, if the information is present in the context and candidate definition, the model doesn’t need complex reasoning steps—and asking for them might introduce noise or error propagation.

2. How many candidates?

How many candidates should you feed the LLM? 5? 50?

Figure 6: Accuracy vs running time for varying number of candidates in the prompt. Dataset : GNormPlus

Figure 6 shows the sweet spot is usually between 10 and 20 candidates.

Too few (5): You risk excluding the correct answer.
Too many (50): The prompt becomes noisy (“Lost in the Middle” phenomenon), and accuracy drops while costs/runtime skyrocket.

3. Does Data Leakage Matter?

Since the retrieval system finds similar examples, does it help if the exact same mention appears in the prompt examples?

Table 5: Performance difference and frequency…

Table 5 confirms that yes, if the exact mention is in the few-shot examples (Checkmark column), performance is higher. However, even without exact matches (X column), the models still perform respectably. The method relies on the structure of the task, not just memorization.

Trade-offs: The Cost of Accuracy

While the accuracy gains are impressive, they come at a cost: Runtime.

Inference with LLMs is computationally heavy compared to a BERT-based bi-encoder.

Figure 12: Runtime Vs Number of Mentions for Accuracy task across Different LLMs…

Figure 12 demonstrates the scalability challenge. Running the full MM-ST21PV dataset (over 30k mentions) on a model like Mistral took nearly 24 hours. While faster than training a model like ArboEL (which can take 20 days), it is significantly slower than standard inference.

Conclusion

This paper presents a compelling argument for the “Generative Judge” pattern in NLP. Rather than trying to force a single model to be good at retrieval and reasoning, we can decouple them:

Use a fast, high-recall model (like SapBERT) to gather evidence.
Use a smart, reasoning-capable model (LLM) to make the final verdict.

Key Takeaways:

Integration is Easy: This method requires no fine-tuning. You can plug it into existing pipelines immediately.
Complementary Strengths: LLMs fix the context-blindness of Alias-matching models.
SOTA Results: Surpassed previous benchmarks by up to 16% accuracy on difficult datasets.

For students and researchers, this suggests that the future of information extraction might not lie in building bigger specific models, but in orchestrating the right interaction between specialized retrievers and general-purpose reasoners.

The Core Problem: Ambiguity and Aliases#

1. Alias-Matching Models (e.g., SapBERT)#

2. Contextualized Models (e.g., ArboEL)#

The New Method: LLM as the Judge#

The Secret Sauce: In-Context Learning#

Constructing the Prompt#

Experimental Setup#

Key Results: A Massive Leap for Alias-Matching#

The “Contextualized” Paradox#

Optimization: How to prompt the LLM?#

1. Direct Answer vs. Reasoning#

2. How many candidates?#

3. Does Data Leakage Matter?#

Trade-offs: The Cost of Accuracy#

Conclusion#