Peeling the Onion: Can LLMs Master Nested Named Entity Recognition?

Natural Language Processing (NLP) has seen a meteoric rise in capabilities thanks to Large Language Models (LLMs) like ChatGPT and Llama. We often see these models generating poetry, solving code, or summarizing emails with ease. However, when we apply them to rigorous Information Extraction (IE) tasks, the cracks begin to show.

One of the most deceptive challenges in NLP is Named Entity Recognition (NER). While identifying a person or a location sounds simple, language is rarely flat. It is hierarchical. It is recursive. It is nested.

Today, we are diving deep into a fascinating research paper titled “Exploring Nested Named Entity Recognition with Large Language Models.” The researchers investigated whether modern LLMs can handle the complexity of entities hidden inside other entities. They explored various prompting strategies, fine-tuning methods, and ultimately, a reality check on where LLMs stand compared to traditional supervised models.

The Problem: Entities Within Entities

To understand the challenge, we first need to distinguish between Flat NER and Nested NER.

In Flat NER, we might look at the sentence: “Elon Musk works at Tesla.”

Elon Musk \(\rightarrow\) Person
Tesla \(\rightarrow\) Organization

This is straightforward because the entities do not overlap. But consider the phrase:

“University of California student”

Here, we have a layering problem:

“University of California” is an Organization.
“University of California student” might be classified as a Person (or a specific role).

In biomedical texts, this is even more common and critical. A protein name might contain a DNA sequence name, which contains a cell type. Traditional models struggle here because they often treat text as a linear sequence where a token can only belong to one entity.

The authors of this paper set out to answer a crucial question: Can we prompt or tune LLMs to “see” these layers, or are they stuck looking at the surface?

The Methodology: Teaching LLMs to Reason

The researchers didn’t just ask ChatGPT to “find the entities.” They systematically deconstructed how an LLM approaches the problem. They tested several reasoning frameworks, ranging from simple extraction to complex, recursive logic.

1. The Importance of Output Format

Before even touching on reasoning, the authors discovered that how you ask the model to answer matters immensely. They tested four output formats:

Span Extraction: Asking for indices (e.g., Start: 0, End: 4).
Distinguishable Tokens: Inserting tags into the text (e.g., <Entity>).
Nested Level BIO: A complex tagging scheme marking the hierarchy depth.
Entity-Category Dictionary: A JSON-like structure (e.g., {'Entity': 'Label'}).

Examples of different output formats tested by the researchers.

As shown in Figure 4, the formats vary in complexity. Surprisingly, despite LLMs being good at math, they failed miserably at Span Extraction (identifying character indices). The clear winner was the Entity-Category Dictionary. It appears that generating a natural text-based dictionary aligns better with the pre-training of generative models than calculating numerical spans or managing complex tag schemas.

2. Prompting Strategies: How to Ask?

Once the output format was settled, the researchers explored reasoning techniques. This is where the paper offers a fantastic critique of existing “Chain of Thought” methods when applied to nested structures.

They compared several techniques against their own proposed methods.

Comparison of reasoning techniques: Recursive, Extract-then-Classify, Structure-based Decomposed-QA, and Nested NER-Tailored Instruction.

The Failures:

Decomposed-QA: This involves asking the model separate questions for each entity type (e.g., “Find all locations,” then “Find all people”). While this works great for Flat NER, it degraded performance for Nested NER. Why? Because nested entities often share context that is lost when you isolate the questions.
Recursive: This method (asking the model to find the outermost entities, then the inner ones) also struggled to match the performance of simpler methods.

The Solution: Nested NER-Tailored Instructions The authors proposed a new method focusing on Instruction Tuning. Instead of relying on complex algorithmic prompting (like recursion), they focused on semantics.

They provided the LLM with:

Label Information: Explicit definitions of what “Protein” or “DNA” actually means in this context.
Nested Case Descriptions: They explicitly explained to the LLM how nesting happens (e.g., “Consider that one entity might contain another…”).

This semantic approach—teaching the model the definitions rather than just the task—proved to be the most effective zero-shot and few-shot strategy.

Instruction Tuning: Moving Beyond Prompting

Prompting can only take you so far. To truly specialize the models, the researchers employed Instruction Tuning using PEFT (Parameter-Efficient Fine-Tuning), specifically QLoRA.

This allows them to adapt massive models (like Llama-2 13B) without retraining all parameters.

The Mathematical Foundation

The goal during tuning is to maximize the probability of generating the correct answer sequence \(Y\) given the instruction and input \(X\). The probability is calculated as:

Equation for probability of generating the next token.

Here, the model predicts the next token \(y_i\) based on all previous tokens \(y_{i<}\) and the input \([P; X]\).

To optimize this without melting their GPUs, they used QLoRA. This method keeps the original Large Language Model weights (\(\Phi_0\)) frozen and only trains a small set of adapter weights (\(\Delta \Phi\)). The loss function looks like this:

Equation for the loss function minimizing error over the dataset.

Specifically, QLoRA approximates the weight updates using low-rank matrices \(A\) and \(B\). Instead of updating a massive weight matrix \(W\), they update two smaller matrices that multiply together to represent the change. The forward pass for a layer becomes:

Equation showing the low-rank adaptation mechanism.

This allowed the researchers to take open-source models (Llama-2, Llama-3, Mistral, Qwen) and specialize them for the complex task of Nested NER using their custom “Tailored Instructions.”

Experimental Results: The Reality Check

The researchers tested their methods on three datasets: ACE 2004, ACE 2005 (general domain), and GENIA (biomedical, highly nested).

Prompting Performance (Closed Models)

First, let’s look at how ChatGPT (GPT-4) performed with different prompting strategies.

Table showing ChatGPT 4.0 results. Note the improvement with Label Information and Nested Case Description.

Key Takeaways from Table 1:

Standard methods failed: The “Decomposed-QA” method, which is a superstar in Flat NER, collapsed in the Nested setting (see the low “Nesting” scores).
Definitions matter: The rows for “Label Information (ChatGPT)” show significant improvements.
Context is King: Adding “Nested Case Description” (explaining that entities can be inside each other) yielded the highest F1 scores, particularly in the One-Shot and Five-Shot settings.

Fine-Tuning Performance (Open Models)

Next, they looked at the open-source models tuned with QLoRA.

Table showing results of instruction tuning on various models.

Key Takeaways from Table 2:

Bigger isn’t always better: While Llama-2 13B was a strong contender, smaller models like Mistral-7B often punched above their weight, specifically in general domains (ACE datasets).
The “SOTA” Gap: Look at the bottom row: “BERT fine-tuned.”
Despite all the fancy prompting and QLoRA tuning, BERT-based models still outperformed LLMs.
On GENIA, the SOTA BERT model achieved an F1 of 81.20 on Flat entities, whereas the best LLM setup hovered around 76-78.

This is a crucial insight. LLMs are generalists. BERT models, when fully supervised and fine-tuned on specific architecture for NER, are specialists. The massive parameter count of an LLM does not automatically grant it the precision required for complex structural extraction.

Detailed Analysis: Where do LLMs Fail?

To understand why LLMs lag behind, the authors broke down the errors.

Inner vs. Outer Entities

Nested NER requires finding the “Outer” entity (the whole phrase) and the “Inner” entity (the sub-phrase).

Bar chart analyzing error rates and recall for Inner vs Outer entities.

Figure 2 reveals a discrepancy. On the GENIA dataset (blue bars), the models struggled significantly more with Inner entities (low recall) compared to Outer entities. The “Wrong Type Error” (classifying a Protein as DNA, for example) was relatively low.

This suggests LLMs are good at semantic classification (knowing what something is) but struggle with structural boundary detection (knowing where the inner entity stops).

The Depth of Nesting

The authors also analyzed performance based on how deep the nesting goes (Level 1 = innermost, Level 2 = containing Level 1, etc.).

Line graph showing performance degradation as nesting level increases.

Figure 3 shows the “Performance changes according to the nested level.”

GENIA (Dark Blue): Performance drops as nesting gets deeper/complex.
SOTA (Orange/Light Blue): The specialized BERT models (SOTA) maintain much higher consistency across levels compared to the LLMs.

The LLMs show a “U-shaped” or inconsistent performance curve. They sometimes catch the deep nesting but miss the shallow ones, or vice versa, lacking the systematic stability of the BERT architectures designed specifically for this recursive task.

Conclusion and Implications

This research provides a sober, evidence-based look at the hype surrounding LLMs. While powerful, they are not magic wands for every NLP task.

Key Takeaways for Students:

Prompt Engineering is Engineering: You cannot just ask “Find entities.” You must define the schema, provide definitions, and explain the structural constraints (nesting) to the model.
Output Format is a Hyperparameter: Asking for JSON vs. Indices vs. Tags changes performance dramatically.
Specialization beats Generalization (for now): For specific, high-precision tasks like Biomedical Nested NER, a smaller, architecturally specialized model (like a BERT-CRF) still beats a massive general-purpose LLM.

The future of Nested NER likely lies in hybrid approaches—using LLMs to generate labeled data or definitions to help train smaller, sharper specialized models. As LLMs continue to evolve, their ability to handle recursive logic may improve, but for now, the “onion” of Nested NER remains a tough one to peel.

The Problem: Entities Within Entities#

The Methodology: Teaching LLMs to Reason#

1. The Importance of Output Format#

2. Prompting Strategies: How to Ask?#

Instruction Tuning: Moving Beyond Prompting#

The Mathematical Foundation#

Experimental Results: The Reality Check#

Prompting Performance (Closed Models)#

Fine-Tuning Performance (Open Models)#

Detailed Analysis: Where do LLMs Fail?#

Inner vs. Outer Entities#

The Depth of Nesting#

Conclusion and Implications#