Streaming NER: How to Extract Entities from LLMs in Real-Time Without Fine-Tuning

When we interact with modern Large Language Models (LLMs) like GPT-4 or Llama, we usually experience them in a “streaming” format. Words appear one by one, creating the illusion of a conversation. But for developers and researchers building complex applications—like automated fact-checkers or knowledge graph builders—this streaming text presents a challenge.

How do you extract structured data, such as names, locations, or dates (Named Entity Recognition, or NER), from a stream of text that hasn’t finished generating yet?

The conventional approach is clunky: you wait for the sentence to finish, then feed it into a separate, heavy-duty NER model. This introduces lag and doubles the computational cost. Alternatively, you could fine-tune the LLM itself to output tags, but that risks “lobotomizing” the model’s general capabilities (catastrophic forgetting).

In the paper “Embedded Named Entity Recognition using Probing Classifiers,” researchers propose a novel solution called EMBER. Instead of treating the LLM as a black box, EMBER peeks inside the model’s “brain” while it generates text. By analyzing the internal signals the model is already producing, EMBER can identify entities in real-time with almost zero latency and without changing a single weight in the LLM.

In this post, we will break down how EMBER works, the architecture behind it, and why it might change how we think about information extraction in generative AI.

The Problem: Latency vs. Utility

To understand why EMBER is significant, we first need to look at the “Baseline” approach used today.

In a typical pipeline (the top half of Figure 1 below), you have a two-step process. First, the Large Language Model generates a full sequence of text. Only after the generation is complete does a separate NER model process that text to find entities like “Miami” (GPE - Geo-Political Entity). This is slow. The latency is high because the second model can’t start until the first one finishes.

EMBER (the bottom half of Figure 1) merges these steps. It attaches lightweight “probes” to the LLM. As the LLM generates the token “Miami,” the probe immediately recognizes it as a city.

$Figure 1: EMBER enables simultaneous text generation and entity annotation by using a language model’s internal representations as the feature space for classification. Compared to using state-of-the-art NER models,this results in a substantially more effcient pipeline allowing for streaming named entity recognition. Parameter and latency comparisons stated in this figure are based on the experiments conducted using \$\\mathrm { G P T - } 2 _ { \\mathrm { X L } }\$ , presented in section 6.$

The claims made by the authors are bold: 80x faster and 50x fewer parameters than the baseline approach.

Background: What is a Probing Classifier?

Before diving into the architecture, we need to understand the core tool used here: the Probing Classifier.

When an LLM processes a word, it transforms that word into a vector of numbers (a hidden state) and passes it through many layers. Research in “mechanistic interpretability” has shown that these internal vectors contain rich semantic information. Even though the model is trained to predict the next word, its internal states essentially “know” if the current word is a verb, a noun, or a proper name.

A probing classifier is a very simple, small neural network (often just a couple of layers) trained to look at these frozen internal vectors and predict a property, such as “Is this a Person?” It is a diagnostic tool—like a stethoscope for a neural network.

The innovation of EMBER is moving probing classifiers from a diagnostic tool to a production tool.

The EMBER Architecture

The authors designed EMBER to solve two distinct sub-problems of Named Entity Recognition:

Entity Typing: What is this word? (e.g., Person, Location, Organization).
Span Detection: Which words belong together? (e.g., “New York” is one entity, not two separate ones).

Because decoder-only models (like GPT) generate text one word at a time (autoregressively), they can’t “see” the future words. This makes knowing where an entity ends difficult. To solve this, EMBER uses a dual-probe architecture.

As illustrated in Figure 2, the system extracts information from two different parts of the Transformer block: the Hidden States and the Attention Weights.

1. Tokenwise Classification (The “What”)

To determine the type of an entity, EMBER looks at the hidden state ($h_i^l$) of a specific layer ($l$) for the current token ($i$). It passes this vector through a small classifier to predict a label (like B-PER for “Beginning of Person”).

The mathematical formulation for this typing function is:

$()\nf _ { t y p e } ( h _ { i } ^ { l } ) = \\hat { y } _ { i } ,\n[$

However, token-level classification has a flaw. If you only see the word “New,” you might guess it’s an adjective. You need the context “New York” to know it’s a city. Because the model generates “New” before “York,” the classification for “New” might be wrong initially. This is where the second component comes in.

2. Span Detection (The “Where”)

To figure out which words form a multi-word entity, the authors utilize the Attention Mechanism. In a Transformer, attention weights ($A$) represent how strongly one token relates to previous tokens.

The hypothesis is simple: if the model understands “New York Film Festival” is a single concept, the attention weights between these tokens should be distinct.

The researchers tested three ways to detect spans using attention weights, as shown in Figure 3:

Neighbour Classification (a): Simply checking if adjacent tokens are related.
Span Classification (b & c): Predicting if token $i$ is the start and token $j$ is the end of an entity.

Figure 3: Illustration of the different span detection methods.Red colors indicate which attention weights to classify as positive for the example span“New York Film Festival". Attention weights are only shown for a single layer, but are generally used at all layers.

The authors found that Span Classification worked best. Specifically, they check the attention weights between the current token and previous tokens to see if they form a coherent group.

The equation for checking neighbor connectivity looks like this:

$]\nf _ { a d j } ( A _ { j , i } ) = \\hat { a } _ { i , j } ,\n[$

And the more effective span classification is defined as:

$]\nf _ { s p a n } ( A _ { k , i } ) = \\hat { s } _ { i , j } ,\n()$

3. Label Propagation

Once the system has (1) a list of potential types for each token and (2) a list of which tokens are grouped together, it needs to combine them.

The authors introduce Label Propagation. The idea is to trust the prediction of the last token in a span. Why the last one? Because in an autoregressive model, the last token has access to the context of all previous tokens.

For “New York Film Festival,” the token “Festival” has “read” the words “New York Film.” Therefore, the hidden state for “Festival” is most likely to correctly encode the concept “Event.” EMBER takes the label predicted for “Festival” and propagates it backward to “New,” “York,” and “Film.”

Experiments and Key Findings

The researchers evaluated EMBER on standard datasets (CoNLL2003 and Ontonotes5) using models like GPT-2, GPT-J, and Pythia.

Finding 1: Span Propagation is Superior

The first major finding was that combining the two probes (Span Propagation) vastly outperformed trying to classify every token individually.

$Table 1: NER scores for \$\\mathrm { G P T - } 2 _ { \\mathrm { X L } }\$ using hidden states and attention weights in different ways. The column “MD"indicates the feature space used for mention detection in the approach,where“H” stands for hidden state and “A” stands for attention. All scores are micro F1 scores measured on the validation sets of CoNLL2003 and Ontonotes5.$

As seen in Table 1, simple tokenwise typing achieved an F1 score of roughly 71%. However, using Span propagation (combining Hidden states for typing and Attention for span detection) jumped the performance to 90.47% on the CoNLL2003 dataset.

Finding 2: Competitive with In-Context Learning

How does EMBER compare to just asking the model to “find the entities” via a prompt (Few-Shot In-Context Learning)?

Table 4: Few-Shot F1 scores for NER on CoNLL2003. All scores are micro F1 scores. *Results as reported by Chen et al. (Chen et al., 2023b).

Table 4 reveals an interesting trade-off. For extremely low-data scenarios (1-shot or 5-shot), prompting (ICL) is better. However, as soon as you have a bit more data (10-shot, 50-shot), EMBER begins to offer a robust alternative, eventually achieving higher stability and efficiency than prompting, which requires processing long examples every time you run the model.

While EMBER generally scored lower (approx 80-85% F1) than heavy, destructive fine-tuning approaches (which reach >90%), it offers a massive efficiency gain that those models cannot match.

Finding 3: Architecture Matters (Heads vs. Hidden Dims)

One of the most educational parts of the paper is the analysis of why some models perform better than others with EMBER.

You might assume that a model with a larger hidden dimension (more neurons) would be better at Named Entity Recognition. Figure 4 shows that Entity Typing (predicting “Person” vs “City”) indeed correlates strongly with hidden dimension size.

Figure 4: Entity typing F1 scores (validation set) for models with respect to hidden state dimension.

However, Span Detection (finding the boundaries of the entity) tells a different story.

Figure 5: Mention detection F1 scores (validation set) for models with respect to the total number of attention heads.

Figure 5 shows that Mention Detection correlates strongly with the number of attention heads, not necessarily the model size. This suggests that having more “perspectives” (attention heads) allows the model to better track the relationships between words, which is crucial for identifying multi-word entities.

This is why the older GPT-2 XL (which has many heads) actually outperformed some newer, theoretically “better” models in this specific task.

The “Killer Feature”: Streaming Efficiency

The primary motivation for EMBER was speed. The results here are undeniable.

$Table 5: Impact of streaming NER during generation on inference speed for \$\\mathrm { G P T - } 2 _ { \\mathrm { X L } }\$ . The results show clearly how much more efficient EMBER is compared to the baseline approach, incurring a performance penalty on token generation rates of only \$1 \\%\$ (compared to more than \$40 \\%\$ ).$

Table 5 compares the time it takes to generate text while extracting entities.

Baseline (External Model): Slows down generation by 43.64%.
EMBER: Slows down generation by only 1.01%.

Because EMBER taps into calculations the GPU has already done (the internal states), the computational overhead is negligible. This makes it feasible to run complex analytics on live chat streams without affecting the user experience.

A Note on Generated Text

The researchers found one quirk: Probes trained on static, human-written text performed worse when applied to the model’s own generated text. The distribution of attention weights changes during generation compared to processing a prompt.

Table 6: NER F1 scores for 3 approaches on our evaluation dataset. “Original” indicates the scores for the non-generated text or prompt. “Generated” indicates scores for annotations on the 1OO generated tokens following the prompt.

As shown in Table 6, when the probes were trained on generated data (the bottom section), the performance gap between original and generated text almost vanished. This highlights the importance of training these probes on the specific type of data (generated vs. static) they will encounter in production.

Putting it to Work: The Toolkit

The authors didn’t just publish a paper; they released a toolkit called STOKE (Streaming TOKen Extraction). It allows developers to train their own probing classifiers on top of Hugging Face models.

Figure 10: Screenshot of the model playground.

The playground (Figure 10) visualizes the power of this approach. As the model types “James,” the system tags it as a PERSON. As it continues to generate “…who has been writing…”, the system updates and refines its understanding of the sentence structure live.

Conclusion

The EMBER paper presents a compelling shift in how we think about “using” Large Language Models. Rather than treating them solely as text generators or black boxes to be fine-tuned, we can view them as rich repositories of semantic information that can be mined in real-time.

Key Takeaways:

Non-Destructive: You can add NER capabilities to a model without altering its weights or risking catastrophic forgetting.
Attention is Key: The attention mechanism holds the structural secrets of the text (spans), while hidden states hold the semantic secrets (types).
Speed: For streaming applications, probing classifiers are vastly superior to external models, incurring almost no latency penalty.

As LLMs become the engine for more real-time applications, techniques like EMBER that allow us to “read the mind” of the model while it speaks will likely become standard tools in the AI engineering toolkit.

The Problem: Latency vs. Utility#

Background: What is a Probing Classifier?#

The EMBER Architecture#

1. Tokenwise Classification (The “What”)#

2. Span Detection (The “Where”)#

3. Label Propagation#

Experiments and Key Findings#

Finding 1: Span Propagation is Superior#

Finding 2: Competitive with In-Context Learning#

Finding 3: Architecture Matters (Heads vs. Hidden Dims)#

The “Killer Feature”: Streaming Efficiency#

A Note on Generated Text#

Putting it to Work: The Toolkit#

Conclusion#