Introduction
In the world of Natural Language Processing (NLP), Named-Entity Recognition (NER) is a cornerstone task. We typically ask models to read a sentence and highlight specific items—persons, organizations, locations, or medical symptoms. For years, the standard approach has been to treat these entities as solid blocks of text. If you see “New York City,” you draw a box around three consecutive words.
But human language is rarely so tidy. Especially in specialized fields like biomedicine, entities often break apart, wrap around other words, or share components. Consider the phrase: “The patient suffered pain in the left and right knees.”
A standard NER model sees a confusing sequence. A human, however, sees two distinct entities: “pain in left knees” and “pain in right knees.” The entities are discontinuous. They skip words, and they share words.
Handling these discontinuous structures has historically been a headache for NLP researchers. Previous solutions involved complex transition systems or graph-based models that were computationally expensive (slow) and often structurally ambiguous.
In the paper “A Fast and Sound Tagging Method for Discontinuous Named-Entity Recognition,” researchers propose a refreshing solution. They demonstrate how to convert this complex, discontinuous problem into a streamlined tagging task—similar to the standard methods we use for simple text—while guaranteeing mathematical “soundness.” The result? A model that matches state-of-the-art accuracy but runs up to 50 times faster.
In this post, we will dissect their method, understanding how they decompose complex sentences, how they use automata to enforce logic, and how they train a model when the data doesn’t explicitly tell them how to break things down.
The Problem: When “Flat” Tagging Fails
To understand the innovation, we first need to look at the limitation of the standard approach.
In “flat” or standard NER, we usually use BIO tagging:
- B: Beginning of an entity.
- I: Inside an entity.
- O: Outside an entity.
This works perfectly for “New York.” New is (B), York is (I). But how do you tag “pain in left and right knees” to extract “pain in left knees” and “pain in right knees”? If you tag “pain” as a Beginning, you can’t easily indicate that it belongs to two separate entities simultaneously without making the tagset incredibly complex.
The researchers tackle this by focusing on Discontinuous NER. This is crucial for biomedical texts where adverse drug reactions often appear as split phrases.
The Linguistic Challenge
Discontinuities usually happen for two reasons:
- Word Order Rules: “Toes are painful” (The entity is “painful toes,” but the verb splits it).
- Coordination Reduction: “Pain in arms and shoulders” (The speaker omits the second “pain in” for efficiency).
Existing methods tried to solve this with “hypergraphs” or “transition actions” (like shift-reduce parsers). While effective, these methods often have quadratic time complexity (\(O(n^2)\)) or worse, making them slow on long documents. Furthermore, some previous tagging schemes had “structural ambiguity,” meaning one sequence of tags could represent multiple different entity combinations, or vice versa.
The Core Method: Reduction to Word Tagging
The researchers’ primary contribution is a new way to represent these broken chains of text so they can be solved with linear-time tagging algorithms.
1. The Two-Layer Representation
Instead of trying to capture the discontinuous entity all at once, the authors propose breaking it down. They view a discontinuous mention as a set of Typed Components.
Imagine the phrase: “swollen and stiff knees”
This contains two entities: “swollen knees” and “stiff knees.” The researchers decompose this into a two-layer structure:
- Upper Layer: Identifies the set of mentions.
- Lower Layer: Identifies the components (the contiguous text fragments).
They restrict the components to two types, arbitrarily labeled X and Y (or conceptually, “Event” and “Body Part”).

Look at the bottom half of Figure 1 above.
- The phrase is: “Chronic fatigue together with swollen and stiff knees and left elbows.”
- Notice the Discontinuous (Disc) section.
- The word “swollen” is tagged. The word “stiff” is tagged. The word “knees” is tagged.
- By assigning “swollen” and “stiff” to one type (let’s say Type X, or EVENT) and “knees” to another type (Type Y, or PART), the model can reconstruct the full entities by taking the Cartesian product.
If Set 1 contains Type X={“swollen”, “stiff”} and Type Y={“knees”}, the reconstruction logic says: combine every X with every Y. Result: {“swollen knees”, “stiff knees”}.
This clever decomposition turns a complex overlapping problem into a simpler component-identification problem.
2. The Tagging Scheme
To implement this decomposition, the authors introduce a specific set of 10 tags.
For Continuous Mentions:
- CB: Continuous Begin
- CI: Continuous Inside
- O: Outside
For Discontinuous Sets:
They use tags with a prefix and a suffix, like DB-Bx.
- Prefix (The Span):
- DB: Discontinuous Begin (Start of the whole set span).
- DI: Discontinuous Inside (Inside the span).
- Suffix (The Component):
- Bx / By: Begin component of type X or Y.
- Ix / Iy: Inside component of type X or Y.
- O: Not part of a component (but inside the span).
Reflecting back on Figure 1, notice the tag sequence below the words. “Swollen” gets DB-Bx (Start of span, start of component X). “And” gets DI-O (Inside the span, but not a component). “Stiff” gets DI-Bx (Inside span, start of another component X).
3. Ensuring Soundness with Automata
Here is where the approach becomes “sound.” If a neural network simply predicts tags for each word independently, it might output nonsense. It could predict an “Inside component” tag without a “Begin component” tag. Or it might predict a sequence that doesn’t map back to any valid entity.
To prevent this, the researchers use a Weighted Finite State Automaton (WFSA).
A WFSA is a directed graph where transitions between states emit symbols (tags). The researchers designed a specific “Grammar Automaton” that encodes all the rules of their tagging scheme.

Figure 2 visualizes this grammar.
- State 1 is the hub.
- If the model predicts a continuous mention (
CB), it moves to State 2. From State 2, it can only go toCI(continue the mention) or back to1(end it). It cannot jump to a discontinuous tag. - If the model starts a discontinuous set (
DB-Bx), it moves into the complex structure on the right (States 7, 8, etc.). - The paths in this automaton enforce constraints. For example, you cannot exit the discontinuous structure until you have found at least one component of Type X and one of Type Y.
This automaton acts as a filter. The model is effectively forbidden from predicting an invalid sequence.
The Decoding Algorithm
How does the model actually select the best tags? It uses the intersection of two automata.
- The Sentence Automaton (\(S\)): This represents the neural network’s raw scores. It allows any tag at any position, weighted by the network’s output.
- The Grammar Automaton (\(G\)): This represents the valid sequences (from Figure 2).
By computing the intersection \(G \cap S\), the researchers create a new graph that contains only valid paths, weighted by the neural network’s confidence.
To find the best tag sequence, they perform Maximum a Posteriori (MAP) inference.
\[ \widehat { \pmb { y } } _ { \boldsymbol { \theta } } ( \pmb { x } ) = \underset { \pmb { y } \in Y } { \arg \operatorname* { m a x } } \ \langle \pmb { y } , f _ { \boldsymbol { \theta } } ( \pmb { x } ) \rangle , \]This equation effectively asks: “Out of all valid sequences (\(y \in Y\)), which one has the highest total score?” Because the intersection graph is well-structured, this can be solved using the Viterbi algorithm.
Critically, the time complexity of this decoding step is linear with respect to the input length (\(O(n)\)). Previous graph-based approaches often scaled quadratically or worse.
Handling Missing Labels: Weak Supervision
There is a catch. Real-world datasets annotate “pain in arms” as an entity, but they do not annotate “pain in” as Component Type X and “arms” as Component Type Y. The internal structure we discussed earlier is a theoretical construct of this paper, not something present in the training data.
This means we have Partial Labels. We know the final entities, but not the tag sequence that generates them.
To solve this, the researchers use a technique often called Soft Expectation-Maximization (EM). They treat the component types as latent (hidden) variables.
During training, instead of forcing the model to match one specific tag sequence, they minimize the negative log-likelihood over all possible valid tag sequences that could produce the gold-standard entities.
\[ \begin{array} { l } { { \displaystyle \widetilde { \ell } ( { \pmb w } ; \widetilde { \pmb Y } ) = - \log p _ { \theta } \big ( \widetilde { \pmb Y } | { \pmb x } \big ) = - \log \sum _ { { \pmb y } \in \widetilde { \pmb Y } } p _ { \theta } \big ( { \pmb y } | { \pmb x } \big ) } } \\ { ~ = A _ { { \pmb Y } } \big ( f _ { \theta } ( { \pmb x } ) \big ) - \underbrace { \log \sum _ { { \pmb y } \in \widetilde { \pmb Y } } \exp \langle { \pmb y } , f _ { \theta } ( { \pmb x } ) \rangle } _ { = A _ { \widetilde { \pmb Y } } \big ( f _ { \theta } ( { \pmb x } ) \big ) } , \qquad ( 3 } \end{array} \]In Equation 3, \(\widetilde{Y}\) represents the set of all tag sequences compatible with the ground truth. The loss function sums the probability of all these valid sequences. This allows the model to “learn” the best internal structure (e.g., distinguishing bodies from events) on its own, without explicit supervision.
The researchers also experimented with a “Hard EM” approach (picking the single best path during training) and incorporating external knowledge (using a medical dictionary to identify body parts), but they found the standard Soft EM worked very well.
Experiments and Results
The team evaluated their model on three biomedical datasets: CADEC, SHARE2013, and SHARE2014. These datasets are notoriously difficult due to the high frequency of disjointed medical symptoms.
Accuracy Comparison
Despite the simplicity of the architecture (a standard DeBERTa-V3 model + the automata layer), the performance was impressive.

Table 1 shows the F1 scores.
- SOTA Comparison: The proposed method (bottom rows) achieves F1 scores that are highly competitive with, and often better than, complex previous models like transition-based systems (Dai et al.) or maximal clique discovery (Wang et al.).
- Consistency: The performance holds up across different variations of their training (Soft EM vs Hard EM).
The Speed Advantage
Where this paper truly shines is efficiency. Complex neural architectures often require heavy computation. By reducing the problem to sequence tagging (linear complexity), the researchers achieved massive speedups.

Table 3 illustrates the sentences processed per second.
- Previous SOTA (Wang et al., 2021): ~200 sentences/sec.
- This Work: ~10,000 sentences/sec.
This is a 40x to 50x speed increase. For real-world applications where millions of medical documents need processing, this difference is transformative. It turns a job that takes days into one that takes hours.
Limitations
No method is perfect. The authors provide a candid “Discontinuity Analysis.” Their tagging scheme assumes a specific structure (sets of components). There are rare linguistic cases—about 26 instances in the CADEC dataset—where the structure is too complex for their tags.
- Example: “muscle and joint aches in arms and elbows.” This involves a 3-way split/combination that their X/Y type system doesn’t fully capture. However, these cases are statistically rare enough that they don’t significantly harm the overall F1 score.
Conclusion and Implications
The research paper “A Fast and Sound Tagging Method for Discontinuous Named-Entity Recognition” offers a masterclass in algorithm design. Rather than throwing more layers of neural networks at a problem, the authors stepped back and redesigned the representation of the data.
By mapping discontinuous entities to a rigorous tagging scheme and enforcing logic through Finite State Automata, they achieved the “holy grail” of engineering: Soundness, Accuracy, and Speed.
Key Takeaways:
- Representation Matters: Decomposing complex entities into simple components allows us to use faster, standard algorithms.
- Constraints are Powerful: Using automata (WFSA) ensures the model never predicts “impossible” sequences, improving reliability.
- Linear Speed: Moving away from quadratic graph algorithms to linear tagging makes the model usable at massive scales.
- Drop-in Replacement: Perhaps most importantly, the decoding algorithm and loss functions can be dropped into any standard BIO tagger. This opens the door for widespread adoption of discontinuous NER in standard NLP pipelines.
For students and practitioners, this work serves as a reminder that sometimes the best way forward isn’t a bigger GPU, but a smarter algorithm.
](https://deep-paper.org/en/paper/file-2670/images/cover.png)