In the world of Natural Language Processing (NLP), there is often a tension between speed and precision. On one side, we have statistical models and Large Language Models (LLMs) that process text incredibly fast but sometimes lack deep structural understanding. On the other side, we have “precision grammars”—systems based on complex linguistic theories that provide exact, semantically rich analyses of sentences, but often at the cost of high computational resources and slower processing times.
Head-Driven Phrase Structure Grammar (HPSG) falls into the latter category. It is a highly detailed framework used to generate deep semantic representations of text. While invaluable for tasks requiring high precision (like grammar coaching or semantic parsing), HPSG parsers can be notoriously slow.
In the research paper “Revisiting Supertagging for Faster HPSG Parsing,” authors Olga Zamaraeva and Carlos Gómez-Rodríguez tackle this bottleneck. They investigate whether modern machine learning architectures—specifically BERT and Neural CRFs—can revitalize a technique called supertagging to make these heavy-duty parsers significantly faster without sacrificing their accuracy.
This post will walk you through their methodology, the architecture of their new supertaggers, and the results of their experiments on the English Resource Grammar.
The Problem: The High Cost of Precision
To understand why HPSG parsing is slow, we first need to understand what the parser is actually doing. Unlike a standard Part-of-Speech (POS) tagger that might label the word “bark” simply as a Noun or a Verb, an HPSG parser assigns a Lexical Type.
Lexical Types vs. POS Tags
In HPSG, the lexicon is vast. A lexical type doesn’t just tell you the category of the word; it encodes detailed syntactic and semantic constraints. For example, a verb type might specify exactly what kind of subject it needs, whether it takes an object, and how it interacts with other clauses.

As shown in Figure 1, the hierarchy is deep. The word bark isn’t just a leaf node; it has a rich ancestry of types (e.g., main verb, mass-count noun). The English Resource Grammar (ERG), which is the specific grammar used in this paper, contains over 43,000 potential lexical types, though “only” about 1,300 appear in the training data.
The Ambiguity Explosion
When an HPSG parser receives a sentence, its first job is Lexical Analysis. It looks at every word and retrieves every possible lexical type that word could be.
Consider the sentence: “The dog barks.”
- “Barks” could be a verb (the dog makes a sound).
- “Barks” could be a noun (plural of tree bark).
If the parser considers every possibility, it builds a massive “parse chart.” It tries to combine every potential noun version of “barks” with the rest of the sentence, eventually realizing it doesn’t make sense. This trial-and-error process consumes massive amounts of RAM and time.

Figure 2 illustrates this. The tree on the left is the correct interpretation. The tree on the right treats “barks” as a noun (perhaps implying “The dog [tree] barks”). This second tree is pragmatically unlikely, but the parser doesn’t know that yet. It has to build the structure to find out.
In complex sentences, this ambiguity grows exponentially. Some long sentences require gigabytes of RAM and minutes of processing time just to rule out all the incorrect combinations. This is where supertagging comes in.
Core Method: Supertagging as a Filter
Supertagging is often called “almost parsing.” The idea is to use a statistical model to predict the correct HPSG lexical type for each word before the full parser starts its work.
If a supertagger can look at “The dog barks” and tell the parser with 99% certainty, “Hey, in this context, ‘barks’ is definitely a verb (type v_3s-fin_orl),” the parser can immediately discard the noun interpretation. This drastically shrinks the search space, saving memory and time.
The Evolution of HPSG Supertaggers
Previous attempts at HPSG supertagging relied on Maximum Entropy (MaxEnt) models. While effective, they were trained on older hardware with less data. Zamaraeva and Gómez-Rodríguez revisited this task using modern architectures. They compared four distinct approaches:
- MaxEnt (Baseline): A logistic regression model similar to previous state-of-the-art attempts.
- SVM (Support Vector Machine): A linear classifier known for speed.
- NCRF++ (Neural CRF): A model using Long Short-Term Memory (LSTM) networks combined with a Conditional Random Field (CRF) layer. This architecture excels at sequence labeling because it considers the “flow” of tags across a sentence.
- Fine-tuned BERT: A transformer-based model. BERT is pre-trained on massive amounts of text and understands deep contextual relationships between words.
The authors trained these models on the ERG 2023 treebanks, which provide gold-standard annotations for diverse text genres (news, emails, technical essays).
The “Exceptions List” Strategy
There is a danger in supertagging. If the tagger is too aggressive and predicts the wrong tag, the parser might fail to build any tree at all because the correct building block was thrown away.
To mitigate this, the researchers employed an Exceptions List. They identified specific lexical types that the models frequently got wrong (often high-frequency function words like “do,” “my,” or specific prepositions).
The strategy works like this:
- Run the BERT supertagger on the sentence.
- If a word is tagged with a “safe” type, prune all other possibilities from the parser’s chart.
- If a word is tagged with a type on the “Exceptions List,” do not prune. Let the parser consider all possibilities for that specific word.
This hybrid approach aims to balance the raw speed of pruning with the safety of keeping ambiguous options open.
Experiments and Results
The authors evaluated their new supertaggers on two fronts: Tagging Accuracy (how often the model picks the right tag) and Parsing Impact (how the tags affect the parser’s speed and final output quality).
1. Tagger Accuracy: BERT Reigns Supreme
The first step was to see if modern neural networks actually outperformed the older MaxEnt baselines.

As shown in Table 2, the results were clear:
- MaxEnt (Baseline): ~91-94% accuracy.
- SVM: Slightly better than MaxEnt.
- NCRF++: Significant jump in accuracy (~95-96%).
- BERT: The clear winner, achieving 97.26% accuracy on the Wall Street Journal (WSJ23) dataset and strong performance on out-of-domain technical essays (cb).
It is important to note that while BERT is the most accurate, it is slower at the tagging phase than the SVM. However, since the subsequent HPSG parsing phase is so computationally expensive, a few milliseconds lost during tagging are negligible if they save seconds during parsing.
2. The Trade-off: Speed vs. F-Score
Next, the researchers integrated these supertaggers into the ACE Parser (the standard efficient parser for the ERG). They compared their BERT-based system against:
- No Tagging: The parser runs with full ambiguity (slowest, theoretically highest recall).
- Ubertagger: An existing, highly optimized supertagger already built into ACE (fast, but based on older MaxEnt models).
They measured performance using the Pareto Frontier, which visualizes the trade-off between speed (x-axis) and accuracy/F-score (y-axis).

Figure 3 tells the story of the paper effectively:
- No Tagging (Bottom Right): Very slow (far right on the x-axis).
- BERT Supertagger with Exceptions (Top Left-Center): This point represents the “sweet spot.” It provides the highest F-score (vertical axis) while being significantly faster than “No Tagging.”
- Ubertagger (Far Left): This is the fastest system. Because it is natively compiled into the C-code of the parser, it is incredibly efficient. However, notice that its F-score is lower than the BERT model.
3. Detailed Parsing Speed
How much faster is “faster”?

Table 7 breaks down the speed in seconds per sentence (sec/sen) across different datasets:
- No Tagging: On the difficult WSJ23 dataset, it takes 6.27 seconds to parse a single sentence.
- BERT-based Supertags: Reduces this to 4.04 seconds (a ~35% speedup on this hard dataset, and up to 3x on others).
- Ubertagger: Blazes through at 0.55 seconds.
While the BERT system is slower than the Ubertagger, it is a massive improvement over the raw parser. But speed isn’t everything—precision matters.
4. Parsing Accuracy
The main contribution of the BERT supertagger is that it makes the parser better, not just faster.

Looking at Table 5 (and the related Table 8 in the paper which uses exceptions), we see the Elementary Dependency Match (EDM) metric:
- Recall Gains: On the WSJ23 dataset, the BERT-based system achieved a recall of 0.44 (default) compared to 0.38 for “No Tagging.”
- Why is “No Tagging” worse? This seems counter-intuitive. Theoretically, keeping all options open should result in the best accuracy. However, in practice, the “No Tagging” system runs out of RAM on complex sentences and fails to produce any output. By pruning the chart efficiently, BERT allows the parser to actually finish difficult sentences.
When using the Exceptions List (Table 8 in the full paper, summarized in the conclusion of the results), the BERT system strictly outperformed the Ubertagger in F-score across almost all datasets.
Conclusion and Implications
The research presented in this paper confirms that the “Pre-train and Fine-tune” paradigm (using models like BERT) offers significant advantages even for highly specialized tasks like HPSG parsing.
Key Takeaways:
- BERT is Superior for Supertagging: It reduces the error rate significantly compared to SVM and MaxEnt models, handling the complex contextual cues required to distinguish between subtle lexical types.
- The “Sweet Spot” exists: While not as computationally optimized as the production-grade Ubertagger, the BERT-based system with an exceptions list offers the best accuracy while still providing a 3x speedup over the baseline parser.
- Modernizing Grammar Engineering: This work ensures that precision grammars remain relevant. By alleviating the speed bottleneck, HPSG can be applied to larger corpora and more real-world NLP tasks where deep semantic understanding is required.
For students and researchers, this paper serves as an excellent case study in how hybrid systems—combining deep linguistic theory with modern neural networks—can outperform either approach used in isolation.
](https://deep-paper.org/en/paper/file-3583/images/cover.png)