Introduction

In the world of Natural Language Processing (NLP), simplicity is often a virtue, but reality is rarely simple. For years, researchers have strived to make syntactic parsing—the task of mapping the grammatical structure of a sentence—as efficient as possible. One of the most successful approaches has been sequence labeling (or tagging). If you can reduce a complex tree structure into a simple sequence of tags (one per word), you can use standard, highly optimized hardware and algorithms to parse sentences at lightning speeds.

This approach works beautifully for dependency trees. A tree has strict rules: every word has exactly one head (parent), and there are no loops. But language is messy. When we move to semantic dependencies or Enhanced Universal Dependencies, we leave the safe harbor of trees and enter the storm of graphs.

In a graph, a word might have multiple parents (reentrancy), or a child might be an ancestor of its own parent (cycles). Traditional sequence labeling methods break down here because they assume a one-to-one relationship between words and heads.

The paper “Dependency Graph Parsing as Sequence Labeling” by Ana Ezquerro, David Vilares, and Carlos Gómez-Rodríguez bridges this gap. The researchers propose a novel set of linearizations—ways to flatten a complex graph into a simple string of labels—allowing us to parse complex semantic graphs using the same efficient machinery we use for simple tagging.

In this post, we will deconstruct how they achieved this, exploring the shift from tree encodings to graph encodings, the mathematics of bit-vector labels, and how these models perform against state-of-the-art parsers.

Background: The Labeling Paradox

To understand the innovation of this paper, we first need to understand the limitation of previous methods.

Parsing as Tagging

Imagine the sentence: “I saw her.” In a Part-of-Speech (PoS) tagging task, we assign a label to each word:

I \(\to\) PRON
saw \(\to\) VERB
her \(\to\) PRON

The input length (\(n=3\)) matches the output length (\(n=3\)). This is sequence labeling. It is fast, efficient, and parallelizable.

The Tree Limitation

Previous work successfully cast dependency tree parsing as sequence labeling. Since every word in a tree has exactly one head, we can construct a label that says, for example, “My head is the word at relative position -1.”

However, consider a semantic graph structure where “John” is both the agent of “wanted” and the agent of “buy” in “John wanted to buy…”

“John” has two heads.
Standard tagging allows only one label per word.
Therefore, standard tree encodings cannot capture this structure.

This paper introduces encodings that can compress multiple edges into a single label per word, effectively enabling graph parsing within the strict constraints of sequence labeling.

Core Method: Flattening the Graph

The authors propose two main families of graph encodings: Unbounded (where the label vocabulary grows with sentence length) and Bounded (where the label size is fixed).

1. Unbounded Encodings: Brackets and Planes

The most intuitive way to handle graphs is to extend the “bracketing” method used in trees. Imagine encoding the arcs of a graph using string characters like <, >, /, and \.

<: Incoming arc from the left.
>: Incoming arc from the right.
/: Outgoing arc to the right.
\: Outgoing arc to the left.

In a standard tree, a word gets a label based on its single relationship. In a graph, a word might have an incoming arc from the left and an outgoing arc to the right. The authors simply concatenate these symbols. A label might look like <//, meaning “one head to the left, two dependents to the right.”

The Problem of Crossing Arcs

A single sequence of brackets works for simple structures, but graphs often have crossing arcs (where one dependency arrow crosses over another). A single stack-based decoding system cannot handle crossings (it violates the Last-In-First-Out principle).

To solve this, the authors utilize Multiplanarity. They split the graph into \(k\) different “planes” (or subgraphs).

Plane 1: Handles a subset of non-crossing arcs.
Plane 2: Handles the remaining arcs that would have crossed the first set.

The encoding adds variations to the brackets for each plane. Plane 1 uses standard brackets (<, >), while Plane 2 uses starred brackets (<*, >*).

Figure 1: An example of a relaxed 2-planar dependency graph linearized with our unbounded encodings.

Figure 1 above illustrates this. Look at word \(w_3\). It has incoming arcs from both \(w_2\) and \(w_6\).

In the Naive positional encoding (top row), \(w_3\) is assigned the tuple (2,6).
In the Bracketing encoding (bottom row), \(w_3\) is assigned <> (incoming from left and right).
Crucially, looking at \(w_5\), notice the >* symbol. This indicates an incoming arc from the right located on the second plane (the green arc from \(w_6\)), allowing it to cross the black arc without breaking the decoding logic.

2. Bounded Encodings: The Bit-Vector Revolution

While unbounded encodings work, they suffer from a large vocabulary size. The authors introduce Bounded Encodings, where the label is a fixed-size vector of bits. This is the paper’s most significant technical contribution.

They propose two variations: the 4k-bit encoding and the 6k-bit encoding.

The 4k-bit Encoding

This method assumes we can split the graph into \(k\) subgraphs. In each subgraph, a node can have at most one parent.

For every word \(w_i\), and for every subgraph \(j\) (from 1 to \(k\)), we assign 4 bits:

Direction Bit: Is the parent to the left or right?
Outermost Bit: Is \(w_i\) the farthest dependent of its parent?
Left-Children Bit: Does \(w_i\) have dependents to its left?
Right-Children Bit: Does \(w_i\) have dependents to its right?

If we use \(k=2\) planes, the label is simply 8 bits long (\(4 \times 2\)).

The Dummy Node Issue: Since this encoding requires every node in a subgraph to have a parent (or be a root), nodes that are effectively “orphans” in a specific subgraph are linked to a dummy root node (\(w_0\)). This ensures the bit logic holds but requires adding “null arcs” (dotted lines in diagrams) that are discarded later.

The 6k-bit Encoding

The 6k-bit encoding is more robust. It splits the graph into \(k\) pairs of subgraphs. Each pair consists of one subgraph containing only leftward arcs and one containing only rightward arcs.

This removes the need for dummy nodes. For each pair \(j\), we allocate 6 bits:

Bits 1-3 (Rightward Subgraph): Has parent? Is farthest dependent? Has dependents?
Bits 4-6 (Leftward Subgraph): Same logic, but for the leftward graph.

This encoding natively supports parentless nodes and tends to cover more complex graphs than the 4k-bit version.

Visualizing the Bits

Let’s look at how these bit vectors represent a graph.

Figure 2: Bounded encodings for the example of Figure 1.

Figure 2 visualizes the bounded encodings for the same sentence as Figure 1.

Top Row (\(4k\)): The graph is split into 3 planes (\(4k^1, 4k^2, 4k^3\)).
Look at the red dotted lines. These are the null arcs required by the 4k-bit encoding to ensure every node has a parent in that subgraph.
Under each word, you see the bit sequences (e.g., 0100). These fixed-length codes tell the parser exactly how to reconstruct the local structure.
Bottom Row (\(6k\)): The graph is split into pairs.
Notice the lack of dotted lines; the 6k encoding handles the structure more naturally.
The colors distinguish the subgraph pairs.

The Full Architecture

How does the neural network actually predict these labels? The architecture is elegant in its simplicity.

Figure 4: Prediction steps of our neural parser.

Figure 4 outlines the pipeline:

Encoder (\(E_\theta\)): A sentence is fed into a standard encoder (like BERT or RoBERTa) to get contextualized word vectors.
Decoder (\(D_\phi\)): A simple feed-forward network predicts the label string (the brackets or bits) for each word.
Decoding: A deterministic algorithm reads the labels and reconstructs the graph arcs (unlabeled).
Labeling (\(D_\varphi\)): A second decoder looks at the predicted arcs and assigns the semantic labels (e.g., “ARG1”, “Compound”).

Experiments and Results

The authors tested their approach on two major benchmarks:

Semantic Dependency Parsing (DAGs): Using datasets like DM, PAS, and PSD. These are Directed Acyclic Graphs.
Enhanced Universal Dependencies (IWPT): These graphs can contain cycles.

Graph Density and Coverage

A major challenge in graph parsing is coverage. Can the encoding actually represent the graph? If a graph is too dense (too many crossing arcs), a bounded encoding with low \(k\) might fail to capture all edges.

Table 4: Statistics of the datasets used.

Table 4 shows the characteristics of the datasets.

% planes: Notice that for datasets like PSDen (Prague Semantic Dependencies), a significant chunk of graphs requires 3 or 4 planes to be fully represented. This suggests that setting \(k=1\) or \(k=2\) would result in lost information.
#cycs: The IWPT datasets contain cycles, testing the limits of tree-based logic.

Performance on DAGs (Semantic Dependencies)

How did the models perform?

Table 1: DAG in-distribution results.

Table 1 presents the results on Semantic Dependency Parsing.

A / R (Positional): The naive positional encodings perform the worst. They suffer from sparsity because there are too many unique combinations of head positions.
Bounded (\(B4\), \(B6\)): These perform excellently. Specifically, looking at the PASen column, the B6 (6k-bit) encodings achieve F-scores (\(93.87\)) that rival the Biaffine baseline (\(94.12\)), which is a much more complex, specialized graph parser.
The Power of \(k\): Notice the jump in “OF” (Oracle F-score—the theoretical maximum coverage) when moving from \(B4_2\) to \(B4_4\). Increasing the number of planes (\(k\)) is essential for dense graphs.

Speed vs. Accuracy Trade-off

One of the main selling points of sequence labeling is efficiency.

Figure 3: Pareto front: UF vs. speed.

Figure 3 plots the Pareto front.

The X-axis is speed (tokens per second).
The Y-axis is Unlabeled F-score (accuracy).
The Sweet Spot: The authors’ models (represented by triangles and circles) often sit to the right of the Biaffine baseline (red diamonds), indicating they are faster.
BiLSTM vs. Transformers: The curves show that while Transformer-based models (like XLM-R) are more accurate, using a BiLSTM encoder with these linearizations yields incredibly fast parsers that still perform respectably.

Conclusion & Implications

The paper “Dependency Graph Parsing as Sequence Labeling” proves that we don’t need fundamentally different architectures to parse trees and graphs. By cleverly encoding graph structures—using multi-planar brackets or multi-bit vectors—we can treat semantic parsing as a tagging task.

Key Takeaways:

Graphs can be flattened: With enough “planes” (\(k\)), even complex, crossing, and cyclic graphs can be represented as a sequence of labels.
Bounded is better: Fixed-size bit vectors (especially the 6k-bit encoding) generally outperform unbounded bracketing or positional lists, likely because they present a consistent classification target for the neural network.
Simplicity wins: This approach allows researchers and engineers to use standard sequence labeling libraries to solve complex parsing tasks, lowering the barrier to entry for Semantic Dependency Parsing.

This work expands the “toolbox” of sequence labeling, demonstrating that with the right linearization strategy, even the most tangled linguistic webs can be untangled one tag at a time.

Introduction#

Background: The Labeling Paradox#

Parsing as Tagging#

The Tree Limitation#

Core Method: Flattening the Graph#

1. Unbounded Encodings: Brackets and Planes#

The Problem of Crossing Arcs#

2. Bounded Encodings: The Bit-Vector Revolution#

The 4k-bit Encoding#

The 6k-bit Encoding#

Visualizing the Bits#

The Full Architecture#

Experiments and Results#

Graph Density and Coverage#

Performance on DAGs (Semantic Dependencies)#

Speed vs. Accuracy Trade-off#

Conclusion & Implications#