Introduction

In the world of Natural Language Processing (NLP), understanding who did what to whom is the holy grail. This process, known as Information Extraction (IE), turns unstructured text—like a news article or a medical report—into structured data tables.

For years, the standard approach has been to train massive language models on raw text. While models like BERT or RoBERTa are incredible at predicting the next word, they often treat sentences as linear sequences. They miss the hidden “skeleton” of language: the structural relationships between concepts. To fix this, researchers typically rely on heavily annotated datasets where humans manually label entities and relations. But this is expensive, slow, and hard to scale.

What if a model could teach itself the structure of language without human supervision?

Enter SKIE (Structural semantic Knowledge for IE), a novel pre-training framework that does exactly that. By leveraging Abstract Meaning Representation (AMR), SKIE visualizes text as a graph of logic and relationships. It uses these graphs to “hallucinate” structure during training, allowing the model to learn deep semantic connections without requiring thousands of hours of human labeling.

In this post, we will deconstruct how SKIE works, from generating cohesive semantic graphs to aligning them with text using contrastive learning, and look at how it outperforms state-of-the-art models in zero-shot and few-shot scenarios.

The Background: Why Structure Matters

To understand SKIE, we first need to understand the limitation of current methods. Most “universal” IE frameworks (like UIE or USM) are pre-trained on text-only tasks or rely on limited supervised data. They treat the sentence “The driver came out of the house” primarily as a sequence of tokens.

However, distinct semantic structures exist within that sentence. “Driver” is the agent of “come out,” and “house” is the source.

Abstract Meaning Representation (AMR)

The researchers utilized AMR to capture this structure. AMR is a semantic formalism that represents the meaning of a sentence as a rooted, directed graph.

Figure 1: An example from the WikiEvents dataset showing an AMR graph.

As shown in Figure 1, the AMR graph on the left strips away the syntactic “fluff” and leaves the core logic. The nodes represent concepts (like person, house, come-out), and the edges represent specific relations (like :source or :ARG1).

The genius of SKIE is that it uses an automatic parser to generate these graphs from massive amounts of unlabeled text. This creates a “free” source of structural supervision.

The SKIE Framework

SKIE is designed to bridge the gap between linear text and structural graphs. The framework consists of three main modules:

  1. Topology Enhancement: Refining the raw AMR graphs to find the most important “cohesive” subgraphs.
  2. Encoding Cohesion: A specialized graph encoder (T-GSN) that preserves the specific types of relationships between nodes.
  3. Contrastive Learning: Teaching the model that a specific text and its corresponding semantic graph mean the same thing.

Figure 2: The overall framework of SKIE.

Let’s break these down step-by-step.

1. Topology Enhancement Module

Raw AMR graphs can be noisy or overly complex. To make them useful for training, the researchers introduce the concept of Cohesive Subgraphs. These are dense, interconnected parts of the graph that represent the core meaning.

The team uses the \(k\)-core algorithm, which iteratively peels away less connected nodes to leave the dense center of the graph. To enhance this further, they apply two strategies.

A. Deterministic Strategy (Graph Diffusion)

This strategy uses mathematical rules to identify the most critical nodes and edges. First, they calculate the importance weight of a node \(v_i\) based on how often it appears in different \(k\)-core subgraphs:

Equation 1: Node weight calculation.

Using these node weights, they update the edge weights. If two important nodes are connected, the edge between them becomes stronger:

Equation 2: Edge weight update.

Finally, they apply a diffusion process (similar to PageRank) to smooth these weights across the graph, ensuring that the “cohesiveness” spreads to neighbors:

Equation 3: Graph diffusion using PageRank.

This results in a graph where the most semantically relevant parts are mathematically highlighted.

B. Probabilistic Strategy (Randomness for Robustness)

To prevent the model from memorizing fixed patterns, SKIE also introduces a probabilistic element. It randomly drops edges or nodes, similar to “Dropout” in neural networks. However, the probability of dropping a node isn’t uniform; it’s inversely proportional to its importance. Important nodes (high weight \(w'_v\)) are less likely to be dropped.

The probability \(P'\) of dropping a node \(v_i\) is calculated as:

Equation 4: Probabilistic node dropping.

And the probability for edges follows suit:

Equation 5: Probabilistic edge dropping.

This dual approach ensures the model sees diverse variations of the structural structure, making it more robust.

2. Encoding Cohesion Module

Once we have these high-quality subgraphs, we need to convert them into mathematical vectors (embeddings). Standard Graph Neural Networks (GNNs) often fail here because they aggregate information from neighbors without paying enough attention to how they are connected (the edge labels).

SKIE introduces the Topology-aware Graph Substructure Network (T-GSN).

Unlike a basic Graph Convolutional Network (GCN), T-GSN applies specific transformations based on the relation type. The update rule for a node’s feature \(h\) at layer \(l+1\) is:

Equation 6: T-GSN update rule.

In simpler terms: When updating a node, the model looks at its neighbors. If a neighbor is connected via an “Agent” relation, it uses weights specific to “Agent.” If it’s connected via “Location,” it uses weights for “Location.”

Finally, the system aggregates these features to get a representation of the whole cohesive subgraph:

Equation 7: T-GSN aggregation function.

This ensures that the final vector contains rich, relation-aware structural information.

3. Contrastive Learning Module

At this stage, SKIE has two representations for a single piece of data:

  1. Text Representation: Encoded by a standard Language Model (RoBERTa).
  2. Graph Representation: Encoded by the T-GSN described above.

The goal of pre-training is to align these two. SKIE uses Contrastive Learning with a triplet loss function.

For a given sentence (anchor \(s\)), the corresponding AMR graph is a “positive” sample (\(g_+\)), and a graph from a different sentence is a “negative” sample (\(g_-\)). The model tries to minimize the distance to the positive graph and maximize the distance to the negative one:

Equation 8: Triplet loss function.

By minimizing this loss, the text encoder learns to “think” like a graph. Even when it sees plain text later, it implicitly understands the structural connections it learned from the AMR graphs.

Task-Specific Fine-Tuning

After pre-training on large unsupervised datasets, the model is fine-tuned for specific tasks like Named Entity Recognition (NER) or Relation Extraction (RE).

The researchers treat IE as a unified task. They input the text and a schema instruction (e.g., “Extract person and location”). They then use Biaffine Attention to predict relationships between tokens. This creates a matrix representing which words connect to which:

Equation 9: Biaffine attention for connection probability.

To optimize this during fine-tuning, they utilize Circle Loss, which effectively handles the class imbalance between positive samples (actual entities) and negative samples (everything else):

Equation 10: Circle Loss for fine-tuning.

Experiments and Results

The researchers evaluated SKIE on 8 standard benchmarks covering NER, Relation Extraction, and Event Extraction.

Few-Shot Learning

One of the most impressive results is SKIE’s performance when data is scarce. In “Few-Shot” settings (where the model sees only 1, 5, or 10 examples), SKIE significantly outperforms competitors like UIE and MetaRetriever.

Table 2: Few-shot results on IE tasks.

As seen in Table 2, in a 1-shot setting for NER (CoNLL03), SKIE achieves an F1 score of 77.50, compared to UIE’s 57.53. This suggests that the structural knowledge learned during pre-training acts as a powerful prior, allowing the model to grasp tasks quickly without needing thousands of examples.

Zero-Shot Learning

What if the model sees a dataset it has never trained on? The researchers tested SKIE on 5 NER datasets (like Literature, Music, and Politics) that were excluded from training.

Table 3: Zero-shot results on 5 NER datasets. Table 12: Supplementary zero-shot results compared to other models.

SKIE consistently beats the baselines. In Table 3, SKIE achieves an average F1 of 58.03, significantly higher than USM’s 41.98. This proves that the semantic knowledge SKIE acquires is generalizable—it learns the concept of “entities” and “relations” broadly, not just for specific domains.

Language Adaptation

Perhaps most surprisingly, SKIE shows strong cross-lingual capabilities. Even though the AMR parser and pre-training data were primarily English, the structural logic of “Agents” and “Actions” is universal.

Table 6: Language adaptation results on Multiconel.

Table 6 shows that SKIE outperforms ChatGPT and GLiNER on several languages, particularly German and English, demonstrating that structural pre-training enhances the model’s fundamental understanding of language mechanics, which transfers across linguistic barriers.

Ablation Studies

Does every part of the engine matter? The researchers performed ablation studies to find out.

  • Removing Cohesive Subgraphs: Performance dropped significantly. The raw AMR graphs are too noisy; finding the dense “core” is essential.
  • Replacing T-GSN with GCN: Switching to a standard Graph Convolutional Network caused a massive drop in performance (e.g., RE score dropped from 72.36 to 47.75). This confirms that preserving edge relations via T-GSN is critical.

Figure 4: Loss trends during pre-training with different graph encoder layers.

They also analyzed hyper-parameters. Figure 4 shows that using 3 layers for the graph encoder (the red line) provides the best balance of learning efficiency and loss reduction compared to 2 or 4 layers.

Conclusion

SKIE represents a significant step forward in Information Extraction. By stepping “beyond plain text” and integrating Structural Semantic Knowledge via AMR graphs, the model learns to see the hidden connections in language.

The key takeaways are:

  1. Unsupervised Structure: We don’t need expensive human labels to teach structure; automatic AMR parsing can generate massive training signals.
  2. Topology Matters: Extracting cohesive subgraphs (the “meat” of the graph) is better than using the whole noisy graph.
  3. Relation-Aware Encoding: You cannot treat all edges the same. T-GSN allows the model to respect the specific semantic roles of different connections.

SKIE proves that machines learn better when they don’t just read the words, but also understand the web of relationships that connects them. As this approach matures, we can expect IE models to become far more data-efficient and adaptable to complex, real-world tasks.