In the world of Artificial Intelligence, Knowledge Graphs (KGs) act as the structured memory for machines. They store vast amounts of data in the form of triples—(Head Entity, Relation, Tail Entity)—such as (Paris, is_capital_of, France). These graphs power everything from search engine sidebars to recommendation systems and question-answering bots.
However, there is a fundamental problem: Knowledge Graphs are rarely complete. Real-world data is messy, and relationships are often missing. This has given rise to the field of Knowledge Graph Completion (KGC), which uses algorithms to predict missing links, like inferring (? , operates_system, iOS) implies Apple.
Historically, researchers have had to choose between two paths:
- Fast but shallow: Structural models that treat entities as points in vector space (efficient, but they ignore the rich text descriptions of entities).
- Smart but slow: Description-based models (like KG-BERT) that read the text descriptions of entities (highly accurate, but computationally agonizingly slow).
In this post, we are doing a deep dive into a research paper that proposes a “best of both worlds” solution. The paper, Joint Pre-Encoding Representation and Structure Embedding for Efficient and Low-Resource Knowledge Graph Completion, introduces a model called PEMLM. By cleverly separating text encoding from the training loop and fusing it with structural data, the authors achieve state-of-the-art results while increasing inference speed by 30x and reducing memory usage by 60%.
Let’s explore how they accomplished this.
The Bottleneck of Modern KGC
To understand why PEMLM is necessary, we first need to look at the limitations of existing approaches.
The Two Camps
Embedding-based models (like TransE or RotatE) learn by looking at the geometry of the graph. If King - Man + Woman = Queen, the model understands the structure. However, these models treat entities as abstract IDs. They don’t know that “Apple” the fruit and “Apple” the company are semantically different just by looking at their names; they only know them by their connections.
Description-based models (like KG-BERT) use Pre-trained Language Models (PLMs) to read textual descriptions. For example, reading “Steve Jobs co-founded Apple” helps the model predict links even if the graph structure is sparse. The downside? They are incredibly heavy. To predict a link, these models often have to feed long text sequences into BERT for every single candidate entity. With thousands of entities, the computational cost explodes, making training on standard GPUs nearly impossible for large graphs.
The PEMLM Solution
The authors of this paper propose PEMLM (Pre-Encoded Masked Language Model). Their core insight is simple yet powerful: Don’t re-read the text every time.
Instead of feeding raw text into the model during the training loop, PEMLM processes all text descriptions once beforehand, converting them into rich semantic vectors. During training, the model works with these pre-computed vectors. This shift dramatically reduces the computational load while preserving the semantic understanding of language models.
Furthermore, they noticed that text alone isn’t always enough. Sometimes, the graph structure holds clues that text misses. To address this, they introduced PEMLM-F, a fusion framework that combines their pre-encoded text representations with structural embeddings.
The Architecture of PEMLM
The architecture is divided into two distinct stages: the Pre-Encoding Phase and the Training Phase. This separation is the key to the model’s efficiency.

As shown in Figure 1, the Description Encoder (left) processes the raw text to create embeddings. These embeddings are then passed to the Triplet Encoder (right), which is the component that actually learns to predict links.
1. The Description Encoder (Pre-Encoding)
Imagine we have an entity \(e\) with a description \(des_e\). For example, the entity might be Gary Rydstrom and the description is “Gary Roger Rydstrom is an American sound designer…”.
The authors use a standard BERT model to tokenize this sentence. In BERT, special tokens [CLS] (start) and [SEP] (end) are added. The sequence is fed into the encoder. instead of using the output of every single word, the authors apply Mean Pooling to the final hidden layer to get a single, compact vector representation for that entity.
The equation for generating the semantic representation \(u\) is:

This process is repeated for every entity and relation in the graph. Once this is done, the BERT model can be turned off or discarded from memory. We are left with static matrices representing the “semantic soul” of every item in our graph:

Here, \(E\) is the matrix of all entity vectors, and \(R\) is the matrix of all relation vectors. This pre-encoding step takes only minutes, yet it saves days of computation later on.
2. The Triplet Encoder (Training)
Now that we have our pre-encoded vectors, how do we train a model to predict links? The authors treat this as a Masked Language Modeling (MLM) task, similar to how BERT is pre-trained, but adapted for graph triples.
Constructing the Input
A knowledge graph triple consists of a Head (\(h\)), a Relation (\(r\)), and a Tail (\(t\)). To predict a missing tail, the input sequence looks like: [CLS], Head Representation, Relation Representation, [MASK], [SEP].
However, there is a catch. Since we are feeding vectors (from the pre-encoding step) rather than raw text tokens into this second encoder, the model loses the concept of order. It doesn’t inherently know that the Head comes before the Relation.
To fix this, the authors add Position Embeddings to the vectors. This explicitly tells the model which vector represents the head, which is the relation, and which is the mask.

Mathematically, the input sequence \(u^{input}\) is constructed as:

The Prediction
The model feeds this sequence through the Triplet Encoder (another Transformer-based architecture). It looks at the output vector corresponding to the [MASK] token position. This vector represents what the model thinks the missing entity should be.

Finally, this output is passed through a dense classification layer (a standard neural network layer) to predict the probability of the missing entity across all possible entities in the graph.

This setup effectively turns the link prediction problem into a multi-class classification problem. The complexity drops from \(O(N)\) (scanning every candidate one by one) to \(O(1)\) (one forward pass to classify against all candidates).
Integrating Structure: PEMLM-F
While the text-based approach is powerful, the authors found that it sometimes struggles with “1-to-N” relations (e.g., one parent having many children). Graph structure models handle this well.
To bridge this gap, they introduced PEMLM-F (Fusion). This variant runs a structural embedding model (based on the famous TransE algorithm) alongside the text model.
The Structural Component
The TransE model treats relations as translations in space: \(h + r \approx t\). The model tries to minimize the distance between the head-plus-relation and the tail.

The scoring function uses cosine similarity to see how close the prediction is to the target:

The Fusion Module
The innovation here is how they combine the two. They don’t just average the results. Instead, they take the semantic text vector (\(u\)) and the structural vector (\(v\)) and concatenate them.

As shown in Figure 3, the fusion module concatenates the two representations:

Then, a learnable MLP (Multi-Layer Perceptron) weights these features to create a new, fused representation \(s\):

This new fused vector \(s\) is what gets fed into the Triplet Encoder. This allows the model to dynamically learn when to rely on text semantics and when to rely on graph structure.
The final loss function combines the classification loss (from the masked language model) and the contrastive loss (from the structural model), balanced by a hyperparameter \(\alpha\):

Experiments and Results
The researchers tested PEMLM on three standard datasets: FB15k-237 (general knowledge from Freebase), WN18RR (lexical relations from WordNet), and UMLS (medical/biomedical data).
Accuracy Performance
The results were impressive. As seen in Table 1 below, PEMLM-F achieves state-of-the-art results, particularly on WN18RR and UMLS.

On the WN18RR dataset, PEMLM-F achieved a Hits@1 score of 50.9%, significantly outperforming previous description-based models like KG-BERT (which scored only 9.5%) and even strong joint models like Pretrain-KGE. This suggests that the fusion strategy is highly effective at pinpointing the exact correct entity (Hits@1), rather than just getting it “close” (Hits@10).
Efficiency: The Game Changer
The most dramatic result is the efficiency gain. High accuracy is usually associated with high resource cost, but PEMLM breaks this trend.

Table 4 highlights the massive difference in resource consumption:
- Inference Time: KG-BERT takes 4 days to run inference on the test set. PEMLM takes just 1 minute. That is a staggering improvement.
- Training Memory: PEMLM requires roughly 60% less memory (3.6GB vs 8.5GB), making it feasible to run on consumer-grade GPUs (like an RTX 3080 or even smaller cards).
- Training Time: It trains in minutes rather than hours.
This efficiency comes from the pre-encoding strategy. By not forcing the heavy BERT model to process text during every single training step, the computational bottleneck is removed.
Why Fusion Matters
The authors also analyzed where the fusion model helps. They broke down performance by relation type: 1-to-1, 1-to-Many, Many-to-1, and Many-to-Many.

Table 5 shows that the Fusion model (PEMLM-F) provides the biggest boost in 1-N (One-to-Many) relations. This makes sense: semantic descriptions of siblings or multiple parts of a whole might be very similar textually. The structural embedding helps distinctively separate these entities in vector space, allowing the model to rank them more accurately.
The Role of Alpha (\(\alpha\))
The parameter \(\alpha\) controls how much weight the model gives to the structural loss versus the text classification loss.

Figure 4 shows that performance peaks around \(\alpha = 2\) for the WN18RR dataset. This indicates that on this specific dataset, structural information is highly valuable and should be weighted heavily alongside the text predictions.
Conclusion and Implications
The PEMLM paper presents a compelling argument for decoupling text encoding from graph training. By treating entity descriptions as pre-computed features rather than live inputs, the researchers unlocked massive efficiency gains without sacrificing accuracy.
Key Takeaways:
- Pre-encoding works: You don’t need to fine-tune a language model end-to-end to get great KGC results. Frozen, pooled embeddings are sufficient and much faster.
- Fusion is vital: Text doesn’t capture everything. Integrating geometric graph structure (TransE) helps specifically with complex, one-to-many relationships.
- Efficiency enables scale: Reducing inference time from days to minutes opens the door for using these models on much larger, real-world knowledge graphs.
While the model relies on the quality of the initial text descriptions (garbage in, garbage out), it represents a significant step forward for “Green AI”—getting better results with fewer resources. For students and practitioners in the field, PEMLM offers a blueprint for building high-performance graph models that don’t require a supercomputer to run.
](https://deep-paper.org/en/paper/file-3238/images/cover.png)