The pharmaceutical and materials science industries are currently undergoing a massive shift from traditional “wet lab” experiments to computational “dry lab” predictions. Deep Neural Networks (DNNs) are at the forefront of this revolution, promising to reduce the cost and time required to discover new drugs.
A popular approach in this field is Chemical Language Representation Learning. Just as Large Language Models (LLMs) like GPT learn to understand English by reading billions of sentences, chemical models learn to understand molecules by reading billions of SMILES (Simplified Molecular-Input Line Entry System) strings. SMILES represents a 3D molecule as a 1D string of text (e.g., CCO for ethanol).
However, treating chemistry exactly like natural language has drawbacks. In this post, we will explore MolTRES (Molecular TRansformer with Enhanced Self-supervised learning), a new framework presented by researchers from Korea University. This paper identifies critical flaws in current pre-training methods—specifically overfitting and “lazy” learning—and proposes a sophisticated dual-model architecture to solve them.
The Problem with Current Chemical Language Models
Most state-of-the-art chemical models, such as ChemBERTa or MolFormer, rely on Masked Language Modeling (MLM). In MLM, the model hides a percentage of tokens in a sequence and tries to guess them based on the context.
While this works wonders for English, it has proven less effective for SMILES. The authors of MolTRES identified two main reasons for this:
- Unbalanced Distribution: In massive datasets like ZINC, atoms like Carbon (C), Nitrogen (N), and Oxygen (O) make up 95% of the tokens. A model can achieve high accuracy simply by guessing “Carbon” most of the time.
- Surface Patterns: SMILES grammar is rigid. For example, a number indicating a ring structure always appears twice. The model can learn these superficial grammatical rules without understanding the underlying chemical properties.
The result? Models learn “lazy” heuristics. They converge too quickly during training and fail to scale effectively with more data.

As shown in Figure 1 above, the existing state-of-the-art model, MolFormer-XL (orange line), shoots up to nearly 100% pre-training accuracy almost immediately. This suggests the task is too easy. In contrast, MolTRES (blue line) maintains a lower, more realistic accuracy during pre-training, indicating it is grappling with a much harder, more informative task. Consequently, MolTRES achieves higher performance on downstream tasks (bottom chart).
The Solution: MolTRES Framework
To force the model to learn meaningful chemical representations, the authors introduce MolTRES. This framework fundamentally changes how the model learns through two key innovations:
- DynaMol: A generator-discriminator training scheme (similar to ELECTRA in NLP) combined with substructure masking.
- Knowledge Transfer: Integrating external knowledge from scientific literature via
mat2vecembeddings.
1. DynaMol: Generator-Discriminator Training
Instead of a single model trying to fill in the blanks, MolTRES uses two models competing and cooperating: a Generator and a Discriminator.

Figure 2 illustrates this workflow. Here is the step-by-step process:
- Masking: The input SMILES sequence is masked. However, rather than masking random atoms, the authors use Substructure Masking. They mask meaningful chemical groups (like functional groups or benzene rings). This prevents the model from guessing based on simple neighboring atoms.
- Generator (\(E_G\)): This model acts like a standard BERT model. It tries to predict the original tokens for the masked positions.
- Discriminator (\(E_D\)): This is the crucial addition. The Discriminator receives a “corrupted” sequence where the masked tokens have been replaced by the Generator’s predictions. The Discriminator’s job is not to guess the word, but to classify every token as Original or Replaced.
This setup is significantly harder than standard MLM. As the Generator gets better at creating realistic chemical tokens, the Discriminator must look for subtle chemical inconsistencies to spot the fakes.
The Mathematical Foundation
The loss function for the Generator (\(\mathcal{L}_G\)) is the standard Maximum Likelihood Estimation, trying to predict the correct token \(x_i\) given the masked input \(\tilde{\mathbf{X}}\):

The input to the Discriminator (\(\tilde{\mathbf{X}}_D\)) is constructed by sampling from the Generator’s probability distribution. If a token was masked (\(i \in \mathcal{M}\)), it is replaced by the generator’s guess (\(\tilde{x}_i\)); otherwise, the original token (\(x_i\)) is kept:

Finally, the Discriminator is trained to predict a binary label \(z_i\) (original vs. replaced) for every token in the sequence:

2. Injecting Scientific Knowledge (mat2vec)
A major limitation of SMILES is that it only describes structure. It does not explicitly contain information about boiling points, toxicity, or reactivity. That information resides in scientific literature.
To bridge this gap, MolTRES integrates mat2vec embeddings. mat2vec is a Word2Vec model trained on millions of materials science abstracts. It captures the semantic relationships between chemical terms (e.g., relating “lithium” to “battery”).
The authors created a mapping system (a thesaurus) to link SMILES tokens to mat2vec words. For example, the token [cH+] maps to “cation” or “methylidyne” in the literature embedding space.
These pre-trained literature embeddings (\(e^m\)) are fused with the Transformer’s learned token embeddings (\(e^t\)) using a projection layer \(F_1\). This results in a composite embedding \(V_G\) that informs the Generator:

This fusion ensures that when the model processes a molecule, it isn’t just seeing a string of characters; it is accessing a latent database of chemical properties derived from human scientific knowledge.
Efficiency: Linear Attention
Modeling long molecules (like polymers) using standard Transformers is computationally expensive because the attention mechanism scales quadratically—\(O(N^2)\). To handle large datasets efficiently, MolTRES employs Linear Attention with Rotary Embeddings, reducing complexity to \(O(N)\).
Standard attention looks like this:

MolTRES replaces the exponential similarity function with a kernel function \(\phi(\cdot)\), allowing for linear scaling:

Experimental Results
The researchers pre-trained MolTRES on 1.9 billion molecules from the ZINC dataset and fine-tuned it on the MoleculeNet benchmark.
Regression Tasks
In regression tasks (predicting numerical properties like solubility or hydration energy), MolTRES demonstrated superior performance compared to both 3D graph models and other SMILES-based models.

As seen in Table 2, MolTRES achieves the lowest error (RMSE) across ESOL, FreeSolv, and Lipophilicity datasets. It significantly outperforms MolFormer-XL, despite using a similar parameter scale. This validates the hypothesis that the “harder” training task (DynaMol) leads to better generalized representations.
Classification Tasks
The trend continues in classification tasks (e.g., predicting toxicity or HIV inhibition). While specific tables for classification are extensive, the summary is that MolTRES achieved state-of-the-art results on 7 out of 8 benchmark tasks, outperforming models that utilize expensive 3D conformal data.
Why Does It Work? An Analysis
The paper offers insightful ablation studies to explain where the performance gains come from.
The Impact of External Knowledge
Does reading scientific literature actually help the model understand chemistry? The training curves suggest yes.

Figure 3 shows two critical trends. The solid line (with mat2vec) shows a lower pre-training loss (Left) and higher downstream ROC-AUC (Right) compared to the dashed line (without mat2vec). This confirms that the external knowledge acts as a high-quality initialization and regularizer, helping the model converge to a better solution faster.
The “Sweet Spot” for Masking
In Natural Language Processing (BERT), the standard masking ratio is 15%. Because SMILES is redundant and “easy” to guess, MolTRES requires a much more aggressive strategy.

Figure 4 reveals that MolTRES performs best with a massive 65% masking ratio. This confirms the authors’ hypothesis: to force a model to learn deep chemical insights rather than surface grammar, you must hide the majority of the molecule, forcing the model to reconstruct it based on faint structural clues.
Balancing the Generator and Discriminator
The training involves a hyperparameter \(\lambda\) that balances the Generator loss and Discriminator loss.

Figure 5 shows that a \(\lambda\) value of 10 is optimal. Interestingly, this differs from NLP implementations (which typically use 50), suggesting that chemical language modeling requires a distinct balance between generation and discrimination compared to natural language.
Conclusion
MolTRES represents a significant step forward in AI for drug discovery. By recognizing that chemical language is fundamentally different from natural language, the authors moved beyond standard Masked Language Modeling.
Through the DynaMol generator-discriminator framework, they created a training task challenging enough to prevent overfitting. Through mat2vec integration, they enriched the model with scientific knowledge that cannot be inferred from structure alone.
The result is a model that is not only scalable but also highly accurate across a wide range of molecular properties, proving that in the world of chemical AI, making the learning process “harder” often makes the final model smarter.
](https://deep-paper.org/en/paper/2408.01426/images/cover.png)