If you have ever fine-tuned a large language model (LLM) like BERT on a small dataset, you likely encountered a familiar frustration: overfitting. The model memorizes the training data perfectly but falls apart the moment it sees something slightly different.
Even worse, these models are notoriously fragile to adversarial attacks. A malicious actor can change a single word in a sentence—a “perturbation”—and cause the model to flip its prediction entirely, even if the sentence looks identical to a human.
Traditionally, researchers try to patch this hole with adversarial training, where they purposely feed the model broken examples during training so it learns to ignore them. But this is computationally expensive and hard to tune.
But what if the fragility isn’t just about the data? What if it’s about the math inside the neural network’s architecture?
In a fascinating paper titled “IM-BERT: Enhancing Robustness of BERT through the Implicit Euler Method,” researchers Mihyeon Kim, Juhyoung Park, and Youngbin Kim propose a solution rooted not in computer science, but in dynamic systems and calculus. By treating BERT’s layers as a solution to Ordinary Differential Equations (ODEs), they fundamentally change how information flows through the network, making it inherently robust to attacks without needing to see them during training.
In this post, we will tear down the IM-BERT architecture, dive into the math of Explicit vs. Implicit Euler methods, and see how a little bit of calculus can make AI significantly stronger.
The Problem: The Fragility of Fine-Tuning
Pre-trained Language Models (PLMs) operate in a two-stage paradigm:
- Pre-training: Read the entire internet to learn general language patterns (High resource).
- Fine-tuning: Train on a specific dataset (e.g., sentiment analysis) to specialize (Low resource).
The problem arises in stage two. When you take a massive model like BERT and aggressively fine-tune it on a small dataset, it becomes “brittle.” It creates a decision boundary that is too complex, making it vulnerable to adversarial perturbations.
Imagine a model that correctly classifies:
“The movie was good.” \(\rightarrow\) Positive
An adversarial attack might change it to:
“The movie was adequate.” \(\rightarrow\) Negative (Incorrect)
To a human, “adequate” and “good” are similar enough in this context, but the model’s internal representation shifts wildly due to the perturbation. The authors of IM-BERT argue that this sensitivity is due to the mathematical nature of the connections between the model’s layers.
Neural Networks as Dynamic Systems
To understand IM-BERT, we first have to look at neural networks through the lens of Ordinary Differential Equations (ODEs).
In a residual network (like ResNet or BERT), the output of a layer \(h_t\) is the sum of the input \(h_{t-1}\) and some transformation \(\phi\) (like attention or feed-forward networks).
\[h_t = h_{t-1} + \phi(h_{t-1})\]As the number of layers approaches infinity, we can view this discrete process as a continuous flow of information over time \(t\). The hidden state \(h(t)\) evolves according to a differential equation:

Here, \(\phi\) is the function describing how the state changes. A neural network is essentially an ODE solver: it tries to calculate the final state \(h(T)\) (the output) given the initial state \(h(0) = x\) (the input).
The Explicit Euler Method (Standard BERT)
Since computers can’t handle infinite continuous steps, we have to approximate. The standard way to solve an ODE numerically is the Explicit Euler Method. It estimates the next step based entirely on the current location and the slope at that location.
Mathematically, a standard Residual Connection in BERT is actually an implementation of the Explicit Euler method with a step size of \(\gamma=1\):

The Flaw: The Explicit Euler method is simple and fast, but it is conditionally stable. If the input \(x\) is perturbed (i.e., an adversarial attack adds noise), the error can accumulate and explode as it propagates through the layers. The “path” the data takes diverges from the correct path, leading to a wrong prediction.
The Implicit Euler Method (The Solution)
There is a more robust way to solve ODEs: the Implicit Euler Method. Instead of calculating the next step based on the current slope, it calculates the next step based on the future slope.

Look closely at the difference. In the Explicit method, \(\phi\) takes \(h_{t-1}\) as input. In the Implicit method above, \(\phi\) takes \(h_t\) as input.
This seems paradoxical—how can you use \(h_t\) to calculate \(h_t\)? We will get to the “how” in a moment, but first, let’s understand the “why.”
Why Implicit is Better: Stability Analysis
The authors perform a stability analysis to compare how these two methods handle perturbations (noise).
They use a standard test equation where \(\lambda\) represents the eigenvalues of the system (which must be negative for stability) and \(\gamma\) is the step size.
When the input \(x\) is perturbed by a small amount \(\eta\), the Explicit method only remains stable if the step size is small enough to keep the product of the step and the eigenvalue within a specific circle. If the perturbation pushes the system outside this region, the error grows exponentially.
However, the Implicit Method has a magical property called A-stability (Absolute Stability).

The authors prove (Proposition 2 in the paper) that for the Implicit Euler method, the error between the perturbed path and the clean path converges to zero regardless of the step size.
In plain English: No matter how much you kick the input (adversarial attack), the Implicit method tends to pull the hidden states back toward the correct trajectory.
The Architecture: Building IM-BERT
We established that the Implicit Euler method is theoretically superior for robustness. But there is a catch: Eq. 6 (the Implicit equation) is a non-linear equation where \(h_t\) appears on both sides. You cannot just “calculate” it; you have to solve for it.
To implement this in a neural network, the authors treat finding the next hidden state \(h_t\) as an optimization problem. They want to find an \(h_t\) that minimizes the difference between the left and right sides of the equation.
They define the target state \(h_t^*\) as:

The IM-Connection
Instead of a simple addition operation between layers, IM-BERT introduces the IM-Connection.
Inside every layer (or specific layers), the model runs a mini-loop. It starts with a guess for \(h_t\) (usually the result of the explicit method) and then uses Gradient Descent to iteratively update \(h_t\) until it satisfies the Implicit equation.
This process essentially “denoises” the representation at every single layer before passing it to the next.
Let’s look at the architectural comparison:

- (a) BERT: Standard connection. Information flows straight through.
- (b) EX-BERT: A variant using explicit residual connections (for fair comparison).
- (c) IM-BERT: Notice the red loops. The hidden state loops back on itself. The output of layer \(l\) isn’t just passed to \(l+1\); it is refined iteratively to ensure it sits on a stable trajectory.
Experimental Results
Does this mathematical rigor translate to better performance? The researchers tested IM-BERT on the AdvGLUE benchmark, a dataset specifically designed to break language models with varying types of attacks (word-level, sentence-level, and human-crafted tricky phrases).
1. Robustness Against Attacks
The results were impressive. IM-BERT significantly outperformed the standard BERT baseline and even competitive adversarial training methods.

Key takeaways from Table 1:
- Standard Training: Just by changing the architecture (IM-BERT), the model achieves 41.2% average accuracy on AdvGLUE, compared to standard BERT’s 35.1%. That is a massive 6.1% jump without seeing any adversarial examples during training.
- Adversarial Training: IM-BERT even outperforms methods like
FreeLBandSMARTwhich are computationally expensive training strategies designed specifically for this purpose.
2. The Low-Resource Scenario
The benefits of IM-BERT are most visible when data is scarce. This makes sense: when training data is low, models usually memorize noise. The stability of the Implicit method prevents this overfitting.

In Table 2, we see results when training on only 1,000 or 500 instances.
- With only 500 samples, IM-BERT (with 10 iterations) scores 42.4%, completely crushing the standard BERT score of 36.5%.
- This suggests that if you are a startup or student with a small dataset, using an ODE-inspired architecture is safer than using a standard Transformer.
3. Where should we put the connections?
The Implicit method is slower because it requires an iterative loop inside the forward pass. To mitigate this, the authors investigated if they could apply the IM-Connection to only some layers.

Table 3 reveals an interesting insight into how BERT processes information:
- Lower Layers (1-3): Applying IM-connections here helps against Word-level attacks.
- Middle Layers (4-6): This seems to be the sweet spot. It offers high accuracy with fewer FLOPs (computational operations).
- Upper Layers (10-12): Applying the correction too late doesn’t help much. If the trajectory has already diverged in the early layers, fixing it at the end is too difficult.
Conclusion and Implications
The IM-BERT paper provides a compelling bridge between the abstract mathematics of differential equations and the practical engineering of Natural Language Processing.
By recognizing that a residual connection is just a sloppy approximation of a continuous flow (Explicit Euler), the authors identified a fundamental source of instability in modern AI. By upgrading that approximation to the Implicit Euler Method, they bestowed the model with “Absolute Stability.”
Why does this matter for you?
- No Free Lunch, but a Cheap Lunch: IM-BERT improves robustness without requiring you to generate thousands of adversarial examples or double your training time with complex regularization schemes. The cost is paid during inference (loops inside layers), not data preparation.
- Safety by Design: It shifts the focus from “patching” models (training) to “hardening” them (architecture).
- Low-Resource Hero: It is a viable strategy for making models useful in scenarios where data is expensive or rare.
As we continue to deploy LLMs in critical areas, robustness isn’t just a metric; it’s a requirement. IM-BERT shows that sometimes, the best way forward is to look back at the calculus textbooks.
](https://deep-paper.org/en/paper/2505.06889/images/cover.png)