The Physics of AI: Why Test Accuracy Isn’t Enough for Material Simulation

In the world of computational chemistry and materials science, we are witnessing a revolution. For decades, Density Functional Theory (DFT) has been the gold standard for modeling how atoms interact. It provides the quantum mechanical foundation for discovering new drugs, designing better batteries, and understanding the thermal properties of semiconductors. But DFT has a major bottleneck: it is notoriously slow. Its computational cost scales cubically with the number of electrons (\(O(n^3)\)), meaning that simulating large systems or long timescales is often impossible.

Enter Machine Learning Interatomic Potentials (MLIPs). These neural networks promise to approximate the accuracy of DFT at a fraction of the computational cost. In recent years, we have seen MLIPs achieve vanishingly small errors on test datasets. If you look at the leaderboards, it seems the problem is solved.

However, a new paper from FAIR at Meta, “Learning Smooth and Expressive Interatomic Potentials for Physical Property Prediction,” highlights a critical disconnect. The researchers demonstrate that a model with the lowest error on a static test set doesn’t necessarily respect the laws of physics when put into motion. Specifically, many state-of-the-art models fail to conserve energy during molecular dynamics (MD) simulations.

In this deep dive, we will explore why high accuracy doesn’t always equal good physics, how the researchers diagnose this problem, and the architectural solution they propose: the eSEN (equivariant Smooth Energy Network).


1. The Disconnect: Test Sets vs. Physical Reality

To understand the core problem, we first need to understand how these AI models are typically benchmarked. Usually, a dataset is split into a training set and a test set. The model predicts energies and forces for the atoms in the test set, and we calculate the Mean Absolute Error (MAE). If the MAE is low, we assume the model understands the physics of the system.

But predicting the force on a static snapshot of atoms is very different from running a simulation. In a Molecular Dynamics (MD) simulation, we use the predicted forces to move atoms over thousands or millions of time steps. If the model makes tiny, systematic errors—or if the energy landscape it predicts isn’t “smooth”—the simulation can spiral out of control.

The Drift Problem

The researchers propose a litmus test for MLIPs: Energy Conservation.

In a microcanonical ensemble (NVE) simulation, an isolated system must conserve total energy (kinetic + potential). If a machine learning model is driving the simulation and the total energy starts to drift significantly, the simulation is physically invalid.

Figure 1 (a) Energy conservation in MD simulations. Direct-force models (Orb, eqV2) and CHGNet fail to conserve. (b) A higher F1 score on the Matbench-Discovery strongly correlates with a lower test-set energy MAE. (c) Test-set energy MAE and K_SRME on the Matbench-Discovery benchmark. (d) Test-set energy MAE and vibrational entropy MAE on the MDR Phonon benchmark. Our model (eSEN) achieves the best performance on all benchmarks.

As shown in Figure 1(a) above, this is exactly what happens with many leading models. The graph plots energy drift over time.

  • The blue line represents eSEN (the model proposed in this paper). Notice how flat it is; the energy drift is negligible.
  • The other lines (CHGNet, EqV2, Orb) show massive energy drifts. After just 40 picoseconds, the simulations using these models become unphysical.

Why does this happen? The authors identify two main culprits: non-conservative force predictions and lack of smoothness in the learned potential energy surface (PES).


2. Preliminaries: The Physics of Potentials

Before analyzing the solution, let’s establish the physical constraints a good model must satisfy.

Conservative Forces

In physics, a force is “conservative” if the work done by moving a particle around a closed loop is zero. This essentially means you can’t create free energy just by moving atoms in a circle and returning them to their start.

Mathematically, this is expressed as:

Integral of Force dot dr equals zero.

For this condition to hold, the force vector field \(\boldsymbol{F}\) must be the negative gradient of a scalar potential energy field \(E\):

\[ \boldsymbol{F} = -\nabla_{\boldsymbol{r}} E \]

If an AI model predicts energy \(E\) and forces \(\boldsymbol{F}\) separately (which many do to save computational time), there is no mathematical guarantee that \(\boldsymbol{F}\) is the derivative of \(E\). These are called Direct-Force Models. They are computationally efficient but physically “broken” because they violate energy conservation.

Bounded Derivatives and Smoothness

Even if a model calculates forces as the gradient of energy, it can still fail in a simulation if the energy landscape is “jagged.”

MD simulations typically use the Verlet algorithm to integrate equations of motion. This algorithm relies on the assumption that the energy function is smooth—specifically, that its higher-order derivatives exist and are bounded.

The paper highlights the error bound for the Verlet integrator:

Energy drift bound equation showing dependence on time step and higher order derivatives.

This equation might look intimidating, but the implication is simple: Energy conservation depends on the smoothness of your energy function. The term \(C_N\) depends on the bounds of the higher-order derivatives of the energy \(E\).

If your neural network uses functions that introduce sharp kinks or discontinuities (for example, hard cutoffs for neighbor lists or discretized grids), the derivatives explode, the error term shoots up, and your simulation crashes (or drifts into nonsense).


3. The Solution: eSEN (equivariant Smooth Energy Network)

The researchers introduce eSEN, a model explicitly designed to prioritize “smoothness” and “conservation” over raw computational shortcuts.

Architecture Overview

eSEN is a Message-Passing Neural Network (MPNN). It represents atoms as nodes in a graph and interactions between atoms as edges.

Figure 2 (a) The eSEN architecture. The high-level architecture is similar to Transformer/Equiformer, while the edgewise/nodewise layers are simplified/enhanced. The final-layer L=0 features are used to predict nodewise energy, which is summed to get the total potential energy E. Forces and stress are obtained through back-propagation. (b) The Edgewise Convolution layer in eSEN.

As seen in Figure 2, the architecture generally follows the structure of modern Transformers or Equiformers, but with specific modifications for physics:

  1. Strict Energy-Force Relationship: Unlike direct-force models, eSEN predicts a single scalar energy value (\(E\)). The forces are obtained via backpropagation (calculating the gradient \(-\nabla E\)). This guarantees the forces are conservative by definition.
  2. Equivariant Representations: The model uses Spherical Harmonics (mathematical functions defined on the surface of a sphere) to represent atomic environments. This ensures that if you rotate a molecule, the predicted energy doesn’t change, and the force vectors rotate accordingly.

The Design Choices for Smoothness

The “secret sauce” of eSEN isn’t just the high-level architecture; it’s the specific design choices made to ensure the Potential Energy Surface (PES) is continuously differentiable. The authors performed rigorous ablation studies to prove which components matter.

1. Avoiding Discretization

Many equivariant networks (like eSCN or EquiformerV2) project spherical harmonics onto a grid to perform non-linear operations efficiently.

  • The Problem: Going back and forth between continuous spherical harmonics and a discrete grid introduces “aliasing” or sampling errors. These errors act like tiny jagged edges on the energy surface.
  • The eSEN Solution: eSEN avoids grid projection entirely for its node-wise operations. It uses a specialized “Gated non-linearity” that operates directly in the spherical harmonic space.

2. Smooth Cutoffs (Envelope Functions)

In graph neural networks, an atom interacts with neighbors within a certain radius (e.g., 6 Angstroms).

  • The Problem: If an atom moves from 5.99Å to 6.01Å, it suddenly “disappears” from the neighbor list. This causes a discontinuity (a jump) in the energy function.
  • The eSEN Solution: They use a polynomial Envelope Function. As an atom approaches the cutoff distance, its interaction strength smoothly decays to exactly zero.

3. No Neighbor Limits

To speed up training, many models limit the number of neighbors to a fixed number (e.g., “closest 20 atoms”).

  • The Problem: If the 21st atom moves slightly closer than the 20th atom, the neighbor list abruptly swaps. This creates a discontinuity.
  • The eSEN Solution: eSEN includes all neighbors within the cutoff radius, regardless of how many there are.

Proving the Design Choices

The researchers didn’t just guess these choices; they tested them by running simulations and measuring energy drift.

Figure 4 Conservation error on the TM23 task (top row) and MD22 task (bottom row) for ablating design choices of eSEN. Models that conserve energy are bolded in the legends.

Figure 4 is a striking piece of evidence. Look at the logarithmic scale on the y-axis (Energy Drift):

  • Plot (a1): The orange line (Direct-Force) drifts significantly. The blue line (Conservative) is stable.
  • Plot (c1): The orange line (“Neighbor limit”) shows massive drift. The blue line (eSEN default) is stable.
  • Plot (c1): The green line (“No envelope”) also fails to conserve energy.

This confirms that even if you have a conservative architecture, small implementation details like neighbor limits or lack of envelopes can destroy the physical validity of the model.


4. A Clever Training Trick: Pre-training with Direct Forces

There is one downside to conservative models: they are slower to train. Because calculating forces requires a full backpropagation pass during training, it takes more memory and time than direct-force prediction.

The authors propose a “best of both worlds” strategy:

  1. Pre-train the model using Direct-Force prediction. This is fast and gets the model “in the ballpark.”
  2. Fine-tune the model using the Conservative (gradient-based) method.

Figure 3 Validation loss curves for epoch and wallclock time.

Figure 3 shows the validation loss over time. The Green Line represents this hybrid strategy (“Finetuning”). Notice how it drops rapidly (thanks to fast pre-training) and then settles at the same low error as the fully conservative model (Orange Line), but in roughly 40% less wall-clock time.


5. Experimental Results: SOTA across the Board

eSEN isn’t just theoretically pure; it performs exceptionally well on difficult benchmarks.

Materials Stability (Matbench Discovery)

Matbench Discovery is a benchmark that simulates the discovery of new inorganic crystals. The goal is to predict which crystal structures are stable (will exist in nature) and which will decompose.

This task requires Geometry Optimization (Relaxation). You start with a crystal structure and let the model move the atoms to find the lowest energy state. If the energy surface is rough (jagged), the optimization gets stuck in local minima, leading to wrong stability predictions.

eSEN achieves a F1 score of 0.831, the highest among compliant models.

Phonons and Thermal Conductivity

Perhaps the most impressive results come from Phonon calculations.

Phonons are quasiparticles that represent the collective vibrations of atoms in a crystal. Accurate phonon prediction is the “final boss” of potential smoothness because it requires accurate second derivatives (the Hessian matrix) of the energy. If the energy surface has even minor kinks, the second derivatives (curvature) will be completely wrong.

The Visual Proof

Let’s compare eSEN against a direct-force model (eqV2) on predicting phonon band structures.

Below is the result from eSEN (Figure 5):

Figure 5 Predicted phonon band structure and density of states (DOS) of Si, CsCl, AlN using eSEN at different displacement values.

Notice the clean lines. The colored lines (model predictions at different settings) overlap almost perfectly with the Green Dashed Line (DFT Ground Truth). The bands go to zero at the Gamma point (the left side of the graphs), which is physically required for acoustic modes.

Now, compare this to a Direct-Force Model (Figure C.11):

Figure C.11 Predicted phonon band structure… using eqV2-S-DeNS (direct-force prediction).

The difference is night and day.

  1. Imaginary Frequencies: In the first plot (Si), notice the lines dipping below zero into negative numbers (which represent imaginary frequencies). This implies the crystal is unstable, which is incorrect for Silicon.
  2. Missing Acoustic Modes: The bands do not correctly converge to zero at the Gamma point (the \(\Gamma\) symbol).
  3. Noise: The bands are wiggly and deviate significantly from the Green Dashed Line (DFT).

This visually demonstrates that while direct-force models might get the energy roughly right, they fail to capture the curvature of the energy landscape, making them useless for advanced property prediction like thermal conductivity.

Efficiency

One might worry that enforcing these constraints makes the model too slow. However, the authors benchmarked eSEN against MACE (a popular lightweight model).

Figure B.10 Inference efficiency of MACE-OFF-L and eSENs of a similar scale.

As shown in Figure B.10, the lightweight version of eSEN (3.2M parameters) matches the inference speed of MACE-OFF-L while achieving lower errors.


6. Conclusion: The “Test Set” Trap

The most significant contribution of this paper might be a change in perspective. For a long time, the field has chased lower MAE on static test sets.

The authors provide compelling evidence that Energy Conservation in MD is a better proxy for downstream performance than simple test error.

Figure 6 Test error correlation across several property prediction tasks for eSEN variants. Conservative models are shown as boxes and those found to not conserve as crosses.

Figure 6 drives this point home. The models marked with Boxes are conservative (eSEN variants). The models marked with Crosses are not.

  • The conservative models show a tight correlation: lower test error leads to better property prediction (\(\kappa_{SRME}\)).
  • The non-conservative models are scattered. You can have a very low energy error but a terrible physical prediction score.

Summary of Key Takeaways

  1. Conservation Matters: Models must conserve energy to be useful for MD simulations.
  2. Smoothness is Key: High accuracy on static structures is useless for vibrational properties (phonons) if the energy surface is jagged.
  3. Architecture > Data: You cannot just train a model on more data to fix non-conservative behavior; it requires architectural choices (e.g., envelope functions, avoiding grids, gradient-based forces).
  4. eSEN: The proposed model achieves State-of-the-Art results by rigorously adhering to these physical principles.

This paper serves as a wake-up call to the AI4Science community: We aren’t just fitting curves; we are modeling physical reality. Our loss functions and benchmarks need to reflect that.