Introduction: The Illusion of Discovery

Imagine you are a physics professor. You ask a student to write down the formula for Einstein’s mass-energy equivalence. The student immediately writes \(E=mc^2\). Impressive? Not really—they simply memorized a famous string of characters. Now, imagine you give that same student a table of raw experimental data concerning the oscillation of a spring and ask them to derive the governing law from scratch, without telling them what physical phenomenon they are looking at. If they can produce the correct differential equation, that is no longer memorization; that is discovery.

This distinction is the central crisis facing Artificial Intelligence in the sciences today. Large Language Models (LLMs) have shown incredible aptitude for coding and mathematics, leading researchers to wonder: Can we use these models to discover new scientific laws? Can we automate the work of Newton or Kepler?

Early attempts have been promising, but they face a critical flaw. Most benchmarks used to test these AI “scientists” are based on famous equations from textbooks (like the Feynman Lectures). Because LLMs are trained on the internet, they have likely seen these equations thousands of times. When an AI solves these problems, is it performing data-driven reasoning, or is it just reciting what it read in its training data?

To answer this, a team of researchers has introduced LLM-SRBench, a rigorous new benchmark designed to strip away the crutch of memorization. By forcing LLMs to solve transformed versions of known laws and entirely new “synthetic” scientific problems, this paper exposes the true capabilities—and current limitations—of AI in scientific discovery.

Figure 1. Error analysis comparing simple LLM sampling (Llama3.1-8B) on 1OO Feynman problems versus LLM-SRBench datasets.

As shown in Figure 1, the difference is stark. When tested on standard Feynman problems (the red line), error rates drop precipitously, suggesting the model “knows” the answer instantly. But when tested on the new LLM-SRBench datasets (blue and green lines), the models struggle significantly, revealing that genuine discovery is a much harder hill to climb.

Background: From Genetic Algorithms to LLMs

What is Equation Discovery?

Equation discovery, formally known as Symbolic Regression (SR), is the task of finding a symbolic mathematical expression that best describes a dataset. Unlike a standard neural network, which is a “black box” of weights and biases, symbolic regression aims to output a human-readable formula (e.g., \(F=ma\)).

Traditionally, this field was dominated by Genetic Programming (GP). These algorithms would generate random mathematical trees, “breed” the best fitting ones, and mutate them over generations to find a solution. While effective, GP is computationally expensive and “blind”—it doesn’t understand physics; it just crunches numbers.

The LLM Paradigm Shift

LLMs offer a new paradigm. Because they have read millions of scientific papers, they possess “embedded scientific knowledge.” Theoretically, an LLM should know that if variables involve “mass” and “acceleration,” the resulting equation might involve force.

Figure 2. Overview of the LLM-based Scientific Equation Discovery.

Figure 2 illustrates the modern workflow for LLM-based equation discovery.

Input: The model receives the goal, scientific context (variable names), and a small set of numerical data.
Discovery Process: The LLM acts as the hypothesis generator. It uses its scientific priors to suggest a formula (e.g., “Maybe this is a harmonic oscillator?”).
Refinement: The system checks how well the formula fits the data. It might use a standard optimizer to tune parameters (finding the exact values for constants like \(k\) or \(g\)).
Feedback: The results are fed back into the prompt, asking the LLM to refine its guess.

However, as promising as this loop looks, we run back into the memorization problem. If the “scientific context” triggers the LLM to recall a specific page from a textbook, the “Discovery Process” becomes a “Retrieval Process.”

The Core Method: Designing LLM-SRBench

To rigorously evaluate whether an AI is truly reasoning, the authors of LLM-SRBench created a dataset of 239 problems split into two novel categories. These categories are designed to be impossible to solve via simple memorization.

Figure 3. Data generation pipelines for the two dataset categories in LLM-SRBench.

Figure 3 outlines the generation pipeline. Let’s break down the two main strategies used: LSR-Transform and LSR-Synth.

1. LSR-Transform: The “Twist” on Classics

The first category, LSR-Transform, plays a clever trick on the models. It takes the 100 well-known physics equations from the Feynman benchmark but mathematically rearranges them.

In a standard textbook problem, you might be given mass (\(m\)), frequency (\(\omega\)), and displacement (\(x\)) and asked to find Energy (\(E\)). The equation is \(E = \frac{1}{4}m(\omega^2 + \omega_0^2)x^2\). An LLM knows this standard form by heart.

But what if we flip the script? What if we provide the Energy, mass, and frequency, and ask the model to find the displacement (\(x\))? Or the mass (\(m\))?

Figure 7. Examples of how LLM-SRBench (LSR-Transform) problems can be obtained from original Feynman benchmark problems

Figure 7 shows this transformation in action.

Top Row: The standard Harmonic Oscillator equation is solved for mass (\(m\)). The resulting equation is mathematically valid but rarely seen in textbooks in that specific form.
Middle Row: The Electric Dipole Potential is solved for the dipole moment (\(p_d\)) or radius (\(r\)).
Bottom Row: The Semiconductor Diode equation is rearranged to solve for temperature (\(T\)).

This forces the LLM to perform algebraic reasoning. It cannot simply autocomplete the text. It must understand the relationship between variables and derive the inverse form based on the data provided.

Crucially, the researchers ensured that these transformed equations weren’t just arbitrarily harder. They filtered the dataset to ensure the complexity (number of nodes in the equation tree) remained similar to the original Feynman problems.

Figure 8. Comparison of expression complexity distributions between Feynman Benchmark and LLM-SRBench (LSR-Transform) datasets.

As seen in Figure 8, the distribution of complexity between the original Feynman benchmark (red) and the new LSR-Transform (blue) is very similar. This controls for variables: if the LLM fails, it’s not because the math is “longer”—it’s because the model can’t rely on rote memory.

2. LSR-Synth: Synthetic Scientific Worlds

The second category, LSR-Synth, goes a step further. Instead of modifying existing laws, the researchers generated entirely new, scientifically plausible problems across four domains: Chemistry, Biology, Physics, and Material Science.

The goal here is to simulate the process of discovering a new scientific phenomenon. To do this, they used a “Known + Synthetic” approach.

Known Terms: They identify standard mathematical terms relevant to a field (e.g., logistic growth in biology or Arrhenius equations in chemistry).
Synthetic Terms: They introduce novel but plausible interaction terms (e.g., a saturation term or a specific oscillation) that represent a hypothetical new mechanism.

Figure 9. Examples of LLM-SRBench (LSR-Synth) problems with known and synthetic terms across different domains.

Figure 9 provides examples of these “Frankenstein” equations. Look at the chemistry example (top left): it combines a standard second-order reaction term (\(-C_0 A(t)^2\)) with a novel synthetic term involving a sine wave of the square root of concentration.

These equations are physically solvable and numerically stable, but they do not exist in any textbook. To solve them, an AI cannot rely on training data. It must look at the numerical data points and realize, “Hey, there’s a periodic oscillation here that standard kinetics can’t explain,” and then hypothesize the correct mathematical structure.

Figure 13. Distribution of problem complexity in LLM-SRBench (LSR-Synth) datasets across scientific domains.

Figure 10 shows the complexity spread of these synthetic problems. They cover a wide range, ensuring that the benchmark tests everything from simple relationships to complex, multi-term interactions.

How to Grade an AI Scientist

Evaluating equation discovery is notoriously difficult. If the ground truth is \(y = x + x\), and the AI predicts \(y = 2x\), standard string matching would mark it wrong, even though it is mathematically identical.

The researchers employed a multi-faceted evaluation strategy:

1. Data Fidelity (The Numerical Check)

The most basic check is whether the discovered equation actually predicts the data.

Equations for Acc_tau and NMSE

As shown above, they use:

\(\text{Acc}_\tau\) (Accuracy to Tolerance): The percentage of test points where the predicted value is within a small tolerance (\(\tau\)) of the actual value.
NMSE (Normalized Mean Squared Error): A standard measure of how “far off” the curve is from the data points.

They evaluate this on both In-Domain (ID) data (data points within the range used for training) and Out-Of-Domain (OOD) data (extrapolating to future time steps or different conditions). OOD performance is the gold standard for science; a physical law must hold true even in unobserved scenarios.

2. Symbolic Accuracy (The Logic Check)

Numerical fit isn’t enough. A high-degree polynomial can “fit” almost any curve but explains nothing about the physics (this is called overfitting). We need to know if the AI found the correct symbolic structure.

To handle the " \(x+x\) vs \(2x\) " problem, the researchers used a novel approach: LLM-based Evaluation. They tasked GPT-4o to act as a mathematical judge.

Figure 11. Symbolic assessment in equation discovery with GPT-4o as evaluator

Figure 11 demonstrates this process. The evaluator is given the Ground Truth equation and the Hypothesis. It analyzes the mathematical properties of both.

In Case 1 (Left), the evaluator recognizes that the program code provided is mathematically equivalent to the ground truth expression.
In Case 2 (Right), the evaluator correctly identifies that the hypothesis includes terms that cannot be simplified to match the ground truth (specifically regarding the quadratic term inside the square root).

This semantic evaluation is significantly more robust than previous methods, providing a truer measure of discovery.

Experiments and Results

The researchers benchmarked several state-of-the-art methods, including:

Direct Prompting: Just asking the LLM to find the equation (DataBlind).
SGA: A method combining LLMs with PyTorch optimization.
LaSR: Uses “concept learning” to evolve equations.
LLM-SR: Uses LLMs to write Python programs representing equations, refined by evolutionary search.

They tested these using Llama-3.1, GPT-3.5, and GPT-4o-mini backbones.

The Main Takeaway: It’s Hard.

The results, summarized in Table 1, are humbling.

Table 1. Comparison of different LLM-based scientific equation discovery methods on LLM-SRBench.

The best performing model (LLM-SR with GPT-4o-mini) achieved a symbolic accuracy of only 31.5% on the LSR-Transform dataset. On the synthetic datasets, performance was even lower in many categories.

Compare this to the “Direct Prompting” rows at the top. When the model blindly guesses without a feedback loop, accuracy is near zero (e.g., 3.61% or 0%). This confirms that LLMs cannot simply “intuit” new physics; they need a rigorous, iterative search process.

The Gap Between Memory and Reasoning

The most damning evidence of memorization comes from comparing performance on Feynman problems vs. their transformed counterparts.

Figure 4. Performance comparison across equation complexity levels for Feynman and LSR-Transform datasets

Figure 4 splits the performance by equation complexity.

Red Bars (Feynman): High accuracy. Even simple LLMs score 60-80% on simple Feynman equations.
Blue Bars (LSR-Transform): The accuracy collapses to 10-20%.

Remember, these equations have the same complexity. The only difference is that the Red bars represent forms the LLM has seen in textbooks, and the Blue bars represent algebraic rearrangements of those same laws. This gap quantifies the “memorization tax.”

Generalization: The True Test of Science

A major contribution of this paper is the focus on Out-Of-Domain (OOD) generalization.

Figure 5. Detailed results of in-domain (ID) and out-of-domain (OOD) performance.

Figure 5 breaks down the error rates (NMSE) by scientific domain.

Solid bars (In-Domain) are lower, meaning models fit the training data well.
Hatched bars (Out-Of-Domain) are consistently higher.

Note that Direct Prompting (purple) has massive error rates, often extending off the chart. In contrast, iterative methods like LLM-SR (blue) and LaSR (green) maintain much better stability, particularly in Physics and Material Science. This suggests that while LLMs struggle to find the exact symbolic law, the iterative search processes effectively find models that generalize decently well.

Symbolic Accuracy Predicts Generalization

Is it worth obsessing over the exact symbolic formula? Yes.

Figure 6. Correlation between symbolic accuracy and OOD performance.

Figure 6 shows a clear correlation.

Left (a): Higher Symbolic Accuracy correlates with higher OOD Accuracy (up and to the right is good).
Right (b): Higher Symbolic Accuracy correlates with lower OOD Error (down and to the right is good).

This validates the premise of Symbolic Regression: if you find the true governing law, your predictions will hold up anywhere. If you just curve-fit, you might fail when extrapolating.

Qualitative Analysis: What do the models actually output?

Let’s look at a Biology example: Population Growth.

Figure 14. Example of output hypotheses from different LLM scientific equation discovery methods for BPG0 problem.

Figure 14 displays the outputs for a population growth problem.

Ground Truth: A specific logistic growth model.
Direct Prompting: Guesses a generic polynomial structure. It’s vague.
LaSR: Produces a highly complex, convoluted expression with sine waves and logs that looks like “math salad.”
LLM-SR: Successfully identifies the logistic growth structure (\(P(1-P)\)) and attempts to add a periodic factor, getting much closer to the truth.

This visualizes the difficulty: LLMs often “hallucinate” complex mathematical terms to try and fit the data, rather than finding the elegant, simple underlying law.

Similar struggles are seen in Chemistry (Figure 15) and Physics (Figure 17), where models sometimes mix correct physical intuitions with incorrect mathematical operators.

Figure 15. Example of output hypotheses from different LLM scientific equation discovery methods for CKR2 problem.

Figure 17. Example of output hypotheses from different LLM scientific equation discovery methods for PO0 problem.

Conclusion: The Road Ahead

LLM-SRBench serves as a reality check for the field of AI for Science. It demonstrates that while LLMs are powerful tools, they are not yet autonomous scientists. Their high performance on previous benchmarks was largely an illusion of memorization.

When faced with transformed equations (LSR-Transform) or novel synthetic scenarios (LSR-Synth), current state-of-the-art methods struggle to break the 30% accuracy barrier.

However, the paper also highlights the path forward. Methods that combine the semantic knowledge of LLMs with rigorous, data-driven evolutionary search (like LLM-SR and LaSR) significantly outperform raw LLMs. The future of scientific discovery likely lies in these neuro-symbolic hybrids—systems that can “read” the literature to generate hypotheses, but use hard mathematical optimization to verify and refine them against reality.

LLM-SRBench provides the rigorous testing ground needed to build these next-generation AI scientists. It ensures that when an AI finally claims to have discovered a new law of physics, it’s because it actually understood the data, not because it remembered the answer key.

Introduction: The Illusion of Discovery#

Background: From Genetic Algorithms to LLMs#

What is Equation Discovery?#

The LLM Paradigm Shift#

The Core Method: Designing LLM-SRBench#

1. LSR-Transform: The “Twist” on Classics#

2. LSR-Synth: Synthetic Scientific Worlds#

How to Grade an AI Scientist#

1. Data Fidelity (The Numerical Check)#

2. Symbolic Accuracy (The Logic Check)#

Experiments and Results#

The Main Takeaway: It’s Hard.#

The Gap Between Memory and Reasoning#

Generalization: The True Test of Science#

Symbolic Accuracy Predicts Generalization#

Qualitative Analysis: What do the models actually output?#

Conclusion: The Road Ahead#