Can LLMs Act as Cognitive Scientists? Discovering How Brains Learn with CogFunSearch

In the world of neuroscience and psychology, there is a constant tension between prediction and understanding.

If you want to simply predict what a human or animal will do next, you might train a massive Recurrent Neural Network (RNN) on their behavioral data. The RNN will likely achieve high accuracy, but it acts as a “black box.” It gives you the answer, but it doesn’t tell you how the brain solved the problem. It doesn’t offer a scientific theory.

On the other hand, if you want to understand the brain, you build symbolic cognitive models. These are compact, interpretable equations (like simple code snippets) that describe mechanisms like “learning rates” or “forgetting curves.” These models are easy for scientists to read, but they are notoriously difficult to design. They require human intuition, trial-and-error, and often fail to capture the messy nuances of real-world behavior.

But what if we could have the best of both worlds? What if we could use Artificial Intelligence to automatically write the scientific theories for us?

In a fascinating new paper, Discovering Symbolic Cognitive Models from Human and Animal Behavior, researchers from Google DeepMind and various academic institutions introduce CogFunSearch. This system uses Large Language Models (LLMs) to function as an automated scientist, writing, testing, and evolving Python code that explains how humans, rats, and fruit flies learn.

This post dives into how CogFunSearch works, why it outperforms human-designed theories, and what it reveals about the hidden algorithms of the mind.

The Scientist’s Dilemma: Theory-First vs. Data-First

To understand why this research is significant, we need to look at how cognitive modeling is usually done.

For decades, the standard approach has been theory-first. A scientist reads the literature, has a “Eureka!” moment, and writes down an equation (e.g., “I think the brain updates value based on a prediction error multiplied by a learning rate”). They then test this model against data. The limitation here is human imagination; the space of possible mathematical models is infinite, and we tend to stick to the ones we already know, like Q-learning.

Recently, the field has dabbled in data-driven approaches using neural networks. While these networks are flexible and fit data beautifully, they are composed of millions of uninterpretable weights. Knowing that an RNN fits the data doesn’t tell you why the subject made a choice.

CogFunSearch attempts to bridge this gap by searching the space of interpretable computer programs.

The Core Method: Evolution Meets LLMs

CogFunSearch builds upon a technique called FunSearch (Functional Search), which pairs a pre-trained LLM with an evolutionary algorithm. The goal is not just to output text, but to output executable Python code that solves a specific scientific problem.

The Architecture of Discovery

The process is a loop of creation and evaluation. It works like an automated laboratory:

The LLM as a Generator: The system starts with a prompt containing a “seed program”—a basic skeleton of a cognitive model. The LLM acts as a mutation engine. It takes the code and suggests changes, additions, or rewrites, effectively proposing a new hypothesis about how the brain works.
The Evaluator (The Critic): The generated code is then run against real behavioral datasets. If the code crashes or produces nonsense, it is discarded. If it runs, it is scored based on how well it predicts the animal’s behavior.
Evolutionary Selection: The best-performing programs are saved to a database and fed back into the LLM as prompts for the next generation. Over thousands of iterations, the programs “evolve” to become better and better scientists.

Overview of CogFunSearch. Panel A shows the evolutionary loop using the LLM. Panel B shows how the program processes inputs (choices and rewards) to update a hidden state. Panel C shows the scoring mechanism.

The Bilevel Optimization Problem

What makes this specific application tricky is that a cognitive model isn’t just code logic; it also contains parameters.

For example, a model might say: “Update the value by alpha times the reward.”

The Logic (the code structure) is what the LLM writes.
The Parameter (alpha) is a number that varies from individual to individual (e.g., one rat might learn faster than another).

CogFunSearch solves this via Bilevel Optimization:

Outer Loop (LLM): Searches for the best structure (the Python code logic).
Inner Loop (Gradient Descent): For every program the LLM writes, the system uses gradient descent to find the best specific parameters (\(\theta\)) for the specific animal being tested.

This ensures the LLM doesn’t have to guess the exact learning rate for Rat #42. It just needs to write the formula for learning, and the inner loop calculates the optimal rate for that specific rat.

Rigorous Evaluation

To ensure the AI wasn’t just memorizing the data (overfitting), the researchers used a strict cross-validation scheme.

Organizing data for train and test. The diagram shows how data is split into train and test sets, and further subdivided into “fit” and “eval” subsets to ensure robust scoring.

As shown in Figure 8 (above), the data is split multiple ways. First, subjects are split into Training and Test groups. Within a subject, their sessions are split into “Even” and “Odd.” The model parameters are fitted on the Even sessions and scored on the Odd sessions (and vice versa). This guarantees that a high score represents a genuine predictive understanding of the behavior.

The Experiments: From Flies to Humans

The team tested CogFunSearch on three distinct datasets, covering a spectrum of biological complexity:

Humans: 862 participants playing a 4-armed bandit task (choosing between 4 options to find the best reward).
Rats: 20 rats performing a 2-armed bandit task with binary rewards.
Fruit Flies: 347 flies navigating a Y-maze with odor-based rewards.

Illustration of datasets. Panel A shows Human keyboard choices. Panel B shows Rat nose-poke choices. Panel C shows Fruit Fly Y-maze choices.

These datasets were chosen because they are large and, crucially, they have already been heavily studied. Top scientists have already spent years hand-crafting models for these specific datasets. This provided a “Human Expert Baseline” to beat.

Results: Beating the Human Experts

The primary question was: Can the AI write a better scientific theory than a human expert?

The answer was a resounding yes.

Across all three species, the best programs discovered by CogFunSearch outperformed the state-of-the-art models from neuroscience literature.

Humans: The AI beat the “Perseverative Forgetting Q-Learning” model (Eckstein et al., 2024).
Rats: The AI beat the “Reward-Seeking/Habit/Gambler-Fallacy” model (Miller et al., 2021).
Flies: The AI beat the “Differential Forgetting Q-Learning” model (Ito & Doya, 2009).

Discovered models outperform human-designed models. Scatter plots show the difference in likelihood between discovered models and baselines. The average (black bar) is consistently positive, indicating superior performance.

In the figure above, any point above the zero line represents an individual subject that was better predicted by the AI model than the human-designed model. For the Rat dataset, the AI model provided a better fit for 100% of the test subjects.

Closing the Gap with Neural Networks

The researchers also compared their interpretable symbolic models against “black box” RNNs (GRU models). Usually, RNNs are vastly superior to symbolic models because of their flexibility.

However, as shown below, the CogFunSearch programs (blue crosses) managed to close the majority of the gap between the baseline models and the RNNs. While the RNNs (orange crosses) still hold a slight edge in raw prediction power, the AI-discovered programs are competitive while remaining readable Python code.

Comparison of CogFunSearch against RNNs. The plot shows that CogFunSearch programs bridge most of the performance gap between traditional models and deep neural networks.

Interpreting the “Alien” Science

The true value of this method isn’t just getting a higher score; it’s reading the code to see how the AI solved the problem. The discovered programs offered several intriguing insights into cognition.

1. Novel Mechanisms

The best program for the Human dataset introduced a mechanism where learned values decay toward their mean (average), rather than decaying to zero or remaining static. It also tracked complex choice histories—variables like “trials since last switch”—suggesting that humans rely heavily on patterns of repetition and alternation, distinct from simple reward tracking.

For the Rat dataset, the AI rediscovered the “Habit” and “Gambler’s Fallacy” terms found by human scientists but refined the learning rules. It implemented a “cross-learning” mechanism where updating the value of the chosen option also nudges the value of the unchosen option—a subtle dependency often missed in standard models.

2. Generative Validity

A good model should simulate behavior that looks like real life. The researchers used the AI-discovered programs to generate “synthetic subjects” and analyzed their behavioral patterns (like how likely they are to repeat a choice after a reward).

Synthetic datasets generated by discovered programs. The blue lines (AI model) closely track the red lines (Real Data) for both reward-seeking and choice perseveration behaviors.

The match between the real data (Red) and the AI’s synthetic data (Blue) in Figure 6 confirms that the programs aren’t just fitting statistical noise—they are capturing the actual dynamics of decision-making.

3. The Complexity Trade-off

One of the most powerful features of CogFunSearch is that it generates a library of programs, not just one. This allows scientists to analyze the trade-off between complexity and accuracy.

We can plot the discovered programs to find the “Pareto frontier”—the set of programs that offer the best accuracy for a given level of code complexity.

Quality-of-Fit trades off with Complexity. The scatter plot shows a clear correlation: more complex programs tend to fit the data better, but efficient programs exist on the ‘frontier’.

In the plot above, researchers can choose a model. Do they want the “highest scoring program” (Blue X) which might be harder to read? Or do they prefer “Example Program 3” (Orange Cross), which is much simpler but still outperforms the baseline? This choice empowers scientists to select the level of interpretability they need.

Conclusion: A New Tool for Discovery

CogFunSearch represents a shift in how we approach scientific discovery. Rather than relying solely on human creativity to generate hypotheses, we can use LLMs to perform a broad, unbiased search of the “hypothesis space.”

The programs discovered here are not black boxes. They are readable, variable-tracking, logic-based algorithms. They suggest that human and animal learning involves subtle mechanisms—like decay-to-mean and complex exploration strategies—that standard textbook models overlook.

While there is still a small gap between these symbolic models and the raw power of neural networks, the gap is narrowing. We are moving toward a future where AI doesn’t just predict the world for us, but helps us explain it, writing the equations of the mind one line of Python at a time.

The Scientist’s Dilemma: Theory-First vs. Data-First#

The Core Method: Evolution Meets LLMs#

The Architecture of Discovery#

The Bilevel Optimization Problem#

Rigorous Evaluation#

The Experiments: From Flies to Humans#

Results: Beating the Human Experts#

Closing the Gap with Neural Networks#

Interpreting the “Alien” Science#

1. Novel Mechanisms#

2. Generative Validity#

3. The Complexity Trade-off#

Conclusion: A New Tool for Discovery#