The Limits of Scale: Why Bigger AI Models Aren’t Necessarily Better Brains

In the current era of Artificial Intelligence, there is a pervasive mantra: scale is all you need. From Large Language Models (LLMs) like GPT-4 to massive vision transformers, the recipe for success has largely been to increase the number of parameters, feed the model more data, and throw more compute at the training process. This “brute force” approach has yielded unprecedented performance on tasks ranging from coding to generating photorealistic images.

For computational neuroscientists, this raises a fascinating question. We have long used Artificial Neural Networks (ANNs) as proxy models for the primate visual system. If scaling up models makes them better at computer vision tasks, does it also make them better models of the biological brain?

A recent paper titled “Scaling Laws for Task-Optimized Models of the Primate Visual Ventral Stream” by Abdulkadir Gokce and Martin Schrimpf takes a rigorous, data-driven look at this question. By systematically training over 600 models, they explore whether the “scaling laws” that drive AI progress also apply to brain alignment.

The results are surprising. While scaling helps models mimic human behavior, it seems we are hitting a hard ceiling on how well these massive models actually replicate the internal neural mechanisms of the brain.

The Problem: Do “Smarter” Models Look More Like Brains?

The visual ventral stream is the pathway in the primate brain responsible for object recognition—often called the “What” pathway. It processes information hierarchically, starting from the Primary Visual Cortex (V1), moving through V2 and V4, and finally to the Inferior Temporal (IT) cortex, where complex object representations are formed.

For the past decade, Deep Convolutional Neural Networks (CNNs) trained on object classification (like ImageNet) have been our best predictive models for this biological system. The assumption has been that as we make these networks better at seeing (by making them deeper or larger), they naturally converge towards the biological solution found by evolution.

However, the relationship between engineering performance (accuracy on a dataset) and biological fidelity (brain alignment) is murky. This paper attempts to clarify that relationship by establishing formal “scaling laws.”

Figure 1: a) Schematic of the compute budget and training workflow. b) The core finding: while behavioral alignment (dark teal) scales up towards 1.0, neural alignment (light teal) saturates early.

As shown in Figure 1, the researchers set out to determine how alignment scores change as we increase the compute budget (\(C\)). The initial summary provided in panel (b) hints at the paper’s main twist: Neural alignment and Behavioral alignment follow very different trajectories.

Background: Scaling Laws and Brain-Score

Before diving into the experiments, we need to understand two key concepts: Scaling Laws and Brain-Score.

1. The Power of Scaling Laws

In machine learning, “scaling laws” refer to the observation that model performance (usually loss) improves predictably as a power-law function of compute, data size, or parameter count. If you plot the test loss against the training compute on a log-log scale, you get a straight line. This predictability allows researchers to estimate how much “smarter” a model will get if they double their budget.

The authors of this paper fit power-law functions to brain alignment scores (\(S\)) or misalignment scores (\(L = 1 - S\)). The general form of the equation is:

Equation for scaling laws

Here, \(L\) is the misalignment, \(E\) is the irreducible error (the best possible score), \(A\) is a constant, \(X\) is the scaling factor (like data size), and \(\alpha\) is the scaling exponent. A higher \(\alpha\) means the model improves faster as you scale up.

2. Measuring Success with Brain-Score

To measure “brain alignment,” the authors used the Brain-Score benchmark. This is a standardized suite of metrics that compares ANNs to biological data:

Neural Alignment: Compares the internal activations of the artificial network layers to single-unit recordings from macaques in regions V1, V2, V4, and IT.
Behavioral Alignment: Compares the model’s “confusion matrix” to human behavior. For example, if humans frequently mistake a dog for a cat but rarely for a car, a good model should make similar mistakes.

The Experiment: A Systematic “Model Factory”

Previous studies on this topic often grabbed off-the-shelf models (like a pretrained ResNet50) and compared them. The problem with that approach is that those models were trained with different recipes, augmentations, and datasets, making it impossible to isolate the variable of “scale.”

Gokce and Schrimpf took a different approach: Controlled Training.

They trained over 600 models from scratch. This included:

Architectures: ResNets, EfficientNets, Vision Transformers (ViTs), ConvNeXts, and CORnet-S.
Datasets: ImageNet (standard object recognition) and EcoSet (ecologically valid categories).
Data Regimes: They varied the number of samples per category from just 1 image all the way up to the full dataset (thousands of images).

This massive undertaking allowed them to disentangle the effects of model size (parameters) from data size (samples).

Core Results

1. The Great Dissociation: Behavior vs. Neurons

The most striking finding is the divergence between behavioral and neural alignment.

When the researchers scaled up compute, the models’ behavioral alignment scores (how well they mimic human error patterns) continued to rise, following a promising power law. The curve suggests that with enough compute, we could approach near-perfect behavioral alignment.

However, neural alignment saturates.

Figure 2: Scaling Model Size. Panel (a) shows individual architectures. Notice the saturation in the neural curves (light blue) versus the continued rise of behavioral curves (dark teal). Panel (c) highlights that simply adding parameters hits a point of diminishing returns.

As seen in Figure 2, regardless of the architecture (ResNet, ViT, etc.), the neural alignment score flattens out. We are hitting a wall where making the model larger or training it longer yields zero improvement in how well its internal neurons match biological neurons.

2. The Battle of Architectures: Inductive Bias Matters

Not all neural networks are created equal. The study compared Convolutional Neural Networks (CNNs) like ResNet and EfficientNet against Transformers (ViTs) and modern hybrid architectures (ConvNeXt).

Strong Inductive Bias: CNNs have “priors” built into their design. They assume spatial invariance (a cat in the top left is the same as a cat in the bottom right). This mimics the receptive fields found in the early visual cortex.
Weak Inductive Bias: Vision Transformers treat images more like sequences of patches. They have to “learn” how to process spatial relationships from scratch.

The results in Figure 2a and Figure 3b reveal a fascinating dynamic:

Low Data Regime: In the early stages (low compute/data), CNNs dominate. Their built-in structure gives them a head start in brain alignment.
High Data Regime: As data and compute increase, the “weaker” models (ViTs and ConvNeXts) catch up. With enough data, the architecture matters less; the data shapes the model to align with the brain.

This suggests that biological structure (convolutions) is highly efficient, but generalized learning machines can reach the same destination if given enough experience.

3. Data is King (and Queen)

If you have a limited compute budget, should you build a bigger model, or should you collect more data?

The paper answers this definitively: Focus on data.

Figure 3: Scaling Dataset Size. Panel (a) shows that training on larger, generalist datasets (ImageNet, EcoSet) drives alignment up. Specialized datasets (Places365) or simple ones (MNIST) fail to produce good brain models.

The scaling laws derived in the paper show that increasing the dataset size (\(D\)) has a much higher exponent (better return on investment) than increasing the model size (\(N\)).

In fact, the authors calculated the optimal allocation of resources. To maximize brain alignment, you should scale your dataset much faster than your model parameters.

Figure 4: Optimal Compute Allocation. Panel (a) creates a contour map of alignment. The trajectory for optimal alignment leans heavily towards increasing dataset size (vertical axis) rather than just model size (horizontal axis).

Figure 4 visualizes this trade-off. The equation for optimal allocation derived from their data suggests that for every unit of compute increase, roughly 70% should go to data scaling and only 30% to model size scaling. This challenges the current trend in some AI sectors that focus heavily on trillion-parameter models.

4. The Hierarchy Effect

The brain is not a monolith; it is a hierarchy. The study found that scaling benefits differ depending on which brain region you are looking at.

Figure 5: The Graded Effect of Scale. Panel (b) shows the “Alignment Gain.” Regions higher in the hierarchy (IT, Behavior) benefit significantly more from scaling than early regions (V1, V2).

V1 (Primary Visual Cortex): Scaling helps very little. Even small models align reasonably well with V1, and massive models don’t offer much improvement. This suggests V1 features are “cheap” and easy to learn.
IT (Inferior Temporal Cortex) & Behavior: These high-level areas see the biggest boost from scaling. The complex, semantic representations needed here require the “depth” and data volume of large-scale learning.

5. Task Performance \(\neq\) Neural Alignment

Finally, the researchers tackled the assumption that “better performance equals better brain model.”

Figure 6: Correlation between Task Performance and Alignment. Panel (b) shows behavioral alignment tracking perfectly with accuracy. Panel (a) shows neural alignment curling over—improving initially, then stalling even as accuracy gets better.

Figure 6 is perhaps the most damning for current modeling approaches. While behavioral alignment (Panel b) moves in lockstep with validation accuracy, neural alignment (Panel a) decouples. You can have a model that is superhuman at classifying ImageNet, yet its internal neural representations are no more “brain-like” than a much weaker model.

Why Does Neural Alignment Saturate?

The authors discuss several reasons why we might be hitting this ceiling:

The Limits of Supervised Learning: Most models were trained to classify objects (Supervised Learning). The brain, however, learns largely through self-supervision (predicting the future, associating inputs).

Note: The authors tested Self-Supervised Learning (SSL) models (SimCLR, DINO) and found they also suffered from saturation, though they were sometimes more data-efficient.

Missing Biology: Current ANNs lack recurrence (feedback loops), spikes, and specific biological constraints. While the authors tested CORnet (a recurrent model), even it eventually saturated.
Data Quality: ImageNet is a collection of static, curated photos. The primate visual stream evolved to process continuous, dynamic video streams of the natural world.

Conclusion and Implications

This paper serves as a vital “reality check” for the field of NeuroAI. It established that we cannot simply scale our way to a perfect model of the brain using current architectures and datasets.

While scaling is excellent for reproducing human behavior (the output), it yields diminishing returns for replicating neural mechanisms (the internal state).

Key Takeaways for Students:

Behavior \(\neq\) Mechanism: A model can act like a brain without working like a brain.
Efficiency of Priors: Architectures that look like the brain (CNNs) are efficient learners, but massive data allows unstructured models (Transformers) to catch up.
Data over Parameters: If you want a better brain model, get more/better data, don’t just add layers.
The Ceiling is Real: To break past the current saturation in neural alignment, we likely need a paradigm shift—perhaps toward more ecologically valid video data, embodied learning, or biologically plausible training objectives beyond simple classification.

The era of “just add more compute” might be ending for brain modeling. The next breakthrough will likely come from smarter designs, not just bigger ones.

The Limits of Scale: Why Bigger AI Models Aren’t Necessarily Better Brains#

The Problem: Do “Smarter” Models Look More Like Brains?#

Background: Scaling Laws and Brain-Score#

1. The Power of Scaling Laws#

2. Measuring Success with Brain-Score#

The Experiment: A Systematic “Model Factory”#

Core Results#

1. The Great Dissociation: Behavior vs. Neurons#

2. The Battle of Architectures: Inductive Bias Matters#

3. Data is King (and Queen)#

4. The Hierarchy Effect#

5. Task Performance \(\neq\) Neural Alignment#

Why Does Neural Alignment Saturate?#

Conclusion and Implications#

Key Takeaways for Students:#