How AI Vision Models Learn to See Like Humans: The Three Keys to Brain-Like Intelligence

Modern AI models for computer vision have become astonishingly good at recognizing objects, segmenting scenes, and even generating photorealistic images. What’s truly fascinating is that their internal workings—the complex patterns of artificial neuron activations—often bear a striking resemblance to neural activity in the human brain when viewing the same stimuli. This is not just coincidence; it’s a clue about the deep principles of information processing.

For years, scientists have observed this brain–AI similarity, but the reason why has remained elusive. Is the resemblance driven by the model’s architecture, the sheer amount of training data, or the type of data it sees? Previous studies often examined pre-trained models where all these factors varied together, making it impossible to isolate their effects.

A recent study from researchers at Meta AI and ENS-PSL tackles this problem head-on. By systematically controlling model size, training duration, and image type in a family of vision transformers, they reveal the causal ingredients behind an AI’s ability to “see” the world like a human.

Comparing AI to Brains: The Encoding Approach

Before exploring their experiments, we need to understand how you can even compare a silicon-based neural network to a biological brain.

The researchers used a well-established method called encoding analysis. The central question: Is there a reliable mapping from the AI’s internal representations to brain activity patterns?

Imagine showing an image of a cat to both a vision transformer and a person. The model produces a high-dimensional activation vector (X), and the person’s brain produces a complex pattern of neural activity (Y), measured with fMRI or MEG.

The encoding model seeks a simple linear transformation \(W\) that predicts \(Y\) from \(X\):

The core math behind encoding models, showing a Ridge Regression formula.

If such a transformation can predict brain activity well, it means the AI’s internal representations contain similar information to the brain’s, albeit in a different “format.” The quality of this prediction—measured with Pearson correlation \(R\)—is our brain-similarity score.

The formula for the Pearson correlation score used to measure similarity.

To gain both spatial and temporal insight, the researchers combined:

Functional Magnetic Resonance Imaging (fMRI) — high spatial resolution: where activity happens.
Magnetoencephalography (MEG) — high temporal resolution: when activity happens.

This dual approach lets them ask not just if AI and brain representations are similar, but whether their hierarchical organization in space and time is aligned.

A Systematic Experiment Design

Their experimental backbone was the DINOv3 family—a state-of-the-art self-supervised vision transformer—trained in systematically varied configurations:

A schematic showing the three core components of the study: Measures, Factors, and Similarity Metrics.

Factors they manipulated:

Model Size:
From Small (21M parameters) to Giant (1.1B parameters), all trained on the same human-centric dataset to isolate scale effects.
Training Amount:
By saving checkpoints throughout training, they assessed how brain similarity evolved from an untrained network to a fully trained one.
Image Type:
Three Large-model variants trained on 10M images each:
- Human-centric: Everyday photos of people, places, and objects.
- Satellite: High-resolution aerial imagery.
- Cellular: Microscopy images of cells.

A table specifying the different DINOv3 model variants used in the study, detailing their parameters, layers, batch size, and training data.

Three similarity metrics:

Encoding Score: Overall representational similarity across the brain.
Spatial Score: Alignment between model’s layer hierarchy and brain’s spatial hierarchy (e.g., early layers match early visual cortex).
Temporal Score: Alignment between model’s layer hierarchy and brain’s temporal processing (e.g., early layers match early MEG responses).

Finding 1: A Well-Trained AI Learns a Brain-like Hierarchy

A large, fully trained DINOv3 showed strong overall brain similarity.

fMRI results: The model’s features could predict activity across the visual pathway—from early visual cortex to higher-order regions in the prefrontal cortex.

MEG results: Similarity emerged within ~70 ms of seeing an image and persisted for seconds.

Figure 2 shows fMRI brain maps with high similarity scores in the visual cortex (Panel A) and an MEG time-course plot showing similarity peaking around 1 second after image presentation (Panel B).

The organization of similarity was hierarchical:

Spatial Score: Early layers predicted early visual regions (V1), deeper layers predicted associative & prefrontal cortices.
Temporal Score: Early layers matched rapid MEG responses; deeper layers matched later, sustained responses.

Figure 3 illustrates hierarchical alignment: early (blue) to late (red) layers along visual pathways; strong correlations between model depth and both spatial distance from V1 and processing time Tmax.

This shows that modern vision transformers don’t just learn a jumble of features—they learn processing hierarchies that mirror the brain’s flow of visual information.

Finding 2: Brain-Likeness Emerges in a Developmental Sequence

Checkpoint analysis revealed that brain-like organization emerges progressively—not all at once.

Untrained models showed minimal similarity. As training proceeded:

Representations matching early visual cortex appeared first.
Representations matching high-level prefrontal regions appeared much later.

Figure 4 contrasts an untrained model (Panels A, C) with a partially trained one (Panels B, D) and shows steady gains in encoding, spatial, and temporal scores over training (Panel E).

The “half-time” metric—training needed to reach 50% of final similarity—made this even clearer:

Early regions like V1 had short half-times.
Prefrontal regions had the longest half-times.
Early MEG time windows matched quickly; later time windows took far longer.

Figure 5 maps half-times across brain regions and time windows, showing strong correlations: further from V1 or later in processing → longer learning time.

This suggests the model learns low-level sensory statistics early, and only with massive training does it acquire high-level abstract representations.

Finding 3: Size, Experience, and Data Type All Matter

Model Size

Bigger models achieved higher scores across all metrics.
The largest gains occurred in predicting high-level brain regions.

Figure 6 shows larger models achieving higher scores; biggest gap in higher-order areas like IFSp and IFSa.

Image Type

Models trained on human-centric images achieved significantly higher scores than those trained on satellite or cellular images—across all regions.

Figure 7: Human-centric photos outperform satellite and cellular training across all metrics, with differences evident in nearly every brain region.

This supports an empiricist view: to build systems that see like humans, you must feed them a visual diet similar to human experience.

Finding 4: AI’s Learning Mirrors Brain Physiology

The researchers then correlated a brain region’s half-time with its physical & developmental properties:

Figure 8 connects AI’s half-time to four cortical properties, showing strong relations for expansion, thickness, timescales, and myelination.

Cortical Expansion: Regions growing most from infancy to adulthood were learned last by the AI.
Cortical Thickness: Thicker regions had longer half-times.
Intrinsic Timescales: Regions integrating information over longer periods were learned later.
Myelination: Less-myelinated (slower) regions were learned later.

The AI’s developmental sequence—from fast, simple sensory maps to slow, complex associative maps—mirrors the biological hierarchy shaped by both evolution and individual development.

Conclusion: Toward AI as a Tool for Neuroscience

Key takeaways:

All factors matter: Architecture (bigger models), training duration, and ecologically relevant data all contribute to brain similarity.
Staged development: Models learn early sensory maps first, high-level abstract maps much later—with immense data needs.
Biological mirroring: AI’s training mirrors human cortical development—regions hardest for AI to master are those that develop slowest in humans.

By building and probing AI under controlled conditions, we can move from observing similarities to understanding their causes. This opens the possibility of using AI models as computational proxies to study how biological brains develop—and perhaps, how to shape that development under different conditions.

In showing how machines can come to see like us, this work offers insight into how we come to see the world.

Comparing AI to Brains: The Encoding Approach#

A Systematic Experiment Design#

Factors they manipulated:#

Three similarity metrics:#

Finding 1: A Well-Trained AI Learns a Brain-like Hierarchy#

Finding 2: Brain-Likeness Emerges in a Developmental Sequence#

Finding 3: Size, Experience, and Data Type All Matter#

Model Size#

Image Type#

Finding 4: AI’s Learning Mirrors Brain Physiology#

Conclusion: Toward AI as a Tool for Neuroscience#