Modern AI models for computer vision have become astonishingly good at recognizing objects, segmenting scenes, and even generating photorealistic images. What’s truly fascinating is that their internal workings—the complex patterns of artificial neuron activations—often bear a striking resemblance to neural activity in the human brain when viewing the same stimuli. This is not just coincidence; it’s a clue about the deep principles of information processing.
For years, scientists have observed this brain–AI similarity, but the reason why has remained elusive. Is the resemblance driven by the model’s architecture, the sheer amount of training data, or the type of data it sees? Previous studies often examined pre-trained models where all these factors varied together, making it impossible to isolate their effects.
A recent study from researchers at Meta AI and ENS-PSL tackles this problem head-on. By systematically controlling model size, training duration, and image type in a family of vision transformers, they reveal the causal ingredients behind an AI’s ability to “see” the world like a human.
Comparing AI to Brains: The Encoding Approach
Before exploring their experiments, we need to understand how you can even compare a silicon-based neural network to a biological brain.
The researchers used a well-established method called encoding analysis. The central question: Is there a reliable mapping from the AI’s internal representations to brain activity patterns?
Imagine showing an image of a cat to both a vision transformer and a person. The model produces a high-dimensional activation vector (X
), and the person’s brain produces a complex pattern of neural activity (Y
), measured with fMRI or MEG.
The encoding model seeks a simple linear transformation \(W\) that predicts \(Y\) from \(X\):
If such a transformation can predict brain activity well, it means the AI’s internal representations contain similar information to the brain’s, albeit in a different “format.” The quality of this prediction—measured with Pearson correlation \(R\)—is our brain-similarity score.
To gain both spatial and temporal insight, the researchers combined:
- Functional Magnetic Resonance Imaging (fMRI) — high spatial resolution: where activity happens.
- Magnetoencephalography (MEG) — high temporal resolution: when activity happens.
This dual approach lets them ask not just if AI and brain representations are similar, but whether their hierarchical organization in space and time is aligned.
A Systematic Experiment Design
Their experimental backbone was the DINOv3 family—a state-of-the-art self-supervised vision transformer—trained in systematically varied configurations:
Factors they manipulated:
- Model Size:
From Small (21M parameters) to Giant (1.1B parameters), all trained on the same human-centric dataset to isolate scale effects. - Training Amount:
By saving checkpoints throughout training, they assessed how brain similarity evolved from an untrained network to a fully trained one. - Image Type:
Three Large-model variants trained on 10M images each:- Human-centric: Everyday photos of people, places, and objects.
- Satellite: High-resolution aerial imagery.
- Cellular: Microscopy images of cells.
Three similarity metrics:
- Encoding Score: Overall representational similarity across the brain.
- Spatial Score: Alignment between model’s layer hierarchy and brain’s spatial hierarchy (e.g., early layers match early visual cortex).
- Temporal Score: Alignment between model’s layer hierarchy and brain’s temporal processing (e.g., early layers match early MEG responses).
Finding 1: A Well-Trained AI Learns a Brain-like Hierarchy
A large, fully trained DINOv3 showed strong overall brain similarity.
fMRI results: The model’s features could predict activity across the visual pathway—from early visual cortex to higher-order regions in the prefrontal cortex.
MEG results: Similarity emerged within ~70 ms of seeing an image and persisted for seconds.
The organization of similarity was hierarchical:
- Spatial Score: Early layers predicted early visual regions (V1), deeper layers predicted associative & prefrontal cortices.
- Temporal Score: Early layers matched rapid MEG responses; deeper layers matched later, sustained responses.
This shows that modern vision transformers don’t just learn a jumble of features—they learn processing hierarchies that mirror the brain’s flow of visual information.
Finding 2: Brain-Likeness Emerges in a Developmental Sequence
Checkpoint analysis revealed that brain-like organization emerges progressively—not all at once.
Untrained models showed minimal similarity. As training proceeded:
- Representations matching early visual cortex appeared first.
- Representations matching high-level prefrontal regions appeared much later.
The “half-time” metric—training needed to reach 50% of final similarity—made this even clearer:
- Early regions like V1 had short half-times.
- Prefrontal regions had the longest half-times.
- Early MEG time windows matched quickly; later time windows took far longer.
This suggests the model learns low-level sensory statistics early, and only with massive training does it acquire high-level abstract representations.
Finding 3: Size, Experience, and Data Type All Matter
Model Size
Bigger models achieved higher scores across all metrics.
The largest gains occurred in predicting high-level brain regions.
Image Type
Models trained on human-centric images achieved significantly higher scores than those trained on satellite or cellular images—across all regions.
This supports an empiricist view: to build systems that see like humans, you must feed them a visual diet similar to human experience.
Finding 4: AI’s Learning Mirrors Brain Physiology
The researchers then correlated a brain region’s half-time with its physical & developmental properties:
- Cortical Expansion: Regions growing most from infancy to adulthood were learned last by the AI.
- Cortical Thickness: Thicker regions had longer half-times.
- Intrinsic Timescales: Regions integrating information over longer periods were learned later.
- Myelination: Less-myelinated (slower) regions were learned later.
The AI’s developmental sequence—from fast, simple sensory maps to slow, complex associative maps—mirrors the biological hierarchy shaped by both evolution and individual development.
Conclusion: Toward AI as a Tool for Neuroscience
Key takeaways:
- All factors matter: Architecture (bigger models), training duration, and ecologically relevant data all contribute to brain similarity.
- Staged development: Models learn early sensory maps first, high-level abstract maps much later—with immense data needs.
- Biological mirroring: AI’s training mirrors human cortical development—regions hardest for AI to master are those that develop slowest in humans.
By building and probing AI under controlled conditions, we can move from observing similarities to understanding their causes. This opens the possibility of using AI models as computational proxies to study how biological brains develop—and perhaps, how to shape that development under different conditions.
In showing how machines can come to see like us, this work offers insight into how we come to see the world.