Artificial Eyes vs. Human Eyes: Do Foundation Models See Like Us?

In the rapidly evolving world of computer vision, we have witnessed a massive shift toward “Foundation Models.” Giants like DINOv2, OpenCLIP, and Segment Anything (SAM) are trained on billions of natural images, learning to recognize objects, segment scenes, and understand visual concepts with uncanny accuracy. These models are self-supervised; they learn by looking at the world, much like a human infant does during development.

This parallel raises a profound scientific question: If neural networks and the human visual system (HVS) are both trained by observing the statistics of the natural world, do they evolve to “see” in the same way?

We know they share high-level similarities—both can identify a cat or a car. But what about the low-level machinery of vision? Does a foundation model perceive contrast, textures, and faint patterns the way our eyes and primary visual cortex do? Or have they found a completely alien way to process light?

A recent paper from the University of Cambridge, titled “Do computer vision foundation models learn the low-level characteristics of the human visual system?”, tackles this question head-on. By subjecting 45 distinct AI models to the same psychophysical tests used on humans, the researchers have created a fascinating map of the similarities and divergences between biological and artificial vision.

Comparison of HVS and Image Encoder testing protocols.

The Premise: Biological vs. Statistical Constraints

To understand why this matters, we have to look at why humans see the way we do. Our vision is shaped by two main forces:

  1. Biological Constraints: The optics of our eyes, the density of cones in our retina, and the neural wiring of our brain create specific “bottlenecks.” For example, we are bad at seeing details in the dark, and we can’t see contrast if a pattern is too fine or too coarse.
  2. Natural Statistics: Our brains are tuned to the environment. We are good at separating objects from backgrounds and recognizing things regardless of lighting or distance.

If computer vision models share these traits, it suggests these characteristics are necessary for efficient vision in our universe. If they differ, it suggests that human vision is largely a product of our biological limitations (like messy optics), which computers don’t have to worry about.

The Framework: Testing AI Like a Human

The researchers treated the AI models as “black boxes” or, more accurately, as digital subjects in a psychology experiment. In human psychophysics, we measure vision by showing people specific patterns—usually gratings (stripes) or noise—and asking, “Can you see this?” or “Does this look different from that?”

To apply this to AI, the authors developed a pipeline to measure the “perceived difference” between two images according to a model’s feature encoder.

The Pipeline

The process, illustrated below, mimics a standard human vision test (2-Alternative Forced Choice):

Pipeline for computing perception alignment.

  1. Stimuli Generation: They generate precise visual patterns (Test Image) and a reference background (Reference Image). These are defined in physical light units (\(cd/m^2\)) to match human experiments.
  2. Display Model: Since these models are trained on internet images (which are usually sRGB), the physical light values are converted into the sRGB color space.
  3. Feature Extraction: Both images are fed into the foundation model (e.g., DINOv2). The model outputs a feature vector—a long list of numbers representing the “meaning” of the image.
  4. Similarity Metric (\(S_{ac}\)): The researchers calculate the difference between the two feature vectors.

The Metric: Angular Cosine Similarity

How do you measure if a neural network thinks two images are different? You might think of using simple Euclidean distance (how far apart the numbers are). However, the researchers found that the angle between the feature vectors was a better proxy for perception.

They defined the metric \(S_{ac}\) (Angular Cosine distance) as:

Equation for Angular Cosine Similarity.

Here, \(F_T\) and \(F_R\) are the feature vectors of the test and reference images.

  • If \(S_{ac} = 0\), the vectors point in the exact same direction; the model sees the images as identical.
  • If \(S_{ac}\) is high, the model perceives a significant difference.

By varying the contrast of the test image and checking when \(S_{ac}\) starts to rise, the researchers could determine the model’s “detection threshold”—the computer equivalent of saying, “I see something!”

The Stimuli: What Are the Models Looking At?

To probe low-level vision, you can’t use pictures of dogs and cats. You need fundamental building blocks of vision. The study used Gabor patches—sinusoidal gratings windowed by a Gaussian envelope. These are the standard tool in vision science because they isolate specific positions and frequencies.

The mathematical definition of the stimuli used is:

Equation for Gabor patch generation.

Visually, they look like fuzzy sets of stripes. The researchers varied the Spatial Frequency (how closely packed the stripes are) and the Contrast (the difference between the light and dark stripes).

Achromatic Gabors with different spatial frequencies and contrast.

In the image above, humans can easily see the patterns at the bottom (high contrast). As you move up, the contrast fades. Interestingly, humans have a “sweet spot” in the middle frequencies (2-4 cycles per degree) where we are most sensitive. We struggle to see very low frequencies (broad, gradual gradients) or very high frequencies (tiny details).

The researchers also tested:

  • Chromatic Gabors: Red-Green and Yellow-Violet patterns to test color vision.
  • Noise: Random static patterns.
  • Masking Patterns: A target pattern hidden inside a noisy or striped background.

Key Experiment 1: Contrast Sensitivity (The CSF)

The Contrast Sensitivity Function (CSF) is essentially the transfer function of the human eye. It tells us how much contrast is needed to detect a pattern of a certain size.

The Human Benchmark: Humans are “band-pass.” We ignore very slow gradients (low frequency) to be invariant to lighting changes, and we lose high-frequency details due to optical blurring.

The Model Results: When the researchers tested models like DINO, SAM, and OpenCLIP, they found a distinct lack of alignment.

Contour plots of contrast detection for various models.

Look at row (a) in the figure above.

  • The dashed line represents human performance (castleCSF). It curves like an upside-down “U”.
  • The contour lines show the model’s sensitivity.

Observation: Most foundation models do not follow the human dashed line.

  • OpenCLIP (far right) is messy and irregular.
  • DINOv2 (second column) shows some band-pass characteristics (a drop in sensitivity at low frequencies), suggesting it has learned to ignore illumination gradients, but it doesn’t match the human curve precisely.
  • SD-VAE (the encoder for Stable Diffusion) drops sensitivity at low frequencies aggressively, likely because it compresses images and discards “boring” low-frequency data.

Takeaway: Foundation models do not share the biological bottlenecks of the human eye. They don’t struggle with high frequencies (until they hit the pixel resolution limit) and they process low frequencies differently. They have forged their own path to detection.

Key Experiment 2: Contrast Masking

Visual masking is a phenomenon where one stimulus (the mask) makes it harder to see another stimulus (the target). For example, it’s hard to spot a zebra in a field of tall, striped grass.

In the “Phase-Incoherent Masking” test, the researchers hid a Gabor patch inside random noise.

Images from Phase-Incoherent Masking experiment.

To generate these masks mathematically, they used filtered noise:

Equation for Noise Mask generation.

The Results: Here, the story changes. While models were bad at mimicking human detection thresholds, they were surprisingly good at mimicking human masking.

Refer back to Figure 5, row (h) (the Masking plots).

  • DINOv2 and OpenCLIP show contour slopes that align well with human data.
  • This suggests that “masking” isn’t just a biological glitch. It’s a statistical property of images. To recognize objects effectively in cluttered scenes (a core task for these models), you naturally develop a resistance to masking similar to how biological systems do.

Key Experiment 3: Contrast Constancy

The final major test was Contrast Constancy.

In the real world, if you walk away from an object, its details get smaller (higher spatial frequency), but the object doesn’t look “faded.” Its perceived contrast remains constant. This is crucial for recognizing objects at different distances.

The researchers tested this by asking the models to “match” the contrast of a test grating to a reference grating. They minimized the difference in their feature space:

Equation for Contrast Matching minimization.

The Results (Figure 5, row i):

  • The dashed lines (Human data) are flat at high contrasts. This proves humans have “Contrast Constancy”—we see strong contrast equally well across frequencies.
  • DINOv2 (green line) and OpenCLIP (orange/red lines) follow this trend quite well. They flatten out.
  • SAM (Segment Anything) struggles here, showing instability.

This indicates that the best vision models have learned scale invariance. They represent a high-contrast edge as “high contrast” whether it is close up (thick) or far away (thin). This is a functional necessity for robust computer vision, just as it is for human survival.

Comparing the Contenders

The study tested 45 different models. While we can’t look at every single one, the researchers quantified the alignment using Spearman correlation (\(r_s\)) for detection/masking and RMSE for matching.

Bar chart quantifying similarity between models and HVS.

The Winners:

  1. DINOv2: This model consistently showed the closest resemblance to human vision, particularly in masking and area summation (how sensitivity increases with object size).
  2. OpenCLIP: Performed very well on masking and constancy, though it struggled with basic detection consistency.
  3. Supervised Models (ResNet): Older, supervised models (trained with labels) generally showed less alignment than the modern self-supervised giants.

The table below breaks down the scores (higher \(r_s\) is better, lower RMSE is better):

Detailed table of model alignment scores.

Note that DINOv2 ViT-B/14 achieves remarkably high correlations in masking tasks (\(>0.95\)), suggesting a convergence with human perception in complex, noisy environments.

Why Does Alignment Matter?

You might ask: Who cares if a robot sees like a human, as long as it works?

It turns out that “seeing like a human” might be a predictor of “working well.” The researchers plotted the alignment scores against the models’ performance on standard computer vision benchmarks (like ImageNet classification).

Correlation between alignment scores and classification performance.

The scatter plots reveal a positive correlation. Models that align better with human masking and matching characteristics (like DINOv2) tend to perform better on classification tasks.

This suggests that the “human way” of handling visual clutter (masking) and scale (constancy) isn’t arbitrary. It is likely the optimal way to process the visual statistics of our world. Evolution figured it out over millions of years; Neural Networks figured it out over millions of GPU hours.

Conclusion

The paper “Do computer vision foundation models learn the low-level characteristics of the human visual system?” provides a nuanced answer to our opening question.

No, foundation models do not blindly mimic the human eye. They lack the biological hardware limitations that create our specific Contrast Sensitivity Function. They don’t suffer from the same drop-offs in low light or low frequency.

However, they do converge with human vision on functional capabilities. Both systems have learned to:

  1. Handle visual clutter similarly (Masking).
  2. Perceive contrast consistently across distances (Constancy).

This implies that while our “hardware” differs, our “software” (the processing rules learned from natural data) is surprisingly similar. As we continue to build more powerful AI, analyzing them through the lens of human psychophysics offers a powerful tool to understand not just what they learn, but why they learn it. The convergence of biological and artificial vision suggests that there are universal laws of seeing—and our machines are beginning to obey them.