Introduction
Imagine you are driving down a highway. Your eyes are constantly scanning the environment, tracking the speed of the car in front of you, the trees rushing past in your peripheral vision, and the slight drift of your own vehicle. You are performing a complex calculation known in computer vision as Optical Flow estimation—determining how pixels move from one moment to the next.
For decades, computer vision researchers have been training AI models to master this task. They use “Ground Truth” data—mathematically perfect calculations of where every pixel actually moved. And modern AI is incredible at this; in many cases, it is far more precise than the human eye.
But here is the catch: Humans don’t see the world with mathematical perfection. Our visual cortex takes shortcuts. We experience optical illusions. We group objects together conceptually rather than tracking every pixel. We ignore the motion of falling snow to focus on the road.
If an AI assistant in a car sees a “perfect” physical flow that implies a collision, but the human driver perceives the scene differently due to a visual illusion, the system might intervene unexpectedly, causing confusion or panic. To build AI that truly interacts with us—whether in autonomous driving, animation tools, or video generation—we need models that understand not just how the world moves, but how we perceive it moving.
In this post, we are diving deep into HuPerFlow, a fascinating paper that introduces the first large-scale benchmark comparing Human Perception against Machine Vision and Physical Ground Truth. The researchers didn’t just run code; they conducted a massive psychological experiment to map out exactly where human vision drifts away from reality—and how current AI models fail to capture those human quirks.
The Gap Between Physics and Perception
In the world of Computer Vision (CV), Optical Flow is the pattern of apparent motion of objects, surfaces, and edges in a visual scene caused by the relative motion between an observer and a scene.
Traditionally, to train a model to estimate optical flow, you feed it two consecutive video frames and a “Ground Truth” (GT) map. The GT map is the physics-based reality: “Pixel A at coordinates (x, y) moved to (x+1, y+2).”
However, human vision is not a physics engine. It is a biological process evolved for survival, not pixel-perfect accuracy.
The Aperture Problem and Grouping
Humans face constraints like the aperture problem, where viewing a moving object through a small window (like a receptive field in the eye) makes it impossible to determine its true direction without more context. Furthermore, humans tend to “group” motion. If a person is walking, we perceive the person moving as a whole unit, often ignoring the complex, tiny distinct motions of their clothing folds or hair strands.
Until now, there hasn’t been a large-scale dataset to quantify these differences. Previous studies were limited to small sets of artificial stimuli (like moving dots) or very few natural scenes.
Introducing HuPerFlow
The researchers introduced HuPerFlow (Human-Perceived Flow), a massive dataset designed to bridge the gap between CV models and human vision.
The Scale of the Project
This isn’t a small lab test. The benchmark includes:
- 38,400 human response vectors.
- 2,400 specific locations probed across different videos.
- 10 different optical flow datasets, ranging from realistic driving to fantastical cartoons.
- 480 participant sessions.
The goal was to create a “Human Ground Truth”—a dataset where the “correct” answer isn’t where the pixel actually went, but where a human thought it went.

As shown in Figure 1, the dataset visualizes three distinct layers of information:
- Red Arrows: The direction the human observer thought the object moved.
- Cyan/Blue Arrows: The Ground Truth (where the object actually moved).
- Green Circles: The “Endpoint Error” (EPE), representing the magnitude of the difference between the two.
Notice in the figure how sometimes the red and blue arrows align perfectly (the human was accurate), while in other scenes, they diverge significantly.
How to Capture Human Perception
Measuring optical flow perception is notoriously difficult. You can’t just ask a participant, “What is the vector coordinate of that tree?” The researchers had to design a novel psychophysical experiment that could be run online but maintained strict scientific controls.
The Method of Adjustment
The researchers utilized a “Method of Adjustment.” Here is how the experiment worked for the participants:
- The Stimulus: A participant watches a short video clip (about 500ms).
- The Probe: A brief flash (a green dot) appears at a specific location in the video, telling the user, “Focus on the motion right here.”
- The Matching Task: After the video, a patch of “Brownian noise” (a static-like pattern) appears. The user controls this patch with their mouse. They have to adjust the speed and direction of the noise until it looks like it is moving exactly the same way the target spot in the video was moving.

Figure 2 illustrates this workflow. The participant can toggle back and forth between the video and the noise patch as many times as they like until they are satisfied that the motions match. This allows for a precise quantitative measurement (speed and angle) of subjective perception.
Diverse Datasets
To ensure the findings weren’t limited to just one type of video, the team sourced clips from 10 prominent computer vision datasets.

As listed in Table 1, these datasets cover a wide spectrum:
- Realistic Driving: KITTI, TartanAir, VIPER.
- Complex Animation: MPI Sintel (dragons, non-rigid motion), Monkaa (fuzzy monsters).
- Human Movement: MHOF (multi-human optical flow).
This diversity is crucial because the human eye processes the rigid motion of a car differently than the fluid motion of a cartoon character or the articulated motion of a walking human.
Analyzing the Human Data: When Do We Get It Wrong?
Once the data was collected, the researchers analyzed the Endpoint Error (EPE). This metric calculates the Euclidean distance between the human response and the physical ground truth.

Using Equation 1, where \((u, v)\) represents the vector components of motion, they could quantify exactly how “wrong” (or rather, how “human”) the participants were.
The “Flow Illusions”
The results revealed that human errors are not random; they are systematic “Flow Illusions.”

Figure 3 provides a fascinating look at these biases across different datasets:
- Smooth Driving (Top Left - Driving): In scenarios like KITTI or Driving, where motion is smooth and predictable, humans are highly accurate. The red arrows (human) and blue arrows (truth) mostly overlap.
- Global vs. Local (Middle Right - MHOF): When looking at walking humans, observers tended to perceive the global body movement rather than the specific motion of a limb. If a person is walking forward but swinging their arm back, a computer sees the arm going back; a human often just sees the person going forward.
- Contextual Influence (Bottom Left - Monkaa): In scenes with heavy camera rotation around a stationary object, humans often experienced induced motion—thinking an object was moving when it was actually the camera (and background) moving.
What Factors confuse Humans?
The researchers broke down the errors based on visual properties of the scene.

Figure 4 highlights the relationship between scene properties and error rates:
- Speed (Top Left): As the Ground Truth speed increases, human error increases. We struggle to track very fast objects accurately.
- Image Gradient (Top Right): We perform better when there are strong edges (high gradient). Blurry, featureless areas are harder for humans to track.
- Self-Motion (Bottom Middle): When the camera itself is moving fast (simulating the observer moving), our error rates spike. This suggests that while our brains attempt to compensate for our own movement to isolate object motion, we aren’t mathematically perfect at it.
Benchmarking the Machines
The ultimate question of the paper is: Do current AI models see like humans?
The researchers tested a variety of optical flow algorithms, ranging from standard Deep Learning models (like RAFT and FlowFormer) to biologically inspired models (like FFV1MT).
To measure “human-likeness,” they couldn’t just use standard accuracy. If a model is 100% accurate to physics, it is 0% human-like in cases where humans see an illusion. They used a Partial Correlation metric.

This equation (Equation 2) calculates the correlation between the Model’s Prediction and the Human Response, while controlling for the Ground Truth. Essentially, it asks: “When the human makes a mistake (deviates from GT), does the model make the same mistake?”
The Results
The comparison results, summarized in Table 2, tell a story of a trade-off between physical accuracy and perceptual alignment.

- The Physicists: Models like VideoFlow and RAFT (SOTA deep learning models) have very high correlation with Ground Truth (Right side of the table). However, their partial correlation with human perception (Left side, \(\rho\)) is very low. They are “too good.” They see the pixels that humans ignore.
- The Human-Mimetics: The biologically inspired model, FFV1MT, had much higher partial correlation with humans. It makes similar errors to us. However, its overall accuracy on the physical ground truth is lower.
Visualizing the Disconnect
Let’s look at what this divergence looks like in practice.

Figure 5 compares these models side-by-side.
- Column A (Top): Look at the red arrow (Human). It points generally down/right. The FFV1MT model (yellow arrow, left side) points in a similar direction. The VideoFlow model (yellow arrow, right side) points much more sharply downward (the physical truth). The bio-inspired model captured the “sluggishness” or general sense of human motion, while the SOTA model was mathematically precise but perceptually “wrong.”
- Column C (Bottom): This shows a motorcycle scene. The strong camera motion creates confusing signals. Humans (Red) largely ignore the background noise to focus on the object. The V1Attention model (Left) mimics this better than the RAFT model (Right), which tries to resolve every distinct motion vector, leading to a result that feels disjointed from the human experience.
Conclusion and Implications
The HuPerFlow benchmark reveals a significant misalignment in current computer vision. We have spent years optimizing AI to be perfect physicists, but in doing so, we have neglected to teach them to be human observers.
Why does this matter?
- Human-AI Interaction: If a semi-autonomous car sees a hazard that the human driver is perceptually blind to (due to an illusion), it needs to know that the driver doesn’t see it to warn them effectively.
- Animation and Art: Tools that automatically generate video or fill in frames (interpolation) need to create motion that looks “right” to us. A mathematically perfect interpolation might look jittery or uncanny if it violates our perceptual groupings.
- Neuroscience: By building models that can replicate the errors found in HuPerFlow, we can better understand the algorithmic nature of the human visual cortex.
HuPerFlow is the first step toward “Human-Aligned Computer Vision”—teaching machines not just to see the world as it is, but to see it as we do.
](https://deep-paper.org/en/paper/file-2073/images/cover.png)