Introduction

Imagine you are driving down a highway. Your eyes are constantly scanning the environment, tracking the speed of the car in front of you, the trees rushing past in your peripheral vision, and the slight drift of your own vehicle. You are performing a complex calculation known in computer vision as Optical Flow estimation—determining how pixels move from one moment to the next.

For decades, computer vision researchers have been training AI models to master this task. They use “Ground Truth” data—mathematically perfect calculations of where every pixel actually moved. And modern AI is incredible at this; in many cases, it is far more precise than the human eye.

But here is the catch: Humans don’t see the world with mathematical perfection. Our visual cortex takes shortcuts. We experience optical illusions. We group objects together conceptually rather than tracking every pixel. We ignore the motion of falling snow to focus on the road.

If an AI assistant in a car sees a “perfect” physical flow that implies a collision, but the human driver perceives the scene differently due to a visual illusion, the system might intervene unexpectedly, causing confusion or panic. To build AI that truly interacts with us—whether in autonomous driving, animation tools, or video generation—we need models that understand not just how the world moves, but how we perceive it moving.

In this post, we are diving deep into HuPerFlow, a fascinating paper that introduces the first large-scale benchmark comparing Human Perception against Machine Vision and Physical Ground Truth. The researchers didn’t just run code; they conducted a massive psychological experiment to map out exactly where human vision drifts away from reality—and how current AI models fail to capture those human quirks.

The Gap Between Physics and Perception

In the world of Computer Vision (CV), Optical Flow is the pattern of apparent motion of objects, surfaces, and edges in a visual scene caused by the relative motion between an observer and a scene.

Traditionally, to train a model to estimate optical flow, you feed it two consecutive video frames and a “Ground Truth” (GT) map. The GT map is the physics-based reality: “Pixel A at coordinates (x, y) moved to (x+1, y+2).”

However, human vision is not a physics engine. It is a biological process evolved for survival, not pixel-perfect accuracy.

The Aperture Problem and Grouping

Humans face constraints like the aperture problem, where viewing a moving object through a small window (like a receptive field in the eye) makes it impossible to determine its true direction without more context. Furthermore, humans tend to “group” motion. If a person is walking, we perceive the person moving as a whole unit, often ignoring the complex, tiny distinct motions of their clothing folds or hair strands.

Until now, there hasn’t been a large-scale dataset to quantify these differences. Previous studies were limited to small sets of artificial stimuli (like moving dots) or very few natural scenes.

Introducing HuPerFlow

The researchers introduced HuPerFlow (Human-Perceived Flow), a massive dataset designed to bridge the gap between CV models and human vision.

The Scale of the Project

This isn’t a small lab test. The benchmark includes:

  • 38,400 human response vectors.
  • 2,400 specific locations probed across different videos.
  • 10 different optical flow datasets, ranging from realistic driving to fantastical cartoons.
  • 480 participant sessions.

The goal was to create a “Human Ground Truth”—a dataset where the “correct” answer isn’t where the pixel actually went, but where a human thought it went.

Figure 1. Example of HuPerFlow. The red arrows indicate human response to perceived flow, and the green arrows indicate ground truth motion vectors. The circles indicate the magnitudes of end-point errors, the difference between perceived flow and ground truth.

As shown in Figure 1, the dataset visualizes three distinct layers of information:

  1. Red Arrows: The direction the human observer thought the object moved.
  2. Cyan/Blue Arrows: The Ground Truth (where the object actually moved).
  3. Green Circles: The “Endpoint Error” (EPE), representing the magnitude of the difference between the two.

Notice in the figure how sometimes the red and blue arrows align perfectly (the human was accurate), while in other scenes, they diverge significantly.

How to Capture Human Perception

Measuring optical flow perception is notoriously difficult. You can’t just ask a participant, “What is the vector coordinate of that tree?” The researchers had to design a novel psychophysical experiment that could be run online but maintained strict scientific controls.

The Method of Adjustment

The researchers utilized a “Method of Adjustment.” Here is how the experiment worked for the participants:

  1. The Stimulus: A participant watches a short video clip (about 500ms).
  2. The Probe: A brief flash (a green dot) appears at a specific location in the video, telling the user, “Focus on the motion right here.”
  3. The Matching Task: After the video, a patch of “Brownian noise” (a static-like pattern) appears. The user controls this patch with their mouse. They have to adjust the speed and direction of the noise until it looks like it is moving exactly the same way the target spot in the video was moving.

Figure 2. Experimental procedure: At the start of each trial, a green circle appeared to mark the selected area. Next, a motion sequence and a matching stimulus were presented alternately until a response was made.

Figure 2 illustrates this workflow. The participant can toggle back and forth between the video and the noise patch as many times as they like until they are satisfied that the motions match. This allows for a precise quantitative measurement (speed and angle) of subjective perception.

Diverse Datasets

To ensure the findings weren’t limited to just one type of video, the team sourced clips from 10 prominent computer vision datasets.

Table 1. Summary of the selected optical flow datasets.

As listed in Table 1, these datasets cover a wide spectrum:

  • Realistic Driving: KITTI, TartanAir, VIPER.
  • Complex Animation: MPI Sintel (dragons, non-rigid motion), Monkaa (fuzzy monsters).
  • Human Movement: MHOF (multi-human optical flow).

This diversity is crucial because the human eye processes the rigid motion of a car differently than the fluid motion of a cartoon character or the articulated motion of a walking human.

Analyzing the Human Data: When Do We Get It Wrong?

Once the data was collected, the researchers analyzed the Endpoint Error (EPE). This metric calculates the Euclidean distance between the human response and the physical ground truth.

Equation for EPE

Using Equation 1, where \((u, v)\) represents the vector components of motion, they could quantify exactly how “wrong” (or rather, how “human”) the participants were.

The “Flow Illusions”

The results revealed that human errors are not random; they are systematic “Flow Illusions.”

Figure 3. Demonstration of human perceived motion vectors.

Figure 3 provides a fascinating look at these biases across different datasets:

  1. Smooth Driving (Top Left - Driving): In scenarios like KITTI or Driving, where motion is smooth and predictable, humans are highly accurate. The red arrows (human) and blue arrows (truth) mostly overlap.
  2. Global vs. Local (Middle Right - MHOF): When looking at walking humans, observers tended to perceive the global body movement rather than the specific motion of a limb. If a person is walking forward but swinging their arm back, a computer sees the arm going back; a human often just sees the person going forward.
  3. Contextual Influence (Bottom Left - Monkaa): In scenes with heavy camera rotation around a stationary object, humans often experienced induced motion—thinking an object was moving when it was actually the camera (and background) moving.

What Factors confuse Humans?

The researchers broke down the errors based on visual properties of the scene.

Figure 4. Endpoint errors (EPEs) of human responses and two machine vision models as functions of optical flow properties.

Figure 4 highlights the relationship between scene properties and error rates:

  • Speed (Top Left): As the Ground Truth speed increases, human error increases. We struggle to track very fast objects accurately.
  • Image Gradient (Top Right): We perform better when there are strong edges (high gradient). Blurry, featureless areas are harder for humans to track.
  • Self-Motion (Bottom Middle): When the camera itself is moving fast (simulating the observer moving), our error rates spike. This suggests that while our brains attempt to compensate for our own movement to isolate object motion, we aren’t mathematically perfect at it.

Benchmarking the Machines

The ultimate question of the paper is: Do current AI models see like humans?

The researchers tested a variety of optical flow algorithms, ranging from standard Deep Learning models (like RAFT and FlowFormer) to biologically inspired models (like FFV1MT).

To measure “human-likeness,” they couldn’t just use standard accuracy. If a model is 100% accurate to physics, it is 0% human-like in cases where humans see an illusion. They used a Partial Correlation metric.

Equation for Partial Correlation

This equation (Equation 2) calculates the correlation between the Model’s Prediction and the Human Response, while controlling for the Ground Truth. Essentially, it asks: “When the human makes a mistake (deviates from GT), does the model make the same mistake?”

The Results

The comparison results, summarized in Table 2, tell a story of a trade-off between physical accuracy and perceptual alignment.

Table 2. Predictions of optical flow algorithms versus Human response or ground truth.

  • The Physicists: Models like VideoFlow and RAFT (SOTA deep learning models) have very high correlation with Ground Truth (Right side of the table). However, their partial correlation with human perception (Left side, \(\rho\)) is very low. They are “too good.” They see the pixels that humans ignore.
  • The Human-Mimetics: The biologically inspired model, FFV1MT, had much higher partial correlation with humans. It makes similar errors to us. However, its overall accuracy on the physical ground truth is lower.

Visualizing the Disconnect

Let’s look at what this divergence looks like in practice.

Figure 5. Predicted vectors of optical flow algorithms.

Figure 5 compares these models side-by-side.

  • Column A (Top): Look at the red arrow (Human). It points generally down/right. The FFV1MT model (yellow arrow, left side) points in a similar direction. The VideoFlow model (yellow arrow, right side) points much more sharply downward (the physical truth). The bio-inspired model captured the “sluggishness” or general sense of human motion, while the SOTA model was mathematically precise but perceptually “wrong.”
  • Column C (Bottom): This shows a motorcycle scene. The strong camera motion creates confusing signals. Humans (Red) largely ignore the background noise to focus on the object. The V1Attention model (Left) mimics this better than the RAFT model (Right), which tries to resolve every distinct motion vector, leading to a result that feels disjointed from the human experience.

Conclusion and Implications

The HuPerFlow benchmark reveals a significant misalignment in current computer vision. We have spent years optimizing AI to be perfect physicists, but in doing so, we have neglected to teach them to be human observers.

Why does this matter?

  1. Human-AI Interaction: If a semi-autonomous car sees a hazard that the human driver is perceptually blind to (due to an illusion), it needs to know that the driver doesn’t see it to warn them effectively.
  2. Animation and Art: Tools that automatically generate video or fill in frames (interpolation) need to create motion that looks “right” to us. A mathematically perfect interpolation might look jittery or uncanny if it violates our perceptual groupings.
  3. Neuroscience: By building models that can replicate the errors found in HuPerFlow, we can better understand the algorithmic nature of the human visual cortex.

HuPerFlow is the first step toward “Human-Aligned Computer Vision”—teaching machines not just to see the world as it is, but to see it as we do.