Introduction
In the world of computer vision, understanding human movement is a cornerstone task. Whether it’s for healthcare rehabilitation systems, security surveillance, or generating realistic video animations, the computer needs to know not just where a person is, but how they are moving.
For years, researchers have relied on two primary tools: Optical Flow (tracking every pixel’s movement) and Pose Estimation (tracking skeleton joints). While both are useful, they have significant flaws. Optical flow is noisy—it tracks blowing leaves and passing cars just as attentively as the human subject. Pose estimation is precise but overly abstract—it reduces a complex human body to a stick figure, losing crucial shape information.
What if we could combine the best of both worlds? What if we could teach a computer to see motion the way we do—focusing on the human, ignoring the background, and understanding both the overall trajectory and the subtle movements of limbs?
Enter H-MoRe, a novel Human-centric Motion Representation pipeline presented by researchers from Michigan State University. This paper proposes a method that dynamically filters out background noise to capture precise human motion and shape, all while learning from real-world video without needing expensive human labeling.
The Problem: Noise vs. Abstraction
To appreciate H-MoRe, we first need to look at the limitations of current technology.
Optical Flow calculates the motion offset of pixels between two consecutive frames. It creates a dense map of movement. The problem? It is “human-blind.” If a person walks past a swaying tree, the optical flow map captures the tree’s motion just as vividly as the person’s. Furthermore, these models are often trained on synthetic data (computer-generated movies), which means they sometimes struggle with the biological nuances of real human movement.
Human Pose Estimation, on the other hand, detects joints (elbows, knees, shoulders). It is strictly human-centric. However, it discards body shape. If you are trying to identify a person by their gait (walking style), the shape of their body and how their clothes move matters. A stick figure doesn’t capture that.

As shown in Figure 1 above, H-MoRe (highlighted in the red box) offers a sharp, clean representation. Unlike standard Optical Flow, which is blurry and noisy, H-MoRe outlines the human figure precisely. Unlike Pose representations, it retains the dense motion information of the body’s shape.
The Solution: H-MoRe
The researchers propose a pipeline that learns to estimate this “Human-centric” motion directly from raw videos. The core innovation lies in two main areas:
- A Joint Constraint Learning Framework that uses physics and shape to supervise the learning process.
- The concept of World-Local Flows, which separates motion into absolute and relative components.
1. The Joint Constraint Learning Framework
One of the biggest hurdles in training motion models is the lack of “ground truth” data. You can’t easily get a perfect pixel-perfect motion map for a real YouTube video.
H-MoRe solves this by using a self-supervised approach. The model learns by ensuring its predictions obey two logical constraints derived from the video itself: the Skeleton Constraint and the Boundary Constraint.

The Skeleton Constraint (\(\mathcal{F}\))
This constraint is based on kinematics. The assumption is simple: the “flesh” (pixels) near a bone shouldn’t move in a completely different direction than the bone itself.
The system extracts the human skeleton using a standard pose estimator. Then, it checks the estimated motion flow. If a pixel belonging to the arm is moving left, but the skeleton arm is moving right, the model is penalized. This aligns the pixel-wise motion with the kinematic structure of the body.
The Boundary Constraint (\(\mathcal{G}\))
This constraint focuses on shape. It ensures that the edges of the predicted motion flow align with the actual edges of the human body in the image.
The researchers use edge detection to find the human silhouette. They then force the “edges” of the flow map to match this silhouette. This prevents the “bleeding” effect seen in standard optical flow, where the motion of a person blurs into the background.

Figure 5 (above) breaks this down visually.
- Panels (b) and (c) show how the model checks if the flow vector (\(u_p\)) aligns with the skeleton vector (\(k_q\)) in both angle and intensity.
- Panel (d) illustrates the boundary constraint, ensuring the flow edge (\(s\)) snaps to the human boundary (\(e\)).
2. World-Local Flows
The second major contribution of this paper is how it represents motion. Standard optical flow usually gives you absolute movement relative to the camera. But human motion is complex.
Imagine a person waving while walking forward.
- World Flow (\(M_w\)): This is the movement of the hand relative to the environment. It combines the walking speed and the waving speed.
- Local Flow (\(M_l\)): This is the movement of the hand relative to the person’s body. It isolates the “waving” action.
H-MoRe captures both.

Why does this matter? For some tasks, like tracking a person across a room, World Flow is essential. For other tasks, like recognizing a specific gesture (e.g., “checking a watch”), Local Flow is far more informative because it ignores how fast the person is walking.
Efficiency in Calculation
Calculating two different flow maps usually requires two heavy neural networks, which doubles the computational cost. The authors use a clever kinematic trick inspired by Galilean transformations.
They calculate the World Flow (\(M_w\)) first. Then, instead of running a massive network for Local Flow, they use a lightweight network to estimate the subject’s overall velocity (\(v_s\)). By simply subtracting the overall velocity from the World Flow, they get the Local Flow mathematically:
\[M_l = M_w - v_s\]This allows H-MoRe to provide rich, multi-layered motion information without sacrificing real-time performance.
Experimental Results
The researchers tested H-MoRe on three distinct tasks: Gait Recognition, Action Recognition, and Video Generation.
Gait Recognition
Gait recognition identifies people by their walking style. This is notoriously difficult when people change clothes or carry bags.

As shown in Table 1, H-MoRe significantly outperforms state-of-the-art optical flow methods (like RAFT and FlowFormer++).
- CL (Clothing Change): Look at the “CL” column. This is the hardest test. H-MoRe achieves 87.66% accuracy, while the popular RAFT model only reaches 80.52%. The Boundary Constraint likely plays a huge role here by ensuring the body shape is preserved despite clothing changes.
Action Recognition
This task involves classifying what a person is doing (e.g., diving, running). The researchers used the Diving48 dataset, which features fast-paced motion and blur.

In Table 2, H-MoRe again leads the pack. It achieves an accuracy (Acc@1) of 72.99%, beating the nearest competitor by over 1%. While this might seem small, in the competitive world of action recognition, this is a solid margin, especially given that H-MoRe is much more efficient than the heavy “VideoFlow” or “FlowFormer++” models.
Video Generation
Perhaps the most visually intuitive proof of H-MoRe’s quality is in video generation. The researchers fed the motion representations into a generative model to see if it could reconstruct video frames.

Look closely at Figure 6.
- GT: The Ground Truth (what the video should look like).
- RAFT: The video generated using standard optical flow. Notice the blurry hands and the “ghosting” effects. The model is confused about where the hand ends and the background begins.
- H-MoRe: The video is much sharper. Because H-MoRe enforces strict boundaries, the generative model knows exactly where to render the human pixels.
Visualizing the Flows
Finally, let’s look at what the computer actually “sees.”

In Figure 7, compare the H-MoRe [World Flow] (bottom right) with RAFT or GMA.
- Other methods: The flow spills over the edges. The feet are blurry blobs.
- H-MoRe: The flow looks like a perfect cutout of the person. You can clearly see the legs and the briefcase. The Local Flow visualization (next to it) isolates the relative limb movements, providing a unique “X-ray” view of the action mechanics.
Conclusion
H-MoRe represents a significant step forward in human motion analysis. By acknowledging that “pixels don’t move randomly”—they move according to skeletons and boundaries—the researchers have created a system that is both physically plausible and visually precise.
The introduction of World-Local Flows provides a richer vocabulary for computers to understand action, distinguishing between “moving through space” and “moving the body.” Whether for identifying a gait in security footage or generating lifelike avatars in the metaverse, H-MoRe proves that when it comes to analyzing humans, it pays to be human-centric.
The code and models have been made available by the authors, paving the way for future applications in real-time sports analysis, healthcare monitoring, and beyond.
](https://deep-paper.org/en/paper/2504.10676/images/cover.png)