Introduction
The field of generative AI has moved at a breakneck pace. We have gone from blurry, postage-stamp-sized GIFs to high-definition, cinematic video generation in a matter of years. Models like Sora, Kling, and Gen-3 can render lighting, textures, and compositions that are nearly indistinguishable from reality.
However, there is a catch. While these models have mastered appearance (what things look like), they often fail spectacularly at motion (how things move).
If you have experimented with text-to-video models, you have likely seen the artifacts: a runner’s legs merging into one, a person walking backward while facing forward, or objects passing through solid walls like ghosts. These aren’t just small glitches; they are fundamental failures to model the physics and dynamics of the real world.
Figure 1. VideoJAM results. By explicitly modeling motion alongside appearance, the framework generates complex movements like twirling, jumping, and elastic deformation.
In this post, we will dive deep into VideoJAM, a research paper presented at ICML 2025. This paper identifies exactly why current models struggle with motion and proposes a surprisingly elegant solution: forcing the model to learn a Joint Appearance-Motion (JAM) representation.
We will explore how the authors modified the training objective to make models “motion-aware” and how they introduced a novel inference technique called Inner-Guidance to steer generation toward physical plausibility.
The Problem: Why Do Video Models Struggle with Physics?
To fix a problem, you first have to understand its root cause. The prevailing architecture for video generation is the Diffusion Transformer (DiT). These models are typically trained with a pixel-reconstruction objective. Put simply, the model takes a noisy video and tries to predict the clean video (or the noise added to it).
The researchers hypothesized that this standard objective is the culprit. In a video, the vast majority of information is static appearance—colors, textures, and backgrounds. The actual motion (the pixels changing over time) constitutes a much smaller part of the signal. As a result, models optimize for looking good rather than moving correctly.
The Motivation Experiments
To prove this hypothesis, the authors conducted a fascinating experiment. They took a set of training videos and created two versions:
- Original: The normal video.
- Permuted: The same video, but with the frames shuffled randomly in time.
If a model understands motion, it should be very confused by the shuffled video. The “loss” (the error rate) should skyrocket because the temporal sequence makes no sense.
Figure 2. The Motivation Experiment. The orange lines show a standard DiT model. Notice how the solid line (original video) and dashed line (shuffled video) are nearly identical until step 60. This implies the model cannot distinguish between a coherent video and a jumbled mess.
As shown in Figure 2, the standard DiT model (orange lines) is nearly invariant to temporal perturbations. Until about 60% of the generation process is complete, the model doesn’t care if the frames are in the wrong order. This confirms that standard video models are biased toward spatial appearance and largely ignore temporal dynamics during the critical early stages of generation.
Furthermore, the researchers identified when motion is determined during the diffusion process. By adding noise to a video and regenerating it from different timesteps (SDEdit), they found that the coarse motion and structure are locked in between steps 20 and 60.
Figure 3. By step 60, the video’s coarse motion and structure are mostly determined. If the model isn’t paying attention to motion during these steps, the result will be incoherent.
Because standard models ignore temporal order during these specific steps, they fail to generate coherent motion, leading to the “uncanny valley” effects seen below.
Figure 4. Standard models struggle with basic mechanics (jogging), complex gymnastics, object solidity, and rotation.
The Solution: VideoJAM
The core philosophy of VideoJAM is that a model shouldn’t just look at pixels; it should explicitly learn how those pixels move. The framework consists of two main units:
- Training: Learning a Joint Appearance-Motion representation.
- Inference: Using Inner-Guidance to steer the generation.
Figure 5. The VideoJAM Framework. (a) During training, the model predicts both video pixels and optical flow. (b) During inference, the predicted flow guides the generation.
1. Joint Appearance-Motion Representations (Training)
The researchers needed a way to represent motion explicitly. They chose Optical Flow—a representation that calculates the displacement of pixels between consecutive frames. Optical flow is ideal because it describes dynamics without caring about texture or lighting.
To make this compatible with standard video models, they convert the optical flow vectors into an RGB image format (using color to represent direction and intensity to represent speed).
The Architecture Modification
Standard DiT models take a noisy video latent \(x_t\) as input and output a prediction \(u\). VideoJAM modifies the architecture to accept two inputs and produce two outputs:
- Input: Noisy Video (\(x_t\)) + Noisy Optical Flow (\(d_t\)).
- Output: Predicted Video + Predicted Optical Flow.
Remarkably, this doesn’t require redesigning the massive Transformer backbone. They simply modify the linear projection layers at the very beginning and end of the network:
- \(\mathbf{W}_{in}^+\): A linear layer that projects the concatenated video and motion latents into the transformer’s embedding space.
- \(\mathbf{W}_{out}^+\): A linear layer that projects the transformer’s output back into separate video and motion predictions.
The equation for the forward pass becomes:

Here, \([x_t, d_t]\) represents the concatenation of appearance and motion. By forcing the model to process both simultaneously, the internal Transformer layers must learn a unified latent representation that understands how appearance and motion correlate.
The New Objective
The loss function is updated to minimize the error for both the video pixels and the motion map.

By training on this joint objective, the model can no longer ignore temporal dynamics. If it generates a video with scattered pixels, the optical flow prediction will be incorrect, and the loss will be high.
2. Inner-Guidance (Inference)
Training a model to predict motion is great, but how do we use that during generation? At inference time, we start with pure noise. We don’t have a ground-truth motion video to feed the model.
The authors introduce Inner-Guidance. The idea is to use the model’s own motion prediction as a guide. As the model starts generating a video, it also generates a “draft” of the motion (optical flow). Even if this draft is noisy, it contains valuable priors about where things should be moving.
The Derivation
In diffusion models, we often use Classifier-Free Guidance (CFG) to align the image with a text prompt. VideoJAM extends this to align the video with the generated motion.
The goal is to sample from a distribution that respects both the text prompt (\(y\)) and the motion (\(d_t\)). The sampling distribution looks like this:

Here, \(w_1\) and \(w_2\) are guidance scales (strength parameters).
- \(w_1\) pushes the model toward satisfying the text prompt.
- \(w_2\) pushes the model toward the motion prior.
However, standard guidance assumes the conditions are independent or external. In VideoJAM, the motion \(d_t\) is generated by the model itself alongside the video \(x_t\). Through Bayesian expansion, the authors derive the score update rule:

Translated into the actual update step during generation, the final prediction \(\tilde{\mathbf{u}}^+\) is a weighted combination of three terms:

This formula tells the model:
- Start with the joint prediction (Text + Video + Motion).
- Move away from the prediction that ignores the text (Unconditional).
- Move away from the prediction that ignores the motion (Appearance-only).
This effectively “steers” the generation trajectory. If the model starts generating a person walking, the “motion branch” of the network predicts the optical flow of a walk. Inner-Guidance then forces the pixel generation to adhere to that flow, preventing the legs from merging or the person from sliding unnaturally.
Experimental Results
The researchers fine-tuned two versions of a DiT model (4 billion and 30 billion parameters) using VideoJAM. They compared these against the base models and leading proprietary models like Sora, Kling, and Gen-3.
Qualitative Comparison
The visual results are striking. In the figure below, compare the VideoJAM results (right column) with the baselines.
Figure 6. Qualitative Comparison. Row 1: The prompt “pull-ups” causes DiT to fail (person facing wrong way) and Kling to hallucinate structures. VideoJAM gets the physics right. Row 2: “Giraffe running.” Sora generates backward motion; VideoJAM generates a natural gait. Row 3: “Headstand.” Baselines struggle with limb placement and gravity.
VideoJAM consistently produces videos where limbs are connected correctly, gravity is respected, and repetitive motions (like a roulette wheel) remain stable.
Quantitative Benchmarks
The team created VideoJAM-bench, a dataset specifically designed to stress-test motion generation (gymnastics, physics interactions, rotations). They evaluated models using VBench (automatic metrics) and human evaluation.
Table 1: 4B Parameter Model Comparison

Even the smaller VideoJAM-4B model achieves a 93.7 motion score, significantly outperforming CogVideo-5B (a larger model).
Table 2: 30B Parameter Model Comparison

The results for the 30B model are even more impressive. VideoJAM-30B achieves a 92.4 motion score, beating Sora (91.7) and Kling 1.5 (87.1). In human evaluations (preference votes), VideoJAM was preferred for motion quality over every competitor.
Ablation Studies
Does the Inner-Guidance really matter? Or is the training enough? The authors tested this by turning off components.
Table 3. Ablation Study.
Removing Inner-Guidance (setting \(w_2=0\)) drops the motion score significantly. Interestingly, removing the optical flow signal during inference (“w/o optical flow”) hurts performance the most, proving that the joint inference process is crucial.
Limitations
VideoJAM is a major step forward, but it is not magic. The authors note two main limitations:
- Zoomed-out Motion: Because the model relies on optical flow, if an object is very small (like a skydiver seen from far away), the flow magnitude is tiny. The model struggles to extract a meaningful signal, leading to incoherent motion.
- Complex Interaction Physics: While the model understands motion better, it doesn’t have an explicit physics engine. Complex interactions, like a foot striking a soccer ball, can still have timing or contact errors.
Figure 7. Limitations. (a) Zoom-out scenarios limit motion signal. (b) Fine-grained object interactions remain challenging.
Conclusion
VideoJAM highlights a critical insight in generative AI: Data scale isn’t everything. Simply throwing more videos at a transformer doesn’t guarantee it will learn physics.
By identifying the bias in the standard training objective—the dominance of appearance over motion—the researchers were able to design a targeted solution. VideoJAM teaches models to “see” motion through a joint representation and uses that knowledge to self-correct during generation.
The result is a framework that can be applied to any video model to significantly boost temporal coherence, producing videos that don’t just look real, but move plausibly too. As we look toward the future of world simulators and AI video, explicit priors like those in VideoJAM will likely play a central role in crossing the uncanny valley.
](https://deep-paper.org/en/paper/2502.02492/images/cover.png)