Motion Modeling is Easier Than You Think: Unlocking Dynamic Video Generation with Model Merging

Image-to-Video (I2V) generation is one of the most exciting frontiers in computer vision. The premise is magical: take a single still photograph—a car on a road, a dog in the grass, a castle on a hill—and breathe life into it. You want the car to drive, the dog to roll over, and the camera to zoom out from the castle.

However, if you have played with current I2V diffusion models, you might have encountered a frustrating reality. Often, the “video” is just the input image with a slight wobble, or a “zooming” effect that looks more like a 2D scale than 3D camera movement. Conversely, if the model does generate movement, it often ignores your text prompts completely, creating chaotic motion that has nothing to do with what you asked for.

Why is this so hard? The challenge lies in a tug-of-war between appearance preservation (keeping the car looking like that specific car) and motion generation (making pixels change significantly over time).

In a fascinating new paper, Extrapolating and Decoupling Image-to-Video Generation Models, researchers propose a solution that doesn’t rely on training massive new models from scratch. Instead, they introduce a clever framework using Model Merging—a technique borrowed from Natural Language Processing (NLP)—to surgically separate and amplify motion capabilities.

Let’s dive into how they achieved this and why it might change how we build video generators.

The Problem: The “Static” Video Trap

Current state-of-the-art models often fail in two ways:

Limited Motion Degree: The video is barely a video. It looks like a static GIF.
Poor Controllability: You type “pan left,” but the camera zooms in or stays still.

This happens because models are often over-conditioned on the first frame. They are so afraid of hallucinating new details that they cling tightly to the input image, resulting in minimal movement.

Comparative visualization of I2V generation. Top row shows a car; the baseline ‘Prior’ produces static frames, while ‘Ours’ shows the car moving away. Bottom row shows a castle; ‘Prior’ is static, ‘Ours’ successfully zooms out.

As shown in Figure 1 above, previous methods (labeled “Prior”) often output nearly identical frames regardless of the prompt. In the top row, the car barely moves. In the bottom row, the prompt asks for a “zoom out,” but the frame remains fixed. The proposed method (labeled “Ours”), however, successfully generates the dust cloud behind the moving car and creates a genuine camera zoom effect.

The Solution: A Three-Stage Framework

To fix this, the researchers didn’t just train a bigger model. They engineered a three-stage pipeline designed to decouple “motion” from “appearance.”

The framework is called Extrapolating and Decoupling. It operates on the insight that different parts of the neural network handle different jobs, and by manipulating model weights algebraically, we can boost specific skills.

Overview of the framework. (a) Adaptation injects text into temporal attention. (b) Extrapolation amplifies motion degree. (c) Decoupling separates parameters for selective injection.

As illustrated in Figure 2, the process involves:

Adaptation: Teaching the model to listen to text prompts for motion.
Extrapolation: A mathematical trick to aggressively boost motion magnitude.
Decoupling: separating these capabilities and injecting them at the right time during generation.

Let’s break these stages down.

Stage 1: Motion-Controllability Adaptation

Most Video Diffusion Models (VDMs) use a U-Net architecture. They have “Spatial Attention” (understanding what is in a frame) and “Temporal Attention” (understanding how frames relate over time).

A flaw in many base models (like DynamiCrafter, which this paper builds upon) is that text prompts are often only fed into the Spatial layers. This means the model knows a “dog” is in the picture, but it doesn’t necessarily use the text to decide how the dog moves.

To fix this, the authors introduce a lightweight Adapter.

Equation showing the Q-Former processing text embeddings.

They use a Q-Former (a component often used to bridge image and text) to compress text embeddings and explicitly inject them into the Temporal Attention modules. By fine-tuning only this adapter and the temporal weights, they teach the model to use the text prompt to guide movement.

The Catch: While this improves control (the model now understands “pan left”), the researchers found that fine-tuning actually reduced the overall amount of motion. The model became “careful” and static. This is a phenomenon known as “degree vanishment.”

Stage 2: Motion-Degree Extrapolation

This is perhaps the most innovative part of the paper. We have a Pre-trained model (\(\theta_{pre}\)) that has decent motion but bad control. We have a Fine-tuned model (\(\theta_{sft}\)) from Stage 1 that has good control but tiny motion.

How do we get big motion? The authors use a technique called Task Vector Extrapolation.

If moving from Pre-trained \(\to\) Fine-tuned reduces motion, then logically, moving in the opposite direction should increase it. But we don’t want to go back to the start; we want to overshoot it.

They define a new model state, \(\theta_{dyn}\), using vector arithmetic:

Equation defining the dynamic model as the pretrained model plus an alpha-weighted difference between pretrained and fine-tuned models.

Here, \(\alpha\) is a hyperparameter. By subtracting the fine-tuned weights from the pre-trained weights (essentially capturing the “loss of motion”), and adding that back into the pre-trained model with a multiplier, they aggressively “unlearn” the static behavior.

This is a training-free operation. It’s just math on the model weights.

Why does this work? The authors provide a theoretical proof based on Taylor expansion. Without getting too deep into the calculus, they prove that provided the parameter update aligns with the gradient of the motion degree, this extrapolation is mathematically guaranteed to increase the motion score.

Derivation showing that the change in motion degree is proportional to the squared norm of the gradient, ensuring a non-negative increase.

The result is a model that produces wild, high-magnitude motion. However, it might be too wild, sometimes sacrificing the consistency of the subject.

Stage 3: Decoupling and Dynamic Injection

Now we have pieces of the puzzle, but they are scattered across different model versions:

Motion Control is in the Adapter parameters.
Motion Degree (dynamics) is in the extrapolated model.
Subject Consistency is best preserved in the fine-tuned model.

We need to combine these best parts. The authors use a method called DARE (Drop And REscale) Pruning. This technique randomly drops a percentage of the weight differences (setting them to zero) and rescales the rest. This helps to isolate the essential parameters for a specific task without carrying over “noise” or conflicting information.

They create three parameter sets:

\(\theta_{adt}\): Parameters for Control (from the Adapter).
\(\theta_{deg}\): Parameters for Degree (extracted from the Extrapolated model).
\(\theta_{con}\): Parameters for Consistency (from the Fine-tuned model).

Equations showing the isolation of parameter sets using DARE pruning and masks.

Using Task Arithmetic, they merge these parameters into two specialized models: one optimized for Dynamics (\(\theta_{dyn}^*\)) and one for Consistency (\(\theta_{con}^*\)).

Equations showing the creation of dynamic-enhanced and consistency-enhanced models using Task Arithmetic.

The Sampling Strategy: Timing is Everything

The final piece of magic happens when generating the video. Diffusion models generate images in steps (e.g., 50 steps), moving from pure noise to a clear image.

The researchers observed that the early steps of diffusion determine the high-level structure and motion trajectories (long-term planning), while the later steps refine the details and appearance (subject consistency).

Therefore, they employ a time-dependent switching strategy:

Equation for the time-dependent model switching during generation.

Steps \(T\) to \(T-K\) (Early): Use the Dynamic Model. Establish big, bold movements and follow the text prompt.
Steps \(T-K\) to \(0\) (Late): Switch to the Consistency Model. Refine the pixels to make sure the car looks like a car and the background stays coherent.

Experimental Results

Does this complex merging and switching actually pay off? The results are quite stark.

Quantitative Benchmarks

The authors tested their method on VBench, a comprehensive benchmark for video generation.

Table 1: Quantitative results on VBench. The proposed method (bottom rows) shows a massive increase in Motion Degree compared to baselines.

In Table 1, look at the “Motion Degree” column.

The base model (DynamiCrafter) scores 68.54.
Standard Fine-Tuning (Naive FT) drops this score to 11.67 (the “static” problem).
The proposed method (Ours) rockets the score up to 87.64.

Crucially, this increase in motion doesn’t destroy the video quality. The “Video Quality” and “Subject Consistency” scores remain competitive with or better than state-of-the-art models.

Visual Analysis

The numbers are backed up by the visuals. In Figure 4, the authors compare the generated frames and their optical flow (the visualization of movement).

Figure 4: Visual comparison. The bottom row (Optical Flow) shows the proposed method has significantly more vibrant colors, indicating greater movement than competitors.

ConsistI2V and DynamiCrafter show very faint optical flow (the darker images in the bottom rows), indicating little movement.
Ours shows bright, colorful optical flow maps, proving that the pixels are actually travelling across the screen to create genuine motion.

We can see the variety of scenarios the model handles in Figure 7, ranging from natural fluids (waves) to rigid bodies (cars).

Figure 7: Examples of generated videos including people on a beach, crashing waves, a dog, and cars driving.

User Preferences

Subjective metrics matter in generative AI. The researchers conducted a user study comparing their output against major competitors like SVD (Stable Video Diffusion) and VideoCrafter.

Figure 3: Bar charts showing human evaluation. The proposed method wins significantly in Motion Degree and Motion Control.

As shown in Figure 3, users overwhelmingly preferred the proposed method (Ours) for Motion Degree and Motion Control. While “Video Quality” is often a toss-up between top models, the ability to actually move things is where this framework shines.

Ablation: Do we need all three stages?

You might wonder if we can just skip the complicated decoupling and just use the extrapolated model.

Table 3: Ablation study. Using extrapolation alone increases motion degree but harms consistency. The full pipeline balances all metrics.

Table 3 answers this.

Adaptation only: Great consistency, terrible motion.
Extrapolation only: Huge motion (98.21), but “Motion Control” drops significantly (17.63). The video moves, but not in the way you asked.
Full Pipeline: Balances high motion (87.64) with high control (43.87).

Conclusion and Future Implications

The paper “Extrapolating and Decoupling Image-to-Video Generation Models” offers a compelling lesson for AI practitioners: better performance doesn’t always require more training data.

By understanding the internal mechanics of the diffusion process—specifically the trade-off between semantic control and motion magnitude—the researchers were able to use Model Merging to engineer a better result. They treated the neural network weights like building blocks, separating the “motion” blocks from the “appearance” blocks and reassembling them dynamically.

This approach not only solves the specific problem of static AI videos but also opens the door for more modular generative models. Imagine a future where we can plug in a “Cinematography Merge” to a generic video model to get better camera angles, without ever having to retrain the backbone.

For students and researchers, this framework highlights the power of post-training manipulations. Sometimes, the capabilities you are looking for are already inside the model—you just have to perform the right arithmetic to let them out.

The Problem: The “Static” Video Trap#

The Solution: A Three-Stage Framework#

Stage 1: Motion-Controllability Adaptation#

Stage 2: Motion-Degree Extrapolation#

Stage 3: Decoupling and Dynamic Injection#

The Sampling Strategy: Timing is Everything#

Experimental Results#

Quantitative Benchmarks#

Visual Analysis#

User Preferences#

Ablation: Do we need all three stages?#

Conclusion and Future Implications#