InstructMove: How Watching Videos Teaches AI to Perform Complex Image Edits

The field of text-to-image generation has exploded in recent years. We can now conjure hyper-realistic scenes from a simple sentence. However, a significant challenge remains: editing. Once an image is generated (or if you have a real photo), how do you change specific elements—like making a person smile or rotating a car—without destroying the rest of the image’s identity?

Current state-of-the-art methods often rely on synthetic datasets. They train models on AI-generated images paired with AI-generated instructions. While this works for style transfer or adding new objects, it fails significantly when asked to perform “non-rigid” edits—changes that involve complex physical movements, such as a dog turning its head or a person changing their pose.

In this post, we dive deep into InstructMove, a research paper that proposes a refreshing solution: instead of learning from synthetic data, AI should learn by watching real videos. By observing how things naturally move and change perspective in video clips, the model learns to perform complex, realistic manipulations that previous models simply cannot handle.

Instruction-based Image Manipulation examples showing pose changes, expression changes, and camera movement.

As shown in Figure 1, the proposed model excels at tasks that require understanding 3D geometry and semantics, such as “Lower the horse’s head” or “Change the view to the side,” while keeping the subject’s identity perfectly intact.

The Problem with Synthetic Training Data

To train an AI to edit images, you need a massive dataset of “triplets”:

Source Image: The original photo.
Instruction: Text telling the AI what to do (e.g., “Make the cat sleep”).
Target Image: The result after the edit.

Obtaining these triplets at scale is difficult. You cannot simply go out and photograph a cat in the exact same lighting conditions twice, once awake and once asleep, without the background changing or the cat moving away.

Because collecting real data is hard, researchers turned to synthetic data. Models like InstructPix2Pix bootstrapped their training data using other AI models (like GPT-3 and Stable Diffusion) to generate these triplets. While this was a breakthrough, it introduced a “synthetic ceiling.”

Comparison of editing failures in existing methods like InstructPix2Pix and MagicBrush.

As illustrated in Figure 2, models trained on synthetic data struggle with realism. When asked to “Put the toy’s legs together,” existing methods often hallucinate artifacts, fail to move the object correctly, or simply blur the details. This is because the training data itself lacked the natural dynamics of the real world. Synthetic data is often static or stylistically consistent, lacking the complex physical transformations found in reality.

The Core Insight: Videos as Natural Supervision

The authors of InstructMove identified a rich, untapped resource for learning image manipulation: Internet videos.

Video frames inherently solve the identity preservation problem. If you take two frames from a video of a person walking:

Content Consistency: It is the same person, wearing the same clothes, in the same environment.
Natural Dynamics: The differences between Frame A and Frame B represent realistic physical changes (pose, camera angle, expression).

If an AI can look at Frame A (Source) and Frame B (Target), and understand the difference, it can learn to replicate those changes. The missing piece of the puzzle is the Instruction. A video doesn’t come with text saying “The person turned left.”

The Data Construction Pipeline

To bridge this gap, the researchers developed a novel pipeline leveraging Multimodal Large Language Models (MLLMs), such as GPT-4o or LLaVA. These advanced language models can “see” images and describe them.

The data construction pipeline: Frame selection, Instruction Generation via MLLM, and Triplet formation.

The pipeline, visualized in Figure 3, operates in three distinct steps:

Frame Selection: The system samples pairs of frames \((I^s, I^e)\) from videos. It filters these pairs carefully. If the frames are too similar (no movement), they are useless. If they are too different (cut to a different scene), they are also useless. The system uses optical flow (motion tracking) to ensure there is “moderate” movement—enough to represent a meaningful edit, but not so much that the context is lost.
Instruction Generation: The selected pair is fed into an MLLM. The prompt asks the MLLM to analyze the differences and generate a precise editing instruction. For example, the MLLM sees a woman reading in Frame A and looking up in Frame B, and generates the text: “Adjust the woman’s gaze from the book to looking directly at the camera.”
Triplet Creation: The result is a high-quality, real-world dataset containing the source frame, the target frame, and the generated instruction.

This method allows the creation of a massive dataset (6 million pairs) that captures non-rigid transformations (bending, smiling, moving) and viewpoint changes (camera panning), which were largely absent from previous datasets.

Table comparing InstructMove dataset with InstructPix2Pix, MagicBrush, and others.

Table 1 highlights the disparity between this new approach and previous attempts. InstructMove is the only large-scale dataset that utilizes Real Target images while supporting complex non-rigid and viewpoint edits.

The Architecture: Spatial Conditioning

Having a great dataset is only half the battle. The authors also introduced a clever architectural change to the diffusion model to better utilize this data.

Most instruction-based editing models use Channel Conditioning. In this standard setup, the reference image (the source) is stacked on top of the noise input like layers in a sandwich. This forces the model to align the source and target spatially pixel-by-pixel. While good for color correction, it is terrible for structural changes. If you want to move a dog from the left to the right, channel conditioning “anchors” the dog to the left side because the reference pixels are there.

The Solution: Spatial Concatenation

InstructMove introduces Spatial Conditioning. Instead of stacking the images in the depth (channel) dimension, they concatenate the source image and the noisy target latent side-by-side (along the width dimension).

Overview of the model architecture showing spatial concatenation of latents.

As shown in Figure 4, the process works as follows:

The source image \(I^s\) and target image \(I^e\) are encoded into latents \(z^s\) and \(z^e\).
The target latent is noised to create \(z^e_t\).
The model input is formed by placing \(z^s\) and \(z^e_t\) next to each other.
This wide input is fed into the U-Net.

Why does this work better? By placing images side-by-side, the model handles the relationship between the source and the target using the self-attention (or cross-attention within the layer) mechanism. The network can “look” at the source image on the left to understand the identity of the object, but it is not forced to align the pixels perfectly in the same \(x,y\) coordinates on the right. This grants the model the flexibility to move objects, rotate heads, or shift camera angles while still having full access to the source appearance.

The training objective is a standard denoising loss, but calculated only on the target half of the output:

The loss function equation.

Here, the model learns to predict the noise \(\epsilon\) added to the target, conditioned on the source image and the text instruction \(C\).

Going Beyond Text: Precise Control

While text instructions are powerful, they can be ambiguous. “Make the man look to the side” doesn’t specify which side or exactly how far. To address this, InstructMove integrates seamlessly with additional control mechanisms.

Masking

The model supports mask-based editing for localization. If you only want to edit a specific face in a crowd, you can provide a mask \(m\). The latent update blends the preserved background with the edited foreground:

The masking equation for localized editing.

ControlNet Integration

Because InstructMove maintains the underlying structure of standard diffusion models (like Stable Diffusion), it is compatible with ControlNet. This allows users to provide “spatial guides”—like a skeleton pose or a scribbled sketch—alongside the text instruction.

Qualitative results showing mask usage and ControlNet integration.

In Figure 6, we see two powerful examples:

(a) Localized Editing: Using a mask to “Have the boy give a thumbs-up” ensures only his hand is modified, leaving his face and clothes untouched.
(b) ControlNet: A sketch of a rotated banana guides the AI to “Rotate the banana 45 degrees,” achieving a precise geometric transformation that text alone might struggle to describe.

Experimental Results

The researchers compared InstructMove against state-of-the-art baselines, including InstructPix2Pix, MagicBrush, and Zero-Shot methods like MasaCtrl. Because previous benchmarks focused on style transfer, the authors created a new benchmark specifically for non-rigid, structural edits.

Quantitative Analysis

The evaluation relied on two main metrics:

CLIP-Inst: How well does the image change match the text instruction?
CLIP-I: How well is the source identity preserved?

Quantitative comparison table showing InstructMove outperforming baselines.

Table 2 shows that InstructMove achieves the best balance. While some methods (like InstructPix2Pix) have high identity preservation (CLIP-I), it is often because they fail to perform the edit at all, leaving the image unchanged. InstructMove achieves the highest instruction alignment scores while maintaining high image fidelity.

Human evaluation tells an even clearer story. Users were asked to pick the best edit from various models.

Human preference table showing 87.62% preference for InstructMove.

As Table 3 indicates, human evaluators preferred InstructMove a staggering 87.62% of the time.

Qualitative Comparison

The visual differences are striking.

Qualitative comparison grid. InstructMove succeeds at complex edits where others fail.

In Figure 5, look at the “Raise arm on bear doll” row (Top).

InstructPix2Pix and MagicBrush fail to move the arm significantly or introduce blur.
MasaCtrl changes the image style entirely or fails to isolate the arm.
InstructMove (Ours) cleanly raises the bear’s arm while keeping the texture and background consistent.

Similarly, in the “Make dog look at camera” row, InstructMove is the only model that convincingly rotates the dog’s head in 3D space without distorting its features.

Ablation Studies: Does the Data Matter?

The authors performed ablation studies to prove that both the video dataset and the spatial conditioning were necessary.

Ablation study visual comparison.

Figure 7 (top section) compares three versions:

SC + IP2P data: Spatial Conditioning trained on the synthetic InstructPix2Pix dataset. It creates horrific artifacts (look at the woman’s face).
CC + Our data: Channel Conditioning trained on the new video dataset. It works better but fails to fully realize the smile.
SC + Our data: The proposed method. It produces a natural, high-quality smile.

This confirms that real-world video data provides the necessary realism, while spatial conditioning provides the architectural flexibility to implement those changes.

Table showing ablation numerical results.

Table 4 reinforces this numerically, showing that removing either the dataset or the spatial architecture leads to a drop in performance.

Limitations and Conclusion

Despite its success, InstructMove is not perfect. The model relies on the quality of the MLLM instructions. If the MLLM hallucinates or misses a detail in the video pair, the model learns incorrect associations.

Additionally, because the model is trained on realistic videos, it sometimes struggles with purely artistic or abstract edits that don’t occur in the physical world (like turning a dog into a cyborg). There are also occasional issues with unintended viewpoint shifts, where the model might move the camera slightly even when not asked to.

Limitations showing unintended viewpoint changes and isolation issues. (Note: Referencing the bottom half of Figure 7/8 in the provided deck).

Summary: InstructMove represents a significant step forward in generative AI. By shifting the training paradigm from synthetic data to naturally occurring video dynamics, the authors have unlocked the ability for AI to understand and manipulate the physical world more effectively. It demonstrates that for an AI to learn how to edit a static image, it helps to watch how the world moves.

InstructMove: How Watching Videos Teaches AI to Perform Complex Image Edits#

The Problem with Synthetic Training Data#

The Core Insight: Videos as Natural Supervision#

The Data Construction Pipeline#

The Architecture: Spatial Conditioning#

The Solution: Spatial Concatenation#

Going Beyond Text: Precise Control#

Masking#

ControlNet Integration#

Experimental Results#

Quantitative Analysis#

Qualitative Comparison#

Ablation Studies: Does the Data Matter?#

Limitations and Conclusion#