Introduction

In the rapidly evolving world of Generative AI, we have moved quickly from creating static images to generating full-motion video. Tools like Sora, Runway, and Stable Video Diffusion have shown us that AI can dream up dynamic scenes. However, for these tools to be useful in professional workflows—like filmmaking, game design, or VR—random generation isn’t enough. We need control.

Specifically, we need to tell the AI exactly where an object should move. This concept, known as “drag-based interaction,” allows users to click a point on an image and drag it to a new location, signaling the model to animate that movement.

But there is a major flaw in current methods: they treat the world as a flat 2D surface.

If you draw a line moving “up” on a photo of a street, does that mean the car flies into the sky, or does it drive further down the road into the distance? Existing models struggle with this ambiguity. They often fail to handle depth (objects getting smaller as they move away) and occlusion (objects passing behind others).

Enter LeviTor.

LeviTor demonstrates generating videos with controlled occlusion and depth changes. Top row: hot air balloons. Middle row: cars. Bottom row: anime characters.

As shown in Figure 1, LeviTor is a new approach that introduces a third dimension to trajectory control. By integrating depth information into the generation process, it allows users to orchestrate complex 3D movements—like a hot air balloon drifting behind a skyscraper or a car driving into the horizon—using simple interactions.

In this post, we will deconstruct how LeviTor works, moving from the high-level concepts to the mathematical engine under the hood.

Background: The Limits of 2D Control

Before diving into LeviTor, it helps to understand the baseline. Most “Image-to-Video” (I2V) models that offer trajectory control rely on 2D coordinates. You give the model a starting point \((x_1, y_1)\) and an ending point \((x_2, y_2)\).

The model tries to hallucinate frames that bridge this gap. However, without an understanding of 3D space, the model creates “physically impossible” videos. An object might shrink when it should stay the same size, or it might slide over an obstacle when it should go behind it.

The researchers behind LeviTor identified that to solve this, we don’t necessarily need a massive dataset of 3D scans. Instead, we can augment existing video data with depth maps and segmentation masks to teach the AI how objects behave in 3D space.

The Core Method: LeviTor

The LeviTor architecture is built upon Stable Video Diffusion (SVD), but it introduces a novel way of representing control signals. The method addresses two main challenges:

Training: How do we teach a model 3D motion when most video datasets are just 2D pixels?
Inference: How do we make it easy for a human user to input complex 3D paths without needing 3D modeling software?

1. The Control Signal: K-Means and Depth

The first innovation is how the researchers represent an object. A simple bounding box or a single center point is too crude to capture complex motion like rotation or deformation. Conversely, a dense pixel-wise mask is too heavy to compute and hard for users to manipulate.

LeviTor finds a middle ground using K-means clustering.

An example of object movement and occlusion represented by K-means clustered points.

As illustrated in Figure 2 above, the system takes the segmentation mask of an object (like a car or motorbike) and clusters its pixels into \(N\) representative points.

Why is this brilliant? Notice the motorbike in the image. As it moves closer to the camera, the points spread out. This “spread” implicitly encodes the object’s size and distance.
When the car is occluded (partially hidden), the cluster points shift distribution.

The algorithm extracts these points mathematically as:

Equation for K-means clustering

Where \(M_t\) is the object mask at frame \(t\), and \((x^i_t, y^i_t)\) are the coordinates of the \(i\)-th control point.

Adding the Z-Axis

To make these points 3D-aware, the researchers use a pre-trained depth estimation model (DepthAnythingV2). They sample the depth value at each of these cluster points.

Equation for sampling depth

This results in a rich trajectory representation \(\mathcal{T}\) that includes the X, Y, and Depth (Z) for every key point on the object across time:

Equation for the full trajectory set

2. The Training Pipeline

Now that we have a way to describe 3D motion, how is the model trained? The researchers used the SA-V (Segment Anything Video) dataset, which provides high-quality masks for objects in videos.

The training process is visualized below:

Control signal generation process of LeviTor. Left: VOS training data. Top path: K-means instance points. Bottom path: Depth maps.

Here is the step-by-step flow:

Input Video: Take a raw video clip.
Segmentation: Use ground truth masks to identify objects.
Clustering: Apply the K-means logic to turn masks into sparse “Instance Points.”

Note: The number of points \(k\) is dynamic. Larger objects get more points. If an object changes size drastically (moves fast in Z-space), the system ensures enough points are used to track it.
The dynamic \(k\) calculation is governed by the object’s area \(S\):
And refined to ensure stability during large zooms:

Depth Estimation: Generate relative depth maps for the video.
Signal Fusion: Combine the 2D tracks and depth values into a “control signal” fed into the diffusion model via a ControlNet-like adapter.

The model is then trained to minimize the difference between the generated noise and the actual video, conditioned on this new 3D trajectory signal.

Loss function equation

3. The Inference Pipeline: Making it Usable

This is where LeviTor shines for the end-user. A user doesn’t want to manually plot the paths of 8 different cluster points. They want to draw one line.

The authors designed an interactive pipeline that translates simple user inputs into the complex signals the model expects.

Inference pipeline of LeviTor. Section 1: Input processing. Section 2: Interactive panel. Section 3: Synthesis.

The User Workflow:

Select Object: The user clicks an object in the first frame (automatically segmented by SAM).
Draw Trajectory: The user draws a 2D path on the screen.
Adjust Depth: The user indicates relative depth (e.g., “start at depth 1.0, end at depth 1.4”).

The System’s “Imagination” Step: The system needs to convert that single user-drawn line into the multi-point cluster signal used in training. To do this, it performs a 3D Projection and Rendering simulation.

3D rendered object masks generation pipeline.

As shown in Figure 5, the system takes the pixels of the selected object and “lifts” them into 3D space using the estimated depth map.

Project to 3D: Convert 2D pixels \((x, y)\) and depth \(d\) into 3D camera coordinates \((X, Y, Z)\).
Apply Motion: Apply the user’s trajectory movement \(\mathbf{T}\) to these 3D points.
Render Back to 2D: Project the moved 3D points back onto a 2D canvas to create a “virtual mask” for the next frame.

Once the system has these “virtual masks” for the future frames, it runs K-means on them. This generates the dense, multi-point control signals that tell the diffusion model exactly how the object should deform, scale, and move to look realistic.

Experiments & Results

Does adding depth actually make a difference? The researchers compared LeviTor against state-of-the-art baselines like DragAnything and DragNUWA.

Qualitative Comparison

The results, shown in Figure 6, highlight the limitations of pure 2D methods.

Qualitative comparison with DragAnything and DragNUWA.

Row 1 (Tornado): The user wants the tornado to sweep across the landscape. LeviTor keeps the tornado distinct from the background. Other methods often blur the tornado or mistakenly move the entire camera, confusing object motion with camera panning.
Row 2 (Planets): This is a classic perspective test. As a planet moves “forward,” it should get larger. LeviTor (left) handles this scale change naturally because of the depth signal. DragNUWA (right) moves the planet but fails to scale it effectively, breaking the illusion of 3D.
Bottom Right (Orbit): The most complex case is “orbiting,” where an object passes in front of and then behind another. Because LeviTor models the trajectory in 3D space, it correctly handles the occlusion—the object disappears when it goes behind the vase and reappears when it loops around. 2D methods simply slide the object over the top like a sticker.

Quantitative Metrics

The team measured success using FID (Frame quality), FVD (Video coherence/smoothness), and ObjMC (Object Motion Consistency—how well it followed the path).

Table comparing LeviTor metrics against baselines.

LeviTor (Ours) significantly outperforms the baselines in FID and FVD. Lower scores are better here.

FVD (Single-Point): LeviTor scores 226.45 vs. DragNUWA’s 330.17. This is a massive jump in video quality and temporal consistency.
FVD (Multi-Point): When using multiple control points, the score drops further to 190.44, proving that the cluster-based control signal is highly effective.

Ablation Studies: Why the Components Matter

The researchers performed ablation studies to prove that every part of their complex pipeline was necessary.

The Role of Instance and Depth Info

What happens if you feed the model the points but not the depth values? Or if you forget to tell the model which object the points belong to (Instance ID)?

Visual ablation study showing blurry results without instance or depth info.

w/o Instance: The model gets confused about the boundaries of the object, leading to blurry edges (Middle row).
w/o Depth: The model loses the ability to render crisp edges during movement, as it lacks the geometric context of where the object sits in space (Bottom row).

The Importance of Point Density

How many points do you need?

Too few points: The object moves, but it might wobble or deform unnaturally because the control is too loose.
Too many points: The object becomes rigid. It simply translates across the screen without the subtle natural deformations (like a puppy’s legs moving while running).

Ablation on number of inference control points.

Figure 8 shows this trade-off. With the right scale of points (Scale = 1.0 or 2.0), the puppy runs naturally. If the control is too dense, the puppy essentially “slides” across the grass as a static image.

Quantitative data backs this up:

Table showing ablation metrics

The full LeviTor model (Checkmarks on Depth and Instance) yields the lowest (best) scores across the board.

Conclusion

LeviTor represents a significant step forward in controllable video generation. By acknowledging that video is a 2D projection of a 3D world, the authors have built a system that respects the physics of depth and perspective.

Key Takeaways:

Depth is data: You don’t need 3D models to get 3D effects; estimated depth maps + segmentation masks are powerful proxies.
Clusters > Points: Representing an object as a cluster of points allows for tracking size changes and deformation much better than a single centroid.
Simulation before Generation: The inference pipeline—simulating the 3D movement of the mask before asking the AI to generate pixels—bridges the gap between user intent and model requirements.

While LeviTor still relies on the quality of the underlying segmentation (SAM) and depth (DepthAnything) models, it paves the way for tools that will allow creators to direct AI video scenes with the precision of a 3D animator, all from a standard 2D interface.

The images and data presented in this article are derived from the research paper “LeviTor: 3D Trajectory Oriented Image-to-Video Synthesis”.

Introduction#

Background: The Limits of 2D Control#

The Core Method: LeviTor#

1. The Control Signal: K-Means and Depth#

Adding the Z-Axis#

2. The Training Pipeline#

3. The Inference Pipeline: Making it Usable#

Experiments & Results#

Qualitative Comparison#

Quantitative Metrics#

Ablation Studies: Why the Components Matter#

The Role of Instance and Depth Info#

The Importance of Point Density#

Conclusion#