Introduction
In the rapidly evolving world of Generative AI, we have moved quickly from creating static images to generating full-motion video. Tools like Sora, Runway, and Stable Video Diffusion have shown us that AI can dream up dynamic scenes. However, for these tools to be useful in professional workflows—like filmmaking, game design, or VR—random generation isn’t enough. We need control.
Specifically, we need to tell the AI exactly where an object should move. This concept, known as “drag-based interaction,” allows users to click a point on an image and drag it to a new location, signaling the model to animate that movement.
But there is a major flaw in current methods: they treat the world as a flat 2D surface.
If you draw a line moving “up” on a photo of a street, does that mean the car flies into the sky, or does it drive further down the road into the distance? Existing models struggle with this ambiguity. They often fail to handle depth (objects getting smaller as they move away) and occlusion (objects passing behind others).
Enter LeviTor.

As shown in Figure 1, LeviTor is a new approach that introduces a third dimension to trajectory control. By integrating depth information into the generation process, it allows users to orchestrate complex 3D movements—like a hot air balloon drifting behind a skyscraper or a car driving into the horizon—using simple interactions.
In this post, we will deconstruct how LeviTor works, moving from the high-level concepts to the mathematical engine under the hood.
Background: The Limits of 2D Control
Before diving into LeviTor, it helps to understand the baseline. Most “Image-to-Video” (I2V) models that offer trajectory control rely on 2D coordinates. You give the model a starting point \((x_1, y_1)\) and an ending point \((x_2, y_2)\).
The model tries to hallucinate frames that bridge this gap. However, without an understanding of 3D space, the model creates “physically impossible” videos. An object might shrink when it should stay the same size, or it might slide over an obstacle when it should go behind it.
The researchers behind LeviTor identified that to solve this, we don’t necessarily need a massive dataset of 3D scans. Instead, we can augment existing video data with depth maps and segmentation masks to teach the AI how objects behave in 3D space.
The Core Method: LeviTor
The LeviTor architecture is built upon Stable Video Diffusion (SVD), but it introduces a novel way of representing control signals. The method addresses two main challenges:
- Training: How do we teach a model 3D motion when most video datasets are just 2D pixels?
- Inference: How do we make it easy for a human user to input complex 3D paths without needing 3D modeling software?
1. The Control Signal: K-Means and Depth
The first innovation is how the researchers represent an object. A simple bounding box or a single center point is too crude to capture complex motion like rotation or deformation. Conversely, a dense pixel-wise mask is too heavy to compute and hard for users to manipulate.
LeviTor finds a middle ground using K-means clustering.

As illustrated in Figure 2 above, the system takes the segmentation mask of an object (like a car or motorbike) and clusters its pixels into \(N\) representative points.
- Why is this brilliant? Notice the motorbike in the image. As it moves closer to the camera, the points spread out. This “spread” implicitly encodes the object’s size and distance.
- When the car is occluded (partially hidden), the cluster points shift distribution.
The algorithm extracts these points mathematically as:

Where \(M_t\) is the object mask at frame \(t\), and \((x^i_t, y^i_t)\) are the coordinates of the \(i\)-th control point.
Adding the Z-Axis
To make these points 3D-aware, the researchers use a pre-trained depth estimation model (DepthAnythingV2). They sample the depth value at each of these cluster points.

This results in a rich trajectory representation \(\mathcal{T}\) that includes the X, Y, and Depth (Z) for every key point on the object across time:

2. The Training Pipeline
Now that we have a way to describe 3D motion, how is the model trained? The researchers used the SA-V (Segment Anything Video) dataset, which provides high-quality masks for objects in videos.
The training process is visualized below:

Here is the step-by-step flow:
- Input Video: Take a raw video clip.
- Segmentation: Use ground truth masks to identify objects.
- Clustering: Apply the K-means logic to turn masks into sparse “Instance Points.”
- Note: The number of points \(k\) is dynamic. Larger objects get more points. If an object changes size drastically (moves fast in Z-space), the system ensures enough points are used to track it.
- The dynamic \(k\) calculation is governed by the object’s area \(S\):

- And refined to ensure stability during large zooms:

- Depth Estimation: Generate relative depth maps for the video.
- Signal Fusion: Combine the 2D tracks and depth values into a “control signal” fed into the diffusion model via a ControlNet-like adapter.
The model is then trained to minimize the difference between the generated noise and the actual video, conditioned on this new 3D trajectory signal.

3. The Inference Pipeline: Making it Usable
This is where LeviTor shines for the end-user. A user doesn’t want to manually plot the paths of 8 different cluster points. They want to draw one line.
The authors designed an interactive pipeline that translates simple user inputs into the complex signals the model expects.

The User Workflow:
- Select Object: The user clicks an object in the first frame (automatically segmented by SAM).
- Draw Trajectory: The user draws a 2D path on the screen.
- Adjust Depth: The user indicates relative depth (e.g., “start at depth 1.0, end at depth 1.4”).
The System’s “Imagination” Step: The system needs to convert that single user-drawn line into the multi-point cluster signal used in training. To do this, it performs a 3D Projection and Rendering simulation.

As shown in Figure 5, the system takes the pixels of the selected object and “lifts” them into 3D space using the estimated depth map.
- Project to 3D: Convert 2D pixels \((x, y)\) and depth \(d\) into 3D camera coordinates \((X, Y, Z)\).

- Apply Motion: Apply the user’s trajectory movement \(\mathbf{T}\) to these 3D points.
- Render Back to 2D: Project the moved 3D points back onto a 2D canvas to create a “virtual mask” for the next frame.

Once the system has these “virtual masks” for the future frames, it runs K-means on them. This generates the dense, multi-point control signals that tell the diffusion model exactly how the object should deform, scale, and move to look realistic.
Experiments & Results
Does adding depth actually make a difference? The researchers compared LeviTor against state-of-the-art baselines like DragAnything and DragNUWA.
Qualitative Comparison
The results, shown in Figure 6, highlight the limitations of pure 2D methods.

- Row 1 (Tornado): The user wants the tornado to sweep across the landscape. LeviTor keeps the tornado distinct from the background. Other methods often blur the tornado or mistakenly move the entire camera, confusing object motion with camera panning.
- Row 2 (Planets): This is a classic perspective test. As a planet moves “forward,” it should get larger. LeviTor (left) handles this scale change naturally because of the depth signal. DragNUWA (right) moves the planet but fails to scale it effectively, breaking the illusion of 3D.
- Bottom Right (Orbit): The most complex case is “orbiting,” where an object passes in front of and then behind another. Because LeviTor models the trajectory in 3D space, it correctly handles the occlusion—the object disappears when it goes behind the vase and reappears when it loops around. 2D methods simply slide the object over the top like a sticker.
Quantitative Metrics
The team measured success using FID (Frame quality), FVD (Video coherence/smoothness), and ObjMC (Object Motion Consistency—how well it followed the path).

LeviTor (Ours) significantly outperforms the baselines in FID and FVD. Lower scores are better here.
- FVD (Single-Point): LeviTor scores 226.45 vs. DragNUWA’s 330.17. This is a massive jump in video quality and temporal consistency.
- FVD (Multi-Point): When using multiple control points, the score drops further to 190.44, proving that the cluster-based control signal is highly effective.
Ablation Studies: Why the Components Matter
The researchers performed ablation studies to prove that every part of their complex pipeline was necessary.
The Role of Instance and Depth Info
What happens if you feed the model the points but not the depth values? Or if you forget to tell the model which object the points belong to (Instance ID)?

- w/o Instance: The model gets confused about the boundaries of the object, leading to blurry edges (Middle row).
- w/o Depth: The model loses the ability to render crisp edges during movement, as it lacks the geometric context of where the object sits in space (Bottom row).
The Importance of Point Density
How many points do you need?
- Too few points: The object moves, but it might wobble or deform unnaturally because the control is too loose.
- Too many points: The object becomes rigid. It simply translates across the screen without the subtle natural deformations (like a puppy’s legs moving while running).

Figure 8 shows this trade-off. With the right scale of points (Scale = 1.0 or 2.0), the puppy runs naturally. If the control is too dense, the puppy essentially “slides” across the grass as a static image.
Quantitative data backs this up:

The full LeviTor model (Checkmarks on Depth and Instance) yields the lowest (best) scores across the board.
Conclusion
LeviTor represents a significant step forward in controllable video generation. By acknowledging that video is a 2D projection of a 3D world, the authors have built a system that respects the physics of depth and perspective.
Key Takeaways:
- Depth is data: You don’t need 3D models to get 3D effects; estimated depth maps + segmentation masks are powerful proxies.
- Clusters > Points: Representing an object as a cluster of points allows for tracking size changes and deformation much better than a single centroid.
- Simulation before Generation: The inference pipeline—simulating the 3D movement of the mask before asking the AI to generate pixels—bridges the gap between user intent and model requirements.
While LeviTor still relies on the quality of the underlying segmentation (SAM) and depth (DepthAnything) models, it paves the way for tools that will allow creators to direct AI video scenes with the precision of a 3D animator, all from a standard 2D interface.
The images and data presented in this article are derived from the research paper “LeviTor: 3D Trajectory Oriented Image-to-Video Synthesis”.
](https://deep-paper.org/en/paper/2412.15214/images/cover.png)