Introduction
How do you know how far away an object is? If you close one eye and sit perfectly still, the world flattens. Depth perception becomes a guessing game based on shadows and familiar object sizes. But the moment you move your head, the world pops back into 3D. Nearby objects rush past your vision, while distant mountains barely budge. This phenomenon, known as motion parallax, is a fundamental way biological systems perceive geometry.
In computer vision, extracting 3D depth from a single video source (monocular video) has been a longstanding hurdle. Traditional methods often treat video as a bag of individual images, predicting depth frame-by-frame. The result? “Flickering.” As the video plays, the estimated depth of a wall or a car jitters uncontrollably because the model doesn’t understand that it’s the same object moving through time.
Enter Seurat, a novel research paper from KAIST AI and Adobe Research. The authors propose a method that bypasses the need for complex stereo setups or massive labeled datasets. Instead, they strip a video down to its barest geometric essence: the movement of points.

As shown in Figure 1, Seurat takes 2D point tracks—simple lines tracing how pixels move across the screen—and lifts them into 3D space. By analyzing how these points converge, diverge, and swirl over time, Seurat infers depth with remarkable stability.
In this post, we will deconstruct Seurat. We will explore why point trajectories are secret carriers of 3D information, how the authors built a Transformer-based architecture to decode this information, and why this method generalizes surprisingly well to real-world footage despite being trained entirely on synthetic data.
Background: The Geometry of Motion
To understand Seurat, we first need to understand the limitations of current technology and the intuition that powers this new approach.
The Problem with Monocular Depth
Monocular Depth Estimation (MDE) is the task of predicting the distance of every pixel from the camera using a single image. Deep learning has gotten very good at this. Models like MiDaS or ZoeDepth can look at a photo of a room and tell you the chair is closer than the window.
However, these models struggle with temporal consistency. When you run a single-image model on a video, it processes frame 1 independently of frame 2. Small changes in lighting or noise can cause the depth prediction to jump drastically. Furthermore, single-view geometry suffers from scale ambiguity—a large car far away looks identical to a small toy car close up.
The Intuition: Structured Light from Motion
The authors of Seurat draw a fascinating parallel to Structured Light 3D Scanning. In a structured light setup (like the FaceID sensor on an iPhone), a known pattern of dots is projected onto an object. The camera looks at how that pattern deforms. If the dots appear closer together than expected, the surface is angled away; if they curve, the surface is curved. The deformation of the pattern reveals the 3D shape.
Seurat posits that we don’t need to project artificial light. The natural texture of the world acts as our pattern.

Consider Figure 2 above.
- (a) Even without seeing the car itself, if you just saw the dots moving, you could tell the object is receding into the distance.
- (b) Imagine a sphere moving away from the camera. In the first frame (left), the red dots on the sphere are spread out. As the sphere moves further away (right), perspective causes those dots to bunch up near the center of the image.
This is the core insight: The changing density of 2D trajectories encodes 3D depth changes. If we can track how points move relative to each other, we can mathematically deduce how their depth is changing.
Core Method
The Seurat framework is designed to ingest 2D trajectories and output 3D depth ratios. It doesn’t rely on the pixel colors (RGB) of the video, only the geometry of the motion.
1. Theoretical Analysis
Before diving into the neural network, let’s look at the math that justifies why this works. The authors derive a relationship between the density of points on the 2D image plane and their depth in 3D space.
Assume we are observing a small surface patch. The area of this patch as it appears on your image sensor (\(A^{image}\)) depends on the focal length (\(f\)), the true surface area (\(A^{surface}\)), the viewing angle (\(\theta\)), and most importantly, the depth (\(d\)).

This equation tells us that the projected area is inversely proportional to the square of the depth (\(d^2\)). Consequently, the density of points (\(\rho\)), which is the inverse of the area, is proportional to the square of the depth:

If we compare the density of points at a starting time (\(t_0\)) versus a later time (\(t\)), we get a ratio. This allows us to cancel out the unknown focal length and true surface area (assuming the object is rigid locally):

Finally, by rearranging the terms, we can solve for the Depth Ratio (\(r_t\)), which represents how much the depth has changed relative to the start (\(d_t / d_{t_0}\)):

This derivation proves that if we can measure the change in point density (\(\rho\)) and account for rotation (\(\cos\theta\)), we can calculate depth changes.
Why use a Neural Network? You might ask, “If we have the formula, why train a model?” The problem is that the formula assumes we know the surface rotation (\(\theta\)) perfectly and that the object is perfectly rigid. In the real world, measuring local density is noisy, and estimating rotation is hard. The authors found that a hand-crafted mathematical solver was too brittle. A neural network, however, can learn to approximate this relationship robustly, handling noise and non-rigid deformations implicitly.
2. Architecture Overview
The Seurat architecture treats depth estimation as a sequence-to-sequence translation problem. Input sequences of 2D coordinates are translated into sequences of depth values.

As illustrated in Figure 3, the process has three main stages:
- Input Processing: The system uses an off-the-shelf tracker (like CoTracker or LocoTrack) to generate trajectories.
- Dual-Branch Transformer: The model processes these tracks using two separate branches.
- Depth Prediction: The model outputs depth ratios, which are then refined.
The Two-Branch Design
A key innovation here is the separation of Supporting Trajectories and Query Trajectories.
If you only tracked the points you wanted to predict depth for (the Query Points), you might miss the context. For example, if you only track points on a moving car, the model might not realize the camera itself is stationary. To solve this, Seurat places a dense grid of points over the entire image—the Supporting Grid.
- Supporting Branch: Processes the dense grid to capture global scene motion and camera movement.
- Query Branch: Processes the specific points of interest.
Crucially, information flows from the Supporting branch to the Query branch via Cross-Attention, but not the other way around. This prevents the specific distribution of query points from biasing the global understanding of the scene.
Transformer Blocks
Both branches use a mix of Temporal and Spatial attention mechanisms.
Supporting Encoder: The supporting trajectories (\(\mathcal{T}_s\)) and their visibility flags (\(\mathcal{V}_s\)) are embedded and passed through \(L\) layers.

The TransformerLayer here alternates between looking at the same point across different times (temporal attention) and looking at different points in the same frame (spatial attention). This allows the model to understand both the motion path and the geometric structure of the scene.
Query Decoder: The query branch is similar but includes an extra step. In every layer, it uses cross-attention to “look at” the supporting features (\(\mathbf{h}_s\)).

This injection of global context is what allows Seurat to accurately predict depth for a specific point by understanding how it moves relative to the rest of the scene.
3. Sliding Window & Log-Ratio Loss
Processing a long video all at once is computationally heavy and difficult because points frequently go out of view. Seurat uses a Sliding Window approach. It breaks the video into overlapping chunks (e.g., windows of 8 frames).
Inside each window, the model predicts the Log Depth Ratio. Why Log? Depth changes are multiplicative (an object moves from 10m to 20m, doubling its depth). In logarithmic space, this becomes additive, which is easier for networks to learn.

The training loss minimizes the difference between the predicted log ratio and the ground truth log ratio:

During inference (testing), these windowed predictions are stitched together. The model accumulates the changes over time to build a full trajectory history:

4. From Relative to Metric Depth
Seurat is fantastic at telling you how depth changes (e.g., “this point is now 2x further away”). However, it doesn’t know if the point started at 1 meter or 100 meters. This is the difference between relative depth and metric depth.
To solve this, the authors combine Seurat with a single-image Monocular Depth Estimator (MDE) like ZoeDepth or DepthPro.
They use a technique called Piecewise Scale Matching. For a specific point, the MDE provides a noisy but roughly correct absolute depth for each frame. Seurat provides a smooth, accurate relative change curve.
The system calculates a scale factor (\(s_{i,t}\)) that aligns Seurat’s smooth curve with the median value of the MDE’s absolute predictions:

This scale factor is then applied to the Seurat prediction:

The result is the best of both worlds: the absolute scale from the image-based model and the temporal smoothness and geometric consistency from the trajectory-based model.
Experiments & Results
The authors evaluated Seurat on the TAPVid-3D benchmark, which includes diverse datasets like Aria (egocentric/VR), DriveTrack (autonomous driving), and Panoptic Studio (indoor motion).
Quantitative Performance
The results show that Seurat significantly outperforms baselines that simply rely on per-frame depth estimation.
In the table below (Table 2 from the paper), we see the comparison when aligning predictions to ground truth using a global scale.

Key takeaways from the data:
- High Accuracy: When combined with DepthPro, Seurat achieves the highest 3D-AJ (Average Jaccard) and APD (Average Position Difference) scores.
- Temporal Coherence (TC): Look at the TC column (lower is better). Seurat’s scores are consistently an order of magnitude better than the baselines (e.g., 0.012 vs 0.172). This mathematically confirms that Seurat eliminates the “jitter” or flickering common in video depth estimation.
- Generalization: Remember, Seurat was trained only on synthetic data (Kubric). Yet, it dominates on real-world datasets (Aria, DriveTrack). This suggests the geometric cues it learned are universal.
Qualitative Comparison
Numbers are great, but visual results tell the story of stability. Figure 4 visualizes the 3D trajectories generated by different methods.

In the visualizations:
- CoTracker + ZoeDepth (Left): The trajectories are wavy and jagged. The depth estimation fluctuates frame-to-frame, resulting in “squiggly” lines that don’t represent real physical motion.
- Seurat (Middle): The trajectories are smooth and straight, closely matching the Ground Truth (Right). The spatial structure of the room (top row) and the street (middle row) is preserved.
Ablation Studies: What Matters?
The authors performed rigorous testing to see which parts of their design were essential.

- Two-branch design: Merging the supporting and query branches (Row II) hurt performance significantly (3D-AJ dropped from 18.0 to 13.7). This confirms that separating the global scene context from specific query points is vital.
- Sliding Window: Removing the sliding window (Row III) and trying to process the whole video at once caused a massive performance drop (3D-AJ 8.8). Long-term motion is too complex to predict in one shot; breaking it down simplifies the problem.
The Surprise: Texture Hurts
One of the most interesting experiments involved feeding the actual RGB image pixels (texture) into the model alongside the point coordinates.

As shown in Table 5, adding texture information actually reduced performance. Why? The authors suspect that when the model sees RGB data, it starts overfitting to the specific look of the synthetic training data. By restricting the input to only coordinate trajectories, the model is forced to learn pure geometry, which transfers perfectly from synthetic cartoons to photorealistic video.
Is it just smoothing?
A skeptic might ask: “Is Seurat just a fancy Gaussian smoothing filter applied to ZoeDepth?”

Table 6 answers this. Simply applying Gaussian smoothing to the output of a standard depth estimator improves results slightly but pales in comparison to Seurat. Seurat isn’t just averaging numbers; it’s using the physics of perspective convergence to calculate the correct depth.
Conclusion
Seurat represents a refreshing “back to basics” approach in computer vision. In an era dominated by massive foundation models trained on billions of images, Seurat demonstrates that there is still immense power in understanding the fundamental geometry of vision.
By treating moving points as a dynamic signal, the authors unlocked a robust method for estimating depth that:
- Generalizes Zero-Shot: Works on real video despite synthetic training.
- Ensures Consistency: Eliminates the flickering that plagues frame-by-frame methods.
- Remains Efficient: Uses lightweight Transformers on coordinate data rather than heavy video processing.
While Seurat relies on external models for absolute scale, its ability to weave 2D chaos into coherent 3D structure is a significant step forward. It reminds us that in video, motion isn’t just a byproduct of time—it’s a blueprint of the 3D world.
](https://deep-paper.org/en/paper/2504.14687/images/cover.png)