Introduction

How do you know how far away an object is? If you close one eye and sit perfectly still, the world flattens. Depth perception becomes a guessing game based on shadows and familiar object sizes. But the moment you move your head, the world pops back into 3D. Nearby objects rush past your vision, while distant mountains barely budge. This phenomenon, known as motion parallax, is a fundamental way biological systems perceive geometry.

In computer vision, extracting 3D depth from a single video source (monocular video) has been a longstanding hurdle. Traditional methods often treat video as a bag of individual images, predicting depth frame-by-frame. The result? “Flickering.” As the video plays, the estimated depth of a wall or a car jitters uncontrollably because the model doesn’t understand that it’s the same object moving through time.

Enter Seurat, a novel research paper from KAIST AI and Adobe Research. The authors propose a method that bypasses the need for complex stereo setups or massive labeled datasets. Instead, they strip a video down to its barest geometric essence: the movement of points.

Figure 1. Seurat predicts precise and smooth depth changes for dynamic objects over time by only looking at the 2D point trajectories.

As shown in Figure 1, Seurat takes 2D point tracks—simple lines tracing how pixels move across the screen—and lifts them into 3D space. By analyzing how these points converge, diverge, and swirl over time, Seurat infers depth with remarkable stability.

In this post, we will deconstruct Seurat. We will explore why point trajectories are secret carriers of 3D information, how the authors built a Transformer-based architecture to decode this information, and why this method generalizes surprisingly well to real-world footage despite being trained entirely on synthetic data.

Background: The Geometry of Motion

To understand Seurat, we first need to understand the limitations of current technology and the intuition that powers this new approach.

The Problem with Monocular Depth

Monocular Depth Estimation (MDE) is the task of predicting the distance of every pixel from the camera using a single image. Deep learning has gotten very good at this. Models like MiDaS or ZoeDepth can look at a photo of a room and tell you the chair is closer than the window.

However, these models struggle with temporal consistency. When you run a single-image model on a video, it processes frame 1 independently of frame 2. Small changes in lighting or noise can cause the depth prediction to jump drastically. Furthermore, single-view geometry suffers from scale ambiguity—a large car far away looks identical to a small toy car close up.

The Intuition: Structured Light from Motion

The authors of Seurat draw a fascinating parallel to Structured Light 3D Scanning. In a structured light setup (like the FaceID sensor on an iPhone), a known pattern of dots is projected onto an object. The camera looks at how that pattern deforms. If the dots appear closer together than expected, the surface is angled away; if they curve, the surface is curved. The deformation of the pattern reveals the 3D shape.

Seurat posits that we don’t need to project artificial light. The natural texture of the world acts as our pattern.

Figure 2. Motivation of our work. (a) By looking at tracked points, we can perceive motion. (b) As a sphere moves away, projected points converge, providing depth cues.

Consider Figure 2 above.

  • (a) Even without seeing the car itself, if you just saw the dots moving, you could tell the object is receding into the distance.
  • (b) Imagine a sphere moving away from the camera. In the first frame (left), the red dots on the sphere are spread out. As the sphere moves further away (right), perspective causes those dots to bunch up near the center of the image.

This is the core insight: The changing density of 2D trajectories encodes 3D depth changes. If we can track how points move relative to each other, we can mathematically deduce how their depth is changing.

Core Method

The Seurat framework is designed to ingest 2D trajectories and output 3D depth ratios. It doesn’t rely on the pixel colors (RGB) of the video, only the geometry of the motion.

1. Theoretical Analysis

Before diving into the neural network, let’s look at the math that justifies why this works. The authors derive a relationship between the density of points on the 2D image plane and their depth in 3D space.

Assume we are observing a small surface patch. The area of this patch as it appears on your image sensor (\(A^{image}\)) depends on the focal length (\(f\)), the true surface area (\(A^{surface}\)), the viewing angle (\(\theta\)), and most importantly, the depth (\(d\)).

Equation 1

This equation tells us that the projected area is inversely proportional to the square of the depth (\(d^2\)). Consequently, the density of points (\(\rho\)), which is the inverse of the area, is proportional to the square of the depth:

Equation 2

If we compare the density of points at a starting time (\(t_0\)) versus a later time (\(t\)), we get a ratio. This allows us to cancel out the unknown focal length and true surface area (assuming the object is rigid locally):

Equation 3

Finally, by rearranging the terms, we can solve for the Depth Ratio (\(r_t\)), which represents how much the depth has changed relative to the start (\(d_t / d_{t_0}\)):

Equation 4

This derivation proves that if we can measure the change in point density (\(\rho\)) and account for rotation (\(\cos\theta\)), we can calculate depth changes.

Why use a Neural Network? You might ask, “If we have the formula, why train a model?” The problem is that the formula assumes we know the surface rotation (\(\theta\)) perfectly and that the object is perfectly rigid. In the real world, measuring local density is noisy, and estimating rotation is hard. The authors found that a hand-crafted mathematical solver was too brittle. A neural network, however, can learn to approximate this relationship robustly, handling noise and non-rigid deformations implicitly.

2. Architecture Overview

The Seurat architecture treats depth estimation as a sequence-to-sequence translation problem. Input sequences of 2D coordinates are translated into sequences of depth values.

Figure 3. Overall architecture.

As illustrated in Figure 3, the process has three main stages:

  1. Input Processing: The system uses an off-the-shelf tracker (like CoTracker or LocoTrack) to generate trajectories.
  2. Dual-Branch Transformer: The model processes these tracks using two separate branches.
  3. Depth Prediction: The model outputs depth ratios, which are then refined.

The Two-Branch Design

A key innovation here is the separation of Supporting Trajectories and Query Trajectories.

If you only tracked the points you wanted to predict depth for (the Query Points), you might miss the context. For example, if you only track points on a moving car, the model might not realize the camera itself is stationary. To solve this, Seurat places a dense grid of points over the entire image—the Supporting Grid.

  • Supporting Branch: Processes the dense grid to capture global scene motion and camera movement.
  • Query Branch: Processes the specific points of interest.

Crucially, information flows from the Supporting branch to the Query branch via Cross-Attention, but not the other way around. This prevents the specific distribution of query points from biasing the global understanding of the scene.

Transformer Blocks

Both branches use a mix of Temporal and Spatial attention mechanisms.

Supporting Encoder: The supporting trajectories (\(\mathcal{T}_s\)) and their visibility flags (\(\mathcal{V}_s\)) are embedded and passed through \(L\) layers.

Equation 5

The TransformerLayer here alternates between looking at the same point across different times (temporal attention) and looking at different points in the same frame (spatial attention). This allows the model to understand both the motion path and the geometric structure of the scene.

Query Decoder: The query branch is similar but includes an extra step. In every layer, it uses cross-attention to “look at” the supporting features (\(\mathbf{h}_s\)).

Equation 6

This injection of global context is what allows Seurat to accurately predict depth for a specific point by understanding how it moves relative to the rest of the scene.

3. Sliding Window & Log-Ratio Loss

Processing a long video all at once is computationally heavy and difficult because points frequently go out of view. Seurat uses a Sliding Window approach. It breaks the video into overlapping chunks (e.g., windows of 8 frames).

Inside each window, the model predicts the Log Depth Ratio. Why Log? Depth changes are multiplicative (an object moves from 10m to 20m, doubling its depth). In logarithmic space, this becomes additive, which is easier for networks to learn.

Equation 7

The training loss minimizes the difference between the predicted log ratio and the ground truth log ratio:

Equation 8

During inference (testing), these windowed predictions are stitched together. The model accumulates the changes over time to build a full trajectory history:

Equation 9

4. From Relative to Metric Depth

Seurat is fantastic at telling you how depth changes (e.g., “this point is now 2x further away”). However, it doesn’t know if the point started at 1 meter or 100 meters. This is the difference between relative depth and metric depth.

To solve this, the authors combine Seurat with a single-image Monocular Depth Estimator (MDE) like ZoeDepth or DepthPro.

They use a technique called Piecewise Scale Matching. For a specific point, the MDE provides a noisy but roughly correct absolute depth for each frame. Seurat provides a smooth, accurate relative change curve.

The system calculates a scale factor (\(s_{i,t}\)) that aligns Seurat’s smooth curve with the median value of the MDE’s absolute predictions:

Equation 10

This scale factor is then applied to the Seurat prediction:

Equation 11

The result is the best of both worlds: the absolute scale from the image-based model and the temporal smoothness and geometric consistency from the trajectory-based model.

Experiments & Results

The authors evaluated Seurat on the TAPVid-3D benchmark, which includes diverse datasets like Aria (egocentric/VR), DriveTrack (autonomous driving), and Panoptic Studio (indoor motion).

Quantitative Performance

The results show that Seurat significantly outperforms baselines that simply rely on per-frame depth estimation.

In the table below (Table 2 from the paper), we see the comparison when aligning predictions to ground truth using a global scale.

Table 2. Quantitative results of affine-invariant depth on TAPVid-3D.

Key takeaways from the data:

  1. High Accuracy: When combined with DepthPro, Seurat achieves the highest 3D-AJ (Average Jaccard) and APD (Average Position Difference) scores.
  2. Temporal Coherence (TC): Look at the TC column (lower is better). Seurat’s scores are consistently an order of magnitude better than the baselines (e.g., 0.012 vs 0.172). This mathematically confirms that Seurat eliminates the “jitter” or flickering common in video depth estimation.
  3. Generalization: Remember, Seurat was trained only on synthetic data (Kubric). Yet, it dominates on real-world datasets (Aria, DriveTrack). This suggests the geometric cues it learned are universal.

Qualitative Comparison

Numbers are great, but visual results tell the story of stability. Figure 4 visualizes the 3D trajectories generated by different methods.

Figure 4. Qualitative comparisons to baselines.

In the visualizations:

  • CoTracker + ZoeDepth (Left): The trajectories are wavy and jagged. The depth estimation fluctuates frame-to-frame, resulting in “squiggly” lines that don’t represent real physical motion.
  • Seurat (Middle): The trajectories are smooth and straight, closely matching the Ground Truth (Right). The spatial structure of the room (top row) and the street (middle row) is preserved.

Ablation Studies: What Matters?

The authors performed rigorous testing to see which parts of their design were essential.

Table 3. Ablation studies.

  • Two-branch design: Merging the supporting and query branches (Row II) hurt performance significantly (3D-AJ dropped from 18.0 to 13.7). This confirms that separating the global scene context from specific query points is vital.
  • Sliding Window: Removing the sliding window (Row III) and trying to process the whole video at once caused a massive performance drop (3D-AJ 8.8). Long-term motion is too complex to predict in one shot; breaking it down simplifies the problem.

The Surprise: Texture Hurts

One of the most interesting experiments involved feeding the actual RGB image pixels (texture) into the model alongside the point coordinates.

Table 5. Texture patch ablation.

As shown in Table 5, adding texture information actually reduced performance. Why? The authors suspect that when the model sees RGB data, it starts overfitting to the specific look of the synthetic training data. By restricting the input to only coordinate trajectories, the model is forced to learn pure geometry, which transfers perfectly from synthetic cartoons to photorealistic video.

Is it just smoothing?

A skeptic might ask: “Is Seurat just a fancy Gaussian smoothing filter applied to ZoeDepth?”

Table 6. Comparison to simple Gaussian smoothing.

Table 6 answers this. Simply applying Gaussian smoothing to the output of a standard depth estimator improves results slightly but pales in comparison to Seurat. Seurat isn’t just averaging numbers; it’s using the physics of perspective convergence to calculate the correct depth.

Conclusion

Seurat represents a refreshing “back to basics” approach in computer vision. In an era dominated by massive foundation models trained on billions of images, Seurat demonstrates that there is still immense power in understanding the fundamental geometry of vision.

By treating moving points as a dynamic signal, the authors unlocked a robust method for estimating depth that:

  1. Generalizes Zero-Shot: Works on real video despite synthetic training.
  2. Ensures Consistency: Eliminates the flickering that plagues frame-by-frame methods.
  3. Remains Efficient: Uses lightweight Transformers on coordinate data rather than heavy video processing.

While Seurat relies on external models for absolute scale, its ability to weave 2D chaos into coherent 3D structure is a significant step forward. It reminds us that in video, motion isn’t just a byproduct of time—it’s a blueprint of the 3D world.