If you have ever tried to use a single-image AI model to process a video frame-by-frame, you are likely familiar with the “flicker” problem. Whether it is depth estimation, style transfer, or colorization, applying an image model to a video usually results in a jittery, inconsistent mess. The ground shakes, colors shift randomly, and objects morph shape from one second to the next.

This happens because standard image models have no concept of time. They don’t know that the chair in frame 10 is the same object as the chair in frame 11.

To solve this, researchers usually employ 3D convolutions or complex temporal attention mechanisms. These are computationally heavy and often struggle with large motions. But what if there was a better way? What if we could simply tell the network exactly how pixels move?

In this post, we are diving into Tracktention, a fascinating paper from the Visual Geometry Group at the University of Oxford. The researchers propose a plug-and-play layer that transforms standard image models into state-of-the-art video models by leveraging the recent explosion in high-quality point tracking.

The Tracktention concept. On the left, an image network is converted to a video network. On the right, performance charts show it is both faster and more accurate than competitors.

The Core Problem: Temporal Consistency

Video analysis is significantly harder than image analysis because of the temporal dimension. The “Holy Grail” of video processing is temporal consistency—ensuring that predictions remain coherent over time.

Traditional approaches to this problem generally fall into two buckets:

  1. Implicit Motion (3D CNNs & Temporal Attention): These methods try to learn motion features implicitly. However, 3D convolutions only look at local neighbors (pixels right next to each other in time), making them bad at tracking fast-moving objects. Temporal attention tries to look at everything at once, which is computationally expensive (\(O(T^2)\)) and inefficient.
  2. Explicit Motion (Optical Flow): Some networks ingest optical flow fields to understand motion. While better, flow fields can struggle with occlusions (when an object goes behind another) and often fail to capture long-range dependencies.

The Rise of Point Tracking

Recently, computer vision has seen a breakthrough in Point Tracking. Models like CoTracker, PIPs, and TAPIR can track thousands of specific points across a video with incredible accuracy, even when those points are occluded or the camera moves wildly.

The authors of Tracktention asked a simple question: Why are we trying to force video networks to implicitly learn motion when we have expert point trackers that can explicitly tell us where everything is going?

Introducing Tracktention

Tracktention is a novel attention layer designed to be inserted into existing Vision Transformers (ViTs) or CNNs. It upgrades a static image model into a motion-aware video model.

Overview of the Tracktention pipeline. 1. Sample image tokens at track locations. 2. Update via Track Transformer. 3. Splat back to image tokens.

As shown in the figure above, the process is intuitive. Instead of making every pixel pay attention to every other pixel in every other frame (which is slow and noisy), Tracktention uses point tracks as “guide rails” for information flow.

The architecture follows a three-step process: Attentional Sampling, Track Transformer, and Attentional Splatting. Let’s break down the mathematics and mechanics of each step.

1. Attentional Sampling

The first step is to transfer information from the dense image feature maps into the sparse point tracks.

We start with input video features \(F\) (the output of some layer in your backbone network) and a set of point tracks \(P\) generated by an off-the-shelf tracker (like CoTracker3).

We need to convert the image features into “track tokens.” We project the track points into queries (\(Q\)) and the image features into keys (\(K\)) and values (\(V\)):

Equation for Query, Key, and Value projection.

Here, \(\mathcal{T}\) represents the positional embeddings of the track points.

Next, we compute the attention weights. This isn’t standard global attention; it is highly localized. We want the track token to only gather information from the image features near the track’s location at that specific timestamp.

To enforce this, the authors introduce a distance-based bias term \(B_t\) into the softmax calculation:

Softmax attention equation with bias term B.

The bias term \(B_t\) is critical. It acts like a Gaussian window, forcing the attention mechanism to focus spatially around the track point:

Equation for the bias term B, utilizing Euclidean distance between track and feature positions.

In this equation, \(P_{ti}\) is the location of the track, and \(\mathrm{pos}_{F_t}(j)\) is the location of the pixel in the feature map. If a pixel is far away from the track point, \(B_{tij}\) becomes a large negative number, effectively zeroing out the attention probability.

Visualization of Attentional Sampling. Heatmaps show how the module attends to features corresponding to specific tracks.

As seen in the visualization above, this results in attention maps that tightly follow the object of interest (like the rider’s helmet) across frames, ignoring the background clutter.

2. The Track Transformer

Once we have “Sampled” the image features into track tokens (\(S\)), we essentially have a bundle of timelines. We have \(M\) tracks, and each track has features for \(T\) frames.

Now, we perform the actual temporal processing. The authors transpose the data to process it along the time dimension. They use a standard Transformer Encoder (Self-Attention + Feed Forward Network).

This step allows the model to:

  1. Smooth out feature inconsistencies over time.
  2. Propagate information from confident frames (e.g., fully visible object) to ambiguous frames (e.g., occluded object).
  3. Learn temporal dynamics.

Crucially, this processing happens independently per track. Information does not flow between Track A and Track B here; it only flows from Time \(t\) to Time \(t+1\) within Track A. This keeps the operation highly efficient.

3. Attentional Splatting

After the Track Transformer updates the tokens, we have temporally consistent, smoothed features. Now we must put them back into the image feature map so the rest of the neural network can use them.

This process is the symmetric inverse of the sampling step. We “splat” the track information back onto the image grid. The Queries now come from the image pixels, while the Keys and Values come from the updated track tokens.

Equation for the final Tracktention output calculation.

Here, \(A'_t\) is the attention matrix (computed similarly to step 1), and \(W_{out}\) is the final projection layer.

A Note on Integration: To ensure this layer can be dropped into pre-trained models without breaking them, the authors use a residual connection (\(F_{new} = F + \text{Tracktention}(F)\)) and initialize the final projection weights \(W_{out}\) to zero. This means at the start of training, the layer does nothing, allowing the model to gradually learn to utilize the temporal cues.

Relative Spatial Information (RoPE)

One implementation detail worth noting is how the model handles position. Since the tracks move around continuously, standard fixed positional embeddings won’t work well.

The authors use Rotational Position Encodings (RoPE). This modern technique allows the attention mechanism to understand the relative distance between the track point and the image pixels more effectively than absolute position embeddings.

RoPE equation showing the rotation of feature vectors based on position.

Why This Beats Standard Attention

To appreciate why Tracktention is clever, we have to compare it to how video transformers usually work.

Comparison of attention mechanisms. Red border is the query token, blue blocks are attended tokens.

  • Spatial-Temporal Attention (Top): Every pixel attends to every other pixel in space and time. This is extremely slow and memory-hungry.
  • Spatial / Temporal Attention (Middle): Factorized attention looks at space, then looks at time separately. This is faster, but the “Temporal” part usually only looks at the same pixel location in the previous frame. If an object moves (like a ball rolling), the pixel at \((x, y)\) in frame \(t\) is attending to \((x, y)\) in frame \(t-1\), which is now the background!
  • Tracktention (Bottom): The attention follows the track. The query at frame \(t\) attends to the correct location of the object at frame \(t-1\), regardless of how far it moved.

Experimental Results

The researchers tested Tracktention on two challenging tasks: Video Depth Estimation and Video Colorization. In both cases, they took a standard single-image model and “upgraded” it with Tracktention layers.

Task 1: Video Depth Estimation

They took Depth Anything, a powerful single-image depth estimator, and added Tracktention layers. They compared it against DepthCrafter, a specialized video diffusion model, and DUSt3R.

Quantitative Results:

Table comparing Tracktention against DepthCrafter and others. Tracktention (Ours) achieves lower error (AbsRel) with significantly fewer parameters.

The results are stark. The Tracktention-augmented model (labelled “Ours”) outperforms standard Depth Anything, DepthCrafter, and NVDS on almost every metric.

  • Parameter Efficiency: Their model has only 140M parameters. Compare that to DepthCrafter’s 1.521 Billion parameters. Tracktention is roughly 10x smaller yet more accurate.

Qualitative Results:

Qualitative comparison of depth prediction. The red column visualizes a slice of pixels over time.

Look at the image above. The “pixel column” visualization (the vertical strips) is a great way to see temporal stability.

  • DepthCrafter shows fluctuations (blue box).
  • DUSt3R struggles with dynamic content (green box).
  • Ours (Tracktention) remains smooth and consistent.

Here is another view comparing the error maps directly. Notice how the “Ours” column has significantly less red/yellow (high error) than the competitors, particularly in feature-less areas like walls where consistency is key.

Detailed error map comparison. DepthCrafter and DUSt3R show large red error zones. Tracktention maintains low error across the frame.

Task 2: Video Colorization

Colorization is notorious for flickering because there are often multiple valid colors for an object (a shirt could be red or blue). If the model changes its mind halfway through the video, the result is jarring.

The authors upgraded DDColor with Tracktention.

Quantitative table for video colorization. Significant improvements in CDC (Color Distribution Consistency).

The metric to watch here is CDC (Color Distribution Consistency)—lower is better. Tracktention improves the consistency of the base model by 46.5%.

Visual comparison of colorization. Top: Input. Middle: Base Model. Bottom: Tracktention.

In the parrot example above (left side of the image), the base model flickers between green and pink hues. The Tracktention model locks onto a color representation and holds it steady throughout the clip.

The Importance of Tracks

One of the most interesting ablation studies in the paper analyzes how the choice of tracks affects the output.

The researchers found that the consistency of the output is directly tied to the consistency of the tracks. If you track the background, the background becomes stable. If you track an object, the object becomes stable.

Ablation showing selective tracks. Tracking specific objects stabilizes their colorization specifically.

In the figure above (Top Row), when tracks are placed on the left bird, the left bird’s color becomes stable (green box), but the right bird (untracted) might still flicker (red box). This confirms that the network is indeed relying on the explicit track information to propagate features.

Implementation Details: The Pre-Processor

To make this work, you need tracks. The authors use CoTracker3. They emphasize the importance of initialization.

If you just initialize points on a grid in Frame 1, you lose tracks as objects move off-screen or become occluded. Instead, they sample points randomly across the entire spatio-temporal volume (random points in space and time) and track them forward and backward.

Comparison of grid vs. random query initialization. Random initialization covers the video much better over time.

As shown above, grid initialization (top row) leaves massive gaps by frame 50. Random initialization (bottom row) maintains dense coverage, ensuring Tracktention always has data to work with.

Conclusion

Tracktention represents a shift in how we think about video architectures. Rather than building massive, parameter-heavy 3D networks that try to “learn” physics and motion from scratch, it argues for a modular approach. We already have excellent motion estimators (point trackers). By explicitly injecting that motion knowledge into the network via an attention mechanism, we get the best of both worlds:

  1. High-fidelity features from image foundation models.
  2. Robust temporal consistency from point trackers.

The result is a model that is faster, smaller, and more accurate than native video models. It turns the “flicker” problem from a mystery of the black box into a solved geometry problem.

For students and researchers, this highlights a valuable lesson: sometimes the best way to improve a neural network isn’t to make it deeper, but to give it the right hints.


All images and equations presented in this post are sourced directly from the research paper “Tracktention: Leveraging Point Tracking to Attend Videos Faster and Better” by Lai & Vedaldi.