Seeing the Unseen: How Event Cameras are Revolutionizing Point Tracking
Imagine trying to track a specific point on the blade of a rapidly spinning fan. Or perhaps you are trying to follow a bird diving into a dark shadow. If you use a standard video camera, you will likely run into two major walls: motion blur and dynamic range limitations. The fan blade becomes a smear, and the bird disappears into the darkness.
For decades, computer vision has relied on frames—snapshots of the world taken at fixed intervals. But biology works differently. Your retina doesn’t take 30 snapshots a second; it reacts to changes in light continuously. This bio-inspired principle has given rise to Event Cameras.
While event cameras solve the blur and lighting issues, they introduce a new headache: their data is incredibly hard to interpret for traditional algorithms. In this post, we will deep dive into a groundbreaking paper, “ETAP: Event-based Tracking of Any Point,” which presents the first method to robustly track arbitrary physical points using only event data.
We will explore how the authors overcame the “motion dependence” paradox, designed a transformer-based architecture for sparse data, and created a new synthetic dataset to train their model.
The Problem with Frames (and the Promise of Events)
To understand why ETAP is necessary, we first need to define the task: Tracking Any Point (TAP). The goal is to take a query point on an object and follow that specific physical point throughout a video, determining where it is and whether it is visible (occluded) at every moment.
Frame-based methods have become quite powerful (e.g., CoTracker or TAPIR), but they are bound by the sensor’s limitations.
- Motion Blur: Fast motion smears features, making precise tracking impossible.
- Low Dynamic Range: Standard cameras get blinded by bright lights or lose details in shadows.
- Low Bandwidth: High-speed cameras produce too much data to process in real-time.
Event cameras address this by having pixels that operate independently. They don’t output an image; they output a stream of asynchronous “events” whenever a pixel detects a change in brightness. This offers microsecond temporal resolution and high dynamic range.
The “Motion Dependence” Challenge
If event cameras are so superior, why haven’t we switched over completely? The answer lies in how events are generated.
In a standard image, a coffee cup looks like a coffee cup whether the camera is stationary, moving left, or moving right. The appearance is invariant to motion.
In event vision, an event is triggered by a change in brightness. This change depends on the gradient of the scene and the motion of the camera. If you move a camera horizontally across a vertical edge, you get a lot of events. Move it vertically along that same edge? You might get zero events.

As shown in Figure 2, the data generated by the sensor is fundamentally coupled to the motion itself. This makes learning stable features extremely difficult because the “fingerprint” of an object changes depending on how it moves. ETAP is designed specifically to solve this.
ETAP: The First Event-Only TAP Method
The researchers propose ETAP, a model that tracks semi-dense trajectories over long ranges using only event data. It is robust to high-speed motion and challenging lighting, bridging the gap where frame-based methods fail.

1. From Asynchronous Events to Structured Input
Deep learning models, particularly Transformers and CNNs, usually require structured, grid-like input. Events, however, are a sparse list of tuples: \(e_k = (x, y, t, p)\), representing position, time, and polarity (brightness increase or decrease).
\[ E _ { t } = \left\{ e _ { k } \ : | \ : \tau _ { k } \in \left( \tau _ { t } - \Delta \tau _ { t } , \tau _ { t } \right) \right\} \subset E \]
To make this digestible for the network, ETAP converts raw events into Event Stacks.
- They group events into temporal windows.
- They bin these events into a multi-channel grid (tensor).
- This results in a representation \(I_t\) with dimensions \(H \times W \times B\) (where \(B\) is the number of time bins).

As visualized above, this process turns the continuous stream of data into discrete “snapshots” of activity that maintain the temporal richness of the events while being compatible with convolutional encoders.
2. The Architecture
The core of ETAP is a Transformer-based tracker. It takes the event stacks and a set of query points (where the user clicks to start tracking) and outputs the trajectory of those points.

The process, defined by the function \(\Psi\), works iteratively:
\[ \mathcal { P } _ { t } = \Psi ( \mathcal { P } _ { t - T _ { s } } , \mathcal { E } _ { t } ) . \]
Here is the step-by-step breakdown of the architecture shown in Figure 3(b):
- Feature Extraction: A CNN extracts multi-scale features from the event stacks.
- Tokenization: For each point being tracked, the model creates a “token” that contains its current estimated position, its visual descriptor (what it looks like), and correlation features (how it matches the surroundings).
- Transformer Refinement: The model uses a Transformer to mix information across time (temporal attention) and across different points (spatial attention). This allows points to “talk” to each other—if one point moves right, its neighbor likely moved right too.
- Update Loop: The model outputs a delta (change) for the position and the descriptor, refining the estimate over several iterations.
\[
\begin{array} { r l } { { ( \mathrm { d } \tilde { \mathbf { x } } _ { s } ^ { i , m } , \mathrm { d } \tilde { Q } _ { s } ^ { i , m } ) = \gamma ( \mathcal { P } ^ { m } , \mathcal { D } ) } } \\ & { \quad \tilde { \mathbf { x } } _ { s } ^ { i , m + 1 } = \tilde { \mathbf { x } } _ { s } ^ { i , m } + \mathrm { d } \tilde { \mathbf { x } } _ { s } ^ { i , m } } \\ & { \quad \tilde { Q } _ { s } ^ { i , m + 1 } = \tilde { Q } _ { s } ^ { i , m } + \mathrm { d } \tilde { Q } _ { s } ^ { i , m } } \end{array}
\]

To help the model find the point in the next timestamp, it looks at a local neighborhood of pixels. It computes Correlation Features—essentially checking how similar the tracked point’s descriptor is to the features in a grid around the predicted location.
\[ C _ { s } ^ { i , m } = \oplus _ { \lambda = 1 } ^ { S } \oplus _ { \delta \in B _ { \Delta } } \left. \tilde { Q } _ { s } ^ { i , m } , D ( \tilde { \mathbf { x } } _ { s } ^ { i , m } / k \lambda + \delta ) \right. \]
3. Solving Motion Dependence: The Feature Alignment Loss
This is the most innovative part of the paper. As discussed, event data changes if the motion direction changes. However, for a tracker to be robust, it needs to recognize that “Point A” is still “Point A,” regardless of whether the camera moved up or down.
To teach the network this invariance, the authors introduce a Contrastive Feature Alignment Loss (\(L_{fa}\)).
The Logic: If you take a video sequence and play it backward (time inversion), the motion vectors invert. In standard video, the image looks the same. In event vision, the event polarities and distributions change fundamentally.
\[ e _ { k } \in E _ { t } \iff - p _ { k } \nabla L ( \mathbf { x } _ { k } , \tau _ { k } ) \cdot \boldsymbol { v } ( \mathbf { x } _ { k } , \tau _ { k } ) \delta \tau _ { k } \approx C \]

The math above proves that the event representation \(\tilde{E}\) (time-inverted) is different from \(E\) (original).
The Solution: During training, the researchers feed the network two versions of the same sample:
- The original sequence.
- A time-inverted (and optionally rotated) version.
They then force the network to produce identical feature descriptors for the corresponding points in both versions. By minimizing the difference between these descriptors, the network learns to ignore the motion-specific artifacts and focus on the underlying structure of the object.
\[ \begin{array} { r } { \mathcal { L } _ { \mathrm { f a } } = \sum _ { t } \frac { 1 } { | \mathcal { P } _ { t } | } \sum _ { i , s } \left( 1 - \big \langle { \mathrm { u } \left( d _ { t } ^ { s , i } \right) } , { \mathrm { u } \left( \tilde { d } _ { t } ^ { s , i } \right) } \big \rangle \right) ^ { 2 } } \end{array} \]
This loss function (\(L_{fa}\)) explicitly penalizes the model if the features of the forward motion don’t match the features of the backward motion (measured by cosine similarity).
EventKubric: A New Benchmark
Deep learning needs data. While there are synthetic datasets for event cameras, they often lack realism or the specific annotations needed for point tracking (visibility flags, long trajectories).
The authors created EventKubric, a large-scale synthetic dataset generated using a complex pipeline:
- Kubric: Renders 3D scenes with physics (gravity, collisions) and realistic textures.
- FILM: Upsamples the video to high framerates to simulate continuous time.
- ESIM: An event simulator that converts the high-speed video into event streams.

This dataset contains over 10,000 samples with varying camera motions and complex object interactions. It provides perfect ground truth for training, which is impossible to get in the real world at this scale.

Experiments and Results
Does it work? The authors tested ETAP against state-of-the-art baselines on both real and synthetic data.
Task 1: Tracking Any Point (TAP)
The primary evaluation was on the EVIMO2 dataset (real-world data) and the new EventKubric dataset.
Quantitative Results: In the synthetic EventKubric benchmark, ETAP achieved a 136% improvement over the baseline (which used E2VID to reconstruct images from events and then ran a standard tracker). This proves that processing events natively is far superior to trying to turn them back into images.
On the real-world EVIMO2 dataset, ETAP showed robust tracking of objects moving in 3D space.

The Ultimate Stress Test: The Fidget Spinner To highlight the advantage over standard cameras, the authors used the E2D2 dataset, which features a fidget spinner rotating at increasing speeds in low light.
- Frame-based methods (CoTracker): Failed completely. The motion blur and low frame rate (10Hz) made the spinner look like a blur.
- ETAP: Successfully tracked the points on the spinner even as it reached high angular velocities, leveraging the microsecond resolution of the event sensor.

The difference is stark. In the bottom row of Figure 6, you can see the frame-based tracker losing the points almost immediately. ETAP (second row) holds onto them tightly.
Qualitative Win: The method also shines in unconstrained environments, such as tracking a bird in an aviary (Figure 7). This is a “nightmare scenario” for computer vision: a small, deformable object, moving fast, against a complex background with difficult lighting (HDR).

Task 2: Feature Tracking
The authors also compared ETAP against specialized feature tracking algorithms on the EDS and EC datasets.
ETAP outperformed the best event-only method by 20% on the EDS benchmark. Perhaps even more impressively, it outperformed the best method that uses both events and frames (FE-TAP) by 4.1%.


As shown in Figure 8, the tracker is persistent. Even if a point leaves the camera’s field of view and comes back, the model can often re-identify and continue tracking it.
Why the Loss Matters
The researchers conducted an ablation study to verify if their Feature Alignment (FA) loss actually helped. They set up an experiment where they tracked the same points on a pattern moving horizontally and then vertically.
Ideally, the feature descriptors should be similar regardless of the motion direction.

The Result: Without the FA-loss, the similarity between the horizontal and vertical features was low (0.399). With the FA-loss, it jumped to 0.887, almost matching the consistency of frame-based features. This confirms that the network successfully learned to “ignore” the motion direction in its internal representation.

Conclusion and Implications
ETAP marks a significant milestone in neuromorphic vision. By building the first Event-only Tracking Any Point method, the authors have unlocked the ability to track arbitrary scene points in conditions that were previously impossible—extreme speed and challenging dynamic range.
Key takeaways:
- Events > Frames for Speed: Native event processing beats reconstructing images from events.
- Solvable Motion Dependence: The novel Contrastive Feature Alignment Loss effectively teaches the network to learn motion-invariant features.
- Data is Key: The EventKubric dataset provides the necessary scale and complexity to train these high-capacity models.
This technology has massive implications for robotics, autonomous drones, and high-speed industrial inspection, where machines need to “see” faster than the blink of an eye.
](https://deep-paper.org/en/paper/2412.00133/images/cover.png)