1. Introduction
We live in a world dominated by video content, yet we are often limited by the hardware that captures it. Most videos are archived at fixed resolutions (like 1080p) and fixed frame rates (usually 30 or 60 fps). But what if you want to zoom in on a distant detail without it becoming a pixelated mess? Or what if you want to slow down a fast-moving action shot without it looking like a slideshow?
This is the domain of Space-Time Video Super-Resolution (STVSR)—the ability to increase both the spatial resolution (making images sharper) and the temporal resolution (adding frames for smooth motion) of a video.
Traditionally, this is done with fixed scales (e.g., upscaling exactly 4x and interpolating exactly 2 frames). However, the “Holy Grail” of this field is Continuous STVSR (C-STVSR): the ability to upscale a video to any arbitrary resolution and frame rate, even non-integers, on the fly.
While recent methods using Implicit Neural Representations (INR) have made strides in this area, they suffer from a major bottleneck: they rely solely on standard RGB frames. Frames can miss critical motion information between shutters, especially in fast scenes, leading to blurry or hallucinated results when you try to push the model outside its training distribution.
Enter EvEnhancer, a novel approach presented in a recent paper by researchers from Beijing Jiaotong University and Hefei University of Technology. EvEnhancer integrates Event Cameras—bio-inspired sensors that capture changes in brightness asynchronously—to bridge the gap between frames.

As shown in Figure 1, EvEnhancer significantly outperforms existing state-of-the-art methods (like VideoINR and MoTIF) across different datasets and upsampling scales, offering sharper details and fewer artifacts.
In this post, we will deconstruct how EvEnhancer works, focusing on its two core innovations: the Event-Adapted Synthesis Module (EASM) and the Local Implicit Video Transformer (LIVT).
2. Background
Before diving into the architecture, let’s establish a few foundational concepts.
The Limits of Frame-Based Vision
Standard cameras capture the world as a sequence of still images (frames) at fixed intervals. If an object moves quickly between frame \(t\) and frame \(t+1\), that information is lost forever. When algorithms try to interpolate these frames, they have to “guess” the motion, which often leads to ghosting artifacts.
Event Cameras
Event cameras are different. Instead of capturing full frames, each pixel operates independently and reports changes in brightness (logarithmic intensity) asynchronously. This produces a stream of “events” with microsecond-level temporal resolution. This stream provides a continuous record of motion, making it the perfect companion for filling in the gaps between standard video frames.
Implicit Neural Representations (INR)
To achieve continuous super-resolution (e.g., zooming 3.4x or 5.9x), we cannot use standard upsampling layers like Deconvolution. Instead, researchers use INRs. The idea is to treat the video as a continuous function. You feed a neural network a spatiotemporal coordinate \((x, y, t)\), and it outputs the RGB value for that specific point. This allows for infinite resolution in theory, but training these networks to capture high-frequency details (textures) and complex temporal dynamics is incredibly difficult.
3. The EvEnhancer Architecture
The researchers propose a unified framework that combines the motion precision of events with the visual richness of frames. The architecture, illustrated below, is split into two main stages.

The workflow begins with EASM, which fuses events and frames to create a high-quality feature sequence. These features are then passed to LIVT, which handles the continuous upsampling.
3.1. Event-Adapted Synthesis Module (EASM)
The goal of the EASM is to extract “latent” (hidden) inter-frame features. Since events provide a continuous stream of motion data, the model can use them to figure out exactly what happened between two RGB frames.
This module is further divided into two sub-steps:
A. Event-Modulated Alignment (EMA)
Alignment is crucial in video processing. The model needs to align features from neighboring frames to the current timestamp being processed.
The authors use a pyramid structure (processing features at different scales). However, instead of just estimating optical flow from images, they use Event Modulation.
The modulation works by modifying the motion vectors using the event features. Since events capture the exact trajectory of motion, they can guide the alignment process much more accurately than image-based flow alone.
The modulated motion vector is calculated as:

Here, \(\mathcal{M}\) represents the modulation function, and \(F^E\) represents the event features. This process is performed in both forward and backward directions to capture motion context from both the past and the future.
B. Bidirectional Recurrent Compensation (BRC)
Alignment isn’t enough. We need to fuse these features over time to build a robust representation of the video. The authors employ a Bidirectional Recurrent Neural Network (RNN) approach.
In this step, the model propagates information across time. It looks at the aligned frame features and the event stream, fusing them iteratively.

As described in the equations above, the forward hidden state (\(h^f\)) and backward hidden state (\(h^b\)) are updated by combining the current frame features (\(F^{LR}\)) and event features (\(F^E\)). This ensures that the high temporal resolution of the events is fully utilized to “fill in the blanks” of the video sequence, creating a feature set that is rich in temporal detail.
3.2. Local Implicit Video Transformer (LIVT)
This is the heart of the “continuous” capability. Previous methods often decoupled space and time—learning one function for spatial upscaling and another for temporal interpolation. This is suboptimal because space and time are correlated (e.g., a moving object changes position over time).
EvEnhancer introduces the Local Implicit Video Transformer (LIVT), which learns a unified video representation.

Instead of processing the entire video volume (which would be computationally impossible), LIVT uses a local attention mechanism. Here is the step-by-step process:
Step 1: Temporal Selection
Given a target timestamp \(\mathcal{T}\) (e.g., we want to generate a frame at \(t=0.54\)), the model first identifies the nearest available feature grids from the sequence generated by the EASM. It selects a local window of size \(T^G\) that is closest to the target time.

Step 2: 3D Local Attention with Positional Encoding
Once the local grid is selected, the model needs to determine the RGB value for a specific coordinate. It treats the target coordinate as a Query (\(q\)) and the surrounding local features as Keys (\(k\)) and Values (\(v\)).
To help the transformer understand where and when things are happening relative to the query point, the authors use a Cosine Positional Encoding. This encodes the spatiotemporal distance \((\delta \tau, \delta x, \delta y)\) between the query point and the local grid pixels.

Step 3: Cross-Scale Attention
The magic happens here. The model computes attention between the query (coordinate) and the local features. This allows the network to dynamically aggregate information from the most relevant spatiotemporal neighbors.

By computing the dot product of the query \(q\) and key \(k\), adding the positional bias \(b\), and multiplying by the value \(v\), the model generates a feature vector \(\tilde{z}\). This vector is then decoded by a simple Multi-Layer Perceptron (MLP) into the final RGB pixel value.
Because this happens via coordinates, you can request any spatial coordinate and any temporal timestamp, achieving true continuous Space-Time Video Super-Resolution.
4. Experiments and Results
The researchers evaluated EvEnhancer on standard datasets like Adobe240 and GoPro (synthetic events) and BS-ERGB (real-world events). They compared it against top-tier methods including TimeLens, VideoINR, and MoTIF.
Quantitative Superiority
Let’s look at the numbers. Table 1 shows the performance on “In-Distribution” scales (upsampling scales the model saw during training).

EvEnhancer (and its lighter version, EvEnhancer-light) consistently achieves the highest PSNR (Peak Signal-to-Noise Ratio) and SSIM (Structural Similarity Index). For example, on the GoPro dataset, EvEnhancer achieves 33.52 dB, significantly higher than MoTIF (31.04 dB) or VideoINR (30.26 dB). Importantly, it does this with roughly half the parameters of MoTIF.
Generalization to Out-of-Distribution (OOD) Scales
The real test of a continuous model is how it handles scales it hasn’t seen before. Table 2 shows the results when pushing the model to extreme settings (e.g., Temporal scale \(t=16\), Spatial scale \(s=12\)).

Even at these extreme scales, EvEnhancer maintains robust performance, whereas other methods degrade more sharply. This proves that the LIVT module has successfully learned a continuous representation of the video, rather than just memorizing fixed upsampling patterns.
Efficiency
You might think that adding event processing and transformers makes the model heavy. However, the TFLOPs (Tera Floating Point Operations per Second) comparison in Table 5 shows otherwise.

While VideoINR is computationally cheaper, its performance is much lower. Compared to MoTIF, EvEnhancer offers a better trade-off, providing state-of-the-art quality with manageable computational costs, especially at higher temporal scales.
Visual Quality
The quantitative metrics are backed up by visual evidence. In Figure 4, we can see the reconstruction quality on the GoPro dataset.

Notice the text and fine details. EvEnhancer recovers sharp edges that are blurred or lost in VideoINR and MoTIF.
Furthermore, temporal consistency is vital for video. If frames are upsampled individually without context, the video will flicker. Figure 5 visualizes the “temporal profile” (a slice of the video over time).

The smooth lines in the EvEnhancer column indicate stable, consistent motion. In contrast, the jagged or blurry lines in the other columns indicate flickering and temporal instability.
Ablation Studies
To prove that every part of the engine is necessary, the authors conducted ablation studies. For instance, they tested the LIVT design by switching between 2D decoupling (the old way) and their 3D unified approach.

Table 9 clearly shows that the “3D unification” strategy yields better results, particularly for out-of-distribution (OOD) scales, confirming that modeling space and time together is the correct approach.
5. Conclusion
EvEnhancer represents a significant step forward in video processing. By marrying the high temporal resolution of event cameras with the continuous representation capabilities of Implicit Neural Representations, the authors have created a system that is both flexible and powerful.
The Event-Adapted Synthesis Module ensures that the model understands complex motion trajectories that regular frames miss, while the Local Implicit Video Transformer allows the model to render this information at any scale humans might need.
For students and researchers in computer vision, EvEnhancer highlights two important trends:
- Multi-modality is key: Combining standard frames with novel sensors like event cameras can solve fundamental limitations in data capture.
- Continuous representations are the future: Moving away from fixed discrete upsampling allows for more versatile and generalized AI models.
As event cameras become more accessible, we can expect to see more technologies like EvEnhancer making their way into consumer devices, potentially allowing us to turn standard footage into high-speed, 4K slow-motion video with a click of a button.
](https://deep-paper.org/en/paper/2505.04657/images/cover.png)