Introduction

Imagine you are playing a high-stakes match of virtual reality table tennis against a friend halfway across the world. You swing your controller, expecting your avatar to mirror the movement instantly. But there’s a catch: the internet connection fluctuates. Your swing data travels through a Wide Area Network (WAN), encountering unpredictable delays before reaching the game server or your opponent’s display.

In the world of computer vision and robotics, this is known as the latency problem. Whether it is a surrogate robot replicating a human’s movements or a metaverse avatar interacting with a virtual environment, time delays caused by network transmission and algorithm execution are inevitable.

For years, research in human motion prediction has operated under an idealized assumption: zero latency. Existing models assume that as soon as a movement is observed, the prediction for the immediate future can be generated and applied instantly.

But in the real world, latency is not only present; it is arbitrary. It varies from tens to hundreds of milliseconds based on network stability. If a prediction model expects zero delay but encounters a 200ms lag, the continuity of the motion is broken, and the system fails to anticipate the human’s actual position.

In this post, we will dive deep into a CVPR paper that proposes a novel solution to this problem: ALIEN (Arbitrary Latency-aware Implicit nEural represeNtation). This framework moves away from traditional sequence-based prediction and instead treats human motion as a continuous function, allowing it to generate accurate poses regardless of how much time has passed during the “lag.”

Figure 1. Comparison of existing methods vs. ALIEN. Part (a) shows traditional methods failing to account for delay. Part (b) shows the ALIEN approach, which accounts for the latency period caused by network transmission and algorithm execution.

As shown in Figure 1, the core difference is the acknowledgement of the “Latency Period.” ALIEN ensures that intelligent systems can accurately anticipate human movements even when the data arrives late.

Background: The Latency Challenge

To understand why this paper is significant, we first need to look at how human motion prediction is typically handled.

The Standard Approach: Sequence-to-Sequence

Most state-of-the-art methods use Recurrent Neural Networks (RNNs), Graph Convolutional Networks (GCNs), or Transformers. They treat motion as a discrete sequence of poses. You feed the model a sequence of past frames (e.g., 10 frames of a person walking), and it outputs the next 10 frames.

Mathematically, this looks like learning a mapping function \(f\) such that:

\[ \text{Future Sequence} = f(\text{Past Sequence}) \]

The Problem with Arbitrary Latency

When you introduce arbitrary latency, the “Future Sequence” doesn’t start immediately after the “Past Sequence.” There is a gap of unknown duration—let’s call it \(T_l\) (time of latency).

If \(T_l\) is variable (random), the connection between the observed past and the target future becomes disconnected.

  1. Disrupted Continuity: The smooth trajectory of joints is broken by the gap.
  2. Explicit Modeling Failure: Traditional models struggle to define a fixed architecture that can handle a gap that might be 40ms one time and 200ms the next.
  3. Wasted Computation: One naive solution is to predict a very long sequence that covers the maximum possible latency. However, this forces the model to waste computational power predicting the “lagged” frames that effectively happened in the past, degrading the accuracy of the actual future frames we care about.

The Core Method: ALIEN

The researchers behind ALIEN propose a paradigm shift. Instead of treating motion prediction as a “motion-to-motion translation” task (Sequence A \(\rightarrow\) Sequence B), they formulate it as a continuous function learning task using Implicit Neural Representations (INRs).

1. The Concept: Motion as a Neural Network

An Implicit Neural Representation (INR) encodes a signal (in this case, human motion) directly into the weights of a neural network. Instead of storing a list of joint coordinates, the network learns a function \(f(t)\) that maps a specific time coordinate \(t\) to a specific pose.

If you can successfully encode a motion into a network, the latency problem disappears. Why? Because the network doesn’t care about the sequence gap. If you want to know the pose after a 135ms delay, you simply query the network with the timestamp \(t = \text{current\_time} + 135\text{ms}\).

However, a standard INR is typically trained on a single object or scene (like in NeRFs). To predict future human motion for unseen people, the system needs to be generalizable.

2. Architecture Overview

ALIEN achieves this using a Hyper-Network.

Figure 2. Overview of the ALIEN architecture. It consists of a Hyper-Network (MLLA-based) that generates weights, and a shared INR Decoder that predicts poses based on time coordinates.

Here is the step-by-step flow illustrated in Figure 2:

  1. Input: The system receives a sequence of past observed motions.
  2. Tokenization: These motions are transformed into tokens (using Discrete Cosine Transform for joint trajectories) and fed into the Hyper-Network.
  3. The Hyper-Network (Meta-Learner): This network analyzes the past motion to understand the specific movement characteristics (or “instance-specific information”) of the person. It outputs a set of weights.
  4. Weight Modulation: These weights are used to parameterize the INR Decoder.
  5. The INR Decoder: This is a Multi-Layer Perceptron (MLP). It takes a time coordinate \(t\) as input and, using the weights provided by the Hyper-Network, outputs the 3D pose for that specific time.

This separation is crucial:

  • The Hyper-Network handles spatial dependencies (how joints relate to each other in the observed sequence).
  • The INR Decoder handles temporal modeling (mapping time to pose).

3. The Efficient Hyper-Network: MLLA

One of the biggest bottlenecks in using Hyper-Networks is efficiency. Previous methods (like TransINR) used Transformers as the Hyper-Network. While powerful, Transformers have quadratic complexity regarding the number of tokens. Since human motion data involves many joints over time, this becomes computationally heavy, increasing the very latency the system is trying to mitigate.

The ALIEN authors draw inspiration from the Mamba architecture. They utilize Mamba-like Linear Attention (MLLA).

Figure 3. Comparison of Hyper-Network architectures. (a) Gradient-based meta-learning, (b) TransINR (Transformer-based), and (c) The MLLA Meta-Learner used in ALIEN, which offers linear complexity.

As shown in Figure 3(c), the MLLA module replaces the heavy softmax attention of Transformers. It allows the model to process long sequences of tokens with linear complexity, making the system significantly faster and more suitable for real-time applications.

4. Low-Rank Modulation

Even with a faster architecture, generating the full weights for a deep neural network is expensive. If the INR Decoder has layers of size \(256 \times 256\), the Hyper-Network needs to output \(65,536\) values for just one weight matrix.

To solve this, ALIEN employs Low-Rank Modulation. Instead of generating the full matrix \(W\), the Hyper-Network generates two smaller matrices, \(U\) and \(V\), and a base parameter \(Z\).

Equation 8. The formula for Low-Rank Modulation.

The weight matrix \(\theta_{W_l}\) is computed as the product of \(U\) and \(V^T\) (modulated by a sigmoid function) and element-wise multiplied by a shared base parameter \(Z\). This drastically reduces the number of parameters the Hyper-Network needs to predict, further boosting efficiency.

5. Multi-Task Learning Strategy

Perhaps the most clever aspect of ALIEN is how it trains.

In a real-time scenario, the “latency period” is a black hole—we don’t see what happens during the lag. However, during training, we have the full ground-truth data.

The authors realized that the poses occurring during the latency period contain valuable information about the continuity of motion. They designed a Multi-Task Learning Framework with two objectives:

  1. Primary Task (Prediction): Predict the future poses after the latency period (\(T_h + T_l \dots\)).
  2. Auxiliary Task (Reconstruction): Reconstruct the poses during the latency period using the variable delay information.

Equation 10. The Multi-Task Learning loss function. L_pred focuses on the future prediction, while L_rec focuses on reconstructing the latency period poses.

By forcing the model to reconstruct the missing latency frames (\(\mathcal{L}_{rec}\)), the Hyper-Network learns a much more robust representation of the motion dynamics. It effectively teaches the INR to “fill in the blanks” correctly, which naturally improves the accuracy of the future predictions.

Experiments & Results

The researchers evaluated ALIEN on three major datasets: Human3.6M, CMU-MoCap, and 3DPW. They adapted state-of-the-art baselines (like LTD, SPGSN, and NeRMo) to the “arbitrary latency” setting for a fair comparison.

Performance on Human3.6M

The results on the Human3.6M dataset (a standard benchmark) are telling.

Table 1. Comparison of prediction errors (MPJPE) on Human3.6M under arbitrary latency. Lower numbers indicate better performance.

In Table 1, we see the Mean Per Joint Position Error (MPJPE). Lower numbers are better.

  • Observations: ALIEN achieves the best (bold) or second-best (underlined) results in almost every category.
  • Comparison: Note the comparison with NeRMo. NeRMo is another recent method using Implicit Neural Representations. However, ALIEN consistently outperforms it (e.g., in the “Walking” category at 600ms, ALIEN scores 52.7 vs. NeRMo’s 55.1). This validates the superiority of the MLLA Hyper-Network and the Multi-Task training strategy over NeRMo’s optimization approach.

Visualizing the “Lag” Recovery

Qualitative results often tell a clearer story than numbers. Figure 4 visualizes how different models handle a walking sequence with a latency of 2 frames (\(T_l = 2\)).

Figure 4. Visualization of predictions. Notice the red boxes in LTD and NeRMo rows—these indicate invalid or distorted poses generated due to the latency gap. ALIEN (Ours) produces smooth, valid poses that closely match the Ground Truth (G.T.).

  • The Baseline Struggle: Look at the LTD and NeRMo rows (highlighted with red boxes). Because of the disconnection caused by latency, these models often generate distorted or “broken” poses where limbs are in impossible positions.
  • The ALIEN Advantage: The Ours row shows smooth continuity. Even after the latency gap (yellow shaded area), the model picks up the stride perfectly, matching the Ground Truth (G.T.).

Efficiency and Scalability

For a system designed to handle latency, the model itself cannot be slow.

Table 4. Comparison of running time and model size. ALIEN runs significantly faster than GCN and Transformer-based methods.

Table 4 highlights that ALIEN runs in 29.10ms, which is within the 30ms budget typically allocated for algorithm execution in real-time systems. It is faster than SPGSN (35.22ms) and MSR-GCN (48.62ms) while maintaining lower error rates.

Furthermore, the authors analyzed how performance holds up as the latency length increases.

Figure 5. (Left) Performance on CMU-MoCap as latency length increases. ALIEN (Blue) consistently beats NeRMo (Orange). (Right) Performance on standard zero-latency tasks.

Figure 5 (Left) demonstrates that even as the latency grows (from 1 to 5 frames), ALIEN’s error rate remains lower than the competitor NeRMo.

Why MLLA Works: Attention Visualization

Why did the authors choose the Mamba-like Linear Attention over standard Transformers? Aside from speed, does it learn effectively?

Figure 6. Attention maps from the MLLA module showing interactions between body joint tokens.

Figure 6 visualizes the attention interactions between body joint tokens inside the Hyper-Network.

  • Early Layers (a): The diagonal line is prominent, meaning joints primarily pay attention to themselves.
  • Deeper Layers (c): The attention spreads out. This indicates that the model is learning global context—understanding how the movement of one joint (e.g., a foot) influences another (e.g., a hand) to generate the correct INR weights.

Conclusion

The paper “ALIEN: Implicit Neural Representations for Human Motion Prediction under Arbitrary Latency” identifies a critical gap in current computer vision research. By moving away from the assumption of zero latency, the authors tackle a real-world problem that affects everything from cloud gaming to tele-robotics.

Key Takeaways:

  1. New Task Definition: The paper formally defines the task of motion prediction under arbitrary, variable latency.
  2. Implicit Modeling: Using INRs allows the model to treat motion as a continuous function, decoupling the prediction from fixed time steps.
  3. Efficiency via Mamba: The use of Mamba-like Linear Attention allows the Hyper-Network to process complex spatial data without the computational penalty of Transformers.
  4. Data Efficiency: The Multi-Task learning framework cleverly repurposes the “lost” latency period as a training signal to improve model robustness.

ALIEN represents a significant step toward more responsive and immersive intelligent systems. As we move toward a future dominated by remote interaction and the metaverse, handling the invisible delay of the network will be just as important as the visual fidelity of the avatars themselves.