TempoFormer: Teaching Transformers the Concept of Time
Imagine reading a single text message from a friend: “I’m fine.”
On its own, it’s a neutral statement. But what if you knew that ten minutes ago, they posted about a terrible breakup? Or what if that text came three months after a long silence? The meaning changes entirely based on the context and the time that has passed.
In Natural Language Processing (NLP), standard models like BERT are incredible at understanding the context of words within a sentence. However, they struggle with the context of time across a sequence of events. They treat a sequence of posts as a simple ordered list (\(1, 2, 3 \dots\)), ignoring whether the gap between post 1 and post 2 was five seconds or five days.
Today, we are diving into TempoFormer, a research paper that proposes a novel architecture to solve this exact problem. It introduces a way to bake “time awareness” directly into the Transformer architecture, allowing models to better detect changes in mood, stance, or conversation topics.
The Problem: Missing the “When”
Dynamic representation learning focuses on understanding how linguistic content evolves. This is crucial for tasks like:
- Mental Health Monitoring: Detecting a sudden shift from positive to negative mood.
- Stance Detection: Seeing when a user changes their opinion on a political rumor.
- Conversation Derailment: Noticing when a chat goes off-topic.
In these scenarios, the most recent post doesn’t tell the whole story. You need the history.

As shown in Figure 1 above, looking at the final post (“And I’m back to being single”) in isolation might suggest a relationship status update. But when viewed in the timeline of previous optimistic posts, it represents a sharp Switch in behavior.
The Limitation of Current Approaches
To tackle this, researchers typically take a two-step approach:
- Use a Transformer (like BERT) to get a vector representation of each individual post.
- Feed those vectors into a Recurrent Neural Network (like an LSTM) to model the sequence.
While this works, it’s not ideal. RNNs are slow, difficult to parallelize, and prone to overfitting on small datasets. More importantly, this approach treats time as a secondary feature or just an index sequence, rather than an integral part of the language understanding process.
Enter TempoFormer
The researchers from Queen Mary University of London and The Alan Turing Institute introduce TempoFormer, a model that modifies the Transformer architecture to be natively “temporally aware.”
Instead of relying on RNNs, TempoFormer modifies the attention mechanism itself. It allows the model to “attend” to historical posts based on how far away they are in time, not just how far away they are in the list.
The Architecture
TempoFormer is a hierarchical model. It doesn’t process the whole timeline at once; it builds understanding in layers. Let’s break down the architecture shown below.

As illustrated in Figure 2, the process works in three main stages:
Post-level Encoding (Local): The bottom of the diagram shows individual posts (\(U_{i-4}\) to \(U_i\)). These pass through the first 10 layers of a standard BERT model. At this stage, the model only looks at words within a post, creating a strong local representation (\(H^{10}\)).
Stream-level Encoding (Global): This is where TempoFormer diverges from standard BERT. The model takes the representations from the 10th layer and adds Stream-level Position Embeddings. These tell the model the order of the posts.
Then, it uses a specialized Temporal Rotary Multi-Head Attention (MHA) mechanism (more on this in the next section) to look at the relationships between posts. This allows the representation of the current post to be influenced by previous posts, weighted by their temporal distance.
- Context-Enhanced Encoding: Finally, the global “stream-aware” information is fused back into the local “word-level” information using a Gate & Norm mechanism. This ensures the final classification is based on both the specific words used and the historical context.
The Secret Sauce: Temporal Rotary Attention
The core innovation of this paper is how it handles attention. Standard Transformers use Positional Embeddings to understand that word A comes before word B.
Recently, Rotary Position Embeddings (RoPE) have become popular (used in models like LLaMA). RoPE encodes position by mathematically “rotating” the vector representation of a token in space. The angle of rotation depends on the token’s position index (\(1, 2, 3 \dots\)).
The dot product of two vectors (query \(q\) and key \(k\)) in RoPE looks like this:

Here, \(R\) is a rotation matrix. The crucial part is \(R_{\theta, n-m}\). This means the attention score depends on the relative distance (\(n-m\)) between two items.
The TempoFormer Twist: The authors realized that for change detection, the index distance (\(n-m\)) matters less than the time distance.
TempoFormer replaces the index difference with the time difference (\(\mathbf{t}_n - \mathbf{t}_m\)).

In the rotation matrix above, the model uses the actual timestamp difference (log-transformed to handle large gaps) to determine the rotation. This means:
- Two posts sent 1 minute apart will have a “small” rotation difference, resulting in high attention compatibility.
- Two posts sent 1 month apart will have a “large” rotation difference, naturally decaying the attention the model pays to the older post.
This is a mathematically elegant way to make the Transformer “feel” the passage of time without needing complex external features.
Fusing Contexts
Once the model has computed these temporally-aware representations, it needs to combine them with the original word meanings. The paper adapts a Gated Context Fusion mechanism.


In these equations, \(\mathbf{g}\) acts as a learned gate. It decides how much of the global, time-aware history (\(H'_{CLS}\)) should be mixed with the local, word-level information (\(H_{CLS}\)). This allows the model to dynamically balance context: sometimes the history matters most, other times the specific words in the current post are all that is needed.
Experimental Setup
The researchers tested TempoFormer on three diverse datasets representing different granularities of time:
- LRS (Longitudinal Rumour Stance): Tracking how users support or deny rumors over time.
- TalkLife (Moments of Change): Detecting mood shifts in social media timelines.
- Topic Shift MI: Identifying when a conversation derails from its main topic.

As shown in Table 1, these datasets vary significantly. “TalkLife” has huge gaps between posts (hours), while “Topic Shift” happens in real-time conversation.
Results: Does it Work?
The short answer is yes. TempoFormer achieves State-of-the-Art (SOTA) performance across the board.

Table 2 highlights several key findings:
- TempoFormer vs. Baselines: It outperforms standard BERT (Post-level) and complex RNN-based models (Stream-level like BiLSTM and Seq-Sig-Net).
- TempoFormer vs. LLMs: Surprisingly, large language models like Llama-2 (7B) and Mistral (7B) performed poorly on these tasks, even with few-shot prompting. This suggests that while LLMs are great at generating text, they struggle with the specific temporal reasoning required to detect subtle shifts in longitudinal data.
- Minority Classes: TempoFormer was particularly good at detecting the “rare” events—the actual switches and derailments—which are usually the hardest to catch.
The Importance of Window Size
Since the model looks at a history of posts, how far back should it look? The authors analyzed different window sizes (\(w=5, 10, 20\) posts).

Figure 3 shows that “more” isn’t always “better.”
- LRS benefited from a longer window (\(w=20\)), likely because rumor stances evolve slowly.
- TalkLife peaked at a medium window (\(w=10\)).
- This highlights that TempoFormer is flexible; the temporal scope can be tuned to fit the specific rhythm of the dataset.
Does the “Time Rotation” Actually Matter?
To prove that their new attention mechanism was responsible for the gains, the authors performed an ablation study (removing parts of the model to see what breaks).

Table 3 reveals that:
- Removing Temporal RoPE (replacing it with standard sequential RoPE) caused a drop in performance. This confirms that knowing the time gap is more valuable than just knowing the sequence order.
- Removing the Gate & Norm fusion also hurt performance significantly, proving that you need to carefully blend the global history with the local content.
Conclusion
TempoFormer represents a significant step forward in Dynamic Representation Learning. By modifying the rotary embeddings to encode time instead of position, the authors created a model that natively understands temporal dynamics.
Key Takeaways:
- RNNs are not the only way: You don’t need LSTMs to model history. Transformers can do it better if you adjust the attention mechanism.
- Time is a feature: In human behavior, when something is said is often as important as what is said.
- Flexibility: This architecture is “task-agnostic.” It can be applied to BERT, RoBERTa, or potentially any Transformer-based model to give it a sense of time.
For students and researchers in NLP, TempoFormer demonstrates that we shouldn’t just treat inputs as static lists. Modifying the core architecture to reflect the real-world properties of data (like time) can yield better results than simply making models larger.
](https://deep-paper.org/en/paper/2408.15689/images/cover.png)