The stock market is a chaotic, noisy environment. To make sense of it, a human trader doesn’t just look at a single number. They look at price charts (visual), read news and social media (textual), and analyze numeric indicators (quantitative). Crucially, they don’t just look at the current moment; they look at the trend over the last few days or weeks. This combination of different data types over time is what researchers call temporal multimodal data.
While humans naturally synthesize this information, teaching a machine to do so is incredibly difficult. Most existing financial models focus on just one mode—analyzing price history as a time series or performing sentiment analysis on news headlines. Few successfully combine vision, language, and price action while respecting the passage of time.
In this post, we will deep-dive into MEANT (Multimodal Encoder for Antecedent Information), a research paper that proposes a novel architecture to solve this problem. We will explore how the authors combine computer vision, natural language processing (NLP), and a unique temporal attention mechanism to predict stock momentum. We will also examine TempStock, a massive new dataset created to benchmark this task.
The Problem: Data Across Modes and Time
Before dissecting the solution, we must understand the complexity of the input data. Financial prediction is not just about what is happening now, but what happened leading up to now.
- Multimodality: Information exists in different formats. A tweet saying “Buying the dip!” is text. A chart showing a “Golden Cross” is an image. The closing price of $150.00 is a number.
- Temporality: These data points are sequential. A tweet from five days ago has a different relevance than a tweet from five minutes ago.
Existing architectures often concatenate these features or process them separately without fully understanding their temporal relationship. The MEANT model aims to unify these by treating the lag period (the days leading up to a prediction) as a structured sequence.
The Foundation: The MACD Indicator
To create a supervised learning problem for stock prediction, the authors rely on a classic technical indicator: the Moving Average Convergence-Divergence (MACD).
The MACD is a trend-following momentum indicator. It is calculated using Exponential Moving Averages (EMA).

The MACD consists of three parts:
- MACD Line: Difference between the 12-day and 26-day EMA.
- Signal Line: A 9-day EMA of the MACD line.
- Histogram: The difference between the MACD and the Signal line.
The authors use this indicator not just as numbers, but as visual graphs fed into the model.

As shown above, the interaction between the blue line (MACD) and the red line (Signal) indicates momentum. When the blue line crosses above the red line, it is often considered a “buy” signal. This visual representation allows the model to “see” trends just as a technical analyst would.
The MEANT Architecture
The core of this research is the MEANT model itself. It is a Transformer-based architecture designed to ingest three distinct streams of data over a lag period (specifically, 5 days):
- Images: Graphs of the MACD indicator.
- Language: Tweets regarding the specific stock ticker.
- Price: Numeric vectors containing EMA, Signal, and MACD values.
The architecture is split into two main phases: the Modality Encoders (processing text and images separately) and the Temporal Encoder (combining everything over time).

Let’s break down each component shown in the schematic above.
1. The Language and Vision Pipelines
MEANT is an encoder-only model, similar to BERT. It avoids recurrence (like LSTMs) in favor of pure attention mechanisms.
The Language Pipeline: Tweets are tokenized using a Fin-BERT tokenizer (specialized for financial text). The encoder uses an interleaved structure inspired by the Magneto model, utilizing sub-layer normalization to improve stability. For positional embeddings—which tell the model the order of words—MEANT uses xPos embeddings, a variant of rotary embeddings that helps the model extrapolate to different sequence lengths.
The Vision Pipeline: This is where the model gets creative. Instead of using a standard Convolutional Neural Network (CNN) to look at a static image, the authors use a TimeSFormer. This architecture was originally designed for video processing.
Why use a video processor for stock charts? Because the input isn’t just one graph; it’s a sequence of graphs over the lag period (5 days). The TimeSFormer treats these graphs like frames in a video, allowing it to extract “spatiotemporal” features—changes in the visual chart over time.
2. From Day-Encoding to Sequence-Encoding
The output of the language encoder for a single day is a tensor representing features of all the tweets from that day. However, the temporal attention mechanism needs a concise representation of “Day \(t\)”.
The authors propose two methods to compress the language output (\(L_{out}\)) into a sequence vector (\(L_{seq}\)):
Method A: Mean Pooling This simply takes the average of the features.

Method B: Sequence Projection This uses a learned projection matrix and a non-linear activation (GELU) to condense the information. This effectively learns a “latent representation” for the day.

As we will see in the results, the choice between these two methods depends heavily on the specific dataset being used.
3. The Temporal Attention Mechanism
Once the text and images are encoded into vectors for each day in the lag period, they are concatenated with the raw numeric price data (\(M\)). This creates a unified tensor \(T\) representing the multimodal state of the stock over the last 5 days.

Now comes the most distinct innovation of MEANT: Query-Targeting.
Standard self-attention allows every token to look at every other token. However, in stock prediction, we are specifically interested in how the history (days \(t-5\) to \(t-1\)) predicts the target (day \(t\)).
The authors force the attention mechanism to focus on the transition to the target day. They calculate a special Query matrix (\(Q_t\)) derived only from the day preceding the target (\(T_{t-1}\)) and a learned parameter \(q\).

The Attention calculation then proceeds using this targeted query against the Keys (\(K\)) and Values (\(V\)) of the entire sequence:

This results in a temporal output that has specifically weighed the historical tweets and prices in relation to the moment right before the prediction is required.
4. Classification
Finally, the temporal outputs from the language stream (\(T_{lang}\)) and the image stream (\(T_{img}\)) are concatenated into a final vector.

This vector is passed through a Multilayer Perceptron (MLP) head to produce the final binary classification: Buy or Sell.
The TempStock Dataset
To train and test this sophisticated architecture, the authors needed data that was aligned across modalities. Existing datasets often lacked the visual graph component or didn’t structure data in the specific lag format required.
They introduced TempStock, a dataset covering all companies in the S&P 500 from April 2022 to April 2023.
Data Structure
For every target day \(t\), the input consists of a 5-day history:
- 5 Numeric Vectors (\(M\)): EMA, Signal, Histogram, MACD values.
- 5 Tweet Sets (\(X\)): All tweets mentioning the ticker for those days.
- 5 Graphs (\(G\)): Images of the MACD chart.

Labels: The Strategy
The dataset is a binary classification task. The labels are determined by the MACD signal cross strategy, a common technical analysis signal.
Positive (Buy) Signal: The MACD line crosses above the Signal line. This indicates the start of bullish momentum.

Negative (Sell) Signal: The MACD line crosses below the Signal line. This indicates the start of bearish momentum.

The dataset is surprisingly balanced between positive and negative signals, which is rare in financial datasets and removes the need for artificial oversampling.

It is worth noting that the authors filtered out days where no crossover occurred or where tweet volume was insufficient, resulting in a dataset focused purely on these momentum shift events.
Experimental Results
The authors trained three versions of their model: MEANT-base, MEANT-large, and MEANT-XL. They compared these against strong baselines, including:
- TEANet: The previous state-of-the-art (SOTA) for this type of task.
- LSTM: A standard recurrent neural network.
- ViLT & VL-BERT: General purpose vision-language models (fine-tuned).
Performance on TempStock
The results were compelling. As the model size increased, performance improved significantly.

Key Takeaways from the Results:
- Size Matters: MEANT-XL achieved an F1 score of 0.8440, the highest among all models.
- Multimodality Wins: MEANT outperformed single-modality baselines (like FinBERT for text only or TimeSFormer for vision only).
- TEANet is Strong: Interestingly, TEANet (0.7898) outperformed the base version of MEANT (0.7815). TEANet uses an LSTM backbone, which is naturally suited for time-series data. However, as MEANT scaled up to “Large” and “XL,” its attention-based architecture allowed it to surpass the recurrent baselines.
- General Models Struggle: General multimodal models like ViLT performed poorly (F1 0.5483). This highlights that financial prediction requires specialized architectures that handle “lag periods” explicitly, rather than just looking at a single image-text pair.
Confusion Matrices
To visualize the performance, we can look at the confusion matrices. A perfect model would only have values in the diagonal (top-left to bottom-right).
Here is MEANT-XL:

Compare this to TEANet:

While TEANet is competent, MEANT-XL shows a tighter clustering of correct predictions, particularly in identifying buy signals (the bottom-right quadrant).
Ablation Study: What Matters Most?
One of the most interesting parts of any deep learning paper is the ablation study, where researchers remove parts of the model to see what breaks.
The authors found a stark difference in the value of the modalities:
- Text is King: Removing the image modality caused a small drop in performance. However, removing the Tweet modality caused a massive performance collapse.
- Price is Crucial: Unsurprisingly, removing the price data (which is strictly correlated to the labels) also degraded performance heavily.
This suggests that while the graphs provide useful context, the short-term sentiment found in social media (Tweets) contains features that are highly indicative of immediate momentum shifts. The long-range visual information in the graphs was less predictive than the immediate textual reaction of the market.

Testing on StockNet
To ensure their model wasn’t just overfitting to their own dataset, the authors also tested MEANT on StockNet, an existing external benchmark dataset. StockNet is harder because it relies on raw price movement (up/down) rather than clean MACD crossovers, and it doesn’t include images.
The authors adapted MEANT to run without the vision component (MEANT-Tweet-price).

Here, MEANT-XL achieved an accuracy of 82.15%, shattering the previous SOTA (TEANet) which sat at roughly 67%. This proves that the Temporal Attention mechanism—specifically the Query-Targeting strategy—is highly effective at extracting dependencies from sequential financial data, even without the image component.
Sequence Projection vs. Mean Pooling
Recall the two methods for compressing the daily language features: Sequence Projection (learning a vector) vs. Mean Pooling (averaging).
The experiments revealed a nuance:
- TempStock: Sequence Projection worked better.
- StockNet: Mean Pooling worked better.

The authors hypothesize that because StockNet relies heavily on noisy tweets for binary price prediction, the parameterized projection might be “over-thinking” or discarding crucial spatial information. Mean pooling acts as a safer, noise-dampening summary. Conversely, TempStock’s labels are derived from a cleaner mathematical indicator (MACD), allowing the model to learn a more complex, useful projection without overfitting to noise.
Conclusion and Future Implications
The MEANT paper represents a significant step forward in financial machine learning. It moves beyond the limitations of analyzing a single data type or a single point in time. By combining computer vision (to read charts), NLP (to read sentiment), and a novel temporal attention mechanism (to understand history), it achieves state-of-the-art results.
Key Takeaways:
- Temporal Attention Works: Forcing the model to query historical data based on the pre-target day is a powerful inductive bias for time-series prediction.
- Scale Improves Financial Models: Larger Transformer-based models (MEANT-XL) significantly outperform smaller recurrent models (LSTM/TEANet).
- Language Drives Momentum: In the context of this study, what people said (Tweets) was more predictive of momentum shifts than what the charts looked like.
While the authors warn strictly against using this for financial decision-making (as all responsible researchers should), the architecture lays the groundwork for more sophisticated “AI Traders” that can process information much like a human does—but with the ability to read millions of tweets and charts in seconds.
](https://deep-paper.org/en/paper/2411.06616/images/cover.png)