In the rapidly evolving landscape of Natural Language Processing (NLP), we currently face a “two-model problem.” If you are building a Retrieval-Augmented Generation (RAG) system, you typically need two distinct architectures: a retriever (usually a bidirectional encoder like BERT) to handle embeddings and search, and a generator (a decoder-only LLM like GPT or Llama) to synthesize the answer.
This structural separation creates inefficiencies. It doubles deployment costs and prevents knowledge sharing between tasks. What if a single Large Language Model (LLM) could handle both high-quality text generation and high-quality sentence representation simultaneously?
This is the promise of UniMAE, a novel unsupervised training method presented in the paper “Decoder-Only LLMs can be Masked Auto-Encoders.” In this post, we will break down how UniMAE forces standard decoder-only LLMs to learn powerful sentence embeddings without losing their ability to generate fluent text.
The Problem: Why LLMs Struggle with Embeddings
To understand UniMAE, we first need to understand the architectural gap between encoders and decoders.
Encoder models (e.g., BERT) use bidirectional attention. Every token can “see” every other token in the sentence simultaneously. They often use a special token (like [CLS]) to aggregate the meaning of the entire sentence, making them excellent for creating vector embeddings used in search and clustering.
Decoder-only models (e.g., Llama, GPT) use unidirectional (causal) attention. A token can only attend to the tokens that came before it. They are trained to predict the next token in a sequence. Consequently, they lack a dedicated mechanism to represent the entire sentence. If you simply take the average of all token embeddings from an LLM, the quality is often poor compared to dedicated encoders.
Recent attempts to fix this involve “hacking” the LLM—either by averaging outputs or forcing bidirectional attention during fine-tuning. However, these methods often degrade the model’s primary strength: text generation.
The Solution: UniMAE
The researchers propose UniMAE (Uni-Directional Masked Auto-Encoder). The core idea is ingenious yet straightforward: force the model to compress all the semantic information of a sentence into the final [EOS] (End-Of-Sequence) token.
How do they achieve this? By training the model to reconstruct the original sentence using only that [EOS] embedding.
![Figure 1: Overview of UniMAE training. The left part shows the Masked Auto-Regressive process, while the right part shows the reconstruction process using a tiny decoder with [EOS] embeddings to recover the input.](/en/paper/file-2328/images/001.jpg#center)
As shown in Figure 1 above, the framework consists of two parallel training objectives:
- Masked Auto-Regressive (MAR): The LLM generates text but with a twist—some inputs are masked.
- Masked Re-Construct (MRC): A tiny, separate decoder tries to rebuild the sentence using the representation from the LLM’s [EOS] token.
Let’s dive deep into these two components.
1. Masked Auto-Regressive (MAR)
Standard LLMs are trained via Auto-Regressive (AR) learning: given a sequence of words, predict the next one. UniMAE adopts a slightly harder version of this called Masked Auto-Regressive (MAR).
In MAR, the input sentence \(X\) is corrupted by randomly masking a certain percentage of tokens (denoted as \(\tilde{X}\)). The LLM must still predict the correct sequence.
The loss function for MAR is calculated as follows:

Here, the model attempts to predict the token \(x_t\) based on the corrupted previous tokens \(\tilde{x}_{i Why do this? If we fed the clean sentence to the model, it might simply memorize sequences. By introducing noise (masks), we force the model to rely on semantic context rather than just short-range correlations. This ensures the model is actually “thinking” about the content, which is crucial for building robust representations. This is the heart of the UniMAE innovation. The goal is to make the embedding of the [EOS] token, denoted as \(h_{eos}\), a container for the entire sentence’s meaning. First, the LLM processes the masked input \(\tilde{X}\) to produce the latent representation at the end of the sequence: Because of the causal attention mask in decoder models, the [EOS] token is the only position that has attended to all previous tokens. However, without incentive, the [EOS] token has no reason to “remember” the start of the sentence—it only cares about what comes next (usually nothing). To fix this, the researchers initialize a Tiny Decoder. This is a small, temporary neural network used only during training. Its job is to take the \(h_{eos}\) vector and recreate the original sentence. The input to this Tiny Decoder consists of two parts, \(H_1\) and \(H_2\), which combine the [EOS] embedding with trainable positional embeddings (\(p\)): The Tiny Decoder uses an attention mechanism to reconstruct the data. However, there is a catch: the authors apply a mask (\(M\)) to the attention matrix. This forces the reconstruction to rely heavily on \(h_{eos}\) rather than cheating by looking at adjacent token embeddings. The attention computation is formalized as: The objective function for this reconstruction step is to minimize the difference between the reconstructed tokens and the original input tokens: By minimizing this loss, the gradients flow back into the main LLM, updating the weights specifically to ensure that \(h_{eos}\) captures a high-fidelity summary of the input text. The final training recipe combines both objectives. The model is trained to be a good generator (MAR) and a good encoder (MRC) simultaneously. Here, \(\alpha\) is a weight parameter balancing the two tasks. Interestingly, the Tiny Decoder is discarded after training. During inference, you simply feed text into the LLM, extract the [EOS] vector, and use it as your sentence embedding. The researchers evaluated UniMAE using the Massive Text Embeddings Benchmark (MTEB), a rigorous suite of 56 datasets covering retrieval, clustering, classification, and more. They tested the method on Llama-3 models of varying sizes (1B, 3B, and 8B parameters). The results were impressive. UniMAE significantly outperformed standard baselines, including “Mean” (average pooling) and “Echo” (repeating sentences). As seen in Table 1, UniMAE achieves state-of-the-art results for unsupervised methods. To prove that the embeddings are semantically meaningful, the authors visualized the vector space using t-SNE (a technique for visualizing high-dimensional data). Figure 2 compares the base model (a) with the UniMAE-trained model (b). This is the critical test. Many methods improve embeddings but destroy the LLM’s ability to write coherent text (catastrophic forgetting). Look at Figure 3(b) (the bar chart on the right): This confirms that UniMAE successfully turns the LLM into a hybrid model: a high-performance encoder and a high-performance generator. Figure 3(a) (the line graph) explores the “Mask Ratio” for the Auto-Regressive part. The results show that a mask ratio between 40% and 60% is optimal. UniMAE represents a significant step toward unifying NLP architectures. By treating the [EOS] token as a latent variable that must reconstruct the input, the authors enable decoder-only LLMs to perform representation tasks (like retrieval and clustering) competitively with dedicated encoders. Key Takeaways: The authors note one limitation: the “Tiny Decoder” used for training is discarded, which technically represents a waste of parameters during the training phase. However, given the inference-time benefits—using a single model for everything—this cost seems negligible. UniMAE suggests a future where we stop distinguishing between “embedding models” and “generative models,” moving instead toward truly universal language models.2. Masked Re-Construct (MRC)

![() \\begin{array} { c } { { H _ { 1 } [ h _ { e o s } + p _ { 0 } , \\ldots , h _ { e o s } + p _ { N } ] , } } \\ { { { } } } \\ { { H _ { 2 } [ h _ { e o s } , e _ { x _ { 1 } } + p _ { 1 } , \\ldots , e _ { x _ { N } } + p _ { N } ] , } } \\end{array} ()](/en/paper/file-2328/images/004.jpg#center)


Joint Optimization

Experiments and Results
1. Embedding Performance

2. Visualizing the Vector Space

3. Preserving Generation Capabilities

4. The Importance of Masking
Conclusion and Future Implications
](https://deep-paper.org/en/paper/file-2328/images/cover.png)