In the rapidly evolving landscape of Natural Language Processing (NLP), we currently face a “two-model problem.” If you are building a Retrieval-Augmented Generation (RAG) system, you typically need two distinct architectures: a retriever (usually a bidirectional encoder like BERT) to handle embeddings and search, and a generator (a decoder-only LLM like GPT or Llama) to synthesize the answer.

This structural separation creates inefficiencies. It doubles deployment costs and prevents knowledge sharing between tasks. What if a single Large Language Model (LLM) could handle both high-quality text generation and high-quality sentence representation simultaneously?

This is the promise of UniMAE, a novel unsupervised training method presented in the paper “Decoder-Only LLMs can be Masked Auto-Encoders.” In this post, we will break down how UniMAE forces standard decoder-only LLMs to learn powerful sentence embeddings without losing their ability to generate fluent text.

The Problem: Why LLMs Struggle with Embeddings

To understand UniMAE, we first need to understand the architectural gap between encoders and decoders.

Encoder models (e.g., BERT) use bidirectional attention. Every token can “see” every other token in the sentence simultaneously. They often use a special token (like [CLS]) to aggregate the meaning of the entire sentence, making them excellent for creating vector embeddings used in search and clustering.

Decoder-only models (e.g., Llama, GPT) use unidirectional (causal) attention. A token can only attend to the tokens that came before it. They are trained to predict the next token in a sequence. Consequently, they lack a dedicated mechanism to represent the entire sentence. If you simply take the average of all token embeddings from an LLM, the quality is often poor compared to dedicated encoders.

Recent attempts to fix this involve “hacking” the LLM—either by averaging outputs or forcing bidirectional attention during fine-tuning. However, these methods often degrade the model’s primary strength: text generation.

The Solution: UniMAE

The researchers propose UniMAE (Uni-Directional Masked Auto-Encoder). The core idea is ingenious yet straightforward: force the model to compress all the semantic information of a sentence into the final [EOS] (End-Of-Sequence) token.

How do they achieve this? By training the model to reconstruct the original sentence using only that [EOS] embedding.

Figure 1: Overview of UniMAE training. The left part shows the Masked Auto-Regressive process, while the right part shows the reconstruction process using a tiny decoder with [EOS] embeddings to recover the input.

As shown in Figure 1 above, the framework consists of two parallel training objectives:

  1. Masked Auto-Regressive (MAR): The LLM generates text but with a twist—some inputs are masked.
  2. Masked Re-Construct (MRC): A tiny, separate decoder tries to rebuild the sentence using the representation from the LLM’s [EOS] token.

Let’s dive deep into these two components.

1. Masked Auto-Regressive (MAR)

Standard LLMs are trained via Auto-Regressive (AR) learning: given a sequence of words, predict the next one. UniMAE adopts a slightly harder version of this called Masked Auto-Regressive (MAR).

In MAR, the input sentence \(X\) is corrupted by randomly masking a certain percentage of tokens (denoted as \(\tilde{X}\)). The LLM must still predict the correct sequence.

The loss function for MAR is calculated as follows:

() L _ { M A R } = \\sum _ { t } C E ( x _ { t } \\mid \\tilde { x } _ { i < t } ; \\Phi _ { l l m } ) , ()

Here, the model attempts to predict the token \(x_t\) based on the corrupted previous tokens \(\tilde{x}_{i

Why do this? If we fed the clean sentence to the model, it might simply memorize sequences. By introducing noise (masks), we force the model to rely on semantic context rather than just short-range correlations. This ensures the model is actually “thinking” about the content, which is crucial for building robust representations.

2. Masked Re-Construct (MRC)

This is the heart of the UniMAE innovation. The goal is to make the embedding of the [EOS] token, denoted as \(h_{eos}\), a container for the entire sentence’s meaning.

First, the LLM processes the masked input \(\tilde{X}\) to produce the latent representation at the end of the sequence:

() h _ { e o s } \\gets \\Phi _ { l l m } \\left( \\tilde { X } \\right) ()

Because of the causal attention mask in decoder models, the [EOS] token is the only position that has attended to all previous tokens. However, without incentive, the [EOS] token has no reason to “remember” the start of the sentence—it only cares about what comes next (usually nothing).

To fix this, the researchers initialize a Tiny Decoder. This is a small, temporary neural network used only during training. Its job is to take the \(h_{eos}\) vector and recreate the original sentence.

The input to this Tiny Decoder consists of two parts, \(H_1\) and \(H_2\), which combine the [EOS] embedding with trainable positional embeddings (\(p\)):

() \\begin{array} { c } { { H _ { 1 } [ h _ { e o s } + p _ { 0 } , \\ldots , h _ { e o s } + p _ { N } ] , } } \\ { { { } } } \\ { { H _ { 2 } [ h _ { e o s } , e _ { x _ { 1 } } + p _ { 1 } , \\ldots , e _ { x _ { N } } + p _ { N } ] , } } \\end{array} ()

The Tiny Decoder uses an attention mechanism to reconstruct the data. However, there is a catch: the authors apply a mask (\(M\)) to the attention matrix. This forces the reconstruction to rely heavily on \(h_{eos}\) rather than cheating by looking at adjacent token embeddings.

The attention computation is formalized as:

() \\begin{array} { r l } & { Q = H _ { 1 } W ^ { Q } , K = H _ { 2 } W ^ { K } , V = H _ { 2 } W ^ { V } } \\ & { \\quad M _ { i j } = \\left{ \\begin{array} { l l } { 0 , } & { \\mathrm { i f ~ } i \\neq j \\land B _ { i j } = 1 } \\ { - \\infty , } & { \\mathrm { e l s e } } \\end{array} \\right. , } \\ & { \\quad \\quad A = \\mathrm { s o f t m a x } \\left( \\frac { Q ^ { T } K } { \\sqrt { d } } + M \\right) V . } \\end{array} ()

The objective function for this reconstruction step is to minimize the difference between the reconstructed tokens and the original input tokens:

() \\mathcal { L } _ { M R C } = \\sum _ { t } C E ( x _ { t } \\mid A , H _ { 1 } , H _ { 2 } , \\Phi _ { d e c } ) . ()

By minimizing this loss, the gradients flow back into the main LLM, updating the weights specifically to ensure that \(h_{eos}\) captures a high-fidelity summary of the input text.

Joint Optimization

The final training recipe combines both objectives. The model is trained to be a good generator (MAR) and a good encoder (MRC) simultaneously.

() \\mathcal { L } _ { U n i M A E } = \\alpha \\mathcal { L } _ { M A R } + \\mathcal { L } _ { M R C } , ()

Here, \(\alpha\) is a weight parameter balancing the two tasks. Interestingly, the Tiny Decoder is discarded after training. During inference, you simply feed text into the LLM, extract the [EOS] vector, and use it as your sentence embedding.

Experiments and Results

The researchers evaluated UniMAE using the Massive Text Embeddings Benchmark (MTEB), a rigorous suite of 56 datasets covering retrieval, clustering, classification, and more. They tested the method on Llama-3 models of varying sizes (1B, 3B, and 8B parameters).

1. Embedding Performance

The results were impressive. UniMAE significantly outperformed standard baselines, including “Mean” (average pooling) and “Echo” (repeating sentences).

Table 1: Results on 56 MTEB datasets with best results highlighted in bold, and the second-best results underlined.

As seen in Table 1, UniMAE achieves state-of-the-art results for unsupervised methods.

  • Significance: On the LLaMA-3.2-1B model, UniMAE jumps from a 39.32 average score (using mean pooling) to 52.81.
  • Scaling: The performance gains hold true across 1B, 3B, and 8B models.
  • Comparison: It even outperforms MNTP, a popular method that tries to make LLMs bidirectional, which is computationally expensive and alters the architecture fundamentally.

2. Visualizing the Vector Space

To prove that the embeddings are semantically meaningful, the authors visualized the vector space using t-SNE (a technique for visualizing high-dimensional data).

Figure 2: t-SNE visualization of sentence embeddings from top 6 classes on BiorxivClusteringS2S testset.

Figure 2 compares the base model (a) with the UniMAE-trained model (b).

  • Left (Base Model): The dots (representing scientific papers from different fields) are scattered and mixed. The model struggles to distinguish between “neuroscience” and “bioinformatics.”
  • Right (UniMAE): The clusters are distinct and tight. The model has learned to group semantically similar texts together effectively.

3. Preserving Generation Capabilities

This is the critical test. Many methods improve embeddings but destroy the LLM’s ability to write coherent text (catastrophic forgetting).

Figure 3: (a) Results under different MAR mask ratio on MTEB-15. (b) Performance on language model tasks.

Look at Figure 3(b) (the bar chart on the right):

  • Base (Blue): The original performance of the LLM on generation tasks.
  • MNTP (Green): A competing method. Notice the massive drop in performance (e.g., from 72 to 41 on LLaMA-8B). This model can no longer generate text effectively.
  • UniMAE (Orange): The performance is nearly identical to the Base model.

This confirms that UniMAE successfully turns the LLM into a hybrid model: a high-performance encoder and a high-performance generator.

4. The Importance of Masking

Figure 3(a) (the line graph) explores the “Mask Ratio” for the Auto-Regressive part. The results show that a mask ratio between 40% and 60% is optimal.

  • Too low (0%): The model doesn’t learn robust representations because the task is too easy (simple memorization).
  • Too high (>60%): The signal is lost; the model can’t reconstruct the meaning.

Conclusion and Future Implications

UniMAE represents a significant step toward unifying NLP architectures. By treating the [EOS] token as a latent variable that must reconstruct the input, the authors enable decoder-only LLMs to perform representation tasks (like retrieval and clustering) competitively with dedicated encoders.

Key Takeaways:

  1. Architecture Agnostic: It works on standard decoder-only LLMs without permanent structural changes.
  2. Efficiency: It requires only ~100 training steps to achieve SOTA unsupervised results.
  3. Versatility: The resulting model can power RAG systems (handling both the search embedding and the answer generation) or serve as a domain-adapted base model.

The authors note one limitation: the “Tiny Decoder” used for training is discarded, which technically represents a waste of parameters during the training phase. However, given the inference-time benefits—using a single model for everything—this cost seems negligible.

UniMAE suggests a future where we stop distinguishing between “embedding models” and “generative models,” moving instead toward truly universal language models.