Introduction: The Speed vs. Quality Dilemma
In the world of Natural Language Processing (NLP), the Transformer architecture is king. Specifically, for tasks like machine translation, autoregressive (AR) Transformers have set the gold standard for quality. They generate translations one word at a time, using the previously generated words as context for the next. This sequential nature ensures high coherence but creates a significant bottleneck: latency. Generating a long sentence takes a long time because you cannot compute the 10th word until you have computed the 9th.
Enter Non-Autoregressive Transformers (NATs). These models promise a revolution in speed by generating the entire target sequence in parallel. Imagine translating a whole sentence in a single computational step. The speedup is massive, but it comes at a cost.
The core issue NATs face is the multi-modality problem. In translation, a single source sentence often has multiple valid translations. For example, the English “I have to go” could be translated into German as “Ich muss gehen” or “Ich muss weg.” An autoregressive model commits to one path step-by-step. A parallel NAT model, however, might try to generate both simultaneously, resulting in a garbled mix like “Ich muss gehen weg”—a phenomenon known as the “conditional independence assumption.”
Researchers have been hunting for ways to mitigate this. One of the most promising architectures is the Directed Acyclic Transformer (DAT), which structures outputs as a graph rather than a line. However, DATs heavily rely on a training technique called Glancing Training (GLAT) to work well. While effective, GLAT creates a mismatch between how the model is trained (with access to target information) and how it runs during inference (without it).
In this post, we are diving deep into Diff-DAT (Diffusion Directed Acyclic Transformer), a novel approach that replaces the GLAT mechanism with Diffusion Models. By treating translation as a denoising process within a graph structure, Diff-DAT not only fixes the training-inference mismatch but also introduces a flexible trade-off between speed and quality.
Background: The Foundation of DAT
To understand Diff-DAT, we first need to understand the architecture it is built upon: the Directed Acyclic Transformer (DAT).
The Directed Acyclic Graph (DAG) Decoder
Traditional transformers output a linear sequence of tokens. DAT changes the game by outputting a Directed Acyclic Graph (DAG). Instead of predicting a single word at position \(i\), the decoder generates a lattice of nodes and edges.
Given a source sentence \(X\) and a target \(Y\), DAT sets a decoder length \(L\) (usually significantly longer than the expected translation) and models the probability of the translation by summing up all valid paths through this graph.
Mathematically, the probability of generating target \(Y\) given source \(X\) is calculated by marginalizing over all possible paths \(A\):

Here, \(A = \{a^1, ..., a^M\}\) represents a path of vertex indices. The model needs to calculate two things:
- Transition Probability: The likelihood of moving from one node to the next.
- Emission Probability: The likelihood of a node generating a specific word.
The path probability is factorized based on the Markov assumption. It looks at the transition matrix \(\mathbf{E}\), which tells us how likely it is to jump from node \(a^i\) to node \(a^{i+1}\):

Once a path is chosen, the tokens are generated based on the hidden states at those specific path indices:

The Problem with GLAT
The original DAT used Glancing Training (GLAT) to improve performance. GLAT injects a “glimpse” of the ground truth target into the decoder during training to help the model learn.

Here, \(Z\) is a latent variable representing a masked version of the target. The problem? \(Z\) exists only during training. During inference, the model has no “glimpse” of the target. This creates a discrepancy: the model learns to rely on information that simply isn’t there when it’s time to actually translate. This motivated the researchers to look for a mathematically sounder alternative: Diffusion.
Core Method: Diff-DAT
Diff-DAT replaces the ad-hoc latent variable mechanism of GLAT with a formal Diffusion Process. If you are familiar with image generation models like Stable Diffusion, the concept is similar: you destroy data with noise and learn to reconstruct it. However, because text is discrete (categorical), we cannot just add Gaussian noise. Instead, we use Absorbing State Discrete Diffusion.
1. The Forward Process (Adding Noise)
In this context, “noise” means masking tokens. We introduce an absorbing state token, denoted as [M].
The forward process gradually corrupts the target sentence \(Y_0\) over \(T\) time steps. At each step, some tokens are replaced by [M]. By the final step \(T\), the sequence \(Y_T\) is entirely composed of mask tokens.
The transition rules for the forward process (Eq 7 in the paper) are straightforward:
- If a token is already masked
[M], it stays masked. - If a token is a real word, it has a probability \(\beta_t\) of staying the same, and \(1-\beta_t\) of becoming
[M].
We can mathematically describe the state at any time step \(t\) directly from the start \(Y_0\) using a cumulative probability \(\alpha_t\):

This defines exactly how the training data is corrupted.
2. The Backward Process (Denoising)
The goal of the neural network is to reverse this process. Given a partially masked sequence \(Y_t\), the model tries to predict \(Y_{t-1}\) (a slightly less masked sequence).
However, Diff-DAT doesn’t just predict the previous step directly. Following the best practices of diffusion models (like DDPM), the network tries to predict the original noiseless token \(Y_0\). This prediction is then combined with the known diffusion dynamics to estimate the previous step.
The backward transition probability is parameterized as follows:

This equation essentially says:
- If the current token is
[M], the model attempts to predict the original word with probability derived from the diffusion schedule (\(\gamma_t\)) and the model’s output distribution (\(\mathbf{P}\)). - If the current token is already unmasked, it stays unmasked (probability 1).
3. The Training Objective (VLB)
Diffusion models are trained by maximizing the Variational Lower Bound (VLB) of the log-likelihood. Diff-DAT combines the standard diffusion objective with the path-marginalization of DAT.
The overall objective function sums over all time steps \(T\):

The crucial part is \(\mathcal{L}_t\), the loss at a specific time step. This measures how well the model reconstructs the data while navigating the graph structure.

Let’s break down this equation (Eq. 8):
- \(\gamma_t\): A weight derived from the diffusion schedule.
- \(b_t^i\): A binary mask that is 1 if the token is masked (needs prediction) and 0 otherwise.
- \(\log \mathbf{P}_{a_i, y_0^i}\): The probability the model assigns to the correct word \(y_0\) at node \(a_i\).
- \(\mathbf{E}\): The transition probabilities summing up the path likelihoods.
This looks computationally expensive because it sums over all paths \(A\). However, the authors use a smart simplification. Instead of summing over all paths for the diffusion calculation, they condition on the most probable path \(\hat{A}\) found by the model. This allows for efficient training using dynamic programming.
4. Why is this better?
- Alignment: Unlike GLAT, the diffusion process defines a rigorous mathematical framework where the “latent variable” (the noisy sequence) has a defined relationship with the input. During inference, we can simulate this process.
- Iterative Decoding: Standard NATs shoot once. Diff-DAT allows for iterative refinement. You can run the backward process for multiple steps. Step 1 gives you a rough translation; Step 2 refines the tokens that the model is less confident about (the ones that effectively remained “masked”).
Experiments & Results
The researchers tested Diff-DAT on standard machine translation benchmarks: IWSLT14 (German-English) and WMT datasets (English-German, English-Romanian, Chinese-English).
Performance vs. Baselines
The results are summarized in Table 1 below. The key metric is the BLEU score (higher is better), a standard measure of translation quality.

Key Observations:
- State-of-the-Art: Diff-DAT achieves competitive or superior performance compared to strong baselines like CMLM and the original DAT.
- Iterative Improvement: Look at the rows for Diff-DAT with “1 iter” vs “2 iters.” On WMT16 En-Ro, the score jumps from 33.65 to 34.00. This confirms that the diffusion mechanism allows the model to refine its output effectively.
- Speed: Even with 2 iterations, the speedup is 9.2x compared to an autoregressive Transformer. With 1 iteration, it maintains the 14.0x speedup of standard NATs.
The Effect of Graph Size (\(\lambda\))
The parameter \(\lambda\) controls the size of the DAG relative to the source sentence length (\(L = \lambda \cdot N\)). A larger graph offers more paths (modalities) but makes the search space more complex.

The chart above shows that performance generally peaks around \(\lambda=4\) to \(\lambda=8\). Crucially, Diff-DAT (green and orange lines) consistently outperforms the original DAT (blue line) across different graph sizes. The gap is particularly noticeable when using 2 iterations (green line), proving the robustness of the diffusion approach.
Does More Iteration Always Help?
Since diffusion models in image generation often use hundreds of steps, one might assume that running Diff-DAT for more steps would linearly increase quality.

The analysis reveals a nuance:
- For short sentences (length < 40), increasing iterations (e.g., to 4 or 8) improves quality.
- For long sentences (length > 40), performance actually degrades with too many iterations.
The authors hypothesize that for long sequences, the graph becomes massive, making transition predictions difficult. Errors in early steps propagate, and the iterative process might “hallucinate” or drift away from the optimal path rather than refining it. However, 2-step decoding remains a “sweet spot” across the board.
Beyond BLEU: Fuzzy Alignment
The paper also explores combining Diff-DAT with Fuzzy Alignment (FA), a technique that relaxes the strict monotonic alignment constraints.

As shown in Table 5, combining the two (FA-Diff-DAT) pushes performance even higher, surpassing the standalone FA-DAT model. This demonstrates that the diffusion objective is orthogonal to alignment improvements; they can work together to boost results further.
Conclusion and Implications
Diff-DAT represents a maturation of Non-Autoregressive Translation. By moving away from ad-hoc training tricks like GLAT and adopting the principled framework of Discrete Diffusion, the authors have created a model that is both theoretically sound and practically effective.
The key takeaways for students and practitioners are:
- Bridging the Gap: Diffusion models effectively close the gap between training with latent variables and inference without them.
- Flexible Latency: Unlike rigid NATs, Diff-DAT offers a knob to turn. Need maximum speed? Run 1 iteration. Need higher quality? Run 2.
- Graph + Diffusion: The combination of structural search (DAG) and iterative refinement (Diffusion) is a powerful paradigm that could extend beyond just machine translation to other sequence generation tasks.
While challenges remain—particularly regarding iterative decoding on very long sequences—Diff-DAT proves that we don’t always have to choose between the high quality of autoregressive models and the high speed of parallel generation. We can, increasingly, have both.
](https://deep-paper.org/en/paper/file-2330/images/cover.png)