The capabilities of Large Language Models (LLMs) like GPT-4 and LLaMA have revolutionized artificial intelligence. However, if you have ever watched an LLM generate a response, you have likely noticed a fundamental bottleneck: the text appears one word at a time, like a slow typist.

This sluggishness is due to the Auto-Regressive (AR) decoding paradigm. To generate the 100th token, the model strictly needs the previous 99 tokens. This sequential dependency prevents parallel processing during generation, leaving powerful GPUs idling while they wait for the next token to be decided.

Researchers have been racing to solve this latency problem using Speculative Decoding—a technique where “draft” tokens are guessed and then verified in parallel. However, most current methods come with a heavy price tag: they require training extra “draft models,” adding complex heads (like Medusa), or running separate, resource-intensive post-training stages.

What if we could teach the main model to speed itself up without adding a single extra parameter or requiring a separate training pipeline?

In this post, we dive into a fascinating paper titled “Make Some Noise: Unlocking Language Model Parallel Inference Capability through Noisy Training.” The researchers propose a framework called MSN (Make Some Noise) that seamlessly integrates into the standard fine-tuning process. By training the model to handle “noisy” inputs, they unlock powerful parallel decoding capabilities, achieving speeds 2.3x to 2.7x faster than standard models—effectively offering a “free lunch” for LLM inference acceleration.

The Problem: The Cost of Acceleration

To understand the beauty of the MSN framework, we first need to look at how we currently try to speed up LLMs.

In standard Speculative Decoding, we usually need a “drafter” and a “verifier.”

  1. The Drafter: Quickly guesses the next few tokens (e.g., “The cat sat on the…”).
  2. The Verifier (The LLM): Checks those guesses in a single parallel forward pass.

The bottleneck usually lies in the drafter. Previous solutions often involve:

  • Small Draft Models: Running a tiny LLaMA alongside a big LLaMA. This consumes extra memory.
  • Extra Heads (e.g., Medusa): adding new architectural layers to predict future tokens. This changes the model structure and adds parameters.
  • Separate Post-Training: After you finish fine-tuning your model for its actual job (like coding or chatting), you have to train it again just to learn how to be fast.

The researchers behind MSN realized that these methods treat “task capability” (being smart) and “inference speed” (being fast) as two separate problems.

Figure 1: An illustration of the differences between the proposed MSN framework and existing model-based speculative decoding methods. The book icon represents task-specific capabilities and the rocket icon represents parallel decoding capabilities.

As shown in Figure 1 above, existing methods (top) require a complex pipeline. First, you get a Base Model, then you do Supervised Fine-Tuning (SFT) to get a smart model, and then you do specific training or add structures to get a fast model.

The MSN Framework (bottom) collapses this. It proposes that by slightly tweaking the standard SFT stage—specifically by adding noise—we can produce a model that is both smart and fast immediately. No extra steps, no extra weights.

Background: Jacobi Decoding

To understand how MSN works, we need to understand the mechanism it uses for speed: Jacobi Decoding.

Standard decoding solves for one token (\(y\)) at a time. Jacobi decoding treats text generation like a system of non-linear equations. It attempts to solve for multiple tokens (\(y_1, y_2, ... y_m\)) simultaneously.

\[ \left\{ \begin{array} { r l } { y _ { 1 } } & { { } = \arg \operatorname* { m a x } P _ { \theta } ( y _ { 1 } | x ) } \\ { y _ { 2 } } & { { } = \arg \operatorname* { m a x } P _ { \theta } ( y _ { 2 } | y _ { 1 } , x ) } \\ { ~ } & { { } \vdots } \\ { y _ { m } } & { { } = \arg \operatorname* { m a x } P _ { \theta } ( y _ { m } | y _ { 1 : m - 1 } , x ) } \end{array} \right. \]

The equation above shows the goal: finding the best token at every position given the previous ones. In Jacobi decoding, you start with a random guess (or “noise”) for the future tokens. You feed this sequence into the model. The model looks at the sequence and updates its predictions. You repeat this iteratively.

Ideally, the sequence “converges” to the correct text quickly. However, standard LLMs are terrible at this. They are trained on perfect text (Teacher Forcing). When they see the random noise used as placeholders in Jacobi decoding, they get confused and hallucinate, failing to converge.

This is where Make Some Noise (MSN) comes in.

The Core Method: Training for Robustness

The authors propose that parallel decoding is essentially a denoising task. If a model can look at a sentence with some garbled junk in the middle and correctly predict what should be there, it can successfully perform Jacobi decoding.

The MSN Training Framework

The standard way to train an LLM uses a loss function based on “Teacher Forcing,” where the model always sees the correct history:

\[ L o s s _ { A R } = \sum _ { i = 0 } ^ { n } - \log P ( X _ { i } | X _ { < i } ; \theta ) \]

MSN changes this slightly. During the Supervised Fine-Tuning (SFT) stage—where the model learns to follow instructions or write code—the researchers randomly replace a segment of the input tokens with “noise.”

\[ L o s s _ { M S N } = \sum _ { i = 0 } ^ { n } - \log P ( X _ { i } | \hat { X } _ { < i } ; \theta ) \]

Here, \(\hat{X}\) represents the input sequence containing noise. Crucially, the target the model tries to predict (\(X_i\)) remains the correct token. The model is forced to ignore the noise and reconstruct the correct next token based on context.

Figure 2: Ilustrations of the Make Some Noisy training framework and Jacobi decoding strategy. The training phase in the figure uses a noise segment of length 2,and the inference phase is shown as an example when the length of the noise segment is set to 3.

Let’s look at the “MSN Training Stage” in Figure 2 (top half).

  • Original Sentence: “I am always on the lookout for exciting…”
  • Noisy Input: The phrase “the look” is replaced by noise tokens “always I”.
  • The Task: The model reads “…always on always I…” and must still predict “the”, “look”, and “out”.

This forces the model to stop over-relying on the immediate previous token (which might be wrong) and look at the broader context. This is exactly the skill needed for parallel decoding.

What Kind of Noise?

You can’t just use random static. The authors found that using Ahead Noise works best:

\[ \hat { x } _ { i } = r a n d o m \_ s a m p l e ( X _ { < i } ) \]

They sample tokens that appeared earlier in the sentence (“Prefix tokens”). This is clever because:

  1. It mimics the “copy mechanism” behavior often seen in language generation.
  2. It is harder to denoise than pure random noise because the tokens are semantically relevant to the context, forcing the model to be more discerning.

Inference: TR-Jacobi Decoding

Once the model is trained with MSN, it is “robust.” It doesn’t panic when it sees incorrect draft tokens. Now, how do we use this for speed?

The authors propose TR-Jacobi (Tree-based Retrieval-augmented Jacobi) decoding.

Step 1: Jacobi Decoding with Noise

Refer back to the bottom half of Figure 2 (“Inference Stage”). Instead of generating one token, the model generates a “draft” of several tokens (filled with noise or guesses). Because the model was trained to denoise, it looks at the draft _on _always _I and corrects it to _on _the _look. It iterates this process, refining the sequence in parallel steps.

Step 2: Retrieval Augmentation (The “R” in TR-Jacobi)

There is a catch with Jacobi decoding: the “Cold Start.” At the very beginning of a new sentence, random noise is a terrible guess. It takes too many iterations to fix it.

To solve this, the authors add a Retrieval path. They use “Prompt Lookup Decoding.” Basically, the model looks back at its own prompt or previously generated text to find patterns (n-grams) that match the current context.

For example, if the prompt talks about “Newton’s Laws,” and the model generates “Newton’s,” it’s highly probable the next word is “Laws.” The retrieval mechanism grabs “Laws” and puts it into the draft candidate list.

Step 3: Tree Verification

Instead of verifying just one sequence, the model verifies a Token Tree.

Figure 7: Ilustration of token tree verification. The model achieves simultaneous verification of multiple candidate paths through a specially constructed sparse attention matrix.

As seen in Figure 7, the system organizes different guesses (from Jacobi iteration and Retrieval) into a tree.

  • Path A: The result of the Jacobi denoising.
  • Path B: The result of the Retrieval lookup.
  • Path C: Extensions of these paths.

The model checks all these branches in a single forward pass using a specific attention mask. The longest valid branch is kept. This maximizes the chance of finding a long, correct sequence in one go.

Figure 3: The main flowchart of TR-Jacobi decoding. It should be noted that candidate generation and tree verification are performed in the same step. For clarity, we choose candidate generation at moment T and tree verification at moment \\(\\mathrm { T } { + } 1\\) for analysis in the figure.

Figure 3 illustrates this flow. At step \(T\), candidates are generated (some from noise, some from retrieval). At step \(T+1\), the tree is verified, and the best path (e.g., accepting “out” and “_for”) is selected to move forward.

Experiments and Results

The researchers tested MSN on two major base models: LLaMA3-8B (for general tasks) and DeepSeekCoder-6.7B (for coding tasks).

1. Does Noisy Training Hurt Intelligence?

This is the biggest fear with modifying the SFT stage. If we feed the model garbage (noise) during training, will it become stupid?

Table 1: Results of task performance experiments in general and code domains. The general domain metric uses scores from MT-bench and MMLU-weighted from MMLU. The code domain uses pass \\(@ 1\\) under greedy decoding.For the code domain,we choose the average of HumanEval and MBPP as a composite metric.‘ \\(\\prime ( + ) ^ { , }\\) : Results after executing additional tests from evalplus. ‘↑’: Percentage improvement over models without MSN.

Table 1 provides a clear answer: No. In fact, looking at the MSN (Ours) rows, the model performance on benchmarks like MT-Bench and HumanEval is comparable to, and sometimes slightly better than, the standard SFT baseline. For example, in the code domain, MSN slightly outperformed standard SFT on the composite metric (77.0 vs 76.6). This suggests that learning to denoise might actually help the model understand context better, rather than confusing it.

2. How Fast is it?

The primary goal was speed. The authors tested the models on Spec-Bench, a benchmark designed for speculative decoding.

Table 2: Experimental results of acceleration ratios in various areas of Spec-Bench (Multi-turn Conversation, Translation, Summarization,Question Answering,Mathematical Reasoning,Retrieval-aug. Generation). Under the dashed line indicates the Jacobi-like decoding method.`AS’: Additional Structure.“#MAT’: #Mean Accepted Tokens.‘↑’:Percentage improvement over models without MSN.

Table 2 compares MSN against other methods, including those that use extra structures (AS) like Medusa and Eagle.

  • Standard AR: 42.13 tokens/s.
  • MSN + TR-Jacobi: 91.63 tokens/s.

This is a 2.17x speedup overall, with some tasks (like Summarization) reaching nearly 2.8x. Notably, MSN achieves speedups comparable to Medusa2 and EAGLE, but without requiring the 1.6B or 0.3B extra parameters those methods need. It effectively turns a standard 8B model into a high-speed inference engine without bloating its memory footprint.

3. Ablation Studies: Noise and Retrieval

The authors performed several deeper analyses to understand why it works.

Does the length of the training noise matter? They trained models with noise segments of length 1, 4, and 8.

Table 3: The effect of the training noise segments length on acceleration and task capability. ‘L’ represents the length of the noise segment.

Table 3 shows a trade-off. Length 1 (L=1) didn’t provide much speedup. Length 8 (L=8) hurt the model’s accuracy (HumanEval score dropped). Length 4 was the sweet spot—high speedup with preserved accuracy.

Does Retrieval actually help? They compared pure Jacobi decoding against their TR-Jacobi (with retrieval).

Figure 5: Results of ablation experiments on the retrieval part of TR-Jacobi decoding. Figure 6: Acceleration experimental results of MSN training for StarCoder2 models of different sizes.

In Figure 5 (left chart), the Blue bars (TR-Jacobi with Retrieval) consistently outperform the Orange bars (without Retrieval) across different tasks. This confirms that adding the “LookAhead” retrieval path significantly helps the model recover from the cold-start problem, leading to more accepted tokens per step.

Conclusion

The “Make Some Noise” paper presents a compelling argument for rethinking how we train LLMs for inference. Instead of treating training and acceleration as separate phases, MSN merges them.

By simply introducing noise during the Supervised Fine-Tuning stage, the model learns a dual capability:

  1. Task Solving: It learns to code, chat, or summarize.
  2. Self-Correction: It learns to look at a noisy, draft sequence and “snap” it into the correct shape.

This self-correction capability is the key to unlocking Jacobi Decoding, allowing the model to generate text in parallel bursts rather than single steps. With the addition of TR-Jacobi, which uses retrieval to stabilize the process, MSN offers a 2-3x speedup.

For students and practitioners, the takeaway is clear: acceleration doesn’t always require complex new architectures. Sometimes, it just requires training the model a little differently—by making some noise.