If you have ever stared at a blinking cursor waiting for a Large Language Model (LLM) to finish a paragraph, you have experienced the inherent bottleneck of autoregressive generation. These models generate text one token at a time, and for every single token, the model must move massive amounts of parameters from memory to the compute units.

In the world of GPU acceleration, this is known as being “memory bandwidth bound.” The speed of generation isn’t determined by how fast the GPU can do math, but by how fast it can move data.

A popular solution to this problem is Speculative Sampling, a technique that uses a smaller “draft” model to guess future tokens, which are then verified in parallel by the larger “target” model. While this reduces the number of forward passes required, the verification step itself involves complex probability calculations that can still throttle performance.

In this deep dive, we will explore a recent research paper, “Optimized Speculative Sampling for GPU Hardware Accelerators,” which proposes two novel methods to optimize this verification phase. We will look at how redistributing GPU workloads can yield a faster—yet mathematically identical—result, and how approximating probability distributions with Sigmoid functions can unlock massive speed gains at the cost of minor accuracy drops.

The Context: Why is Sampling Slow?

To understand the solution, we first need to understand the hardware constraint. Modern GPUs, like the NVIDIA A100, have a hierarchy of memory:

  1. HBM (High-Bandwidth Memory): Massive capacity (e.g., 80GB), but relatively slow to access. This is where the model weights live.
  2. SRAM (Static Random-Access Memory): Tiny capacity (a few hundred KB per streaming multiprocessor), but incredibly fast. This is on-chip memory, often used as a programmable cache or “Shared Memory.”

In standard autoregressive decoding, the GPU spends a significant amount of time reading and writing intermediate matrices between HBM and SRAM. Speculative sampling attempts to mitigate this by predicting \(K\) tokens ahead. Instead of running the large model \(K\) times sequentially, we run it once to verify all \(K\) tokens in parallel.

However, the authors of this paper noticed that the verification logic itself—the math used to accept or reject those drafted tokens—was suboptimal in standard implementations. It involved too many read/write operations to HBM.

Background: The Mathematics of Speculative Sampling

Before optimizing the hardware, let’s look at the math we are trying to compute.

In speculative sampling, we have a target distribution \(p(x)\) (the big model) and a draft distribution \(q(x)\) (the small model). When the draft model guesses a token \(x_{i+c}\), we determine if we should keep it based on an acceptance probability \(\tau_c\).

The standard rejection criterion is defined as:

Equation 1: The acceptance probability formula.

Here, \(r_c\) is a random number drawn from a uniform distribution \(U(0,1)\). If \(r_c \le \tau_c\), the token is accepted.

If the token is rejected, we must resample a new token from a corrected distribution to ensure the final output strictly follows the target model’s distribution. This corrected distribution is defined as:

Equation 2: The resampling distribution formula.

The function max_norm normalizes the difference between the distributions, ensuring positive probabilities that sum to 1:

Equation 3: The max_norm function details.

Calculating these values requires element-wise operations across the entire vocabulary (often 30,000+ tokens). The standard approach loads the probabilities for \(p\) and \(q\) from HBM, computes the difference, writes back to HBM, reads again for normalization, and so on. This “memory traffic” is what the researchers set out to eliminate.

Method 1: Exact Optimization via Kernel Fusion

The first contribution of the paper is an Exact Optimization. “Exact” means that the mathematical output is bit-wise identical (or within floating-point error margins) to the baseline. It produces the exact same text generation results but does it faster.

The Strategy: Tiling and SRAM

The core insight is that the calculations for \(\tau\) (acceptance rate) and the resampling distribution are largely independent for chunks of the vocabulary. Instead of processing the whole vocabulary globally, the researchers broke the workload into “tiles.”

They wrote a custom GPU kernel that redistributes the workload across GPU threads and blocks. Here is the process:

  1. Load to SRAM: A thread block loads a chunk of probabilities \(p(x)\) and \(q(x)\) from the slow HBM into the fast SRAM.
  2. Compute in Parallel: The threads compute the acceptance probability \(\tau\) and the numerator for the resampling equation (Eq. 3) entirely within SRAM.
  3. Parallel Reduction: The denominator in Equation 3 (the sum of all probabilities) is the only tricky part because it requires a global sum. The kernel performs a partial sum within the block in SRAM.
  4. Write Back: Only the necessary partial results are written back to HBM.

By keeping the intermediate values in SRAM, the GPU avoids multiple expensive trips to HBM.

Short alt text and caption.

Figure 1: This diagram visualizes the Exact Optimization pipeline. Notice how the data moves from HBM (left) into SRAM (center) where the heavy lifting happens. The probability distributions \(p\) and \(q\) are sliced into sub-vocabularies (\(V_k\)). The calculations for acceptance (\(\tau\)) and the resampling numerator (\(a_k\)) happen concurrently. Only the final aggregated results are sent back to HBM.

This method maximizes parallelism. By tiling the kernel, the GPU can saturate its compute cores without waiting for memory fetches. Because it implements the standard equations (1, 2, and 3) faithfully, there is zero loss in accuracy.

Method 2: Approximate Optimization via Sigmoid

The researchers didn’t stop at “exact.” They asked: What if we trade a tiny bit of accuracy for massive speed?

The bottleneck in the exact method—and indeed in almost all Transformer operations—is the Softmax function.

Equation 6: The Softmax formula showing the dependency on the global sum.

Look at the denominator in the Softmax equation above. To calculate the probability of even a single token, you need to sum the exponentials of every token in the vocabulary. This creates a dependency that spans across all thread blocks, making it difficult to fully parallelize.

The Sigmoid Switch

The authors propose replacing Softmax with an element-wise Sigmoid approximation. Unlike Softmax, Sigmoid processes each logit independently. It doesn’t care what the other logits are doing.

To make this work, the logits (raw model outputs, \(z\)) are first scaled using constants \(\alpha\) and \(\beta\) to fit the active range of the Sigmoid function. The approximate probabilities \(\hat{p}\) and \(\hat{q}\) are calculated as:

Equation 7: The Sigmoid approximation formulas for p and q.

Because \(\hat{p}\) and \(\hat{q}\) are calculated via Sigmoid, they are technically not valid probability distributions (they don’t sum to exactly 1). However, they preserve the relative ranking of tokens, which is often enough for speculative sampling.

Using these approximate probabilities, the acceptance criteria and resampling logic are updated:

Equation 8: The approximate acceptance probability.

Equation 9: The approximate resampling distribution.

Fusing the Kernel

Because Sigmoid is element-wise, it can be fused directly into the sampling kernel. There is no need to wait for a global sum to normalize values before checking if a token is accepted.

Short alt text and caption.

Figure 2: The workflow for the Sigmoid Approximation. Highlighted in red is the key difference: instead of receiving pre-calculated probabilities, the kernel takes raw logits (\(z_p, z_q\)), scales them, and applies Sigmoid directly in SRAM. This creates a highly parallel, non-blocking pipeline.

This approach is significantly faster because it bypasses the expensive global synchronization required by Softmax. However, it is “non-exact,” meaning the generated text might differ from the standard model.

Experiments and Results

To validate these methods, the researchers ran extensive benchmarks on two tasks:

  1. Automatic Speech Recognition (ASR): Using Whisper (Target) and Distil-Whisper (Draft).
  2. Text Summarization: Using Llama 2, Qwen, and Gemma models.

They measured performance using Profiling Time (raw GPU execution time of the sampling function) and Accuracy (Word Error Rate for ASR, ROUGE scores for summarization).

Performance Gains

The results show a clear hierarchy of speed. The Baseline is slowest, Exact Optimization provides a solid boost, and Sigmoid Approximation is drastically faster.

Short alt text and caption.

Table 1: The main results. Look at the “\(\Delta\%\) Profiling Time” columns. The Exact method (middle right) consistently delivers ~6-13% speedups with identical accuracy (WER/ROUGE scores remain largely the same). The Sigmoid method (far right) delivers massive speedups of 37-94%, though sometimes with a slight degradation in accuracy scores.

For example, on the CNN/DM summarization task using Llama2 13B, the Exact method improved profiling time by 10.1%, while the Sigmoid method improved it by 37.2%.

Is the Sigmoid Method Accurate Enough?

The trade-off for the Sigmoid method is visible in the accuracy metrics.

  • ASR: The Word Error Rate (WER) increased slightly (e.g., from 0.08 to 0.09 on LibriSpeech Clean). This is a minor degradation.
  • Summarization: The ROUGE-1 scores dropped by small amounts (e.g., from 0.31 to 0.29).

While “non-exact,” the models generally remained coherent. The researchers found that the acceptance rate (how often the draft token is accepted) remained relatively high, indicating the Sigmoid approximation closely mirrored the real distribution’s decisions.

Stability and Memory

An important question for system architects is whether these optimizations introduce instability or memory overhead.

The researchers plotted the execution time per decoding step against the number of draft tokens (\(\gamma\)).

Short alt text and caption.

Figure 3: Execution time vs. number of draft tokens. The Green line (Baseline) is consistently the highest (slowest). The Orange line (Exact) is lower, and the Blue line (Sigmoid) is the lowest (fastest). Crucially, the lines are flat or linear, showing that the optimization scales well as we speculatively generate more tokens.

Regarding memory, the optimizations are highly efficient. Because they rely on SRAM for intermediate storage, they do not bloat the HBM usage.

Short alt text and caption.

Figure 5: Peak HBM memory usage for Whisper models. The Exact (circles) and Baseline (squares) lines overlap almost perfectly. The Sigmoid method (triangles) actually uses slightly less memory in some cases because it avoids storing full probability matrices in favor of on-the-fly computation.

Conclusion and Implications

This paper highlights a critical lesson in modern AI systems: algorithms must be hardware-aware.

By simply treating the mathematical operations of speculative sampling as a “black box,” standard implementations leave performance on the table. By understanding the GPU’s memory hierarchy (HBM vs. SRAM) and the properties of the functions (Softmax vs. Sigmoid), we can squeeze out significant gains.

  • The “Exact” method is a no-brainer for production systems requiring guaranteed fidelity. It offers a free ~10% speedup on the sampling step with zero downsides.
  • The “Sigmoid” method offers a more aggressive option. For applications where a 1-2% drop in metric performance is acceptable (e.g., real-time chatbots where latency is king), the nearly 2x speedup in the sampling kernel is transformative.

As Large Language Models continue to grow, the “Memory Wall” will only get taller. Techniques like these—which optimize data movement rather than just compute—will be essential for making next-generation AI accessible and responsive.