Breaking the Vocabulary Barrier: How to Accelerate LLMs with Any Drafter Model

The inference speed of Large Language Models (LLMs) remains one of the primary bottlenecks in deploying generative AI. Whether you are running a chatbot, a code assistant, or a summarization tool, the cost and latency of generating text token-by-token can be prohibitive.

To solve this, the community has largely adopted Speculative Decoding (SD). This technique uses a smaller, faster “drafter” model to guess upcoming tokens, which are then verified in parallel by the larger “target” model. When it works, it’s like magic: you get the exact same quality output but significantly faster.

However, there has always been a major “catch” to Speculative Decoding: The Vocabulary Constraint.

Traditionally, the drafter and the target model had to speak the exact same language—literally. They needed to share the same tokenizer and vocabulary. This meant you couldn’t use a highly efficient model from the Qwen family to accelerate a Llama model, or a Phi model to accelerate a Mixtral model. If the vocabularies didn’t match, standard SD algorithms broke down.

In the paper Accelerating LLM Inference with Lossless Speculative Decoding Algorithms for Heterogeneous Vocabularies, researchers from the Weizmann Institute of Science, Intel Labs, and d-Matrix present a breakthrough set of algorithms that remove this constraint. Their work allows any off-the-shelf model to serve as a drafter for any target model, regardless of their vocabularies.

This post will deconstruct how these algorithms work, why “heterogeneous” vocabularies are so difficult to handle, and how these new methods achieve speedups of up to 2.8x.

The Background: Why Vocabularies Matter

To understand the innovation, we first need to understand the problem with standard Speculative Decoding.

LLMs do not process raw text; they process numbers called Token IDs. A tokenizer converts text into a sequence of these IDs based on a specific vocabulary.

  • Model A (e.g., Llama-3) might map the word “apple” to ID 5043.
  • Model B (e.g., Qwen-2) might map the word “apple” to ID 8912.

Standard Speculative Decoding works by passing Token IDs directly from the drafter to the target. If you try to use Model B to draft for Model A, Model B sends ID 8912. Model A receives 8912, looks it up in its own dictionary, and realizes it corresponds to something completely different (or nothing at all). The verification fails immediately, and the speculation is wasted.

Until now, the only solution was to train a specific drafter from scratch to match the target’s vocabulary—a costly and time-consuming process. The researchers propose three algorithms to bridge this gap, enabling heterogeneous speculative decoding.

We will explore the three proposed methods:

  1. SLEM: A robust, text-based alignment method.
  2. TLI: A mathematical adjustment of the token distributions.
  3. SLRS: A theoretical approach to string-level probability.

Method 1: String-Level Exact Match (SLEM)

The first and most robust algorithm introduced is Algorithm 2: String-Level Exact Match (SLEM). This method is designed to be the “universal adapter” for LLM inference.

The Logic of Plain Text

Since the Token IDs between two models are incompatible, SLEM uses the one medium they both understand: Plain Text.

Instead of passing Token IDs, the process looks like this:

  1. Drafting: The drafter generates a sequence of tokens in its own vocabulary.
  2. Decoding: These draft tokens are decoded immediately into a raw text string.
  3. Re-tokenizing: This raw text string is then encoded using the target model’s tokenizer.
  4. Verification: The target model verifies these new tokens against its own predictions.

This sounds simple, but it solves the fundamental incompatibility issue. The drafter can “think” in its own tokens, and the target receives input in its own tokens.

The Challenge of Non-Injective Tokenizers

The researchers encountered a subtle but critical engineering hurdle: Non-Injective Tokenizers.

Ideally, tokenization should be reversible. If you encode text into tokens and decode it back, you should get the exact same text. However, many modern tokenizers apply “normalization” rules. For example:

  • Replacing multiple spaces with a single space.
  • Lowercasing text.
  • Normalizing accented characters (e.g., ‘é’ becomes ’e’).

In a heterogeneous setup, this creates a mismatch. The drafter generates text, but when that text is re-tokenized for the target, normalization rules might alter it slightly. If the algorithms simply compared the resulting Token IDs blindly, they would reject valid drafts because of minor formatting discrepancies caused by the tokenizer itself, not the model’s logic.

The Solution: SLEM implements a sophisticated matching mechanism. It translates tokens bidirectionally and searches for the longest stretch of matched text between the accepted target tokens and the proposed draft tokens. It effectively aligns the sequences in the text space, mitigating the noise introduced by different tokenization rules.

This robustness makes SLEM highly effective in practice, serving as the default algorithm for heterogeneous SD in the Hugging Face Transformers library.

Method 2: Token-Level Intersection (TLI)

While SLEM relies on converting tokens to text and back, Algorithm 4: Token-Level Intersection (TLI) stays strictly within the realm of probability distributions.

The Intersection of Vocabularies

Even if two models have different vocabularies, they usually share a significant chunk of tokens (like common words, punctuation, and alphabet characters).

Look at the table below (Table 8 from the paper), which shows the vocabulary sizes of popular models. They vary wildly, but they are rarely disjoint.

Table 8: Vocabulary sizes of widely used target and drafter models.

Standard SD algorithms fail because the drafter might propose a token that strictly does not exist in the target’s vocabulary.

The TLI Strategy

TLI improves upon this by modifying the drafter’s probability distribution. It forces the drafter to “stay in its lane” regarding the target.

  1. Identify the Intersection: The set of tokens that exist in both vocabularies.
  2. Renormalize: The algorithm takes the drafter’s probability distribution and sets the probability of any token outside the intersection to zero. It then scales up the probabilities of the remaining tokens so they sum to 1.

By doing this, the drafter is mathematically prevented from ever suggesting a token that the target model simply cannot produce.

The researchers proved (Theorem 4.1) that TLI guarantees a higher expected acceptance rate than a naive “union” strategy (where you simply treat the vocabularies as combined). By focusing the drafter’s energy only on valid candidates, TLI reduces waste and increases speed.

Method 3: String-Level Rejection Sampling (SLRS)

The third method, Algorithm 3 (SLRS), is the most theoretically ambitious. While standard SD performs verification token-by-token, SLRS attempts to perform Rejection Sampling at the String Level.

The Concept

In standard token-level verification, we check if \(P_{target}(token) \geq P_{draft}(token)\). In SLRS, the researchers attempt to check if \(P_{target}(string) \geq P_{draft}(string)\).

This is a powerful idea because a “string” (like a word) might be represented as one token in the target model but three tokens in the drafter model. Token-by-token comparison is impossible here, but comparing the probability of the entire string occurring should theoretically work.

The Computational Bottleneck

While mathematically sound and “lossless” (preserving the exact target distribution), SLRS faces a massive hurdle: calculating the probability of a string is expensive.

To know the probability of the string “hello” according to the drafter, you must sum the probabilities of every possible way the drafter could construct “hello.”

The researchers illustrated this complexity using the Qwen2-7B-Instruct vocabulary. As shown in Figure 1 below, even a simple word like “hello” can be formed by 14 different combinations of tokens (e.g., “h”+“ello”, “hel”+“lo”, “hello”).

Figure 1: Left: All the 14 valid combinations of tokens from the Qwen2-7B-Instruct vocabulary that can be concatenated to form the string ‘hello’. Right: Tree visualization of all these combinations. Each of the 14 checkmarks indicate a valid combination, which is a leaf in the visualized tree. In this example, calculating \\(\\psi(t)\\) from Algorithm 3 requires 16 forward passes of the drafter model, which is the number of non-leaf nodes in the tree plus one.

To accurately calculate the rejection sampling criterion (\(\psi(t)\)), the algorithm must compute the probability for every branch of this tree. For “hello,” that requires 16 forward passes of the drafter model—just to verify one word!

This complexity explodes exponentially as the tokens get longer. The graph below (Figure 2) shows the relationship between token length and the number of combinations.

Figure 2: The number of combinations for different token lengths for the 150,000 selected tokens from the Qwen2-7B-Instruct vocabulary. We can see that the number of combinations grows exponentially with the token length.

Because of this “combinatorial explosion,” the paper concludes that while SLRS is a significant theoretical contribution, it is likely only practical for drafters with very small vocabularies (like Byte-level models) where the number of combinations remains manageable.

Comparing the Algorithms

The paper provides a helpful comparison of expected acceptance rates across these methods.

  • Alg 5 (Standard SD): Undefined for heterogeneous vocabularies (fails).
  • Alg 1 (Naive Union): Works, but suboptimal.
  • Alg 2 (SLEM): Uses exact matching. High compatibility, moderate acceptance rate.
  • Alg 4 (TLI): Mathematical intersection. Higher acceptance rate than Alg 1.
  • Alg 3 (SLRS): The theoretical upper bound for string verification, but computationally heavy.

Table 3: Expected acceptance rates given heterogeneous vocabularies for all speculation methods. The expected acceptance rate of Algorithm 1 is always less than or equal to the expected acceptance rate of Algorithm 4, as Theorem 4.1 proves.

Empirical Results: Does it Work?

The researchers tested these algorithms on high-end hardware (NVIDIA H100s) and consumer hardware (RTX 6000) using popular datasets like CNN/DailyMail (summarization), Humaneval (coding), and SCROLLS (long context).

SLEM Performance

SLEM (Algorithm 2) showed the most robust performance across the board. In scenarios where no in-family drafter existed, SLEM unlocked significant speedups.

For example, using CodeLlama-13b as a target and a Tiny Starcoder (completely different family) as a drafter, they achieved a 2.79x speedup on coding tasks.

Table 1: Benchmark comparing Algorithm 2 (SLEM) and autoregressive decoding (AR) for widely used models, tasks, and hardware setups. The results demonstrate that SLEM increases throughput by up to 2.8x over AR.

This table highlights the flexibility of the method. You can pair a Mixtral model with a Qwen drafter or a Vicuna drafter. This mix-and-match capability allows engineers to find the absolute fastest drafter for their specific hardware, rather than being stuck with the “little sibling” of their target model.

TLI Performance

TLI (Algorithm 4) also demonstrated solid improvements, though generally slightly lower than SLEM in peak scenarios, it remains a very efficient token-level approach.

Table 2: Benchmark comparing Algorithm 4 (TLI) and autoregressive decoding (AR) for widely used models, tasks, and hardware setups. The results demonstrate that TLI increases throughput by up to 1.7x over AR.

Crucially, TLI achieved up to 1.7x speedup on standard benchmarks. It serves as a lightweight alternative when string-level decoding overhead might be undesirable, provided the vocabulary overlap is sufficient.

Implications for the Future of AI Inference

The contributions of this paper extend beyond just faster benchmarks. By solving the heterogeneous vocabulary problem, the authors have fundamentally changed the economics of LLM inference.

1. Democratization of Speed: Previously, only organizations with the resources to train custom drafter models could fully optimize their inference stacks. Now, anyone can download a state-of-the-art target model (like Llama-3) and pair it with any lightweight model available on the Hugging Face Hub to get immediate speedups.

2. Off-the-Shelf Efficiency: The integration of SLEM and TLI into the Hugging Face transformers library means these gains are available to the public now. The paper notes that Hugging Face maintainers independently evaluated these methods and made SLEM the default for heterogeneous speculation due to its effectiveness.

3. Model Architecture Independence: We are no longer tied to “model families.” A reasoning model could be drafted by a coding model. A multilingual model could be drafted by a specialized English model. This opens up a new area of research into finding the “perfect pair” of models for specific tasks.

Conclusion

Standard Speculative Decoding was a leap forward for LLM latency, but it was shackled by the requirement of vocabulary matching. This work by Timor et al. breaks those shackles.

Through SLEM, we have a robust, text-based method that handles the messy reality of tokenizers. Through TLI, we have a statistically sound method for exploiting vocabulary overlap. And through SLRS, we have a theoretical framework for string-level verification, even if current compute limits constrain it.

For students and practitioners, the takeaway is clear: the constraints on how we deploy and accelerate Large Language Models are falling away. We can now treat models as modular components, mixing and matching them to achieve the best balance of speed and accuracy.

The full benchmarks and code for these methods are available in the open-source Hugging Face Transformers library.