If you have ever stared at a blinking cursor while ChatGPT or LLaMA writes a response word by word, you have experienced the inherent bottleneck of Large Language Models (LLMs). This sluggishness stems from autoregressive generation: the model must generate token A before it can generate token B, and token B before C. It is a strictly serial process that leaves modern GPUs—which thrive on parallel computation—largely underutilized.

To solve this, researchers developed Speculative Sampling, a technique that allows models to “draft” several future tokens cheaply and verify them in parallel. However, most current speculative methods rely on rigid, static structures. They assume that predicting the next word is always equally difficult, regardless of the context.

Enter EAGLE-2, a new research paper that challenges this assumption. By making the drafting process dynamic and context-aware, EAGLE-2 achieves inference speeds 3x-4x faster than standard decoding, outperforming previous state-of-the-art methods while remaining mathematically lossless.

In this post, we will deconstruct how EAGLE-2 works, why static draft trees are inefficient, and how dynamic trees allow us to squeeze significantly more performance out of LLMs without retraining the base model.

The Bottleneck: Why is LLM Inference Slow?

Before diving into EAGLE-2, we need to understand the problem it solves. LLM inference is memory-bound. Generating a single token requires moving billions of parameters from the GPU’s high-bandwidth memory (HBM) to its compute units. Because LLMs generate one token at a time, we pay this memory access “tax” for every single word generated.

The Solution: Speculative Sampling

Speculative sampling (or speculative decoding) reduces this tax using a “Draft and Verify” approach:

  1. Draft: A smaller, faster model (the “Draft Model”) quickly guesses the next \(K\) tokens (e.g., the next 5 words).
  2. Verify: The massive “Target Model” (the original LLM) processes all drafted tokens in a single forward pass.

Because the Target Model can verify 5 tokens in parallel almost as quickly as it can generate 1, we save time whenever the draft is correct.

The Evolution: From Chains to Trees

Classic speculative sampling drafts a single chain of tokens (Token A \(\rightarrow\) Token B \(\rightarrow\) Token C). If Token A is wrong, the whole chain is discarded.

Newer methods, such as EAGLE (the predecessor to EAGLE-2), use a Tree Structure. Instead of guessing just one sequence, the draft model explores multiple branches. If the draft model isn’t sure if the next word is “is” or “was,” it can draft branches for both. During verification, the Target Model checks the whole tree at once using a specific attention mask (Tree Attention).

Comparison of standard speculative sampling and EAGLE.

As shown in Figure 3, standard speculative sampling (left) works linearly. EAGLE (right) operates at the feature level—passing feature vectors rather than just tokens—and verifies a tree structure. This allows for higher acceptance rates because if one branch fails, another might succeed.

The Problem with Static Trees

While EAGLE represented a massive leap forward, it had a hidden inefficiency: it treated every sentence the same way.

EAGLE, along with other methods like Medusa, uses a Static Draft Tree. This means the shape of the prediction tree is fixed. For example, it might always guess 2 options for the first token, 2 for the second, and 1 for the third. This structure assumes that the difficulty of guessing the next token depends only on its position (i.e., the immediate next token is easier to guess than the one 3 steps away).

The Insight: Context Matters

The researchers behind EAGLE-2 discovered that this assumption is flawed. The “acceptability” of a drafted token depends heavily on the context, not just the position.

Consider two scenarios:

  1. Context A: “The capital of France is…” (Next token is almost certainly “Paris”).
  2. Context B: “The best strategy for this game is…” (Next token could be anything).

In Context A, we should commit to a deep, narrow tree because we are confident. In Context B, we should use a wide, shallow tree to cover our bases. A static tree forces a compromise that is optimal for neither.

Differences between EAGLE and EAGLE-2.

Figure 4 illustrates this intuition perfectly.

  • Left (EAGLE): The tree shape is fixed. Even when the query is 10 + 2 =, where the answer is obviously 1, the static tree wastes resources guessing a highly unlikely alternative (3).
  • Right (EAGLE-2): The model recognizes the context. For 10 + 2, it expands wide because the operands could be anything. But for 10 + 2 =, it recognizes the certainty and creates a deep, single branch for 1, 2.

Evidence: Context-Dependent Acceptance

To prove this mathematically, the authors analyzed the acceptance rates of tokens.

Acceptance rates of tokens at different positions.

Figure 5 (left) shows a standard static tree structure. Figure 5 (right) plots the acceptance rates. While it is true that on average earlier tokens (P1) are accepted more often than later ones (P6), the vertical spread of the dots tells a different story. The wide variance at each position means that for some queries, position 4 is very easy to guess, while for others, position 1 is very hard.

This variance confirms that a fixed structure leaves performance on the table.

The Method: EAGLE-2’s Dynamic Architecture

So, how do we build a tree that changes shape on the fly? We need a signal to tell us when to go “wide” and when to go “deep.”

Crucially, we cannot ask the Target LLM for this signal, because running the Target LLM is exactly what we are trying to avoid. We need the Draft Model to self-assess.

Step 1: Calibration (The Confidence Signal)

The researchers found that the Draft Model in EAGLE is remarkably well-calibrated. This means that the model’s confidence score (the probability it assigns to a token) is a very accurate predictor of whether that token will actually be accepted by the Target LLM.

Average acceptance rates for different confidence score intervals.

As shown in Figure 6, there is a near-linear relationship between the Draft Model’s confidence (x-axis) and the actual acceptance rate (y-axis). If the draft model says “I am 90% sure this token is correct,” it is accepted roughly 90% of the time. This allows EAGLE-2 to use confidence scores as a proxy for acceptance rates.

Step 2: The Expansion Phase

EAGLE-2 builds the draft tree layer by layer. In standard methods, the expansion is fixed (e.g., always expand the top 2 nodes). In EAGLE-2, we select which nodes to expand based on a metric called Value (\(V_i\)).

The global acceptance probability of a token isn’t just about the token itself; it relies on the entire path leading up to it being correct. If the parent node is wrong, the child node is irrelevant. Therefore, the Value \(V_i\) of a node is the product of all confidence scores along its path:

Value Equation

Here, \(c_j\) is the confidence score of the draft model for node \(j\).

During the Expansion Phase, EAGLE-2 calculates this Value for all nodes in the current layer and selects the top-k nodes with the highest values to expand further. This ensures that computational resources are spent extending the most promising branches, rather than adhering to a rigid geometric shape.

Step 3: The Reranking Phase

Once the tree is fully expanded, we have a problem. We need to send a fixed number of candidate tokens to the Target LLM for verification.

Simply taking the nodes from the final expanded layer is risky. A node deep in the tree (at depth 5) might have a lower cumulative Value than a node at depth 2 that we chose not to expand.

  • Node A (Depth 2): 80% confidence.
  • Node B (Depth 5): 40% confidence (cumulative).

To maximize the number of accepted tokens, EAGLE-2 performs a Reranking Phase. It looks at every node generated in the tree (not just the leaves), sorts them by their Value (\(V_i\)), and picks the top \(m\) candidates. This creates a flexible draft that might include a few deep chains and several shallow alternatives.

Putting It All Together

Let’s visualize the full pipeline using Figure 7. This is the core of the EAGLE-2 algorithm.

Illustration of EAGLE-2 pipeline.

  1. Drafting (Top): The model starts with “It”.
  • It predicts “is” (high confidence) and “has” (lower confidence).
  • It calculates the Value for the next layer. “is \(\rightarrow\) a” has a value of \(0.6 \times 0.8 = 0.48\).
  • It expands the nodes with the highest values. Notice how the tree grows asymmetrically. The “is” branch goes deeper because the confidence is higher.
  1. Reranking (Middle): We have generated many nodes. We now sort all of them (blue blocks) by their Value.
  2. Flattening (Bottom): The selected nodes are flattened into a 1D sequence (e.g., [It, is, has, a, the, to...]) to be fed into the GPU.
  3. Attention Mask: Because this is a tree flattened into a list, we must ensure that “has” does not attend to “is”—they belong to different possible realities (branches). The attention mask ensures tokens only “see” their direct ancestors.

Experiments and Results

Does this dynamic flexibility translate to real-world speed? The authors tested EAGLE-2 on several datasets (MT-bench, HumanEval, GSM8K) using Vicuna and LLaMA models.

Speedup Analysis

The results show that EAGLE-2 consistently outperforms standard Speculative Sampling (SpS) and the original EAGLE (EAGLE-1).

Speedup ratios of different methods at temperature 1.

Figure 1 shows the speedup ratios at Temperature=1 (a common setting for creative generation).

  • Vicuna 13B: EAGLE-2 achieves a 3.80x speedup, compared to 2.32x for EAGLE and 1.62x for standard speculative sampling.
  • LLaMA2-Chat 13B: EAGLE-2 reaches nearly 4x speedup (3.92x).

The gap widens further at Temperature=0 (Greedy decoding), as seen in Figure 2.

Speedup ratios of different methods at temperature 0.

Here, EAGLE-2 consistently leads the pack. On Vicuna 13B, it hits a 4.26x speedup. This is a massive improvement over the baseline autoregressive decoding (1.0x). Notably, it is roughly 20% to 40% faster than EAGLE-1.

Why is it faster?

The speedup is driven by the Average Acceptance Length (\(\tau\)). This metric tracks how many drafted tokens are accepted per verification step on average.

As shown in Table 1 (below), EAGLE-2 consistently has the highest \(\tau\). For Vicuna 13B on MT-bench, EAGLE-2 has an acceptance length of 4.83, compared to 3.98 for EAGLE. This means for every single forward pass of the massive Target LLM, EAGLE-2 effectively generates nearly 5 useful tokens.

Table 1: Speedup ratios and average acceptance lengths.

Ablation Study: Do Value and Reranking Matter?

The authors performed an ablation study (removing features to see their impact) to verify their design choices.

Table 3: Ablation experiment results.

Table 3 reveals:

  • w/o value: If you expand based only on local confidence (not cumulative Value), performance drops from 3.62x to 3.21x.
  • w/o reranking: If you skip the global reranking step, performance drops to 3.48x.
  • w/o both: If you remove both, you effectively revert closer to the baseline performance.

This confirms that both the context-aware expansion (using Value) and the global selection (Reranking) are critical for maximizing efficiency.

Conclusion and Implications

EAGLE-2 represents a significant refinement in the field of LLM acceleration. By abandoning the rigid constraints of static draft trees, it aligns the inference process with the linguistic reality that not all predictions are created equal.

Here are the key takeaways:

  1. Lossless Acceleration: EAGLE-2 does not approximate. The output distribution is guaranteed to be identical to the original LLM.
  2. Context-Aware: It dynamically allocates compute resources (tree depth/width) based on how confident the model is in the current context.
  3. Out-of-the-Box Usability: If you already have a draft model trained for EAGLE, you can switch to EAGLE-2 without any training. It is purely an algorithmic change to how the draft tree is constructed and verified.

As LLMs continue to grow in size, inference latency remains a critical hurdle. Techniques like EAGLE-2, which optimize the process of generation rather than just compressing the model, will be essential for making advanced AI accessible and responsive. By treating inference as a dynamic decision-making process rather than a static loop, we can finally break the speed limit of autoregressive generation.