Introduction

We are currently living in the “Golden Age” of Large Language Models (LLMs). From LLaMA to GPT-4, these models have demonstrated capabilities in reasoning, coding, and creative writing that were unimaginable a decade ago. However, this intelligence comes with a massive price tag: computational cost.

Running inference on a model like GPT-175B requires hundreds of gigabytes of GPU memory and massive parallel processing power. The primary bottleneck is the autoregressive decoding process. Because LLMs generate text token-by-token—where the generation of the \(n\)-th token depends on all previous \(n-1\) tokens—the computational load scales linearly with the sequence length. Every single token requires a full pass through billions of parameters.

To solve this, researchers have long proposed “early-exit” or “layer-skipping” strategies. The logic is simple: not every token requires the full brainpower of the model. If the model is predicting the word “the” or a punctuation mark, surely it can skip a few layers, right?

While this sounds intuitive, it introduces a critical problem: Key-Value (KV) Cache mismanagement. When you skip layers to save time, you disrupt the memory mechanism of the Transformer, often leading to hallucinations or “token collapse” (where the model starts repeating itself endlessly).

Today, we are diving into a research paper that proposes a clever workaround. Instead of skipping entire layers, the authors introduce FFN-SkipLLM. This method strategically skips only the Feed-Forward Network (FFN) blocks—the most computationally expensive part of the layer—while keeping the Attention mechanism intact. The result is a faster model that retains its intelligence and, crucially, avoids the hallucinations that plague other compression methods.

The Problem with Traditional Layer Skipping

To understand why FFN-SkipLLM is necessary, we first need to understand why previous attempts have failed.

In a standard Transformer architecture, every layer processes information and passes it to the next. To speed up inference, methods like SkipDecode or ShortGPT attempt to skip entire layers. If a token is “easy” to predict, these methods might exit the network at Layer 10 instead of Layer 32.

However, LLMs rely on a KV Cache to store the context of previous tokens. If the current token skips Layer 15, it produces no Key or Value states for that layer. When the next token tries to attend to the previous token at Layer 15, it finds a “hole” in the memory.

Previous works tried to patch this by copying states from other layers or recomputing missing values, but these are band-aid solutions. As the authors of this paper discovered, these methods often result in severe performance degradation on knowledge-intensive tasks.

Comparison of AI responses. SkipDecode and ShortGPT fail to identify the Prime Minister of India, while FFN-SkipLLM succeeds.

As shown in Figure 1, when asked “Who is the prime minister of India?”, traditional skipping methods (SkipDecode and ShortGPT) fail spectacularly. They either hallucinate that the position has been abolished or devolve into gibberish. In contrast, FFN-SkipLLM (the green bubble) retrieves the correct information despite skipping roughly 25% of the computation.

Background: Anatomy of a Transformer Layer

Why does FFN-SkipLLM work where others fail? The answer lies in the architecture of the Transformer layer itself.

A single Transformer layer consists of two primary blocks:

Multi-Head Attention (MHA): This allows the token to look at other tokens in the sequence. This is where the KV Cache lives.
Feed-Forward Network (FFN): This processes the information gathered by the attention mechanism. It is often described as the “memory” or “knowledge” storage of the network.

Crucially, these two blocks are not equal in terms of size.

Table showing parameter counts. FFN weights are significantly larger than attention weights.

As Table 1 illustrates using LLaMa-7B as an example, the Feed-Forward Network accounts for approximately two-thirds of the parameters in a layer. The Attention mechanism is relatively lightweight in comparison.

This leads to the core hypothesis of the paper: If we only skip the heavy FFN blocks but keep the lightweight Attention blocks running, we can save massive amounts of compute without breaking the KV Cache.

The Core Method: FFN-SkipLLM

The researchers propose a fine-grained strategy: Input-Adaptive Feed-Forward Skipping. Instead of dropping a whole layer, they execute the Attention block (updating the KV cache correctly) but conditionally skip the FFN block if it is deemed redundant.

1. The Redundancy of FFN Blocks

How do we know if an FFN block is redundant? The authors analyzed the cosine similarity between the input tensor entering an FFN block and the output tensor leaving it.

If the cosine similarity is high (close to 1.0), it means the FFN block didn’t change the representation much—it effectively did nothing. If the similarity is low, the FFN block performed a significant transformation.

Graph of cosine similarity across layers. Middle layers show high similarity, indicating redundancy.

Figure 2 reveals three critical insights about LLM behavior:

High Similarity: There is generally a high similarity between inputs and outputs of FFNs, suggesting inherent redundancy.
The “Cold” Regions: The first few layers and the last few layers (marked in red) have lower similarity. These are the “Cold Regions.” The FFNs here are doing heavy lifting and should never be skipped.
The Redundant Middle: The middle layers (yellow region) show a monotonically increasing similarity. This is the sweet spot for skipping.

2. The Algorithm

Based on these observations, FFN-SkipLLM employs a simple but effective algorithm during token generation:

Warm-Up Phase: Due to the “Attention Sink” phenomenon (where the first few tokens act as anchors for the attention mechanism), the first few tokens of a sequence (e.g., the first 25-30 tokens) are processed by the full model. No skipping allowed. This stabilizes the generation.
Identify Regions: The model identifies the “Cold Start” (first few layers) and “Cold End” (last few layers). FFNs in these layers are always executed.
Adaptive Skipping: For the layers in the middle, the model calculates the cosine similarity between the input of the FFN and the output of a simplified pass. If the similarity exceeds a certain threshold, the model predicts that the FFN is redundant.
Greedy Skip: Because redundancy tends to increase monotonically in the middle layers, if the model decides to skip an FFN, it can greedily skip the next \(k\) FFN blocks as well, further saving compute.

The Beauty of this Approach: Because the Attention blocks are never skipped, the Key-Value Cache is always perfectly maintained. The model always has a complete “memory” of the past, even if it skipped the “processing” (FFN) of certain steps.

Experiments and Results

The authors subjected FFN-SkipLLM to a rigorous “knowledge-intensive” evaluation. Unlike standard benchmarks that just check for fluency, these tests determine if the model actually remembers facts and follows instructions.

1. Factoid-Based Question Answering

This task tests the model’s ability to recall specific entities and attributes (e.g., “Who wrote ‘1984’?”).

Table comparing performance on Factoid-QA. FFN-SkipLLM retains high accuracy even with skipping.

Table 2 compares FFN-SkipLLM against the full model and other skipping methods.

Full Model: 79.02% accuracy.
SkipDecode: Drops to 73.33%.
ShortGPT: Drops to 70.49%.
FFN-SkipLLM: Maintains 78.89% accuracy.

Even when skipping ~25% of the FFN blocks, the performance drop is negligible (less than 0.2%). This confirms that the skipped FFN blocks were indeed redundant for recalling facts.

2. In-Context Summarization

Summarization tests the model’s ability to process long contexts and generate coherent text. The researchers used GPT-4 to judge the quality of summaries generated by the compressed models.

Bar charts showing summarization performance. FFN-SkipLLM matches or exceeds baselines in coherence and fluency.

Figure 3 shows the results across Coherence, Consistency, Fluency, and Relevance.

The Red Line (Ours) consistently hovers near the top, even as the percentage of pruned FFNs increases (moving right on the x-axis).
Interestingly, at a ~10-12% skip ratio, FFN-SkipLLM sometimes achieves better scores than the full model. The authors hypothesize that removing redundant processing might reduce “over-thinking” or noise in the signal.

3. Multi-Turn Conversation

Real-world usage often involves back-and-forth dialogue. The authors used the MT-Bench dataset to evaluate performance across categories like Coding, Math, and Roleplay.

Charts showing performance across 8 categories. FFN-SkipLLM is robust in Coding and Math.

Figure 4 highlights the robustness of the method.

In complex categories like Coding and Math (Fermi), FFN-SkipLLM (Red Line) stays very close to the Full Model (Blue Line) up to a 25% skip ratio.
The Random Skip baseline (Orange Line) crashes almost immediately, proving that you cannot just skip FFNs arbitrarily—you must target the redundant ones.

4. Efficiency and Speedup

The ultimate goal of this research is speed. Does skipping FFNs actually translate to real-world gains?

Table showing FLOPs reduction and throughput. Throughput increases as FLOPs decrease.

Table 4 shows the FLOPs (Floating Point Operations) reduction and the increase in throughput (tokens per second).

At a 50% skip ratio, the model achieves a throughput of 12.45 tokens/second, compared to 9.08 for the dense model.
The FLOPs reduction is substantial, saving up to ~4.3 Billion operations per token in the 50% scenario.

While there is a small overhead for calculating the cosine similarity, the savings from skipping the massive FFN blocks far outweigh the cost, especially as the skipping ratio increases.

Conclusion & Implications

FFN-SkipLLM represents a shift in how we think about model compression. Rather than treating a Transformer layer as an atomic unit that must be kept or discarded as a whole, this work encourages us to look inside the layer.

By recognizing that:

FFNs are heavy (2/3 of compute),
FFNs are often redundant in middle layers, and
Attention is critical for memory (KV Cache),

The authors crafted a method that circumvents the catastrophic failures of previous “layer-dropping” techniques.

This “Hidden Gem” suggests that future LLM architectures might be designed with modularity in mind, perhaps dynamically allocating FFN capacity only when the input complexity demands it. For students and researchers, FFN-SkipLLM is a perfect example of how analyzing the fundamental properties of a network (like cosine similarity and parameter distribution) can lead to simple, elegant, and highly effective optimization strategies.

Introduction#

The Problem with Traditional Layer Skipping#

Background: Anatomy of a Transformer Layer#

The Core Method: FFN-SkipLLM#

1. The Redundancy of FFN Blocks#

2. The Algorithm#

Experiments and Results#

1. Factoid-Based Question Answering#

2. In-Context Summarization#

3. Multi-Turn Conversation#

4. Efficiency and Speedup#

Conclusion & Implications#