Large Language Models (LLMs) are transforming software development. From autocompleting entire functions to answering complex, repository-level questions, these AI assistants are quickly becoming indispensable.
But they have an Achilles’ heel: context length.
When you ask an LLM to work with a large project, you often end up feeding it tens of thousands of lines of code. This long context creates a perfect storm of problems:
- Lost in the middle: Models can struggle to identify relevant pieces as important tokens get buried.
- Slow generation: Attention mechanisms scale quadratically, so long inputs cause latency to skyrocket.
- High costs: Commercial APIs charge by the token. Long contexts quickly run up the bill.
For source code, this is particularly problematic. Unlike prose, code has intricate interdependencies. A function in one file might be essential to logic spread across dozens of others. Randomly chopping off text to fit a budget risks breaking compile-ready structure and losing critical constraints.
Existing solutions like Retrieval-Augmented Generation (RAG) try to fetch only relevant snippets. But RAG often relies on embedding-based text similarity, which is fine for finding functions with similar names but fails to catch subtle, non-obvious dependencies.
So, what if we could compress the code intelligently—keeping exactly the crucial parts while discarding the rest?
A new research paper—“LongCodeZip: Compress Long Context for Code Language Models”—introduces such a framework. It’s training-free, model-agnostic, and designed specifically for code. It leverages the structure and semantics of source code to achieve up to 5.6× compression without hurting (and sometimes improving) performance.
Let’s explore how it works.
The Trouble with Similarity: Why RAG Falls Short for Code
Retrieval-Augmented Generation is popular for taming long contexts. It uses a model to embed and compare snippets, then retrieves those “closest” to the target query or code to be completed.
But in code, “similarity” comes in flavors:
- Lexical: Shared variable names, keywords.
- Structural: Matching function signatures.
- Semantic / Dependency-based: Connections that only appear when you understand program flow.
RAG excels at lexical and superficial structural matches but often misses deeper, dependency-based links—especially if these are implicit.
Figure 1: RAG succeeds in finding lexically similar code (left) but fails to discover crucial, non-obvious dependencies (right), like a configuration class needed to set up an optimizer.
Consider the examples above. If you need to complete get_email_by_id
, RAG will readily find get_account_by_id
—a near-perfect lexical match. This is similarity relevance.
But in a different task—completing train_model
—the essential Config
class lives elsewhere and has no overlapping identifiers. Without understanding the dependency, RAG misses it, ranking irrelevant code higher. The result: incomplete or incorrect completions.
We need a relevance measure that understands functionality and dependencies, not just surface similarity.
Introducing LongCodeZip: A Coarse-to-Fine Compression Framework
LongCodeZip tackles this by measuring information gain instead of simple similarity. It asks: Which snippets reduce the model’s uncertainty the most?
Perplexity as a Relevance Signal
Perplexity measures how “surprised” a model is by a sequence. If adding a context snippet makes the model less surprised by the instruction—i.e., reduces perplexity—that snippet is likely essential.
The authors define Approximated Mutual Information (AMI) as:
\[ AMI(c, q) = PPL(q) - PPL(q \mid c) \]Where:
- \(PPL(q)\): perplexity of the instruction on its own.
- \(PPL(q|c)\): perplexity of the instruction given context \(c\).
A higher AMI means \(c\) makes the model more confident—capturing both similarity and dependency relevance.
Stage 1: Coarse-Grained Compression (Selecting Relevant Functions)
The first stage filters out irrelevant whole functions or classes.
Function-Level Chunking:
Split source code at function/class boundaries. This maintains syntactic validity and keeps self-contained logic blocks.Instruction-Aware Ranking:
Score each chunk using AMI with respect to the task instruction.Budget-Constrained Selection:
Greedily select top-ranked chunks until a coarse budget (\(B_{\mathrm{coarse}}\)) is full. Non-selected chunks are replaced with placeholders to preserve overall file structure.
Stage 2: Fine-Grained Compression (Pruning Within Functions)
With the most relevant functions in hand, LongCodeZip trims them further:
Figure 2: Stage 1 selects relevant functions; Stage 2 prunes blocks within them.
Perplexity-Based Block Chunking:
Segment functions into semantic blocks. Within a coherent block, perplexity stays steady; spikes often mark a new logical unit. Boundaries are placed where perplexity rises sharply compared to neighbors.Adaptive Budget Allocation:
Not all functions are equally important.- Very short functions (<5 lines) are kept whole.
- Larger ones get a baseline retention ratio adjusted by normalized AMI and importance parameter \(\beta\): \[ R_{\mathrm{biased}, i} = R_{\mathrm{base}} \cdot \left(1 + \beta \cdot (2 \times AMI_{\text{norm}, i} - 1) \right) \]
- Ratios are rescaled so total tokens match the final budget.
Dynamic Block Selection (Knapsack Optimization):
Treat each block as an item:- Value = AMI score
- Weight = token length
Choose items to maximize total value under budget, ensuring the most relevant code remains.
Experimental Evaluation
The researchers tested LongCodeZip on:
Table 1: Benchmarks have average contexts around 10K tokens.
Tasks:
- Long Code Completion
- Long Module Summarization
- RepoQA (repository-level QA)
Baselines: RAG (Sliding/Function), text compressors (LLMLingua, LongLLMLingua), code compressors (DietCode, SlimCode), and upper/lower bounds (No Compression, No Context).
RQ1: Compression & Performance
Table 2: LongCodeZip consistently achieves top ES/EM, sometimes exceeding No Compression—with far fewer tokens.
Highlights:
- On Long Code Completion, it sometimes beats the No Compression baseline.
Example: Seed-Coder-8B drops just 0.93 ES points but uses only ~18% of the tokens (5.6× compression). - For RepoQA, removing irrelevant context made retrieval easier, sometimes surpassing No Compression on GPT-4o and Claude-3.7-Sonnet.
Table 3: Works equally well on top closed-source models.
RQ2: Ablation—What Matters Most?
Table 4: Coarse-grained perplexity ranking is the most critical component.
Swapping AMI ranking for similarity-based caused ~8 ES points drop. Removing fine-grained optimizations (adaptive budgeting, knapsack) hurt performance further, proving each piece adds value.
RQ3 & RQ4: Generalization & Efficiency
- Cross-Model:
A tiny 0.5B model can handle compression, then pass optimized context to a larger generator with near-identical results. - Efficiency:
Small compression overhead (~2.6s) leads to large downstream savings:- Generation time cut from 15.7s → 6.6s
- 77% fewer input tokens = ≈77% cost saved
Figure 3: Maintains high ES even under extreme compression.
Case Study: Completing execute_blind
Figure 4: Perplexity spikes define semantic blocks; knapsack keeps the most important ones.
The function evaluate_blind
is identified as relevant. Perplexity analysis segments it into blocks; AMI scores prioritize those initializing action
and call_name
. Including these ensures the model correctly completes execute_blind
.
Conclusion: Smarter Long Context for Code LLMs
LongCodeZip is a leap forward in handling large codebases with LLMs. Key contributions:
- Hierarchical Compression: Function filtering plus block pruning balances compression and semantic integrity.
- Perplexity-Driven Relevance: AMI captures deep dependencies missed by similarity methods.
- Practical & Plug-and-Play: Training-free, model-agnostic, quick to integrate.
By cutting context size, generation time, and API costs without sacrificing performance, LongCodeZip makes LLMs more viable for real-world engineering—especially when “the whole codebase” won’t fit.
Sometimes, less context really is more.