LongCodeZip: Making LLMs Read Your Entire Codebase Without Breaking the Bank

Large Language Models (LLMs) are transforming software development. From autocompleting entire functions to answering complex, repository-level questions, these AI assistants are quickly becoming indispensable.
But they have an Achilles’ heel: context length.

When you ask an LLM to work with a large project, you often end up feeding it tens of thousands of lines of code. This long context creates a perfect storm of problems:

Lost in the middle: Models can struggle to identify relevant pieces as important tokens get buried.
Slow generation: Attention mechanisms scale quadratically, so long inputs cause latency to skyrocket.
High costs: Commercial APIs charge by the token. Long contexts quickly run up the bill.

For source code, this is particularly problematic. Unlike prose, code has intricate interdependencies. A function in one file might be essential to logic spread across dozens of others. Randomly chopping off text to fit a budget risks breaking compile-ready structure and losing critical constraints.

Existing solutions like Retrieval-Augmented Generation (RAG) try to fetch only relevant snippets. But RAG often relies on embedding-based text similarity, which is fine for finding functions with similar names but fails to catch subtle, non-obvious dependencies.

So, what if we could compress the code intelligently—keeping exactly the crucial parts while discarding the rest?

A new research paper—“LongCodeZip: Compress Long Context for Code Language Models”—introduces such a framework. It’s training-free, model-agnostic, and designed specifically for code. It leverages the structure and semantics of source code to achieve up to 5.6× compression without hurting (and sometimes improving) performance.

Let’s explore how it works.

The Trouble with Similarity: Why RAG Falls Short for Code

Retrieval-Augmented Generation is popular for taming long contexts. It uses a model to embed and compare snippets, then retrieves those “closest” to the target query or code to be completed.

But in code, “similarity” comes in flavors:

Lexical: Shared variable names, keywords.
Structural: Matching function signatures.
Semantic / Dependency-based: Connections that only appear when you understand program flow.

RAG excels at lexical and superficial structural matches but often misses deeper, dependency-based links—especially if these are implicit.

A diagram comparing similarity-based relevance with dependency-based relevance. On the left, a code completion task benefits from finding a function with a similar name. On the right, a different task requires finding a ‘Config’ class that has no textual similarity to the function being completed but is essential for its implementation.

Figure 1: RAG succeeds in finding lexically similar code (left) but fails to discover crucial, non-obvious dependencies (right), like a configuration class needed to set up an optimizer.

Consider the examples above. If you need to complete get_email_by_id, RAG will readily find get_account_by_id—a near-perfect lexical match. This is similarity relevance.

But in a different task—completing train_model—the essential Config class lives elsewhere and has no overlapping identifiers. Without understanding the dependency, RAG misses it, ranking irrelevant code higher. The result: incomplete or incorrect completions.

We need a relevance measure that understands functionality and dependencies, not just surface similarity.

Introducing LongCodeZip: A Coarse-to-Fine Compression Framework

LongCodeZip tackles this by measuring information gain instead of simple similarity. It asks: Which snippets reduce the model’s uncertainty the most?

Perplexity as a Relevance Signal

Perplexity measures how “surprised” a model is by a sequence. If adding a context snippet makes the model less surprised by the instruction—i.e., reduces perplexity—that snippet is likely essential.

The authors define Approximated Mutual Information (AMI) as:

\[ AMI(c, q) = PPL(q) - PPL(q \mid c) \]

Where:

\(PPL(q)\): perplexity of the instruction on its own.
\(PPL(q|c)\): perplexity of the instruction given context \(c\).

A higher AMI means \(c\) makes the model more confident—capturing both similarity and dependency relevance.

Stage 1: Coarse-Grained Compression (Selecting Relevant Functions)

The first stage filters out irrelevant whole functions or classes.

Function-Level Chunking:
Split source code at function/class boundaries. This maintains syntactic validity and keeps self-contained logic blocks.
Instruction-Aware Ranking:
Score each chunk using AMI with respect to the task instruction.
Budget-Constrained Selection:
Greedily select top-ranked chunks until a coarse budget (\(B_{\mathrm{coarse}}\)) is full. Non-selected chunks are replaced with placeholders to preserve overall file structure.

Stage 2: Fine-Grained Compression (Pruning Within Functions)

With the most relevant functions in hand, LongCodeZip trims them further:

An overview of the LongCodeZip framework, showing the two main stages. The first stage, Coarse-grained Compression, selects relevant functions from the codebase. The second stage, Fine-grained Compression, prunes the content within those selected functions to fit a token budget.

Figure 2: Stage 1 selects relevant functions; Stage 2 prunes blocks within them.

Perplexity-Based Block Chunking:
Segment functions into semantic blocks. Within a coherent block, perplexity stays steady; spikes often mark a new logical unit. Boundaries are placed where perplexity rises sharply compared to neighbors.
Adaptive Budget Allocation:
Not all functions are equally important.
- Very short functions (<5 lines) are kept whole.
- Larger ones get a baseline retention ratio adjusted by normalized AMI and importance parameter \(\beta\): \[ R_{\mathrm{biased}, i} = R_{\mathrm{base}} \cdot \left(1 + \beta \cdot (2 \times AMI_{\text{norm}, i} - 1) \right) \]
- Ratios are rescaled so total tokens match the final budget.
Dynamic Block Selection (Knapsack Optimization):
Treat each block as an item:
- Value = AMI score
- Weight = token length
  Choose items to maximize total value under budget, ensuring the most relevant code remains.

Experimental Evaluation

The researchers tested LongCodeZip on:

A table summarizing the datasets used in the evaluation, including Long Code Completion, Long Module Summarization, and RepoQA, showing the number of examples and average context length for each.

Table 1: Benchmarks have average contexts around 10K tokens.

Tasks:

Long Code Completion
Long Module Summarization
RepoQA (repository-level QA)

Baselines: RAG (Sliding/Function), text compressors (LLMLingua, LongLLMLingua), code compressors (DietCode, SlimCode), and upper/lower bounds (No Compression, No Context).

RQ1: Compression & Performance

A table showing the results for the Long Code Completion task. LongCodeZip achieves the highest Edit Similarity (ES) and Exact Match (EM) scores among all compression methods, often matching or exceeding the ‘No Compression’ baseline at a much higher compression ratio.

Table 2: LongCodeZip consistently achieves top ES/EM, sometimes exceeding No Compression—with far fewer tokens.

Highlights:

On Long Code Completion, it sometimes beats the No Compression baseline.
Example: Seed-Coder-8B drops just 0.93 ES points but uses only ~18% of the tokens (5.6× compression).
For RepoQA, removing irrelevant context made retrieval easier, sometimes surpassing No Compression on GPT-4o and Claude-3.7-Sonnet.

A table showing LongCodeZip’s performance on closed-source models like GPT-4o and Claude-3.7-Sonnet. It nearly matches the performance of the uncompressed baseline on code completion and surpasses it on RepoQA, all while using significantly fewer tokens.

Table 3: Works equally well on top closed-source models.

RQ2: Ablation—What Matters Most?

A table from the ablation study showing how performance drops when key components of LongCodeZip are removed. Replacing the perplexity-based ranking with similarity-based or random ranking causes the biggest performance drop.

Table 4: Coarse-grained perplexity ranking is the most critical component.

Swapping AMI ranking for similarity-based caused ~8 ES points drop. Removing fine-grained optimizations (adaptive budgeting, knapsack) hurt performance further, proving each piece adds value.

RQ3 & RQ4: Generalization & Efficiency

Cross-Model:
A tiny 0.5B model can handle compression, then pass optimized context to a larger generator with near-identical results.
Efficiency:
Small compression overhead (~2.6s) leads to large downstream savings:
- Generation time cut from 15.7s → 6.6s
- 77% fewer input tokens = ≈77% cost saved

A line graph showing performance (ES Score) versus the percentage of context remaining. LongCodeZip’s performance curve is consistently above all other methods, especially at very high compression ratios.

Figure 3: Maintains high ES even under extreme compression.

Case Study: Completing `execute_blind`

A case study diagram showing the code to be completed, the relevant function found in the context, the perplexity distribution over the lines of that function, and the final compressed context that successfully enables the correct completion.

Figure 4: Perplexity spikes define semantic blocks; knapsack keeps the most important ones.

The function evaluate_blind is identified as relevant. Perplexity analysis segments it into blocks; AMI scores prioritize those initializing action and call_name. Including these ensures the model correctly completes execute_blind.

Conclusion: Smarter Long Context for Code LLMs

LongCodeZip is a leap forward in handling large codebases with LLMs. Key contributions:

Hierarchical Compression: Function filtering plus block pruning balances compression and semantic integrity.
Perplexity-Driven Relevance: AMI captures deep dependencies missed by similarity methods.
Practical & Plug-and-Play: Training-free, model-agnostic, quick to integrate.

By cutting context size, generation time, and API costs without sacrificing performance, LongCodeZip makes LLMs more viable for real-world engineering—especially when “the whole codebase” won’t fit.

Sometimes, less context really is more.

The Trouble with Similarity: Why RAG Falls Short for Code#

Introducing LongCodeZip: A Coarse-to-Fine Compression Framework#

Perplexity as a Relevance Signal#

Stage 1: Coarse-Grained Compression (Selecting Relevant Functions)#

Stage 2: Fine-Grained Compression (Pruning Within Functions)#

Experimental Evaluation#

RQ1: Compression & Performance#

RQ2: Ablation—What Matters Most?#

RQ3 & RQ4: Generalization & Efficiency#

Case Study: Completing execute_blind#

Conclusion: Smarter Long Context for Code LLMs#