Imagine you are a software engineer joining a massive, legacy codebase for the first time. You are assigned a ticket: “Fix the bug where the user login times out on the dashboard.” Your first challenge isn’t fixing the code; it is finding where that code lives among thousands of files and tens of thousands of functions.

This “needle in a haystack” problem is exactly what AI coding agents face today. While Large Language Models (LLMs) are excellent at writing code, they struggle significantly with retrieval—locating the specific files and functions that need editing based on a natural language request.

In this post, we dive deep into CoRet, a new research paper that proposes a dense retrieval model specifically designed for code editing. By leveraging repository structure, call graph dependencies, and a specialized training objective, CoRet achieves a massive leap in performance over existing state-of-the-art models.

The Problem with Standard Retrievers

To understand why CoRet is necessary, we first need to look at how modern code retrieval usually works. Most systems use a “Dense Retrieval” approach:

  1. An encoder model (like BERT or CodeBERT) turns a user query (e.g., “fix login timeout”) into a vector.
  2. The same model turns code snippets into vectors.
  3. The system calculates the similarity (dot product) between the query vector and code vectors to find the best match.

The problem? Most pretrained models treat code just like flat text. They ignore the rich structural information inherent in software:

  • Repository Hierarchy: Files live in folders that provide context (auth/login.py is very different from tests/login_test.py).
  • Dependencies: Functions call other functions. Understanding process_login() requires knowing about validate_credentials() which it calls.

The researchers found that existing models, even those trained on code (like CodeSage or CodeBERT), perform poorly when tasked with repository-level retrieval for specific bug fixes. They often retrieve code that looks semantically similar but isn’t the actual location of the logic that needs changing.

The CoRet Solution: Anatomy of a Codebase

CoRet (Contextual Retriever) addresses these limitations by rethinking how we represent code and how we train the model to find it.

1. Atomic Chunking and Repository Structure

A standard file in a repository can be thousands of lines long, containing multiple classes and dozens of functions. Retrieving the whole file is too coarse; retrieving arbitrary 512-token chunks often cuts logic in half.

CoRet adopts a “semantic chunking” strategy. Instead of arbitrary lines, it parses the code into its “atoms”: functions, classes, and methods.

Top: Code repository before processing. Bottom: Code chunks after filtering and chunking.

As shown in Figure 1, the repository is broken down hierarchically. Notice two critical details in the bottom half of the image:

  1. Granularity: The file jedi.py is split into distinct logical units (the r2d2 function, the Jedi class).
  2. Context Preservation: Crucially, the file path is prepended to the code. The string knights/jedi.py becomes part of the embedding.

The authors found that including the file path is a massive signal for retrieval. In their experiments, removing the file path caused a significant drop in performance (which we will look at in the results section). The model effectively learns to “read” the directory structure to narrow down its search.

2. The Power of Call Graphs

Code is rarely isolated. A function’s meaning is often defined by what it calls (downstream dependencies) or what calls it (upstream usage).

CoRet enriches the representation of a code chunk by including its Call Graph Context. Specifically, the authors focus on downstream neighbors—the functions that the current chunk invokes.

Given a function c_i and its downstream neighbour c_out, we concatenate the strings including the special separation token, and fine-tune the model to obtain CoRet.

As illustrated in Figure 2, if the function lightsaber() (the target chunk \(c_i\)) calls lightsaber_on() (the neighbor \(c_{out}\)), the model doesn’t just see the code for lightsaber(). It sees a concatenated string:

lightsaber code + [DOWN] + lightsaber_on code

This allows the embedding to capture not just what the function is, but what it does in relation to the rest of the system.

The Mathematical Core: A Better Loss Function

Perhaps the most significant contribution of the paper is the training objective. Standard contrastive learning (used to train most retrievers) typically uses “in-batch negatives.” This means if the model is looking at a query from Repository A, it treats code from Repository B (which happens to be in the same training batch) as the “wrong answer.”

This is too easy. It’s easy to tell that a login function isn’t related to a weather app’s database function.

The hard part of code retrieval is distinguishing the actual login function from the login helper function, the login test function, or the logout function—all of which exist within the same repository.

CoRet uses a loss function explicitly designed for this “in-instance” retrieval.

Equation 1: The loss function optimizing the likelihood of retrieving the correct chunk.

In this equation:

  • \(\mathbf{q}_i\) is the query embedding.
  • \(\mathbf{c}^*\) is the correct code chunk (positive sample).
  • \(\Gamma\) is the normalizing factor (the sum of similarities to all other chunks).

This looks like a standard cross-entropy loss, but the magic lies in how they calculate the denominator \(\Gamma\). Since summing over an entire repository (10,000+ chunks) is computationally expensive, they approximate it using a set of hard negatives sampled from the same repository.

Equation 2: Approximating the normalizing factor using in-instance negatives.

Here, \(\mathcal{B}\) represents a random sample of “within-instance negatives.” By forcing the model to distinguish the correct chunk from other chunks in the same project, the model learns fine-grained distinctions rather than just general topic matching.

Experimental Results

The authors evaluated CoRet on SWE-bench, a benchmark based on real-world GitHub issues (bugs and feature requests). They compared CoRet against strong baselines like BM25 (keyword search) and pretrained models like CodeBERT and CodeSage.

The metric used is primarily Recall@k: if the model retrieves \(k\) chunks, is the correct one in the list?

Recall metrics definition

Massive Performance Gains

The results are striking. CoRet doesn’t just edge out the competition; it dominates it.

Graph showing CoRet significantly outperforming baselines in Recall@k.

In Figure 3, look at the brown dashed line (CoRet). It rises sharply, achieving over 50% recall at just \(k=5\), whereas the previous best model (CodeSage S, the red line) sits around 35%.

This is a critical improvement for AI agents. An agent has a limited context window; it can’t read 100 files. It needs the right code in the top 5 or 10 results.

The authors also introduced a strict metric called Perfect Recall, which measures how often the model retrieves all necessary chunks to fix a bug (since bugs often span multiple functions).

Table 1: Perfect recall at chunk level.

Table 1 highlights that CoRet (bottom row) achieves a Perfect Recall@5 of 0.54, compared to 0.34 for the base CodeSage model. This is a 20 percentage point increase in the ability to find every piece of code needed to solve a problem.

Why does it work? Ablation Studies

The authors systematically removed features to see what mattered most.

1. The importance of File Paths: When the file path was removed from the input, performance dropped significantly.

Table 2: Comparing CoRet with and without file paths.

Table 2 shows that Recall@5 drops from 0.53 to 0.42 without file paths. This confirms that the directory structure provides essential semantic clues.

2. The importance of “In-Instance” Negatives: One of the most interesting findings is the impact of the negative sampling strategy during training.

Figure 5: The influence of number of negatives and their source.

Figure 5 shows that using “In-instance” negatives (purple/green lines) consistently outperforms “Across-instance” negatives (orange line). Furthermore, simply increasing the number of these negatives (from 8 to 1024) drives performance higher. The harder the training task (distinguishing between more chunks in the same repo), the smarter the retriever becomes.

Visualizing Attention

To prove the model is actually using the file paths and not just getting lucky, the authors visualized the attention maps of the Transformer model.

Attention map showing the model focusing on file path tokens. (Figure 6: Attention map for a query containing a file path)

Attention map: code chunk containing file path. (Figure 7: Attention map for a code chunk)

These heatmaps (Figures 6 and 7) show strong activation (lighter colors) around the tokens representing file paths. The model has effectively learned to perform a “fuzzy match” between the file paths mentioned or implied in the issue description and the actual files in the repository.

Conclusion and Implications

CoRet represents a significant step forward in “Repository-Level” understanding. It moves us away from treating code as a bag of words and towards treating it as a structured, interconnected graph.

Key Takeaways:

  1. Structure is Signal: Including file paths and call graph neighbors transforms retrieval performance.
  2. Training Context Matters: Training a model to distinguish between files in the same repository is far more effective than standard contrastive learning methods for this task.
  3. Agents need help: For AI software engineers to be effective, they need reliable retrieval. CoRet provides the “eyes” that allow these agents to locate bugs efficiently.

As AI coding assistants evolve from simple autocomplete to autonomous agents that can navigate and fix complex repositories, specialized retrieval systems like CoRet will be the backbone of their success. The ability to find the needle in the haystack is, after all, the first step to threading it.