If you are a software developer, your browser history is likely filled with searches like “how to reverse a list in Python” or “pandas dataframe drop duplicates.” This process—finding the right code snippet based on a natural language description—is known as Code Retrieval.

For AI models to get good at this, they need massive amounts of training data consisting of pairs: a user query (Natural Language) and the corresponding code snippet (Programming Language). But where does this data come from?

Historically, we’ve had two choices:

  1. Scrape GitHub: We take code functions and treat their “docstrings” (comments) as queries. This is scalable but inaccurate because a formal docstring looks nothing like a messy user search query.
  2. Scrape Stack Overflow: We take user questions and accepted answers. The queries are realistic, but the code quality varies wildly.

We need a middle ground: high-quality code from repositories paired with realistic, user-like queries. In the research paper “Optimizing Code Retrieval: High-Quality and Scalable Dataset Annotation through Large Language Models,” researchers from the University of Science and Technology of China propose a novel solution. They developed a method to use Large Language Models (LLMs) to generate synthetic, high-quality queries for existing code, resulting in a new dataset called Query4Code.

In this post, we will explore why LLMs struggle to read code in isolation, how the researchers solved the “missing context” problem using graph theory, and how this synthetic dataset outperforms traditional data sources.

The Mismatch: Docstrings vs. Queries

To understand the core problem, we first need to look at the data models are currently trained on. Most open-source datasets rely on docstrings.

A docstring is a formal explanation of a function, often detailing parameters and return types. A query, however, is a question or a statement of intent.

Figure 1: Example of code snippet and corresponding query and docstring.

As shown in Figure 1, there is a semantic gap. The docstring describes how the function works (“Export content from a Jupyter notebook…”), while the query asks what the user wants to achieve (“How to export the content…”). To build a search engine that understands users, we need data that looks like the “Query” section, not just the “Docstring” section.

The researchers asked: Can we just ask an LLM (like GPT-3.5 or CodeLlama) to look at the code and write a user query for us?

The answer is yes, but with major caveats. LLMs are powerful, but they are not omniscient. If you feed an LLM a single function in isolation, it often fails to understand what that function actually does.

The Challenge: The Context Gap

The researchers performed a preliminary analysis to understand why LLMs fail to annotate code correctly. They identified two main culprits: Intra-repository function calls and Third-party API calls.

1. Intra-Repository Function Calls

Real-world code is modular. Function A calls Function B, which calls Function C. If you ask an LLM to explain Function A, but you don’t show it the code for Function B, the LLM has to guess.

Table 1: Statistics on the number and proportion of calls to intra-repository and third-party library APIs.

Table 1 shows the scale of this issue. In the repositories analyzed, nearly half (46.5%) of the functions involved calls to other functions within the same repository.

The impact of this missing context is massive. The researchers tested LLMs (GPT-3.5 and CodeLlama) on code with and without the context of called functions.

Figure 2: The impact of calls within repositories of varying quantities on the quality of query annotations.

Figure 2 reveals the results. The blue and orange bars represent the quality score of the generated queries. Notice that when the context (the code of the called functions) is included (w/ Context), the quality score jumps significantly. The more complex the call chain (X-axis), the more the model relies on that extra context.

2. Third-Party API Calls

Modern coding involves gluing together libraries (like NumPy, PyTorch, or obscure utility libraries). While LLMs have seen popular libraries during pre-training, they struggle with niche or unpopular APIs.

Figure 4: The impact of third-party APIs with Different Popularity Levels on LLM Understanding.

Figure 4 illustrates the “popularity bias.” The X-axis represents the popularity of a library. The Y-axis is the LLM’s understanding score. As you can see, for low-popularity APIs (the left side of the chart), the models perform poorly. They simply don’t know what random_niche_lib.do_thing() actually does.

The Solution: A Context-Aware Annotation Pipeline

Based on these findings, the researchers proposed a sophisticated pipeline to generate the Query4Code dataset. They didn’t just throw code at ChatGPT; they engineered a system to provide the exact context required.

Figure 3: The overview of our annotation method. (a)Files in the repository. (b)Function call graph obtained from parsing. (c) API calls obtained from parsing and their corresponding popularity. (d) Construct annotated context based on call relationships and current API calls. (e) Pipeline for annotation method.

Figure 3 provides a comprehensive view of their method. Let’s break down the key innovations.

Step 1: Parsing and The Call Graph

First, they treat the repository as a network, not a list of files. They parse the code to build a Function Call Graph (Figure 3b). This graph maps out exactly which function calls which.

Step 2: Topological Sorting for Context

This is the clever part. If Function Train calls Function LoadData, you can’t understand Train until you understand LoadData.

The researchers use Topological Sorting (Algorithm 1 in the paper) to determine the order of annotation. They start with the “leaf” nodes—functions that don’t call anything else within the repo. The LLM annotates those first.

When it’s time to annotate a higher-level function (like Train), the system retrieves the summaries of the dependencies (LoadData) generated in the previous step. This turns a complex, multi-level reasoning problem into a single-level problem.

Step 3: Handling Unpopular APIs

For third-party libraries, the system checks the popularity of the API calls (Figure 3c).

  • High Popularity: The system assumes the LLM knows it (e.g., torch.mean).
  • Low Popularity: The system uses a search engine (DuckDuckGo) to scrape the official documentation for that specific API. This documentation is fed into the LLM’s context window.

Step 4: Two-Stage Generation

Instead of asking for a query directly, the researchers decompose the task into two steps to reduce cognitive load on the LLM.

  1. Summarization: The LLM first generates a code summary (\(s\)) based on the code (\(c\)).
  2. Query Generation: The LLM then generates the user query (\(q\)) based on both the code and the summary.

Equation 1

This “Chain of Thought” approach ensures the model fully “understands” the code before attempting to write a search query for it.

Step 5: Reverse Validation (The Filter)

LLMs hallucinate. To ensure quality, the system includes a self-correction mechanism. After generating a query, the system asks the LLM to reverse the role: “Given this query, does this code snippet actually answer it?”

Equation 2

The model assigns a score (\(f(q,c)\)) from 0 to 3.

Equation 3

Only pairs that receive a score of 1 or 2 (meaning the code satisfies the query requirements) are kept in the final dataset (\(C_{filtered}\)). This filters out noise and irrelevant code.

The Result: Query4Code

Using this pipeline on high-quality Python repositories from GitHub, the authors created Query4Code, containing 237.2k query-code pairs.

To prove this dataset is better than what we had before, they ran extensive experiments. They pre-trained standard models (like CodeBERT and UniXcoder) on Query4Code and compared them to models trained on CodeSearchNet (CSN).

Zero-Shot Performance

They tested the models on real-world benchmarks without any fine-tuning (Zero-Shot). This tests how well the model generalizes.

Table 3: Compare the zero-shot and fine-tune performance of code representation models pre-trained on Code-SearchNet (CSN) and Query4Code (Q4C) datasets.

As shown in Table 3, the models trained on Query4Code (Q4C) consistently outperform those trained on CodeSearchNet (CSN).

  • CoSQA and WebQueryTest are datasets derived from actual search engine logs. The performance boost here is significant (e.g., CodeBERT jumps from 56.34 to 59.80 on CoSQA).
  • This confirms that the synthetic queries generated by the LLM are much closer to real-world user search behavior than standard docstrings.

Fine-Tuning Performance

The researchers also fine-tuned the models. Even when the models are allowed to learn specifically for the target task, starting with Query4Code pre-training provides a better foundation, leading to higher final scores across the board (refer to the “Fine-Tuning” section in Table 3).

Human Evaluation

Automated metrics are great, but do humans agree with the dataset quality? The researchers asked human experts to rate the generated pairs.

Table 5: Results of human evaluation.

Table 5 shows the correlation (\(r\) and \(\tau\)) between human scores and the model’s self-validation scores. The positive correlation indicates that the automated filtering mechanism aligns well with human judgment. The high consistency score among experts (0.858) confirms the evaluation was reliable.

Why This Matters: Cost and Scalability

You might think, “Using GPT-4 to annotate code sounds expensive.” However, the authors performed a cost analysis.

  • Crowdsourcing: Paying humans to annotate query-code pairs costs roughly $0.20 per pair and takes minutes.
  • Query4Code Method: Using GPT-3.5-turbo costs between $0.001 and $0.004 per pair.

This is nearly 100x cheaper than human annotation, while offering scalability that humans cannot match.

Case Study: A Real Example

Let’s look at a concrete example from the dataset to see the difference between a docstring and a generated query.

Figure 5: Example of code snippet with docstring and annotated query.

In Figure 5, the function escape_shell_arg has a docstring that describes parameters and types ("@param shell_arg", “@type shell_arg: string”).

  • Docstring: Technical, verbose, parameter-focused.
  • Generated Query: “Python code for shell argument escaping with single quotes.”

The generated query captures the intent perfectly. A user searching for this function wouldn’t type “param shell_arg string”; they would type exactly what the LLM generated. This slight shift in language style is exactly what modern code retrieval models have been missing.

Conclusion

The Query4Code project demonstrates that we don’t necessarily need to scrape more data to build better AI tools; sometimes, we just need better data. By leveraging the reasoning capabilities of LLMs—and critically, by supporting them with the right context (call graphs and API docs)—we can synthesize training datasets that are superior to organic data scraped from the web.

This method bridges the gap between the messy reality of code repositories and the natural language needs of developers. While this study focused on Python, the topological sorting and API documentation approach could easily be applied to Java, C++, or TypeScript, paving the way for smarter, more context-aware coding assistants for everyone.