The promise of Large Language Models (LLMs) in software engineering is dazzling. You type a prompt, and the model spits out working code. For simple tasks—like writing a Fibonacci sequence or a basic SQL query—current models like GPT-4 are incredibly proficient.
However, the reality of professional software development is rarely that simple. Real-world coding involves intricate libraries (like TensorFlow or Pandas), complex logic, and specific data structures. When LLMs face these “complex code generation” tasks, they often hallucinate non-existent libraries, write code that runs but produces the wrong answer, or fail to handle edge cases.
In this post, we are diving deep into CoCoST (Automatic Complex Code Generation with Online Searching and Correctness Testing), a research paper that proposes a new framework to bridge the gap between “toy” coding problems and real-world development. CoCoST doesn’t just ask the LLM to guess the code; it forces the model to act like a human developer: planning, Googling for answers, writing tests, and debugging based on actual execution results.
The Problem: Why Simple Generation Fails
Before understanding the solution, we must understand why standard LLMs struggle with complex code. The researchers identify three primary challenges that prevent models from being truly autonomous developers:
- Static Knowledge Limits: LLMs are trained on offline data. If you need a specific function from a library that was updated last week, or if the problem requires a niche combination of functions not commonly found in the training set, the LLM is flying blind.
- Lack of Test Cases: In competitive programming datasets (like HumanEval), test cases are provided. In the real world, when you ask an AI to “analyze this dataset,” there are no pre-written tests to verify the output.
- Hidden Bugs: A common method for improving AI code is “Self-Debugging,” where the AI fixes code that crashes (throws an error). But in complex data science, code often runs without crashing but produces the wrong result (e.g., incorrect tensor shapes or wrong statistical calculations). These “silent bugs” are the hardest to catch.
Figure 1 illustrates how a human developer tackles these issues compared to a standard model. A human doesn’t just guess; they search for documentation, write a draft, test it, and refine it.

As shown above, CoCoST is designed to mimic this exact human workflow.
The CoCoST Framework
The CoCoST framework operates in two distinct phases: Retrieval and Refinement.
The intuition is straightforward. Instead of relying solely on the weights of the neural network, the system actively retrieves information from the internet (StackOverflow, documentation) and then rigorously tests the generated code against generated inputs.

Let’s break down these two steps in detail.
Phase 1: Retrieval with Strategic Planning
In many Retrieval-Augmented Generation (RAG) systems, the model simply uses the user’s prompt to search a database. However, for complex coding problems, the user’s prompt might be too vague or too multifaceted to yield good search results.
CoCoST introduces a planning stage. Before writing a single line of code, the LLM analyzes the problem (\(D\)) and generates a step-by-step plan (\(P\)) and a set of search queries (\(Q\)).

This implies that the model determines what it doesn’t know. If the plan involves a specific Pandas operation the model isn’t sure about, it generates a query like “pandas calculate value counts normalized.”
The system then executes these searches using a search engine (like Google) to retrieve URLs and extract relevant information (\(INFO\)).

Armed with this up-to-date, external information, the model generates the initial code (\(W_0\)). This effectively solves the “hallucination” problem by grounding the generation in real documentation.

Phase 2: Refinement via Correctness Testing
This is where CoCoST differentiates itself from previous “self-repair” mechanisms. Most existing methods only fix code if it crashes (e.g., a TypeError). CoCoST cares about correctness.
But how do you verify correctness if you don’t have the ground truth? The authors introduce Test Case Generation.
Generating Test Cases
Since real-world problems don’t come with unit tests, CoCoST asks the LLM to generate inputs (\(I\)) based on the problem description.

Critically, the model only generates the inputs, not the expected outputs. It uses these inputs to run the generated code and observes what happens.
The Challenge of Complex Outputs: Serialization
In simple programming tasks, outputs are usually integers or strings. In complex data science tasks, outputs might be massive NumPy arrays, Pandas DataFrames, or even Matplotlib charts. An LLM cannot simply “read” a chart or a raw binary object.
CoCoST solves this via Serialization. It converts complex objects into text-based representations that the LLM can understand.
- DataFrames/Arrays: Converted to strings with type info, shape, and truncated data previews.
- Images (Charts): Converted to Scalable Vector Graphics (SVG) code, which is text-based and readable by LLMs.

The Refinement Loop
The system executes the code (\(W\)) with the generated inputs (\(i\)). It captures both the execution result (\(o\)) and any errors (\(e\)).

The LLM then acts as a reviewer. It looks at the problem description, the code it wrote, and—crucially—the serialized output of that code.
If the code ran without error but produced a DataFrame of all zeros when it should have had values, the LLM detects this semantic error. It then refines the code (\(\widehat{W}_{j+1}\)) based on this insight. This loop continues until the code is correct or a limit is reached.
Experimental Results
The researchers validated CoCoST on two challenging datasets: DS-1000 (Data Science code generation) and ClassEval (Object-Oriented class generation).
DS-1000 Performance
DS-1000 is notoriously difficult because it involves seven different Python libraries (NumPy, Pandas, TensorFlow, etc.) and requires intricate logic.
As seen in Table 1, CoCoST achieves a Pass@1 score of 71.71%, significantly outperforming the base GPT-4 model (64.47%) and the previous state-of-the-art method, Self-Evolve (66.23%).

The “Diff-Rewrite” column is particularly interesting. This represents perturbations where the problem description is rephrased. CoCoST shows massive resilience here (53.09% vs 33.95% for Self-Evolve), suggesting that the retrieval and testing mechanism makes the model much less sensitive to how a user asks a question.
ClassEval Performance
ClassEval tests the ability to generate entire classes, not just snippets. This requires maintaining internal state and method dependencies.

Table 2 shows even more dramatic improvements. CoCoST achieves a 46.3% Pass@1 rate at the class level, nearly doubling the performance of Reflexion (24.1%). This proves that the benefits of CoCoST extend beyond data science scripts to structured software engineering.
Breakdown by Library
When we look at specific libraries (Table 5), we see that CoCoST excels in areas where data structures are complex and “invisible” to standard text processing, such as Matplotlib (plotting) and TensorFlow/PyTorch (tensors).

The serialization of images (Matplotlib) and tensors allows the LLM to “see” that its plot is empty or its tensor shape is wrong, leading to successful refinement.
Case Studies: CoCoST in Action
To truly appreciate the framework, let’s look at two specific examples from the paper.
Example 1: Fixing Logic via Online Search
In this TensorFlow example (Figure 4), the user wants to calculate accuracy for a specific class in a multi-class dataset. The initial code guess is syntactically correct but functionally wrong—it doesn’t use the specific tensor_scatter_nd_update logic required.
By planning a search query (tensorflow tensor_scatter_nd_update usage), CoCoST retrieves a tutorial on how to use that specific function. The result is a correct implementation that matches the “ground truth” logic.

Example 2: Catching Silent Bugs via Correctness Testing
Figure 3 illustrates the power of the refinement stage. The task is to compare two rows in a Pandas DataFrame. The initial code uses a simple comparison: row0 != row8.
In Python NaN != NaN evaluates to True. This is a classic “gotcha” in data science. The code runs without errors, but the output is wrong because it counts NaN matches as differences.
Because CoCoST generates a test case containing NaN values and serializes the output, the LLM sees the unexpected result. It realizes the mistake and refines the code to use isnull() checks, correctly handling the missing data.

Impact of Base Models
Does CoCoST work with any model? Table 3 compares using CoCoST with GPT-4 versus WizardCoder (an open-source coding model).

While GPT-4 sees massive gains, smaller models like WizardCoder struggle, sometimes even performing worse with the framework. This highlights a crucial finding: Retrieval and Refinement require strong reasoning capabilities. If the model cannot plan a good search query or understand the serialized feedback, the extra context just becomes noise.
Conclusion
CoCoST represents a significant step forward in automated code generation. By acknowledging that LLMs cannot simply “memorize” all complex coding logic, the authors built a system that emulates the behavior of a competent human developer.
The key takeaways for students and practitioners are:
- Search is Vital: Supplementing LLMs with live internet access (via planned queries) solves the knowledge cutoff and hallucination problems.
- Output Matters: Debugging shouldn’t just happen when code crashes. Serializing outputs allows models to fix semantic logic errors that are otherwise invisible.
- Self-Testing: We don’t need to wait for human-provided test cases. LLMs are capable of generating their own inputs to verify their own code.
As models like GPT-4 become faster and cheaper, frameworks like CoCoST will likely become the standard for IDE assistants, moving us from “code completion” to true “code generation.”
](https://deep-paper.org/en/paper/2403.13583/images/cover.png)