Code review is the backbone of high-quality software engineering. It’s the process where developers check each other’s work to spot bugs, ensure stylistic consistency, and verify that the code actually does what the commit message says it does.

However, if you have ever worked in a software team, you know the reality: code review is labor-intensive, time-consuming, and prone to human error.

Naturally, researchers have turned to Large Language Models (LLMs) to automate this. But there is a snag. Most existing AI tools treat code review as a simple “input-output” task—you feed in code, and the AI spits out a critique. This ignores a fundamental truth: Code review is an interactive, collaborative process. It involves understanding context, checking formatting against legacy files, and ensuring security—tasks that often require different “mindsets.”

In this post, we are diving deep into CodeAgent, a new framework proposed by researchers from the University of Luxembourg and other institutions. CodeAgent doesn’t just ask an LLM to “review this.” Instead, it spawns a digital team of autonomous agents—from a CEO to a specialized Reviewer—who talk to each other to produce a comprehensive review.

We will break down how this multi-agent system works, the mathematical “QA-Checker” that keeps them on track, and the results that suggest this might be the future of automated software maintenance.

The Problem: Why Single Agents Struggle

Before we look at the solution, we need to understand why a standard ChatGPT or CodeBERT prompt often fails at complex code reviews.

  1. Lack of Role Specialization: A single model tries to do everything at once—check syntax, logic, security, and formatting.
  2. Prompt Drifting: In multi-step reasoning (Chain-of-Thought), LLMs have a notorious habit of “drifting” away from the original question. They might start discussing a security vulnerability and end up hallucinating a fix that breaks the code style, forgetting the original constraint.
  3. Context Isolation: Automated tools often look at the code change in isolation, missing whether the new code matches the style of the original file or if the commit message accurately reflects the changes.

Enter CodeAgent: A Multi-Agent Framework

CodeAgent mimics a real-world software company. Instead of one AI doing all the work, the framework assigns specific “personas” or roles to different agents. These agents communicate, share information, and review each other’s outputs.

The Corporate Structure

As illustrated below, the system simulates a hierarchy involving six distinct characters.

Figure 1: A Schematic diagram of role data cards of simulated code review team and their conversations within CodeAgent.

The roles are defined as follows:

  • User: The human submitting the Pull Request (PR).
  • CEO (Chief Executive Officer): Handles high-level decision-making and information synchronization.
  • CTO (Chief Technology Officer): Provides high-level technical insights and modality analysis (e.g., identifying the programming language).
  • CPO (Chief Product Officer): Helps summarize the findings into a final report.
  • Reviewer: The workhorse. This agent looks for specific issues like vulnerabilities or formatting errors.
  • Coder: The technical expert that implements suggested revisions and assists the reviewer.

The Pipeline

How do these agents work together? The researchers designed a four-phase “Waterfall” pipeline. It ensures that information is gathered before analysis begins, and analysis is completed before documentation is written.

Figure 2: CodeAgent’s pipeline/scenario of a full conversation during the code review process among different roles.

  1. Basic Info Sync: The CEO, CTO, and Coder analyze the input files to determine the language (e.g., Python, Java) and the nature of the request.
  2. Code Review: The Reviewer and Coder engage in a back-and-forth dialogue to identify issues. This includes:
  • Consistency Analysis (CA): Does the code do what the commit message says?
  • Vulnerability Analysis (VA): Are there security risks?
  • Format Analysis (FA): Is the style consistent with the original file?
  1. Code Alignment: Based on the review, the Coder suggests revisions to fix bugs or align formatting.
  2. Document: Finally, the CPO and CEO help synthesize the conversation into a readable report for the human user.

The Secret Sauce: The QA-Checker

The most innovative part of CodeAgent isn’t just that it uses multiple agents—it’s how it keeps them from talking nonsense.

In multi-agent conversations, “Prompt Drifting” is a major issue. Agent A asks a question, Agent B answers slightly off-topic, Agent A responds to the off-topic part, and soon the review is useless.

To solve this, the researchers introduced the QA-Checker (Question-Answer Checker). This is a supervisory module that monitors the conversation flow.

Figure 3: This diagram shows the architecture of our designed Chain-of-Thought (CoT): Question-Answer Checker (QA-Checker).

How QA-Checker Works

The QA-Checker acts like a strict moderator. When an agent generates an answer (\(A_0\)) to a question (\(Q_0\)), the QA-Checker evaluates the quality. If the answer is irrelevant or drifts from the intent, the QA-Checker intervenes.

It generates an Additional Instruction (\(aai\)) and forces the agent to try again. The new question becomes a combination of the original question plus the correction.

The Mathematical Foundation

The researchers grounded this mechanism in optimization theory. They treat the conversation quality as a function \(\mathcal{Q}(Q, A)\) that needs to be maximized.

The update rule for the conversation is modeled on the Newton-Raphson method, a mathematical technique used to find the roots of a function (or in optimization, to find local maxima/minima).

The update rule looks like this:

Equation for Question and Answer update rule

Here, \(\alpha\) is a learning rate (how big of a correction to make). The term involving \(H\) (the Hessian matrix) and \(\nabla\) (the gradient) represents the direction and magnitude of the “correction” needed to steer the answer back to relevance.

The QA-Checker evaluates the quality of an answer based on three metrics, combined into a single score:

Equation for Quality Assessment Function

  1. Relevance: The cosine similarity between the question vector and the answer vector.
  2. Specificity: A measure of how technical and detailed the answer is (penalizing vague responses).
  3. Coherence: How well the answer flows logically and grammatically.

By applying this rigorous check at every turn of the conversation, CodeAgent ensures that the “Reviewer” and “Coder” agents don’t get distracted, leading to much higher accuracy.

Experimental Results

The researchers evaluated CodeAgent against state-of-the-art models (like CodeBERT, GPT-3.5, and GPT-4) and frameworks (ReAct and standard Chain-of-Thought). They tested on nine programming languages including Python, Java, Go, and C++.

1. Vulnerability Analysis (VA)

Detecting security flaws is perhaps the most critical task. The team ran CodeAgent on over 3,500 real-world code changes.

Table 2: The number of vulnerabilities found by CodeAgent and other approaches.

The results are striking.

  • Precision (Hit Rate): Look at the row Rate_cr (Confirmed Rate). CodeAgent achieved a 92.96% confirmation rate. This means that when CodeAgent said “this is a vulnerability,” it was almost always right.
  • Comparison: GPT-4 only had a 51.42% confirmation rate. CodeBERT was even lower at ~20%.
  • CodeAgent w/o QA-Checker: The last column shows the system without the QA-Checker. The performance drops significantly (from ~93% to ~73%), proving that the supervisory agent is essential.

The Venn diagram below shows the overlap in detection. CodeAgent found 449 confirmed vulnerabilities, covering almost all issues found by other models plus many unique ones.

Figure 4: Overlap of vulnerability detection by CodeBERT, GPT-3.5, GPT-4.0, and CodeAgent.

2. Consistency Analysis (CA)

Consistency Analysis checks if the commit message (e.g., “Fixed bug in login”) actually matches the code change. If the message says one thing but the code does another, it’s a “Negative” sample.

Table 3: Comparison of CodeAgent with other methods on merged and closed commits across 9 languages on CA task.

CodeAgent achieved an average Recall of 88.63% and F1-Score of 93.16%, consistently beating GPT-4 and ReAct methods. This suggests CodeAgent is much better at “understanding” the semantic link between natural language descriptions and code logic.

3. Format Analysis (FA)

This task checks if the new code follows the indentation and naming conventions of the existing file.

Table 4: Comparison of CodeAgent with other methods on merged and closed commits across the 9 languages on FA task.

Here, the gap is massive. CodeAgent improved Recall by nearly 16 percentage points over GPT-4. Standard LLMs often ignore minor formatting nuances, but the specialized “Reviewer” role in CodeAgent, constrained by the QA-Checker, pays attention to these details.

4. Code Revision (CR)

Finally, can CodeAgent actually fix the code? The researchers measured “Edit Progress” (EP), which calculates how much closer the AI’s suggested fix is to the correct solution compared to the buggy code.

Table 5: Experimental Results for the Code Revision (CR task).

CodeAgent achieved the highest Edit Progress (31.6%) across varied datasets. Notably, other tools like Trans-Review sometimes resulted in negative progress (making the code worse), while CodeAgent remained robust.

Case Study: CodeAgent in Action

To visualize how this works, let’s look at an actual output from CodeAgent on a Python project.

Figure 13: Example in Python project

In this example:

  1. The Change: The developer changed dictionary keys from user_id to user and client_id to client. However, they also removed a line checking for auth.user_id.
  2. Semantic Consistency: CodeAgent correctly flags that the authentication check was removed, which wasn’t mentioned in the commit message (“rename client_id…”).
  3. Security Analysis: It flags potential risks regarding input validation.
  4. Format Analysis: It notices indentation issues.
  5. Suggestion: It provides actionable advice to fix the inconsistencies.

This depth of analysis—connecting the missing logic in the code to the commit message—is difficult for single-agent systems to achieve reliably.

Cost and Performance

One trade-off of a multi-agent system is cost and time. Since multiple agents are “talking” to each other, it uses more tokens and takes longer than a single API call.

Figure 6: Execution time with CodeAgent across different languages.

As shown above, the execution time hovers between 250 and 450 seconds (4 to 7 minutes) per review. While slower than a 10-second GPT response, this timeframe is acceptable for a CI/CD pipeline, especially considering the depth of the review.

The researchers also noted that running CodeAgent-4 (based on GPT-4) costs significantly more (\(0.122 per review) than CodeAgent-3.5 (\)0.017), though the performance benefits are substantial.

Conclusion

CodeAgent represents a shift in how we think about AI in software engineering. We are moving away from “smart autocomplete” tools toward autonomous digital employees.

By defining specific roles (Reviewer, Coder, QA) and enforcing strict conversational quality through the mathematical QA-Checker, CodeAgent solves the “drifting” problem that plagues many LLM applications. It proves that a team of agents, supervised correctly, is greater than the sum of its parts.

For students and researchers entering this field, CodeAgent highlights a crucial lesson: Architecture matters as much as the underlying model. You don’t always need a smarter LLM; sometimes, you just need a better way for your agents to talk to each other.

References