RealVul: A New Era for PHP Vulnerability Detection using Large Language Models

If you are studying software security or machine learning, you have likely noticed the explosion of interest in Large Language Models (LLMs). We know LLMs can write code, explain algorithms, and even translate languages. But can they act as security auditors? Can they look at a piece of code and tell you, “Hey, there’s a dangerous SQL injection right here”?

The short answer is yes, but with major caveats. While there has been significant research into using Deep Learning for vulnerability detection in languages like C and C++, the web’s most dominant language—PHP—has been largely left behind. This is a critical gap. PHP powers nearly 80% of the top ten million websites, including giants like WordPress and Wikipedia.

Today, we are doing a deep dive into a research paper titled “RealVul: Can We Detect Vulnerabilities in Web Applications with LLM?”. This paper proposes a novel framework called RealVul that addresses the specific challenges of detecting vulnerabilities in PHP.

In this post, we will explore why traditional datasets fail for PHP, how the RealVul framework intelligently “slices” code to find bugs, and how it uses synthetic data to train models that outperform traditional tools.

The Problem with Existing Datasets

To train an AI model to find vulnerabilities, you need a dataset. You need thousands of examples of “vulnerable code” and “secure code.”

Historically, researchers have built these datasets using a method called Vulnerability Repair. They look at open-source repositories (like GitHub), find commit messages that mention “fix vulnerability,” and then compare the code before the fix (vulnerable) and after the fix (secure).

It sounds logical, but in practice, it is noisy and often inaccurate.

The “Diff” Dilemma

When a developer fixes a vulnerability, they might change a configuration file, add a library, or modify a function in a completely different file from where the vulnerability is actually triggered.

Take a look at Figure 1 below. It illustrates a scenario where a vulnerability exists in a system involving three files: page.php, lib.php, and conf.php.

Figure 1: In the case of using vulnerability repair to build a dataset, the green part will be considered secure, and the red part will be considered vulnerable.

In this example, the actual fix happens in conf.php (adding a regex filter). However, the “Vulnerability Repair” collection method might look at all files involved in the commit. It might label the old version of page.php as “vulnerable” and the new version as “fixed,” even though the code in page.php did not change at all.

This confuses the model. It is being told that two identical pieces of code have different labels. Furthermore, fixing a bug doesn’t just mean changing the vulnerable line; it often involves context that simple “diffs” miss. This noise makes it incredibly hard for an LLM to learn what an actual vulnerability looks like.

Enter RealVul: A Snippet-Level Framework

To solve this, the researchers developed RealVul. Instead of relying on messy commit diffs, RealVul uses static analysis to identify specific “trigger points” in the code and extracts only the relevant slice of code needed to verify the bug.

This approach shifts the focus from “files” or “functions” to code snippets—precise sequences of logic that lead from user input to a potential security violation.

Here is the high-level architecture of the RealVul system:

Figure 2: RealVul architecture overview.

The workflow consists of four main stages:

Vulnerability Candidate Detection: Identifying potential danger zones.
Data Preprocessing: Cleaning and normalizing the code.
Data Synthesis: Artificially creating more training data.
Model Training: Fine-tuning LLMs (like CodeLlama or StarCoder) to recognize these patterns.

Let’s break these down step-by-step.

1. Vulnerability Candidate Detection & Slicing

This is the most technically interesting part of the framework. You cannot simply feed an entire 2,000-line PHP file into an LLM and ask, “Is there a bug?” The context window is limited, and the noise is too high. You need to isolate the specific logic that matters.

RealVul uses Program Slicing based on Taint Analysis.

In security terms, a “source” is where untrusted data enters the application (e.g., $_GET['id']), and a “sink” is where that data is used dangerously (e.g., mysql_query()). A vulnerability exists if data flows from a source to a sink without being sanitized.

RealVul automates this discovery process:

Identify Triggers: It scans the source code for potential “sinks.” For an XSS vulnerability, this might be an echo statement. For SQL injection, it looks for variable concatenation inside SQL strings.
Abstract Syntax Tree (AST) Analysis: It converts the PHP code into a tree structure (AST) to understand the relationships between different lines of code.
Flow Analysis: It traces the variables used in the sink backwards. Where did they come from? It extracts only the statements that affect those variables.

Figure 3 illustrates this process beautifully.

Figure 3: The process of vulnerability candidate detection from a real-world PHP project. We identify potential vulnerability triggers and analyze the data flow and control flow through the source file’s AST. The obtained code snippets are our samples.

On the left, you see a full PHP source file. It has HTML, database connections, and logic. RealVul parses this into an AST (center), identifies the data flow (red arrows), and extracts the Code Snippet (right).

Notice how the final snippet removes the database connection setup, the HTML headers, and other irrelevant lines. It keeps only the logic where $_GET data is assigned to variables and eventually echoed out or used in a query. This “distilled” code is what the LLM will analyze.

2. Data Preprocessing and Normalization

Once the snippets are extracted, they are still “raw.” RealVul performs several cleaning steps to make the data even easier for the model to learn:

Labeling: Since the extraction follows a specific path to a “sink,” the system can label the snippet based on whether the data was sanitized (Secure) or not (Vulnerable).
Normalization:
Removing Constant Strings: Web applications are full of HTML strings (e.g., <div class="menu">). These don’t affect the logic of a vulnerability. RealVul removes them to reduce noise.
Renaming Variables: A variable named $user_id behaves the same as $uid. To prevent the model from overfitting on specific variable names, RealVul maps variables to generic names like $var1, $var2, etc. (though it keeps function names as they carry semantic meaning).
Deduplication: It removes identical or highly similar snippets to prevent the model from memorizing duplicates.

3. Data Synthesis

One of the biggest hurdles in training AI for security is the lack of “vulnerable” samples. In most open-source projects, 99.9% of the code is secure. If you train on that, the model just learns to say “Secure” all the time.

To fix this imbalance, RealVul employs Data Synthesis. It takes “pure” vulnerability patterns (short, clear examples of bugs) and injects them into complex, real-world “clean” functions.

This creates a Semi-Synthetic Dataset. It has the complexity of real-world code (complex control flows, weird logic) but contains a known vulnerability. This allows the researchers to scale up their training data significantly, ensuring the model sees enough examples of what not to do.

Experimental Results

The researchers evaluated RealVul by fine-tuning several state-of-the-art Code LLMs, including CodeLlama, StarCoder2, and CodeT5. They focused on two major vulnerability types:

CWE-79: Cross-Site Scripting (XSS)
CWE-89: SQL Injection

Effectiveness vs. Baseline

The first test was simple: Does RealVul work better than the traditional “Vulnerability Repair” dataset method?

The results were staggering.

Table 1: Evaluation results on Random Samples. ΔF1 is the difference between the F1 scores of RealVul and Baseline methods.

As shown in Table 1, RealVul (the top half of the table) consistently crushed the Baseline method (bottom half). Look at the F1 Score, which is a balanced measure of precision and recall.

For CodeLlama-7b on XSS (CWE-79), RealVul achieved an F1 score of 83.68, compared to just 32.35 for the baseline. That is a +51.3 point improvement.
For SQL Injection (CWE-89), the improvement was even more dramatic, with gains of over +70 points in some cases.

This proves that how you prepare the data is just as important, if not more important, than the model you use. A smaller model trained on RealVul data (like CodeT5-base) outperformed massive models trained on noisy data.

Comparison with SAST Tools

Static Application Security Testing (SAST) tools like RIPS and Fortify SCA are the industry standard. They use strict rule-based engines to find bugs. How does an LLM compare?

Table 3: Comparison of RealVul and two SAST tools. We also provide the time required for the evaluation.

Table 3 shows the comparison.

For XSS (CWE-79): The SAST tools (RIPS) performed slightly better in raw identification, but RealVul was competitive.
For SQL Injection (CWE-89): RealVul significantly outperformed the traditional tools. RIPS only found 3 True Positives (TP), while RealVul models found around 30.

Why did SAST fail on SQL injection? Traditional tools often rely on matching specific function names (like mysql_query). If a developer wraps that function in a custom helper class or uses a framework, the SAST tool might miss it. RealVul, however, analyzes the flow of string concatenation into SQL commands, allowing it to catch vulnerabilities that rule-based tools miss.

Does Normalization Help? (Ablation Study)

You might wonder if “normalizing” the code (renaming variables, removing HTML) actually helps, or if it destroys valuable context. The researchers tested this by training models with and without normalization.

Figure 4: Comparison of ablation study results with the visualization of results from the first two experiments.

Figure 4 visualizes this. The Blue bars (RealVul with normalization) generally reach higher F1 scores than the Orange bars (without normalization). The Grey bars represent the old baseline method.

The data confirms that stripping away the “noise” (like variable names and HTML content) helps the LLM focus on the structural logic of the vulnerability, improving its predictive performance.

Case Study: Seeing the Difference

To truly appreciate the difference RealVul makes, let’s look at the actual data fed into the models.

Figure 5 below compares samples obtained through the traditional method vs. RealVul.

Figure 5: Two sets of sample Cases obtained through vulnerability repair and RealVul. We mark the data flow and potential vulnerability statements.

Snippet (a) & (b): These show the clean, extracted snippets from RealVul. Notice how focused they are. In snippet (b) (CWE-89 Case), you can clearly see the data flow from $var1 and $var2 into $staint, which is then used in a query. There is no clutter.
Contrast this mentally with a raw file that might be hundreds of lines long, containing CSS, JavaScript, and unrelated PHP logic.

Because the LLM sees these focused snippets, it can learn the pattern of a vulnerability (e.g., “concatenating user input into a SQL string”) rather than memorizing unrelated artifacts of a specific file.

Conclusion and Implications

The RealVul framework represents a significant step forward in automated security auditing. By moving away from noisy “commit-based” datasets and embracing a static analysis + LLM hybrid approach, the researchers achieved state-of-the-art results for PHP vulnerability detection.

Key Takeaways:

Garbage In, Garbage Out: The quality of the dataset matters more than the size of the model. Cleaning the data via slicing and normalization yielded massive performance gains.
Hybrid Approaches Win: RealVul isn’t just “asking ChatGPT.” It uses traditional program analysis (ASTs, Control Flow Graphs) to prepare the data before the AI touches it. This combination of classic CS theory and modern AI is powerful.
PHP Security Matters: As the backbone of the web, improving PHP security tools has a massive real-world impact.

As LLMs continue to evolve, we can expect tools like RealVul to become integrated into IDEs and CI/CD pipelines, acting as an intelligent pair programmer that catches security flaws before they ever reach production.

The Problem with Existing Datasets#

The “Diff” Dilemma#

Enter RealVul: A Snippet-Level Framework#

1. Vulnerability Candidate Detection & Slicing#

2. Data Preprocessing and Normalization#

3. Data Synthesis#

Experimental Results#

Effectiveness vs. Baseline#

Comparison with SAST Tools#

Does Normalization Help? (Ablation Study)#

Case Study: Seeing the Difference#

Conclusion and Implications#