Imagine you find an old executable file on a server. It’s a critical piece of legacy software for your company, but there’s one problem: the source code is gone. No GitHub repo, no backup zip files. Just the raw binary.

To update or fix this software, you need to perform decompilation—the process of reversing the compilation steps to turn machine code back into a human-readable programming language like C. For decades, this has been the realm of specialized tools like Ghidra or IDA Pro. While powerful, these tools often produce output that looks more like a mathematical riddle than clean code.

Enter LLM4Decompile. In a recent paper, researchers from the Southern University of Science and Technology and Hong Kong Polytechnic University introduced the first open-source Large Language Model (LLM) series specifically trained to decompile binary code. Ranging from 1.3 billion to 33 billion parameters, these models don’t just guess at the code—they are trained to produce re-executable source code.

In this post, we’ll break down how LLM4Decompile works, why traditional methods struggle, and how this new approach outperforms even giants like GPT-4o in this specific domain.

The Decompilation Headache

To understand why LLM4Decompile is necessary, we first need to look at why decompilation is so hard.

When you compile C code, the compiler (like GCC) goes through several aggressive steps:

  1. Preprocessing: Expanding macros and removing comments.
  2. Compilation: Turning C into Assembly (ASM).
  3. Assembly: Turning ASM into binary (0s and 1s).
  4. Linking: Connecting functions to create the final executable.

During this process, a massive amount of information is lost. Variable names are discarded, high-level structures (like for loops) are flattened into jmp (jump) instructions, and data types are converted into raw memory offsets.

Recovering the original C code from binary is like trying to reconstruct a whole cow from a hamburger. You might guess the general ingredients, but getting the original structure back is incredibly difficult.

The Traditional Approach vs. The Reality

Tools like Ghidra attempt to reverse this by analyzing the control flow graph of the binary. They look for patterns that resemble loops or if-statements.

Figure 1: Illustration of compiling source code to binary, disassembling binary to assembly code (ASM), and decompiling ASM to pseudo-code with Ghidra.

As shown in Figure 1, the results are often mixed.

  • Top Left: The original source code is a clean C function with a nested loop.
  • Bottom Right: Ghidra’s output captures the logic but replaces standard loops with while(true) and do-while constructs. It loses variable names (calling them param_1, local_28) and uses pointer arithmetic that is hard for humans to parse.

Crucially, Ghidra’s output is often pseudo-code. It usually cannot be re-compiled because of syntax errors, undefined types (like undefined4), and missing headers. This makes it useful for analysis, but terrible for rebuilding software.

The Solution: LLM4Decompile

The researchers propose that LLMs, which excel at translation tasks (e.g., French to English), can be treated as binary-to-code translators. They introduce two distinct methodologies for tackling this problem:

  1. End-to-End Decompilation: Translating Assembly directly to C.
  2. Refined Decompilation: Using an LLM to “fix” Ghidra’s messy output.

Let’s explore the architecture of these two approaches.

1. The End-to-End Approach (LLM4Decompile-End)

The “End2End” method treats assembly language (ASM) as the source language and C code as the target language.

Figure 2: End2end-Decompile framework.

As illustrated in Figure 2, the process is straightforward in concept but complex in execution:

  1. The system takes the Binary.
  2. It uses a tool like objdump to Disassemble the binary into Assembly instructions (ASM).
  3. The LLM4Decompile model takes this ASM text as input and predicts the original Source Code (SRC’).
  4. During training, the model compares its prediction against the original source to calculate loss and learn.

The Secret Sauce: Data Augmentation & Two-Stage Training You cannot simply train a model on random GitHub code. To make the model robust, the researchers used compilation optimization as a form of data augmentation.

Developers compile code at different optimization levels (O0 for no optimization, up to O3 for aggressive optimization). The researchers compiled millions of C functions at O0, O1, O2, and O3. This taught the model that different-looking Assembly code could map back to the same C source code.

They also employed a Two-Stage Training strategy:

  • Stage 1: Compilable Data: They trained on 7 billion tokens of “compilable” code (object files that aren’t fully linked). This gave the model a massive breadth of general C knowledge.
  • Stage 2: Executable Data: They fine-tuned on “executable” binaries (fully linked).

Why the distinction?

Figure 6: Compilable data and Executable data.

Figure 6 highlights the subtle but critical difference. In a “Compilable” object file (Left), jump addresses are relative offsets. In an “Executable” file (Right), the linker has replaced those offsets with specific memory addresses (e.g., jmp 11f1). By training on both, the model learns to understand the code logic regardless of memory addressing.

2. The Refined Approach (LLM4Decompile-Ref)

The second approach acknowledges that tools like Ghidra are actually quite good at recovering the general logic, even if the syntax is ugly.

Figure 3: Refined-Decompile framework.

In the Refined-Decompile framework (Figure 3), the workflow changes slightly:

  1. The Binary is passed through Ghidra first.
  2. Ghidra produces its messy pseudo-code.
  3. The LLM4Decompile-Ref model takes that pseudo-code as input and acts as a code polisher, rewriting it into valid, readable C code.

This method leverages the best of both worlds: Ghidra’s strong structural analysis and the LLM’s ability to write clean syntax.

Experiments and Results

To test these models, the researchers didn’t just check if the code “looked” right. They used a strict metric called Re-executability.

What is Re-executability?

  1. Take the decompiled code generated by the LLM.
  2. Attempt to compile it with GCC.
  3. Run the resulting executable against test cases and assertions.
  4. If it compiles AND passes the tests, it counts as a success.

This is a very high bar. A single typo or logic error results in failure.

End-to-End Performance

The researchers tested their models on HumanEval-Decompile (standard coding problems) and ExeBench (real-world functions).

Table 1: Main comparison of End2end-Decompile approaches for re-executability rates on evaluation benchmarks

Table 1 reveals some impressive findings:

  • LLM4Decompile-End-6.7B achieves a 45.37% average success rate on HumanEval.
  • GPT-4o, generally considered the state-of-the-art in general coding, only achieves 16.01%.
  • DeepSeek-Coder, the base model before fine-tuning, gets 0%—it simply doesn’t understand how to process binary assembly code without specific training.

Note the drop in performance as optimization levels increase (O0 to O3). Heavily optimized code (O3) creates assembly that is vastly different from the original source, making decompilation much harder.

The Power of Refinement

Does adding Ghidra to the mix help? Absolutely.

Table 3: Comparison of Refined-Decompile approaches

Looking at Table 3, the refined approach dominates:

  • Ghidra alone has a re-executability of roughly 20%. It produces mathematically correct logic often, but syntax errors prevent compilation.
  • Ghidra + GPT-4o improves this to 35%.
  • Ghidra + LLM4Decompile-Ref-6.7B hits 52.74%.

This shows that a specialized small model (6.7B) can outperform a massive general model (GPT-4o) when fine-tuned on the specific task of cleaning up decompiler output.

Visualizing the Code

Numbers are great, but what does the code actually look like?

Figure 4: Decompilation results of different approaches.

Figure 4 provides a fascinating side-by-side comparison:

  • Source (Top Left): The ground truth.
  • Ghidra (Middle Left): Uses local_24, local_28, and messy while(true) loops. Hard to read.
  • GPT-4o (Bottom Left): It produces very clean code, but it hallucinates. Notice it creates a 2D array arr[outer][inner] which didn’t exist in the original code. This makes the code readable but functionally wrong.
  • LLM4Decompile (Top Right): It recovers the logic and structure almost perfectly, maintaining the 1D array semantics while using readable C syntax.

Why Does It Still Fail?

While 50% re-executability is a massive leap forward, it’s not 100%. The researchers analyzed where the models struggle.

1. Input Length Figure 7: Re-executability rate with the growth of input length.

As shown in Figure 7, performance drops as the assembly code gets longer. The 6.7B model (bottom chart) is more robust than the 1.3B model (top chart), but even the larger model struggles when the input sequence exceeds 400 tokens. This is a common limitation in LLMs—maintaining context over long sequences is difficult.

2. Error Types Figure 8: Types of errors identified in the two benchmarks

Figure 8 breaks down the failure causes.

  • HumanEval (Left): Most failures are “Assert” errors. The code compiles, but the logic is slightly off, causing it to fail test cases.
  • ExeBench (Right): The main issues are “Declare” and “Struct.” ExeBench uses real-world code with custom data structures (like struct MyObject). When code is compiled, struct definitions are erased. The LLM struggles to “hallucinate” the correct struct definition back into existence, causing compilation errors.

The Ethical Question: Is This Dangerous?

A tool that can perfectly reverse-engineer binaries sounds like a dream for developers recovering lost code, but a nightmare for security. Could malware authors use this to study patches? Could pirates use it to crack software?

To address this, the researchers tested LLM4Decompile against Obfuscated Code. Developers often use techniques like “Control Flow Flattening” to intentionally scramble the logic of their binaries to prevent reverse engineering.

The results were reassuringly poor. When tested on obfuscated binaries, the re-executability rate dropped by over 70%. The models rely on standard compiler patterns; when those patterns are intentionally broken by obfuscation, the LLM gets confused just like a human would. This suggests the tool is effective for legitimate recovery tasks but has natural limitations that prevent easy misuse on protected software.

Conclusion

LLM4Decompile represents a significant milestone in applying AI to low-level systems programming. By treating binary code as just another language to be translated, the researchers have created a tool that significantly outperforms traditional decompilers in generating executable, readable code.

Key takeaways for students and researchers:

  • Domain Specificity Wins: A 6B parameter model fine-tuned on decompilation beats GPT-4o.
  • Hybrid Approaches Work: The best results came from combining traditional tools (Ghidra) with AI refinement.
  • Real-World Constraints: Handling user-defined types (structs) and long contexts remains the next frontier for this technology.

The 6.7B and 33B models are open-source, paving the way for a future where losing your source code might be nothing more than a minor inconvenience.