The software development world is currently in the grip of a revolution driven by Large Language Models (LLMs). Tools like GitHub Copilot and ChatGPT have demonstrated an uncanny ability to auto-complete functions, write unit tests, and even solve complex algorithmic puzzles. It is tempting to believe that we are on the verge of fully autonomous software engineering, where an AI can take a high-level requirement and produce a deployment-ready application.
However, a critical gap exists between generating a sorting algorithm in a sandbox and building a robust, secure backend application that handles real user data. Backends are the engine rooms of modern software; they require complex logic spanning multiple files, strict adherence to API specifications, and, most importantly, bulletproof security.
To test whether AI is truly ready for the “real world,” a team of researchers from ETH Zurich, LogicStar.ai, UC Berkeley, and INSAIT introduced BAXBENCH. This new benchmark moves beyond simple code snippets to evaluate LLMs on end-to-end backend generation.
The results are a sobering reality check. As illustrated below, even the most advanced flagship models struggle to produce code that works and is safe to deploy.

In this article, we will dissect the BAXBENCH paper, exploring how the researchers built this rigorous testing ground, why current models fail, and what this means for the future of AI-assisted development.
The Gap in Current Benchmarking
Before diving into BAXBENCH, it is important to understand why existing benchmarks were insufficient.
Most standard coding benchmarks, such as HumanEval or MBPP, focus on function-level tasks. They ask models to “write a Python function to reverse a string” or “solve this dynamic programming problem.” While useful for measuring basic coding fluency, these tasks lack the context of a system. They don’t test:
- System-level coherence: Managing state and logic across multiple files.
- Framework knowledge: Using libraries like Django, Express.js, or Spring Boot correctly.
- Security: Protecting against SQL injection, Cross-Site Scripting (XSS), or unauthorized access.
Recent benchmarks like SWE-bench have started to address repository-level editing, but they often focus on patching existing bugs rather than generating functionality from scratch. Furthermore, security is frequently treated as an afterthought or tested in isolation using static analysis tools, which are prone to false positives and negatives.
The BAXBENCH Methodology
The researchers designed BAXBENCH to bridge this gap by simulating a realistic software engineering task: “Here is an API specification. Build the backend server.”
The benchmark is constructed around three core dimensions: Scenarios, Frameworks, and rigorous Evaluation.

As shown in Figure 2 above, the process begins with a scenario definition (e.g., a Calculator or Shopping Cart) and a target framework (e.g., Python-Django). The LLM generates the solution, which is then spun up in a Docker container. Finally, the system runs functional tests to check if it works and security exploits to check if it can be hacked.
1. The Scenarios: Contract-Driven Development
Real-world backend development is rarely vague. It is usually defined by a contract—specifically, an API specification. BAXBENCH includes 28 diverse scenarios, ranging from simple utilities to complex multi-endpoint applications.
Instead of vague natural language prompts, the models are provided with OpenAPI specifications. This mirrors industry best practices, where a frontend team and backend team agree on an API contract (endpoints, request bodies, response formats) before coding begins.
The scenarios were carefully selected to cover:
- Practicality: Real use cases like user authentication, e-commerce carts, and file management.
- Complexity: Tasks requiring database interactions and file system manipulation.
- Security Risk: Applications that are naturally prone to vulnerabilities if mishandled.

2. The Frameworks: Beyond Just Python
An AI developer must be versatile. It is not enough to know Python; one must know how to build a server in Rust, Go, or Ruby if the stack demands it. BAXBENCH evaluates models across 14 distinct frameworks in 6 programming languages.
This diversity helps identify whether models truly understand backend concepts or if they have just memorized a lot of Python code from their training data.

3. Evaluation: Exploits > Static Analysis
The most significant contribution of BAXBENCH is its approach to security evaluation. Many prior studies relied on Static Application Security Testing (SAST) tools. These tools scan source code for patterns that look like vulnerabilities. However, SAST tools are notoriously noisy—they often flag safe code as dangerous (false positives) or miss complex logic errors (false negatives).
BAXBENCH takes a “black-box” approach. For every scenario, security experts wrote actual exploits.
- Functional Tests: Does the endpoint
/calculatereturn4when I send2+2? - Security Exploits: Does the server crash, leak data, or execute malicious commands when I send
2; DROP TABLE users?
The benchmark tracks 13 specific Common Weakness Enumerations (CWEs), covering the most dangerous flaws in software development, including the OWASP Top 10.

If an exploit succeeds, the code is undeniably insecure. There is no ambiguity.
Analyzing the Results
The researchers evaluated 11 state-of-the-art LLMs, including OpenAI’s o1 and GPT-4o, Anthropic’s Claude 3.5 Sonnet, and open-source models like Llama 3 and DeepSeek. The metric used is sec_pass@k, which measures the probability that a model generates a solution that is both functionally correct and secure.
The “Correct but Insecure” Trap
The primary finding is alarming. As visualized in Figure 3, even the best models struggle to reach a 40% success rate for correct and secure code (sec_pass@1).

Look closely at the bar chart (Figure 3). The full height of the bar represents code that works (passes functional tests). The solid red portion represents code that is secure.
- The Delta: The shaded area represents code that works perfectly but is vulnerable to hackers.
- The Statistic: Across all models, roughly 50% of the functionally correct programs were exploitable.
This confirms a dangerous limitation of current LLMs: they prioritize functionality over safety. They will happily use eval() to solve a math problem because it’s the easiest way to make the test pass, ignoring the fact that it opens the door to Remote Code Execution (RCE).
The Reasoning Gap: Standard vs. Reasoning Models
The study highlights a distinction between standard models (like GPT-4o) and “reasoning” models (like OpenAI o1 or o3-mini). Reasoning models, which “think” before they output tokens, performed significantly better. OpenAI o3-mini achieved the highest scores, suggesting that the “test-time compute” (the time the model spends processing the prompt) allows it to consider edge cases and security constraints that standard models rush past.
Can We Just Prompt for Security?
A common counter-argument in prompt engineering is: “Did you tell the model to be secure?”
The researchers tested this hypothesis using three prompting strategies:
- No Reminder: Just the task specs.
- Generic Reminder: “Please follow security best practices.”
- Oracle Reminder: Explicitly listing the specific vulnerabilities (CWEs) relevant to that scenario (e.g., “Watch out for SQL injection in this task”).

The results (Figure 4) reveal a nuanced trade-off:
- Oracle Prompts Help Security: Explicitly warning models about specific bugs improves security scores significantly.
- The Trade-off: However, adding security constraints often confuses the models regarding functionality, causing the pass@1 (functional correctness) to drop.
Interestingly, reasoning models (like o1 and o3-mini) benefited the most from generic reminders. They were able to take the abstract instruction “be secure” and successfully apply it to the code without breaking functionality. Standard models (like GPT-4o and Claude 3.5 Sonnet) showed less improvement from generic prompts, indicating they lack the reasoning depth to infer specific security requirements from general advice.
The Framework Lottery
Not all code is generated equal. The researchers found a strong correlation between the popularity of a language/framework and the model’s performance.

As shown in Figure 5, OpenAI o1 performs admirably on Python-Django and JavaScript-Express—frameworks with massive representation in the training data (GitHub, StackOverflow). However, performance collapses on Rust-Actix or PHP-Lumen.
Crucially, in less popular frameworks, models don’t just fail to compile; they produce code that is more likely to be insecure. This suggests that LLMs have a “shallow” understanding of security. They don’t necessarily understand the abstract concept of “SQL Injection”; rather, they have seen millions of examples of how to sanitize inputs in Django, but very few examples in specialized Rust libraries.
Complexity Kills Performance
The researchers also analyzed the relationship between the complexity of the task (measured by the length of the OpenAPI specification) and the success rate.

The negative correlation in Figure 6 is clear: as the specification gets longer and more detailed, the model’s ability to implement it correctly decreases. This highlights the “context window” limitation not just in terms of token count, but in terms of “attention span.” Maintaining consistency across a complex API contract remains a major hurdle for autonomous generation.
Digging Deeper: The pass@k Metric
For students interested in evaluation metrics, the paper utilizes a statistical estimator known as pass@k. Since LLMs are probabilistic (they generate different code every time), running them once isn’t enough. pass@k estimates the probability that at least one solution is correct if you allow the model \(k\) attempts.
The researchers define sec_pass@k similarly, but with the added constraint that the passing solution must withstand all security exploits.

When the researchers expanded their evaluation to \(k=5\) (giving the model 5 tries), performance improved, but the security gap remained.

Figure 9 shows that even with 5 attempts and “Oracle” security reminders (the best-case scenario), the gap between functional code (full bar) and secure code (solid color) persists.
Why Do Models Fail?
The paper identifies several failure modes:
- Boilerplate Fatigue: Models often fail on trivial tasks like setting up the correct file structure, handling imports in multi-file projects, or configuring the server to listen on the correct port (0.0.0.0 vs localhost).
- The “Eval” Instinct: For tasks like a Calculator, models frequently resort to
eval(), which executes a string as code. It is the easiest way to solve the math problem but a catastrophic security flaw. Implementing a proper parser is “harder” (requires more tokens/logic), so the model takes the path of least resistance. - Framework Hallucinations: In less popular frameworks (like Rust-Actix), models often invent functions that don’t exist or use outdated syntax, reflecting a lack of grounding in the library’s actual documentation.
Implications for the Future of Software Engineering
BAXBENCH serves as a critical reality check for the industry. While LLMs are powerful assistants, they are currently unfit for autonomous backend development.
The findings suggest three major avenues for improvement:
- Test-Time Compute: The superior performance of reasoning models (o1, o3-mini) suggests that “thinking time” is crucial for security. Security is a constraint satisfaction problem; the model must check its own output against safety rules before finalizing it.
- Security Alignment: We need to train models to prefer secure implementations even when they are more verbose. The model should “know” that writing a parser is better than using
eval(), even if it takes 50 lines instead of 1. - Agentic Workflows: While the paper briefly tested agents (OpenHands) and found only modest improvements, the future likely lies in agents that can iteratively run tests, see the security exploit fail, and patch the code themselves—mimicking a human developer’s workflow.
Conclusion
BAXBENCH introduces a rigorous standard for AI-generated code. By shifting the focus from simple algorithms to full-stack, secure application generation, it exposes the fragility of current LLMs.
For students and developers, the takeaway is clear: Use LLMs to accelerate your work, but never trust them to secure it. The ability to generate code has outpaced the ability to secure it, and until that gap closes, the human engineer remains the most critical security feature in the loop.
The benchmark is open for the community to expand, ensuring that as models evolve, the bar for “deployment-ready” keeps rising.
](https://deep-paper.org/en/paper/16211_baxbench_can_llms_genera-1728/images/cover.png)