](https://deep-paper.org/en/paper/2503.01811/images/cover.png)
The Reality Gap: Can LLMs Actually Break Real-World AI Defenses?
We are living in the era of the “AI Agent.” We have moved past simple chatbots that write poems; we now evaluate Large Language Models (LLMs) on their ability to reason, plan, and interact with software environments. Benchmarks like SWE-Bench test if an AI can fix GitHub issues, while others test if they can browse the web or solve capture-the-flag (CTF) security challenges. But there is a lingering question in the research community: Do these benchmarks reflect reality? ...
](https://deep-paper.org/en/paper/2502.05164/images/cover.png)
](https://deep-paper.org/en/paper/2502.11413/images/cover.png)
](https://deep-paper.org/en/paper/2406.19593/images/cover.png)
](https://deep-paper.org/en/paper/2503.03025/images/cover.png)
](https://deep-paper.org/en/paper/2505.22939/images/cover.png)
](https://deep-paper.org/en/paper/2502.00640/images/cover.png)
](https://deep-paper.org/en/paper/2502.06768/images/cover.png)
](https://deep-paper.org/en/paper/2502.09560/images/cover.png)
](https://deep-paper.org/en/paper/2410.16201/images/cover.png)
](https://deep-paper.org/en/paper/2502.15988/images/cover.png)
](https://deep-paper.org/en/paper/6221_neural_discovery_in_mathe-1802/images/cover.png)
](https://deep-paper.org/en/paper/3646_polynomial_delay_mag_list-1801/images/cover.png)
](https://deep-paper.org/en/paper/2501.04519/images/cover.png)
](https://deep-paper.org/en/paper/1439_what_limits_virtual_agent-1798/images/cover.png)
](https://deep-paper.org/en/paper/2411.09355/images/cover.png)
](https://deep-paper.org/en/paper/2412.18603/images/cover.png)
](https://deep-paper.org/en/paper/2502.04879/images/cover.png)
](https://deep-paper.org/en/paper/347_position_not_all_explanati-1793/images/cover.png)
](https://deep-paper.org/en/paper/2507.09897/images/cover.png)