If you have ever tried to reproduce the results of a machine learning paper from a GitHub repository, you know the struggle. It usually goes something like this: you clone the repo, install the requirements, and immediately get a ModuleNotFoundError. You fix that, downgrade three packages, and then get a CUDA version mismatch. Three hours later, you haven’t even started the training script.
As Large Language Models (LLMs) become more proficient at writing code, a tantalizing question arises: Can we use AI agents to automate this messy, frustrating process?
Can an LLM autonomously clone a research repository, figure out the undocumented dependencies, configure the data loaders, and reproduce the paper’s results? This is the question tackled by researchers from the Allen Institute for AI and the University of Washington in their new paper, SUPER.
In this post, we will dive deep into the challenges of “in-the-wild” research code, explore the unique architecture of the SUPER benchmark, and analyze why even the most powerful models like GPT-4 struggle to solve these tasks.
The Problem: Research Code is “Wild”
Previous benchmarks for AI coding agents (like HumanEval or ML-Bench) often focus on well-defined problems or popular, well-maintained libraries. If you ask an agent to “train a ResNet using PyTorch,” it has seen thousands of tutorials on that exact topic during its pre-training.
However, real research code is different. It is often:
- Low-profile: Repositories might have very few stars and minimal community support.
- Undocumented:
requirements.txtfiles are often missing or outdated. - Complex: Running an experiment requires modifying multiple files, handling data paths, and managing distinct “setup” and “execution” phases.
The researchers identified that the most difficult part of reproducing experiments isn’t just writing the code—it’s setting up the environment and executing the task.

As shown in Figure 1, a successful agent can’t just run a script. It must:
- Set up: Install dependencies and download datasets (often from Google Drive or obscure links).
- Resolve Issues: Handle version conflicts (e.g., the code relies on an old version of
transformersbut doesn’t say so). - Execute & Report: Run the training, parse the logs, and report specific metrics.
To test this capability, the authors introduced SUPER (Setting UP and Executing tasks from Research repositories).
The SUPER Benchmark Structure
SUPER is not just a collection of coding questions; it is a simulation of the graduate student experience. The benchmark is divided into three distinct sets, each serving a different purpose in evaluating the agent.

1. The Expert Set
This is the “final exam.” It consists of 45 end-to-end problems manually curated from research papers published in 2021 or later.
- Source: Repositories found on “Papers With Code.”
- Condition: These are “in-the-wild” repositories, meaning they are not the famous, polished libraries everyone uses.
- Task: The agent is given a high-level goal (e.g., “Train this model on the MRPC dataset and report the F1 score”).
- Gold Standard: Human experts (freelance developers) actually solved these tasks first to ensure they are solvable and to provide a ground truth for evaluation.
2. The Masked Set
Evaluating end-to-end tasks is harsh. If an agent does 90% of the work correctly but fails the final execution, it gets a zero. To understand where agents fail, the researchers created the Masked Set.
They took the expert solutions and “masked” (removed) specific chunks of code to create 152 sub-problems. This is similar to a “cloze test” in linguistics (fill-in-the-blank), but for repository execution.

As illustrated in Figure 3, a masked problem provides the agent with a “pre-executed” environment (e.g., dependencies are already installed) and asks it to solve a specific remaining hurdle, such as:
- Dependency configuration: Fixing version conflicts.
- Data configuration: Updating a dataloader for a new CSV file.
- Issue resolution: Fixing a runtime bug.
This allows the researchers to isolate specific capabilities. Is the agent bad at coding, or just bad at installing libraries?
3. The Auto Set
Finally, because human curation is expensive, the researchers used GPT-4 to generate 604 additional tasks automatically. While these don’t have human-verified “gold” solutions, they serve as a large-scale playground for training and development.
The construction pipeline for these sets is visualized below:

Evaluation: How Do We Score an AI Researcher?
Scoring code generation is usually done with unit tests. However, you can’t easily unit test a messy research repo. Instead, SUPER uses two main metrics:
- Accuracy (Solution-based): The agent must report a specific metric (e.g., “Accuracy: 0.27”). If the reported number matches the gold solution (within a margin of error), the task is successful. This is the hardest metric.
- Landmarks: This is a partial credit system. The researchers identified specific strings in the output logs that indicate progress (e.g., “Loading data… 100%” or “*** training completed ***”). Even if the final number is wrong, hitting landmarks shows the agent is on the right track.
The Agents
The paper evaluates several agent architectures, but the most interesting comparison is between ReAct and ReAct-SUPER (and a third external baseline, SWE-Agent).
The agents operate in a Jupyter Notebook environment. This is a crucial design choice. Unlike a simple terminal, a Notebook allows the agent to maintain state (variables) between cells and mix shell commands (!pip install) with Python code.
- ReAct: A standard agent that thinks, acts, and observes. It uses basic bash commands.
- ReAct-SUPER: The researchers improved the standard agent by giving it an edit tool. Standard agents struggle to edit files using command-line tools like
sed. The edit tool allows the agent to replace specific lines of code in a file reliably, mimicking how a human uses an IDE.
Experimental Results: A Reality Check
So, can GPT-4 reproduce your research? The short answer is: Not yet.
The results on the Expert Set (the full end-to-end tasks) are humbling.

As shown in Table 4:
- GPT-4o (State-of-the-Art): The best performing agent (SWE-Agent using GPT-4o) only solved 16.3% of the tasks.
- Open Source Models: Llama 3.1 and Mixtral struggled significantly, achieving single-digit accuracy.
- Landmarks: The landmark scores are higher (30-40%), indicating that agents often get the environment set up or start the training, but fail to cross the finish line to get the correct result.
Performance on Sub-Problems (Masked Set)
When we break the tasks down into the Masked Set sub-problems, performance improves, but it reveals specific weaknesses.

Table 6 shows that on focused sub-problems, the success rate jumps to 46.1% for the best agent. This confirms that the sheer length and complexity of end-to-end tasks are a major failure point. When the agent only has to focus on one thing (like fixing a bug), it does much better.
Key Findings from Error Analysis:
- Tools Matter: The ReAct-SUPER agent outperformed the standard ReAct agent largely because of the edit tool. Agents without a reliable way to edit files get stuck trying to use complex
sedcommands that break the code. - Type of Error: Agents are surprisingly good at fixing explicit crashes (e.g., “RuntimeError: size mismatch”). They are terrible at “silent failures” or configuration tasks. For example, if asked to “load the first 10 examples,” an agent might hallucinate a
n_examples=10argument for a function that doesn’t exist, rather than digging into the code to find the correct data loader class. - Reflection Didn’t Help Much: The researchers tried using “Reflexion,” where the agent is asked to reflect on why it failed and try again. As seen in Table 7 below, this only boosted accuracy marginally (from 41.6% to 45.4%). If the model lacks the fundamental reasoning to understand the repository structure, thinking about it longer doesn’t necessarily solve the problem.

Why is this Benchmark Important?
To understand why SUPER is so challenging, we have to look at the repositories it uses. These aren’t the sterile, optimized repositories used in other benchmarks.

Table 8 lists the repositories used. Notice the star counts (column 3). Many have fewer than 20 stars. These are “research grade” repositories—often written by a single graduate student for a specific paper, potentially never updated again.
This is the real world of scientific reproducibility.
Conclusion
The SUPER benchmark serves as a rigorous stress test for the next generation of AI agents. It demonstrates that while LLMs are excellent at writing new code from scratch, they still struggle with the forensic work required to understand, debug, and execute existing, messy codebases.
For students and researchers, this paper highlights a massive opportunity. The gap between 16.3% (current SOTA) and 100% represents the difference between a helpful chatbot and a true AI Research Assistant.
Future progress will likely require agents that are better at:
- Repository-level exploration: Reading and understanding the file structure before acting.
- Tool use: specifically robust file editing and debugging tools.
- Handling ambiguity: Figuring out what to do when the documentation says one thing, but the code does another.
Until then, it looks like we’re stuck fixing our own ModuleNotFoundErrors.
](https://deep-paper.org/en/paper/2409.07440/images/cover.png)