Can AI Reproduce Research? Inside the SUPER Benchmark

If you have ever tried to reproduce the results of a machine learning paper from a GitHub repository, you know the struggle. It usually goes something like this: you clone the repo, install the requirements, and immediately get a ModuleNotFoundError. You fix that, downgrade three packages, and then get a CUDA version mismatch. Three hours later, you haven’t even started the training script.

As Large Language Models (LLMs) become more proficient at writing code, a tantalizing question arises: Can we use AI agents to automate this messy, frustrating process?

Can an LLM autonomously clone a research repository, figure out the undocumented dependencies, configure the data loaders, and reproduce the paper’s results? This is the question tackled by researchers from the Allen Institute for AI and the University of Washington in their new paper, SUPER.

In this post, we will dive deep into the challenges of “in-the-wild” research code, explore the unique architecture of the SUPER benchmark, and analyze why even the most powerful models like GPT-4 struggle to solve these tasks.

The Problem: Research Code is “Wild”

Previous benchmarks for AI coding agents (like HumanEval or ML-Bench) often focus on well-defined problems or popular, well-maintained libraries. If you ask an agent to “train a ResNet using PyTorch,” it has seen thousands of tutorials on that exact topic during its pre-training.

However, real research code is different. It is often:

Low-profile: Repositories might have very few stars and minimal community support.
Undocumented: requirements.txt files are often missing or outdated.
Complex: Running an experiment requires modifying multiple files, handling data paths, and managing distinct “setup” and “execution” phases.

The researchers identified that the most difficult part of reproducing experiments isn’t just writing the code—it’s setting up the environment and executing the task.

Figure 1: An illustration of a research task and some of the steps an agent would need to complete it.

As shown in Figure 1, a successful agent can’t just run a script. It must:

Set up: Install dependencies and download datasets (often from Google Drive or obscure links).
Resolve Issues: Handle version conflicts (e.g., the code relies on an old version of transformers but doesn’t say so).
Execute & Report: Run the training, parse the logs, and report specific metrics.

To test this capability, the authors introduced SUPER (Setting UP and Executing tasks from Research repositories).

The SUPER Benchmark Structure

SUPER is not just a collection of coding questions; it is a simulation of the graduate student experience. The benchmark is divided into three distinct sets, each serving a different purpose in evaluating the agent.

Table 2: The different sets of SUPER.

1. The Expert Set

This is the “final exam.” It consists of 45 end-to-end problems manually curated from research papers published in 2021 or later.

Source: Repositories found on “Papers With Code.”
Condition: These are “in-the-wild” repositories, meaning they are not the famous, polished libraries everyone uses.
Task: The agent is given a high-level goal (e.g., “Train this model on the MRPC dataset and report the F1 score”).
Gold Standard: Human experts (freelance developers) actually solved these tasks first to ensure they are solvable and to provide a ground truth for evaluation.

2. The Masked Set

Evaluating end-to-end tasks is harsh. If an agent does 90% of the work correctly but fails the final execution, it gets a zero. To understand where agents fail, the researchers created the Masked Set.

They took the expert solutions and “masked” (removed) specific chunks of code to create 152 sub-problems. This is similar to a “cloze test” in linguistics (fill-in-the-blank), but for repository execution.

Figure 3: An abstract demonstration of how sub-problems are extracted.

As illustrated in Figure 3, a masked problem provides the agent with a “pre-executed” environment (e.g., dependencies are already installed) and asks it to solve a specific remaining hurdle, such as:

Dependency configuration: Fixing version conflicts.
Data configuration: Updating a dataloader for a new CSV file.
Issue resolution: Fixing a runtime bug.

This allows the researchers to isolate specific capabilities. Is the agent bad at coding, or just bad at installing libraries?

3. The Auto Set

Finally, because human curation is expensive, the researchers used GPT-4 to generate 604 additional tasks automatically. While these don’t have human-verified “gold” solutions, they serve as a large-scale playground for training and development.

The construction pipeline for these sets is visualized below:

Figure 2: An overview of the construction pipeline for the Expert and Masked sets.

Evaluation: How Do We Score an AI Researcher?

Scoring code generation is usually done with unit tests. However, you can’t easily unit test a messy research repo. Instead, SUPER uses two main metrics:

Accuracy (Solution-based): The agent must report a specific metric (e.g., “Accuracy: 0.27”). If the reported number matches the gold solution (within a margin of error), the task is successful. This is the hardest metric.
Landmarks: This is a partial credit system. The researchers identified specific strings in the output logs that indicate progress (e.g., “Loading data… 100%” or “*** training completed ***”). Even if the final number is wrong, hitting landmarks shows the agent is on the right track.

The Agents

The paper evaluates several agent architectures, but the most interesting comparison is between ReAct and ReAct-SUPER (and a third external baseline, SWE-Agent).

The agents operate in a Jupyter Notebook environment. This is a crucial design choice. Unlike a simple terminal, a Notebook allows the agent to maintain state (variables) between cells and mix shell commands (!pip install) with Python code.

ReAct: A standard agent that thinks, acts, and observes. It uses basic bash commands.
ReAct-SUPER: The researchers improved the standard agent by giving it an edit tool. Standard agents struggle to edit files using command-line tools like sed. The edit tool allows the agent to replace specific lines of code in a file reliably, mimicking how a human uses an IDE.

Experimental Results: A Reality Check

So, can GPT-4 reproduce your research? The short answer is: Not yet.

The results on the Expert Set (the full end-to-end tasks) are humbling.

Table 4: Results on Expert, with GPT-4o numbers averaged across 3 seeds.

As shown in Table 4:

GPT-4o (State-of-the-Art): The best performing agent (SWE-Agent using GPT-4o) only solved 16.3% of the tasks.
Open Source Models: Llama 3.1 and Mixtral struggled significantly, achieving single-digit accuracy.
Landmarks: The landmark scores are higher (30-40%), indicating that agents often get the environment set up or start the training, but fail to cross the finish line to get the correct result.

Performance on Sub-Problems (Masked Set)

When we break the tasks down into the Masked Set sub-problems, performance improves, but it reveals specific weaknesses.

Table 6: Results of our baselines on SUPER (Masked) with different underlying LLMs.

Table 6 shows that on focused sub-problems, the success rate jumps to 46.1% for the best agent. This confirms that the sheer length and complexity of end-to-end tasks are a major failure point. When the agent only has to focus on one thing (like fixing a bug), it does much better.

Key Findings from Error Analysis:

Tools Matter: The ReAct-SUPER agent outperformed the standard ReAct agent largely because of the edit tool. Agents without a reliable way to edit files get stuck trying to use complex sed commands that break the code.
Type of Error: Agents are surprisingly good at fixing explicit crashes (e.g., “RuntimeError: size mismatch”). They are terrible at “silent failures” or configuration tasks. For example, if asked to “load the first 10 examples,” an agent might hallucinate a n_examples=10 argument for a function that doesn’t exist, rather than digging into the code to find the correct data loader class.
Reflection Didn’t Help Much: The researchers tried using “Reflexion,” where the agent is asked to reflect on why it failed and try again. As seen in Table 7 below, this only boosted accuracy marginally (from 41.6% to 45.4%). If the model lacks the fundamental reasoning to understand the repository structure, thinking about it longer doesn’t necessarily solve the problem.

Table 7: Results of the ReAct-SUPER agent (using GPT-4o) with and without Reflexion on the Masked set.

Why is this Benchmark Important?

To understand why SUPER is so challenging, we have to look at the repositories it uses. These aren’t the sterile, optimized repositories used in other benchmarks.

Table 8: Details of the 45 repositories used in SUPER along with GitHub link and star information.

Table 8 lists the repositories used. Notice the star counts (column 3). Many have fewer than 20 stars. These are “research grade” repositories—often written by a single graduate student for a specific paper, potentially never updated again.

This is the real world of scientific reproducibility.

Conclusion

The SUPER benchmark serves as a rigorous stress test for the next generation of AI agents. It demonstrates that while LLMs are excellent at writing new code from scratch, they still struggle with the forensic work required to understand, debug, and execute existing, messy codebases.

For students and researchers, this paper highlights a massive opportunity. The gap between 16.3% (current SOTA) and 100% represents the difference between a helpful chatbot and a true AI Research Assistant.

Future progress will likely require agents that are better at:

Repository-level exploration: Reading and understanding the file structure before acting.
Tool use: specifically robust file editing and debugging tools.
Handling ambiguity: Figuring out what to do when the documentation says one thing, but the code does another.

Until then, it looks like we’re stuck fixing our own ModuleNotFoundErrors.

The Problem: Research Code is “Wild”#

The SUPER Benchmark Structure#

1. The Expert Set#

2. The Masked Set#

3. The Auto Set#

Evaluation: How Do We Score an AI Researcher?#

The Agents#

Experimental Results: A Reality Check#

Performance on Sub-Problems (Masked Set)#

Why is this Benchmark Important?#

Conclusion#