The prospect of Artificial Intelligence automating its own development is one of the most transformative—and potentially risky—concepts in modern computer science. If an AI system can conduct Research and Development (R&D) to improve itself, we could enter a feedback loop of accelerating capabilities.

But how close are we to that reality? We know Large Language Models (LLMs) can write Python scripts and solve LeetCode problems. However, answering the question of self-improvement requires measuring something much harder: Research Engineering.

In a new paper from METR (Model Evaluation and Threat Research), researchers introduce RE-Bench, a rigorous benchmark designed to pit frontier AI agents against human experts in realistic machine learning optimization tasks.

This article breaks down the RE-Bench paper, exploring how the evaluation works, the methodology behind the “human vs. machine” face-off, and the nuanced results that suggest while AI agents are incredibly fast, they still lack the deep problem-solving durability of human experts.

The Problem: Measuring “Real” Research

Current benchmarks for AI coding capabilities, such as SWE-bench, focus on software engineering—fixing bugs or implementing features in standard repositories. While valuable, these tasks don’t capture the specific skills required for AI R&D.

Frontier AI research involves high-level experimentation: deriving scaling laws, optimizing GPU kernels, stabilizing training runs, and debugging complex convergence issues. These tasks are often open-ended, require handling heavy compute resources (like H100 GPUs), and demand a mix of theoretical knowledge and engineering grit.

The authors of RE-Bench set out to solve three specific problems with current evaluations:

Feasibility: Ensuring tasks are actually solvable by humans (many benchmarks contain impossible or broken tasks).
Ecological Validity: The tasks must resemble the actual work done at labs like OpenAI, Anthropic, or DeepMind.
Direct Human Comparison: We can’t judge an AI’s score in a vacuum; we need to know how a human expert performs under the exact same constraints.

RE-Bench: The Setup

RE-Bench consists of seven hand-crafted environments. To ensure a fair fight, the researchers standardized the testing conditions for both AI agents and human participants.

Figure 1: RE-Bench is a collection of seven hand-crafted environments intended to test the AI R&D capabilities of current AI systems. Humans and AI agents are tested under the same conditions.

As shown in Figure 1 above, the setup is rigorous. Whether the “agent” is a human connecting via VSCode or an AI model using a scaffold, they receive:

A Goal: E.g., “Speed up this training script” or “Fix this corrupted model.”
Resources: Access to a virtual machine with significant compute (1 to 6 H100 GPUs).
Time Limit: A strict budget (typically 8 hours).
Starting Materials: A codebase that runs but performs poorly.
Scoring: A script they can run to check their progress (e.g., validation loss or runtime).

The Seven Environments

The core of this paper lies in the diversity of the tasks. They cover different stages of the AI pipeline, from low-level kernel optimization to high-level architectural decisions.

Table 2: Summary of environments. Detailed descriptions of the environments and reference solutions can be found in Appendix C.

Here is a breakdown of the challenges presented in Table 2:

Optimize LLM Foundry: The agent must reduce the runtime of a training script without altering the model’s behavior. This tests system engineering and profiling skills.
Optimize a Kernel: A low-level task where the agent must write a custom GPU kernel (using Triton) to compute a prefix sum. This is “hardcore” engineering.
Fix Embedding: The agent is given a model with corrupted embeddings and must recover performance. This tests debugging and hypothesis generation.
Scaling Law Experiment: The agent must predict the optimal hyperparameters for a massive training run by conducting smaller experiments. This mimics the “science” of deep learning.
Restricted Architecture MLM: Building a masked language model without standard primitives (like division or exponentiation). This tests mathematical creativity.
Finetune GPT-2 for QA: Training a model to be a chatbot using Reinforcement Learning (RL) or similar methods without ground truth data.
Scaffolding for Rust Codecontests: Writing a Python script that prompts an LLM to solve Rust programming problems. This tests “AI engineering” or prompt engineering skills.

Scoring Methodology

Because every task has a different metric (seconds, accuracy, loss), the researchers normalize the scores.

Equation: y_n = (y - y_s) / (y_r - y_s)

The normalized score ($y_n$) sets the Starting Solution at 0 and the Reference Solution (a strong solution written by the task authors) at 1. Ideally, a score of 1.0 means the agent performed as well as the benchmark creators intended. However, scores can exceed 1.0 if the agent finds a better solution than the authors.

The Human Baselines

One of the paper’s strongest contributions is the human data. The researchers didn’t just guess how long a task would take; they recruited 61 experts to perform 71 runs.

These weren’t random participants. They included graduate students from top universities (MIT, Stanford, Berkeley) and professionals with experience at major labs (DeepMind, OpenAI).

Table 5: Summary statistics for our human expert baseline runs.

As Table 5 indicates, the experts from professional networks performed significantly better (average score 0.98) than applicants from general hiring pipelines. This distinction is vital: to automate frontier research, an AI needs to match the top-tier experts, not just the average engineer.

Figure 4: Human expert score over time spent across the 7 AI R&D environments.

Figure 4 illustrates the trajectory of human efforts. Notice the curve: humans start slow. They spend time reading code, understanding the problem, and setting up. But once they start submitting, their scores rise reliably. Crucially, strictly non-zero scores were achieved in 82% of expert attempts, proving the tasks are solvable but difficult.

Experiments: The Tortoise and the Hare

The researchers tested two primary model families: Claude 3.5 Sonnet (Anthropic) and o1-preview (OpenAI). They utilized two different “scaffolds” (the code wrapping the LLM to give it access to tools):

Modular: A general-purpose scaffold allowing file editing and command execution.
AIDE: A more specialized scaffold designed for data science, capable of tree-search over solutions.

Results by Time Spent

The most striking finding is the difference in working styles between humans and AI.

Figure 5: Average (normalized) score achieved by time spent across the 7 environments.

Look closely at Figure 5.

The AI (Blue/Orange Lines): AI agents fly out of the gate. Within the first 30 minutes to 2 hours, they often have working solutions that score reasonably well.
The Human (Black Line): Humans are slower to start. However, look at the crossover point around the 2-hour mark.

While AI agents plateau, humans continue to optimize, debug, and improve. By the 8-hour mark, the average human expert overtakes the best AI agents. This suggests that current agents excel at “low-hanging fruit”—applying standard fixes or optimizations—but struggle with the “grind” of iterative research where initial attempts fail and deep debugging is required.

Results by Sampling (Best-of-K)

Since AI is cheaper than human labor, what if we let the AI try 10, 20, or 100 times and just take the best result? This is known as Best-of-k evaluation.

Figure 2: Best score@k results observed by total time budget across different allocations between samples and time horizon.

Figure 2 is arguably the most important chart in the paper. It compares performance given a specific total time budget (summed across parallel attempts).

At 2 Hours Total Budget: AI dominates. The best AI agents achieve scores 4x higher than humans. If you need a quick-and-dirty fix immediately, AI wins.
At 8 Hours Total Budget: Humans narrowly exceed the top AI scores.
At 32 Hours Total Budget: The human advantage grows. Humans achieve 2x the score of the top AI agent.

This reveals a critical limitation in current agents: throwing more time at the problem (via longer context windows or more attempts) yields diminishing returns compared to a human’s ability to learn and adapt over a long session.

The Cost Factor

However, performance isn’t the only metric. There is also economics.

Figure 11: Best score@k results by token cost for AI agents and by money paid to human experts.

Figure 11 presents the results by cost. The x-axis is logarithmic.

AI agents cost roughly $123 per 8-hour run.
Human experts cost roughly $1,855 for the same period.

For the price of one human expert day, you could run an army of agents. This suggests that even if agents are individually less capable, their economic viability for “brute forcing” research directions is already here.

Qualitative Analysis: Brilliance and Stupidity

The numbers tell us how the agents performed, but the qualitative analysis tells us why. The researchers found that agents possess broad knowledge but brittle reasoning.

The “Superhuman” Successes

In the Optimize a Kernel task, agents produced shocking results. Writing Triton kernels is notoriously difficult and requires niche knowledge that even many human experts lack.

Figure 18: Best agent solution for Optimize a Kernel. The agent implements custom CUDA kernels and creates a fast solution that beats some human experts.

As shown in Figure 18, the o1-preview agent wrote a custom kernel that ran significantly faster than the reference solution, effectively “beating” the game. It utilized complex memory optimizations that human experts in the study (who were general ML researchers, not necessarily GPU optimization specialists) failed to find. This highlights a key strength of LLMs: they have read all the documentation, including the obscure parts.

The Frustrating Failures

Conversely, agents struggle with “long-horizon agency”—the ability to maintain a plan, notice when it’s failing, and pivot.

Figure 15: A Claude 3.5 Sonnet agent assumes that retraining the embeddings is out of scope, despite having access to an H100 for 8 hours.

In the Fix Embedding task (Figure 15), one agent correctly identified that the embeddings were broken. However, it stubbornly assumed that retraining them was “out of scope” or impossible, despite having 8 hours of H100 compute available. It spent the entire session trying workarounds that were doomed to fail.

A human expert, upon seeing a workaround fail, would likely question their assumptions. The agent, however, got stuck in a loop of false premises.

Similarly, agents struggled with resource management. If a process didn’t die correctly (a “zombie process”) and hogged the GPU memory, agents would often fail to kill it, leading to repeated “Out of Memory” errors. They would then try to shrink the model to fit, degrading performance, rather than simply running a kill command.

Comparison by Environment Type

Figure 9: Comparison of agent vs human performance across the 7 environments.

Figure 9 breaks down the performance gap by specific task.

Agents win or tie: “Optimize a Kernel” (Niche knowledge wins).
Humans dominate: “Restricted Architecture MLM” and “Optimize LLM Foundry.” These tasks likely required more creative engineering or navigating complex, inter-dependent codebases where a single small bug breaks everything.

Implications for the Future of AI R&D

The RE-Bench paper provides a sobering yet exciting snapshot of current capabilities.

The verdict: AI agents cannot yet fully automate the role of a Machine Learning Research Engineer. They lack the durability, debugging intuition, and “common sense” to navigate the messiness of an 8-hour research struggle.

However, the gap is narrower than many might expect, particularly in shorter timeframes.

Speed: Agents generate solutions 10x faster than humans.
Cost: Agents are an order of magnitude cheaper.
Knowledge: Agents possess encyclopedic knowledge of niche libraries (like Triton) that can surpass generalist human experts.

The authors conclude that while we haven’t reached the point of automated self-improvement, the trajectory is clear. As context windows grow and scaffolding improves (allowing agents to better “think” before they act or manage their memory), the human advantage in the 8-hour window may erode.

For students and researchers today, this suggests a shift in how we work. The future isn’t necessarily AI replacing researchers immediately, but researchers acting as “managers” for fleets of cheap, fast agents that handle the initial brute-force experimentation, leaving the deep, structural thinking to the humans—at least for now.

The Problem: Measuring “Real” Research#

RE-Bench: The Setup#

The Seven Environments#

Scoring Methodology#

The Human Baselines#

Experiments: The Tortoise and the Hare#

Results by Time Spent#

Results by Sampling (Best-of-K)#

The Cost Factor#

Qualitative Analysis: Brilliance and Stupidity#

The “Superhuman” Successes#

The Frustrating Failures#

Comparison by Environment Type#

Implications for the Future of AI R&D#