The role of a data scientist is often dubbed the “sexiest job of the 21st century.” It requires a unique blend of skills: statistical knowledge, coding proficiency (usually Python or SQL), business acumen, and the ability to wrangle messy, unstructured data into actionable insights.
With the meteoric rise of Large Language Models (LLMs) like GPT-4 and Claude, a burning question has emerged in the tech community: Can we automate data science?
We’ve seen LLMs generate code snippets, write SQL queries, and even debug simple scripts. But real-world data science isn’t just about writing a for loop. It involves exploring a file system, understanding a vague schema, cleaning dirty data, choosing the right machine learning model, and iteratively debugging errors until a chart looks right.
Today, we are diving deep into a research paper that tackles this exact problem. The paper introduces DA-Code, a rigorous benchmark designed to test whether LLMs can truly act as autonomous data science agents. We will explore how this benchmark works, the “DA-Agent” framework built to solve it, and the sobering results that show just how far we have left to go.
The Problem: Why Current Benchmarks Aren’t Enough
Before we unpack DA-Code, we need to understand the gap it fills. Previous code generation benchmarks, like HumanEval or DS-1000, have served us well, but they represent a “narrow” view of coding.
In a traditional benchmark, the task usually looks like this:
- Input: “Write a Python function to calculate the Fibonacci sequence.”
- Output: The model generates a clean, self-contained function.
However, a real data science task looks like this:
- Input: “Here is a folder with five different CSVs and a messy SQLite database. Find out why our sales dropped in Q3 and visualize the trend.”
- Output: The agent must explore the files, write SQL to join tables, handle missing values using Python, debug library version conflicts, and finally save a PNG chart.
The researchers behind DA-Code argue that we need to move from Code Generation to Agent Data Science.

As shown in Table 1 above, DA-Code differentiates itself significantly from predecessors like DS-1000 or Arcade. It requires planning, operates in a controllable executable environment (Docker), handles multiple files per task (averaging 5.7 files), and requires much longer solutions (averaging 85 lines of code).
DA-Code: A Benchmark for Reality
DA-Code is not just a dataset; it is a comprehensive evaluation suite. It consists of 500 hand-curated, challenging task examples derived from real-world scenarios.
The researchers categorized data science work into three main pillars:
- Data Wrangling (DW): The messy work of loading, cleaning, and transforming raw data.
- Machine Learning (ML): Building models for classification, regression, or clustering.
- Exploratory Data Analysis (EDA): Visualizing data, calculating statistics, and deriving insights.

The distribution of these tasks (shown in the pie charts above) ensures a balanced test of an agent’s capabilities. Note the diversity in file types—CSVs, Markdown documentation, YAML configs, and SQL databases. This mirrors a real workspace where data is rarely served on a silver platter.
How Was DA-Code Built?
Creating a benchmark of this complexity is a massive undertaking. The researchers couldn’t just scrape the web; they had to ensure the tasks were solvable yet difficult.

The annotation pipeline, illustrated in Figure 2, involves five rigorous steps:
- Data Selection: Identifying real, complex, and timely datasets (e.g., NYC Taxi data).
- Task Definition: Rewriting simple instructions into abstract, agent-level problems.
- Implementation: Setting up the “Sandbox” environment. This is crucial. The agent isn’t just generating text; it is interacting with a simulated file system containing databases and schemas.
- Evaluation Setup: Writing scripts to automatically check if the agent’s output (tables, charts, or text) matches the ground truth.
- Red Teaming: Having human experts try to break the tasks to ensure robustness.
The Mathematics of Agent Tasks
To understand how an AI agent solves these problems, we need to redefine what a “coding task” is.
In traditional coding benchmarks, the process is a static function mapping context (\(C\)) and instructions (\(I\)) to code:

However, in DA-Code, the process is interactive and iterative. The agent has a “State” (\(S\)), an “Action Space” (\(A\)), and a “Memory” or History (\(H\)).
The agent looks at its current memory (\(m_t\)) and the state of the environment (\(s_t\)) to decide on an action (\(a_{t+1}\)) and generate code:

Once the agent acts (e.g., “Run this Python script”), the environment executes it. The environment returns a new observation (\(o_{t+1}\)) and updates the state (\(s_{t+1}\)):

This loop continues until the task is done or the agent times out. This formalization is vital because it shifts the focus from writing code to navigating a problem space.
The DA-Agent Framework
To test the benchmark, the researchers developed a baseline agent called DA-Agent.

Figure 1 demonstrates the DA-Agent in action. Notice the workflow:
- Exploration: The agent lists files (
README.md,E-commerce.db) to understand what it’s working with. - Reasoning: It decides it needs to query the database.
- Action: It writes an SQL query.
- Feedback Loop: It encounters an error (e.g., a column doesn’t exist), inspects the table structure, and tries again.
- Result: It eventually writes a Python script to generate a plot.
The Action Space
For an agent to function, it needs a set of tools. DA-Agent is equipped with a specific set of actions that allow it to manipulate the Docker environment.

As listed in Table 5, the agent can use Bash for system operations, Python for complex logic and plotting, and SQL for database interactions. Crucially, the “Terminate” action signals that the agent believes it has finished the task.
Scoring the Agent: The Evaluation Suite
How do you grade an AI on “Data Science”? It’s harder than grading a multiple-choice test. The researchers developed tailored evaluation metrics for different output types.
1. Table Matching If the task requires producing a CSV or a database table, the evaluation checks if the predicted table (\(M'\)) matches the ground truth table (\(M\)) exactly, often after sorting or filtering specific columns.

2. Chart Matching Grading a visualization is tricky. Pixel-by-pixel comparison is too fragile. Instead, the researchers use a “Plot-based Evaluation.” They extract the underlying data (\(d\)) and the metadata (\(J\)) (like title, x-axis labels, colors) from the generated plot script and compare them to the ground truth.

3. Machine Learning Evaluation For ML tasks (e.g., “Train a model to predict churn”), the agent’s model is scored against a held-out test set. However, raw scores (like accuracy or MSE) vary wildly between datasets. To make scores comparable, they are normalized between 0 and 1, where 0 is a baseline performance (like a random guess) and 1 is the state-of-the-art.

Depending on the task, they use specific metrics like F1 Score for classification or RMSE for regression:
F1 Score (Harmonic mean of precision and recall):

RMSE (Root Mean Squared Error):

Experiments and Results: The Reality Check
So, how smart are today’s best LLMs when placed in the shoes of a data scientist? The researchers tested models like GPT-4, Claude-3-Opus, and open-source models like Mixtral and DeepSeek-Coder.
The results, summarized in Table 3, are revealing.

Key Takeaways:
- It’s Hard: The best model, GPT-4, only achieved a 30.5% completion rate. This confirms that DA-Code is significantly more difficult than previous benchmarks where models often score 80%+.
- The Open Source Gap: There is a steep drop-off between proprietary models (GPT-4, Claude-3) and open-source models. For example, Mixtral-8x22B only achieved 15.4%.
- Difficulty Levels Matter: Models performed decently on “Easy” tasks (45.4% for GPT-4) but crumbled on “Hard” tasks (23.4%).
Performance by Category
We can also break down performance by the type of data science task.

The radar chart in Figure 3 shows that models generally perform better on Machine Learning tasks (perhaps because standard ML boilerplate code is common in training data) but struggle with Data Wrangling and Data Insight. This is intuitive: cleaning data requires visually inspecting files and making judgment calls about weird formats, which is harder for an LLM than importing scikit-learn.
Analyzing Agent Behavior
The researchers didn’t just look at the final score; they analyzed the trajectory—the step-by-step path the agents took.
The “EEEA” Pattern Successful agents tend to follow a specific behavioral pattern: Exploration, Execution, Evaluation, Adjustment.
- Exploration: Viewing files (using
ls,head). - Execution: Writing code.
- Evaluation: checking the output.
- Adjustment: Debugging errors.
We can see this in the action type counts over time:

Note: In the image above (bottom charts), look at the “File Viewing” (dark blue) bars.
At the beginning of a task (Step 1-5), agents spend a lot of time on File Viewing. As the steps progress, they shift toward Python coding (orange) and System Operations (light blue). Lower-performing models (like DeepSeek-Coder-33B) fail to shift effectively, often getting stuck in loops or failing to parse actions correctly (yellow bars).
Success vs. Steps Another interesting finding is the relationship between the number of steps and success rate.

Figure 5 shows that most successful tasks are completed within the first 10 steps. If an agent hasn’t solved the problem by step 15, the likelihood of success plateaus, while the “Incompletion Rate” (tasks not finished) drops as agents either finish or give up. This suggests that planning capability is the bottleneck; if the agent doesn’t have a good plan early on, more steps won’t save it.
Comparison with Other Frameworks
Finally, the researchers compared their DA-Agent baseline against other popular agent frameworks like AutoGen and OpenDevin.

DA-Agent outperformed them (31.5% vs 26.2% for OpenDevin). Interestingly, when the researchers provided a Reference Plan (a human-written guide on how to solve the problem), the score jumped to 39.7%. This reinforces the idea that the core struggle for LLMs right now is high-level reasoning and planning, not just writing syntax.
Conclusion and Future Outlook
The DA-Code paper serves as a reality check for the AI industry. While we often see cherry-picked demos of AI analyzing data, the rigorous testing provided by this benchmark shows that we are far from fully autonomous data scientists.
A 30% success rate on real-world tasks indicates two things:
- Potential: The fact that AI can autonomously solve one-third of complex, multi-file data problems is incredible.
- Room for Growth: The remaining 70% requires improvements in how agents explore environments, debug their own errors, and plan long-term strategies.
DA-Code provides the roadmap for that growth. By moving testing away from simple code completion and into interactive, executable environments, we are pushing LLMs to become not just coders, but true problem solvers.
As models evolve, benchmarks like DA-Code will be the yardstick we use to measure the transition from “AI Assistant” to “AI Data Scientist.”
](https://deep-paper.org/en/paper/2410.07331/images/cover.png)