Introduction: The Detective Work of AI Development

Imagine managing a team of software developers, researchers, and data analysts. You assign them a complex project—say, analyzing the housing market in a specific city—and wait for the results. But when the report comes back, it’s wrong. The data is hallucinated, or the code failed to execute. Now, you have to figure out who on the team dropped the ball and when exactly things went south. Was it the analyst who pulled the wrong file? The coder who wrote a buggy script? Or the manager who gave unclear instructions?

Now, replace those human workers with Large Language Model (LLM) agents. This is the reality of LLM Multi-Agent Systems. These systems are incredibly powerful, capable of collaborating to solve complex tasks in coding, science, and math. However, like human teams, they fail. And when they do, debugging them is a nightmare.

Traditionally, finding the root cause of a failure—a process called failure attribution—requires a human expert to read through hundreds of lines of interaction logs. It is slow, expensive, and unscalable.

But what if AI could debug AI?

In a recent paper titled “Which Agent Causes Task Failures and When?”, researchers propose a new framework for automated failure attribution. They introduce a massive dataset, Who&When, and evaluate different algorithms to see if LLMs can identify their own mistakes.

This diagram compares manual versus automated failure attribution methods in LLM multi-agent systems using failure logs.

As shown in Figure 1, the goal is to shift from the intensive cost of manual attribution to an efficient, automated pipeline that can pinpoint the “Decisive Error Step.”

The Multi-Agent Ecosystem

To understand how we find errors, we first need to understand the system architecture. The researchers model the multi-agent system as a turn-based environment.

\[ { \mathcal { M } } = \Big \langle \mathcal { N } , S , A , P , \phi \Big \rangle . \]

Here, \(\mathcal{N}\) represents the set of agents (the team). At any given time step, exactly one agent takes an action based on the current state. The system produces a “trajectory”—a history of states and actions—that eventually leads to a result.

The core problem isn’t just that the result is wrong; it’s identifying the specific point in time where the trajectory was doomed.

Defining the “Decisive Error”

The researchers introduce a formal definition for what counts as the true cause of failure. They call this the Decisive Error.

In a failed trajectory (where the final outcome \(Z(\tau) = 1\), representing failure), there might be many small mistakes. However, a decisive error is the specific action that, if fixed, would flip the outcome from failure to success.

Mathematically, they define an intervention on the trajectory:

\[ \tau ^ { ( i , t ) } = \mathcal { T } _ { ( i , t ) } ( \tau ) , \]

If we intervene at step \(t\) by Agent \(i\) (replacing the bad action with a good one), and the system subsequently succeeds, then that specific agent-step pair is the culprit.

\[ \Delta _ { i , t } ( \tau ) ~ = ~ \left\{ \begin{array} { l l } { { 1 , } } & { { \mathrm { i f } } ~ Z ( \tau ) = 1 \mathrm { ~ a n d } ~ Z \big ( \tau ^ { ( i , t ) } \big ) = 0 , } } \\ { { } } & { { } } \\ { { 0 , } } & { { \mathrm { o t h e r w i s e . } } } \end{array} \right. \]

The objective of automated failure attribution is to find the pair \((i^*, t^*)\) that represents the earliest moment this decisive error occurred.

\[ \begin{array} { r } { \mathcal { C } ( { \tau } ) = \ \{ ( i , t ) \ | \ \Delta _ { i , t } ( { \tau } ) \ = \ 1 \} , } \\ { ( i ^ { * } , t ^ { * } ) = \arg \operatorname* { m i n } _ { ( i , t ) \in \mathcal { C } ( { \tau } ) } \ t . } \end{array} \]

The Who&When Benchmark

You cannot train a detective without crime scenes. To make automated failure attribution possible, the researchers built the Who&When dataset.

This dataset contains failure logs from 127 different multi-agent systems, including both algorithmically generated teams and hand-crafted systems like Microsoft’s Magentic-One. These aren’t toy problems; they are failures derived from real benchmarks like GAIA and AssistantBench.

The Cost of Ground Truth

Creating this dataset was an immense effort. Human experts had to manually analyze logs, understand the logic of multiple agents, and pinpoint the exact line where things went wrong.

Statistical analysis of the annotation process showing high labor costs and disagreement rates.

As Figure 2 illustrates, this is hard work. It took over 30 hours per annotator to label these logs, with significant disagreement rates initially. This difficulty underlines exactly why we need automation—if it’s hard for humans, it’s unscalable for developers.

Below is an example of what these logs look like. You can see how an automated system (or a human) must parse the dialogue to find that the “Assistant” made a mistake at step 82 by navigating agents incorrectly.

A task example from Who&When, where failure-responsible agents and their corresponding error steps are annotated.

The dataset covers a wide variety of complexities. As shown in Figure 8 below, the logs range from short interactions with single agents to massive logs involving up to five agents and over 100 steps.

The number of agents involved and the total length of each failure log instance in the Who&When dataset.

Core Methods: Three Ways to Automate Debugging

With the dataset in hand, the researchers tested three distinct strategies for using LLMs to identify failures. Each method represents a different trade-off between context, precision, and cost.

1. All-at-Once

In this method, the LLM acts like a reviewer reading a novel. You feed the entire failure log to the model in one go and ask: “Who failed and when?”

  • Pros: The model sees the full context. It knows how the story ends, which helps in understanding which agent steered the ship off course.
  • Cons: “Needle in a haystack” problem. LLMs often struggle to point to a specific line number within a massive block of text.
  • Cost: This is the cheapest method, requiring only one API call. \[ C o s t _ { a l l - a t - o n c e } = C + n \cdot T _ { l } \]

2. Step-by-Step

Here, the LLM acts like a code interpreter. It reads the log line by line (or step by step). At every single step, it asks: “Is this action correct? Does it lead to failure?”

  • Pros: High precision. By focusing on one step at a time, the model is less likely to hallucinate the step number.
  • Cons: Tunnel vision. It lacks the “future” context. It might flag an action as weird that actually turns out to be brilliant ten steps later.
  • Cost: Very expensive. The cost scales quadratically because the context grows with every step. \[ C o s t _ { s t e p - b y - s t e p } = \sum _ { i = 1 } ^ { i ^ { * } } ( C + i \cdot T _ { l } ) = i ^ { * } \cdot C + T _ { l } \cdot \frac { i ^ { * } \cdot ( i ^ { * } + 1 ) } { 2 } \]

This method applies a classic computer science algorithm to debugging. The LLM looks at the whole log and decides: “Is the error in the first half or the second half?” It then splits the log and repeats the process until it isolates the error.

  • Pros: Balances context and efficiency.
  • Cons: If the model makes one wrong turn (e.g., says “First Half” when the error is in the second), it creates a cascading failure and misses the error entirely.
  • Cost: Moderate. It scales logarithmically. \[ C o s t _ { B i n a r y S e a r c h } = \sum _ { i = 1 } ^ { \lceil \log _ { 2 } ( n ) \rceil } ( C + \frac { n \cdot T _ { l } } { 2 ^ { i - 1 } } ) = \lceil \log _ { 2 } ( n ) \rceil \cdot C + \sum _ { i = 1 } ^ { \lceil \log _ { 2 } ( n ) \rceil } ( \frac { n \cdot T _ { l } } { 2 ^ { i - 1 } } ) \]

Experiments & Results: The Verdict

So, can LLMs debug themselves? The results are fascinating and reveal a “No Free Lunch” scenario.

Who vs. When

The researchers evaluated the methods on two metrics: Agent-Level Accuracy (identifying who messed up) and Step-Level Accuracy (identifying when).

Table showing performance of the three failure attribution methods. All-at-once is best for Agent-level, Step-by-step is best for Step-level.

As Table 1 shows, the results are split:

  • All-at-Once is the champion of identifying the Agent (Who). Because it sees the whole picture, it understands which agent was responsible for the strategy that failed.
  • Step-by-Step is the champion of identifying the Step (When). Its granular focus allows it to pinpoint the exact moment of failure much better than the other methods.
  • Binary Search acts as a middle ground but rarely wins.

This trend holds true even when testing different underlying models (like Llama-3 or Qwen), as seen in the radar charts below.

Radar charts comparing performance across different models. Trends remain consistent.

The Context Length Limit

A major finding is that current LLMs struggle as the conversation gets longer.

Charts showing performance degradation as log length increases.

Figure 4 paints a stark picture. As the log length increases (from Level 1 to Level 5), the accuracy for all methods plummets. Step-level accuracy, in particular, drops to near zero for very long interactions. This confirms that the “Needle in a Haystack” problem is a significant barrier to automated debugging.

Is “Close Enough” Good Enough?

The researchers also realized that finding the exact step is incredibly hard. But for a human debugger, knowing the error is “around step 50” is almost as good as knowing it’s “at step 50.”

When they relaxed the criteria to allow a tolerance of ±5 steps, the accuracy improved significantly.

Table showing Step-Level accuracy improves with tolerance.

Interestingly, All-at-Once benefits the most from this tolerance. While it struggles to hit the exact step, it often lands in the right neighborhood.

Statistical Reliability

Even if the models struggle with individual instances, do they get the general statistics right? Yes.

Histogram of actual vs predicted failure-responsible agents. The distributions match closely.

Figure 6 shows that the predicted distribution of “guilty” agents closely matches the ground truth. This means that even if the system can’t perfectly debug every single run, it can tell a developer: “Hey, your Orchestrator agent is causing 60% of your failures.” That is a valuable insight for system optimization.

The Hybrid Approach and Future Directions

Given that All-at-Once is good at finding the Agent, and Step-by-Step is good at finding the Step, can we combine them?

The researchers tested a Hybrid Method:

  1. Use All-at-Once to identify the responsible Agent.
  2. Use Step-by-Step only on the actions taken by that Agent.

Comparison of the three methods with a hybrid approach. Hybrid wins on metrics but costs the most tokens.

The Hybrid method achieves the best overall performance, significantly boosting step-level accuracy. However, there is a catch: it is the most expensive method in terms of token cost.

Can Reasoning Models Help?

Finally, the team tested the latest “Reasoning” models (like OpenAI o1 and DeepSeek R1) to see if their advanced chain-of-thought capabilities could solve the problem.

Table showing performance with strong reasoning models. They improve results but don’t solve the problem entirely.

While reasoning models do offer improvements, they are not a silver bullet. The accuracy remains low enough that human supervision is still required.

Conclusion

The paper “Which Agent Causes Task Failures and When?” opens a new door in the field of AI agents. It transitions us from simply building agents to understanding how to maintain and debug them.

The key takeaways are:

  1. Debugging is hard: Even for SOTA models, pinpointing the exact cause of failure in a multi-agent conversation is a difficult reasoning task.
  2. Context vs. Precision: There is a trade-off. Seeing the whole history helps identify the bad actor, but checking step-by-step is needed to find the exact error.
  3. The Who&When Benchmark: This new dataset will likely become a standard for testing the diagnostic capabilities of future models.

As multi-agent systems become more integrated into our software infrastructure, automated failure attribution will move from a “nice-to-have” to a necessity. This research lays the foundation for self-healing AI systems that can not only solve problems but also understand why they sometimes fail.