Beyond "Did It Work?": Measuring the True Utility of LLM Apps with AgentEval

The explosion of Large Language Models (LLMs) has shifted the landscape of software development. We are no longer just building chatbots; we are building agents—applications capable of planning, coding, and collaborating to solve complex problems. From solving intricate math equations to managing household logistics in simulated environments, these agents are becoming increasingly autonomous.

But this rapid capability growth has created a new bottleneck: Evaluation.

How do you know if an LLM application is actually “good”? In traditional software, we have unit tests. Pass or fail. In machine learning, we have accuracy metrics. But for a generative agent helping a human, “success” is nuanced. An agent might solve a math problem correctly but explain it in a confusing, roundabout way. It might complete a household chore but break three other things in the process.

In this deep dive, we explore a research paper that tackles this exact problem. The researchers introduce AgentEval, a novel framework that uses LLMs to evaluate other LLMs. It moves beyond simple success rates to measure “Task Utility”—a multi-dimensional view of how well an application aligns with user needs.

The Problem with Binary Success

To understand why we need AgentEval, we first have to look at how we currently test agents. Most benchmarks rely on end-to-end success metrics. Did the code run? Did the agent find the answer?

While necessary, these metrics are insufficient for two main reasons:

Success is not the only metric: A user’s experience involves clarity, efficiency, and safety. If an agent takes 50 steps to do a 5-step task, it has low utility, even if it eventually succeeds.
Success is hard to define: For open-ended tasks (like “write a funny email”), there is no single right answer.

The researchers propose a taxonomy of tasks to better understand where current evaluation methods fall short.

The taxonomy of tasks assessment.

As shown in Figure 2, tasks generally fall into two categories. On the left, we have assistive tasks where success is vague (e.g., creative writing). On the right, we have tasks where success is clearly defined. AgentEval focuses on this right-hand branch—specifically scenarios where, even if the outcome is binary (success/fail), the path to get there matters. Whether there is an optimal solution or multiple possible trajectories, we need a way to verify the quality of the execution, not just the result.

Introducing AgentEval

The core insight of the paper is that human evaluation is the gold standard but is too expensive and slow for rapid development. However, LLMs themselves have shown remarkable ability to act as evaluators.

AgentEval is a multi-agent framework designed to automate the assessment of task utility. It doesn’t just grade the homework; it writes the rubric, does the grading, and then checks its own work to make sure the grading was fair.

The framework is composed of three specific agents, operating in a closed loop:

CriticAgent: Defines what matters (the criteria).
QuantifierAgent: Measures how well the system performed against those criteria.
VerifierAgent: Checks if the criteria are robust and reliable.

An overview of the AgentEval framework: CriticAgent creates a set of criteria and suggested values; QuantifierAgent quantifies the criteria for a considered application; and VerifierAgent verifies the criteria based on its robustness. The output of the QuantifierAgent is a multi-dimensional assessment of the utility of the application based on a suggested list of criteria and their evaluations.

Let’s break down each agent to understand how they collaborate to produce a utility score.

1. The CriticAgent: Defining the Rubric

The process begins with the CriticAgent. In a human evaluation setting, you would ask a domain expert to list what makes a solution “good.” Here, the CriticAgent takes on that role.

It receives the task description and examples of both successful and failed executions. By comparing these, it generates a list of criteria tailored to the specific application. It’s not a generic list; it’s context-aware.

For example, when evaluating an agent designed to solve math problems, the CriticAgent suggests the following criteria:

Table 1: Verification Criteria for MathProblems

As we see in the table above, the CriticAgent identified that “Accuracy” is not enough. It proposed Clarity (is the explanation easy to follow?), Efficiency (was the method optimal?), and Completeness (did it cover all aspects?). It also defines accepted values (e.g., “Not Clear,” “Moderately Clear,” “Very Clear”) to standardize the scoring.

2. The QuantifierAgent: assigning the Score

Once the criteria are set, the QuantifierAgent steps in. Its job is to look at a specific solution generated by the application and assign a score based on the rubric created by the CriticAgent.

This agent effectively calculates the “Utility” of a task, defined as a vector of scores across the different criteria. This allows developers to see trade-offs. Perhaps a new update to their model increased accuracy but decreased clarity. A simple “pass/fail” test would miss that regression, but the QuantifierAgent catches it.

The researchers tested this on a dataset of complex math problems, comparing three different solution methods:

ReAct: A reasoning and acting paradigm.
Vanilla Solver: A standard GPT-4 solver.
AutoGen: A multi-agent conversation framework.

Figure 3: AgentEval assessment of three solutions on math problems categorized by success and failed cases.

Figure 3 illustrates the power of this multidimensional assessment. The darker bars represent successful cases, while the lighter bars represent failed cases.

Notice the nuance here. Even among “Successful” cases (the dark bars), AutoGen (dark blue) often scores higher on Completeness and Efficiency than the Vanilla Solver (green). This tells us that while both systems got the math right, AutoGen provided a higher-utility experience for the user. Conversely, looking at the “Failed” cases (light bars), we see they consistently score lower on criteria like Clarity and Completeness, confirming that the QuantifierAgent is correctly penalizing poor performance.

3. The VerifierAgent: Quality Control

The most innovative part of AgentEval is likely the VerifierAgent. A common risk with using LLMs as judges is hallucination or inconsistency. How do we know the criteria suggested by the CriticAgent are actually useful? How do we know the QuantifierAgent isn’t just guessing?

The VerifierAgent performs a “Robustness Check.” It validates the criteria through two main methods:

Criteria Stability: It checks if the QuantifierAgent gives consistent scores when run multiple times on the same input. If a criterion leads to wild variance in scoring (e.g., scoring the same solution 1/5 then 5/5), the VerifierAgent flags it as unstable and removes it.
Discriminative Power: It checks if the criteria can actually tell the difference between a good solution and a corrupted one.

To visualize the stability check, the researchers plotted the distribution of scores across multiple runs.

Figure 5: Distribution of QuantifierAgent output on AutoGen results on successful (dark blue) and failed (light blue) cases on different criteria.

In Figure 5, we see box plots representing the spread of scores for AutoGen. The separation between the dark blue (Success) and light blue (Failed) boxes is crucial. For criteria like Clarity and Completeness, there is a distinct separation—successful tasks consistently score higher. This confirms these criteria have high discriminative power.

However, look at Error Analysis. The boxes overlap significantly, and the range is wide. This indicates that “Error Analysis” might be a noisy or confusing criterion for this specific task, as the agent struggles to distinguish between successful and failed attempts based on it. The VerifierAgent would use this data to filter out “Error Analysis” from the final evaluation set, ensuring the developer only focuses on reliable metrics.

Adversarial Testing: Stress Testing the Metric

To further prove the Discriminative Power of the system, the researchers performed an adversarial attack. They took valid solutions and intentionally degraded them—specifically by randomly dropping sentences to simulate incoherence or incompleteness (“Disturbed” solutions).

If the AgentEval framework is working, it should drastically lower the scores for these disturbed solutions.

Figure 7: Assessment of original and disturbed solutions on Math dataset (discriminative power study).

Figure 7 shows the results of this stress test. The darker bars are the original solutions, and the lighter bars are the “Disturbed” versions.

Across almost every method (AutoGen, Vanilla, ReAct) and every criterion (Clarity, Efficiency, Completeness), the scores drop significantly for the disturbed versions. This is a strong validation signal. It proves that the QuantifierAgent isn’t just hallucinating a high score; it is actively reading the content. If the content degrades, the utility score reflects that immediately.

Beyond Math: Household Tasks (ALFWorld)

One of the claims of AgentEval is flexibility. It shouldn’t just work for text-heavy math problems; it should work for embodied agents acting in a virtual world.

The researchers applied the framework to ALFWorld, a benchmark where agents must navigate a text-based simulation of a house to complete chores (e.g., “clean the apple and put it in the fridge”).

The CriticAgent generated a completely different set of criteria for this domain, including Task Understanding, Plan Making, and Response to Feedback.

Figure 10: AgentEval assessment of three different solutions on ALFWorld house-holding Tasks categorized by success and failed cases.

The results, shown in Figure 10, demonstrate the framework’s adaptability. Once again, we see that successful executions (darker bars) generally score higher than failed ones.

Interestingly, looking at Task Understanding, we see high scores even for failed cases (the light bars are almost as high as the dark ones). This provides a fascinating insight for developers: The agents understood what they needed to do (hence the high score), but they failed in the Action Execution or Plan Making phase.

Without AgentEval, a developer would just see “Failed.” With AgentEval, they see “Understood the task, but failed to execute the plan.” This actionable insight allows them to debug the specific module responsible for execution rather than retraining the language understanding module.

Conclusion

The “AgentEval” framework represents a maturity shift in how we build with LLMs. We are moving away from the “vibes-based” evaluation—where we just chat with a bot and decide if it’s good—toward rigorous, automated, multi-dimensional quantification.

By employing a team of agents to Criticize, Quantify, and Verify, developers can:

Scale Evaluation: Run assessments on thousands of interactions without human cost.
Deepen Insights: Move beyond binary success to understand clarity, efficiency, and safety.
Ensure Robustness: Use the Verifier loop to guarantee that the metrics themselves are stable and reliable.

As we entrust more complex tasks to AI agents, verifying their utility—not just their ability to output text—will become the cornerstone of reliable AI development. AgentEval offers a promising blueprint for how that verification layer should be built.

The Problem with Binary Success#

Introducing AgentEval#

1. The CriticAgent: Defining the Rubric#

2. The QuantifierAgent: assigning the Score#

3. The VerifierAgent: Quality Control#

Adversarial Testing: Stress Testing the Metric#

Beyond Math: Household Tasks (ALFWorld)#

Conclusion#