Introduction
Imagine a student who aces every written exam in history, mathematics, and computer science but struggles to hold a conversation, offer advice to a friend, or brainstorm a creative gift idea. In the world of Artificial Intelligence, this is a common paradox. We have Large Language Models (LLMs) that score near-perfect marks on standardized tests like the Bar Exam or math Olympiad questions, yet they sometimes fail to satisfy simple, messy, real-world user requests.
For years, the AI community has relied on benchmarks like MMLU (Massive Multitask Language Understanding) or GSM8k (Grade School Math) to measure progress. These are excellent for testing specific abilities—reasoning, coding, and knowledge retrieval. However, they view the model as a test-taker, not a service provider. They don’t necessarily tell us how well an LLM serves a human user who is looking for travel advice, emotional support, or a creative spark.
Today, we are diving into a research paper that proposes a paradigm shift: the User Reported Scenarios (URS) benchmark. This work moves away from static exams and toward a user-centric evaluation, analyzing how well LLMs satisfy diverse human intents across multicultural contexts.
The Shift: From Model Abilities to User Intents
To understand why this research is significant, we first need to look at the landscape of AI evaluation. Traditionally, benchmarks are “ability-focused.” They treat the LLM like a calculator or an encyclopedia. If the model outputs the correct answer to a math problem, it gets a point.
However, real-world interaction is rarely that binary. A user might ask, “Help me plan a weekend trip that feels relaxing but not boring.” There is no single “correct” answer here. Success depends on personalization, tone, and creativity.
The researchers propose a new framework that categorizes evaluation not by academic subjects (like Math or Physics), but by User Intents.

As shown in Figure 1, existing benchmarks (left) focus on exams and human-designed tasks categorized by domain. The proposed URS framework (right) focuses on real-world usage, organizing data by what the user is actually trying to achieve—whether that is solving a professional problem, seeking creativity, or just killing time with leisure activities.
Building the URS Dataset: Global and Authentic
One of the primary criticisms of current user-centric benchmarks is that they often rely on synthetic data (AI talking to AI) or logs from a single platform (like ChatGPT). This creates bias. If we only evaluate based on ChatGPT logs, we only learn how to be a better ChatGPT, potentially ignoring how users interact with other models like Claude or Ernie Bot.
To counter this, the authors conducted a massive user study to build the User Reported Scenarios (URS) dataset. They collected 1,846 authentic cases from 712 participants. Crucially, these participants weren’t just from one location; they spanned 23 countries across Asia, Europe, North America, and beyond.

Figure 2 illustrates this geographic diversity. While there is a strong concentration in China (orange) and the UK/US (green/red), the dataset captures a variety of cultural contexts. This is vital because a “Leisure” query about a local festival in China requires different cultural knowledge than a query about pop culture in the UK.
The Six Core User Intents
What exactly do users want from LLMs? Through their study, the researchers identified and validated six primary categories of user intent. Understanding these categories is essential for evaluating how versatile a model truly is.
- Factual QA: Quick, direct access to information (e.g., “What is Bitcoin?”).
- Solving Professional Problems: In-depth reasoning in specialized fields like engineering or math.
- Text Assistant: Tasks involving summarization, translation, or polishing text.
- Ask for Advice: Seeking opinions for personal decisions, like career planning or shopping.
- Seek Creativity: Brainstorming for inspiration (e.g., “Give my cat a name”).
- Leisure: Recreational interactions, such as asking for movie recommendations or role-playing.

Table 3 provides concrete examples of these intents. Notice the difference in complexity. A “Factual QA” prompt might be short and objective. In contrast, “Seek Creativity” prompts, like explaining photosynthesis to a 9-year-old, require the model to adjust its style and tone significantly.
The Core Method: Intent-Aware Evaluation
The most technically interesting part of this paper is how the authors automated the grading process. Manually grading thousands of long-form answers is impossible at scale. Instead, they utilized a “Model-as-Judge” approach, specifically using GPT-4 to evaluate the responses of other models.
However, simply asking GPT-4 “Is this answer good?” is too vague. The researchers developed an Intent-Aware Evaluation Framework.
The Evaluation Workflow
The process, illustrated in Figure 3 below, transforms a raw user query into a scored benchmark.

Here is the step-by-step breakdown of their method:
- Input: The system takes the User Intent, the User Question, a Reference Answer (generated by a strong model like GPT-4), and the Test Model’s Output.
- Intent-Aware Criteria: This is the key innovation. The system doesn’t judge a “Leisure” prompt the same way it judges a “Math” problem. It selects specific criteria based on the intent.
- Chain-of-Thought Reasoning: The judge model is instructed to “think” before it grades. It compares the test answer to the reference, identifies shortcomings, and evaluates specific dimensions before assigning a final score.
- Scoring: A final score (1-10) is parsed from the judge’s output.
Different Criteria for Different Goals
To ensure fairness, the specific criteria change depending on the user’s goal. You wouldn’t judge a creative poem based on “factuality” alone, nor would you judge a math solution based on “empathy.”

As listed in Table 6, each intent has five specific dimensions:
- Factual QA prioritizes Factuality and Clarity.
- Ask for Advice prioritizes Fairness and Responsibility—ensuring the model doesn’t give dangerous or biased life advice.
- Seek Creativity prioritizes Richness and Engagement.
This granular approach ensures that models are penalized if they are boring when they should be fun, or hallucinating when they should be factual.
Experiments and Results
The researchers benchmarked 10 major LLM services, including GPT-4, Claude-3, Qwen-max, and others. The results provide a fascinating snapshot of the current state of LLM capabilities.
1. The Leaderboard
Unsurprisingly, GPT-4 (specifically GPT-4o) consistently took the top spot across almost all categories, with an overall score of 8.15. However, the gap is closing. Claude-3-Opus and Qwen-max followed closely behind, forming a distinct “first tier” of models.
Interestingly, models performed significantly better on objective tasks (Solving Problems, Factual QA) than on subjective ones (Creativity, Leisure). This suggests that while LLMs are becoming excellent encyclopedias and calculators, they still struggle to be truly engaging or creative companions.
2. Cross-Validation: Is the Judge Biased?
A common concern in “LLM-as-a-Judge” research is bias. If we use GPT-4 as the judge, does it simply prefer answers that sound like itself?
To test this, the authors compared GPT-4 and Claude-3 acting as judges for each other.

Figure 4 shows the results of this cross-validation. “GPT Eva Claude Ans” means GPT-4 is grading Claude’s answers. The results show that GPT-4 consistently scores slightly higher than Claude-3, regardless of which model is acting as the judge. This consistency validates the ranking—GPT-4 isn’t just winning because it’s the judge; it appears to be genuinely generating preferred answers in this context.
3. Does the Benchmark Match Human Reality?
The ultimate test of any benchmark is whether it reflects reality. If the benchmark gives a model a score of 10/10, but actual users hate using it, the benchmark is useless.
The authors compared their automated scores against real user satisfaction ratings reported during the data collection phase.

The correlation, shown in Figure 5, is remarkably high (Pearson r = 0.95). You can see a clear trend: intents that received high benchmark scores (like Text Assistant and Factual QA) also had high user satisfaction. Conversely, areas where models struggled in the benchmark (like Leisure and Creativity) were the same areas where users reported lower satisfaction.
This validates the URS dataset: the automated scores are a reliable proxy for how happy a human user will be.
Furthermore, they performed a pairwise comparison (A vs. B testing) with human annotators.

Figure 6 confirms the findings. The ranking of models derived from human pairwise comparisons aligns almost perfectly with the automated benchmark ranking.
Conclusion and Implications
The URS Benchmark represents a maturing of the field. We are moving past the phase where “solving a math problem” is the sole indicator of intelligence. By focusing on User Intents, this research highlights that users view LLMs as multifaceted tools—sometimes they are search engines, sometimes creative partners, and sometimes casual chat companions.
The key takeaways for students and researchers are:
- Context Matters: Evaluating a model requires knowing why the user is asking the question. A factual answer to a creative prompt is a failure, not a success.
- Subjectivity is the New Frontier: Models are mastering facts and logic. The next big hurdle is mastering nuance, creativity, and personalization—areas where scores remain lower.
- Global Diversity: Incorporating data from 23 countries ensures we aren’t just building AI that works for one specific cultural demographic.
As LLMs continue to integrate into our daily lives, benchmarks like URS will become the standard. They ensure that we aren’t just building models that are smart on paper, but models that are actually useful to the people typing into the prompt box.
](https://deep-paper.org/en/paper/2404.13940/images/cover.png)