You’ve probably chatted with an AI assistant like ChatGPT, Claude, or Llama. You type a question, and it fires back with a polished, well-structured answer — articulate, exhaustive, and unfailingly polite. These models are trained to be ideal conversational partners.

But here’s the catch: real human users aren’t like that.

Our requests in the wild are messy. We make typos, use slang, change our minds mid-conversation, and rarely lay out our entire request in perfect order. For example:

“write a python function to sort some numbers”
“oh, and only numbers from 1 to 9”
“actually, reverse it after sorting”
“and can you change the digits to words like ‘One’, ‘Two’?”

This gradual, sometimes chaotic, multi-turn interaction is exactly where even the most advanced AI assistants can falter. The challenge: How do we evaluate assistants in environments that resemble this real, “beautiful mess” of human conversation?

The common shortcut has been to ask a powerful AI assistant to simulate a human user talking to another assistant. Sounds reasonable, right? After all, if a model is good at language, it should be good at role-playing.

But in their paper Flipping the Dialogue, researchers from Microsoft and Georgia Tech show this is fundamentally flawed. Their surprising finding: the better an AI is at being an assistant, the worse it is at pretending to be a human user.

Instead, they introduce a new approach — training dedicated User Language Models (User LMs) designed from the ground up to mirror human conversational quirks.

And it’s not just an academic tweak: when GPT-4o is evaluated in conversation with a realistic User LM instead of another assistant, its task success rate drops from 74.6% to 57.4%. That means our current evaluation methods are giving us an overly rosy picture of how assistants handle real-world interactions.

A comparison showing how a GPT-4o user simulator is simple and direct, leading to success, while the purpose-built UserLM-8b is more nuanced and indirect, causing the assistant to fail.

Figure 1: Two simulators, two realities. On the left, GPT-4o role-plays as a user with simple, direct turns — and the assistant breezes through. On the right, UserLM-8b mirrors real users’ indirect style, and the assistant stumbles.


Why Assistant-LMs Fall Short as Simulated Users

Using an assistant LM to mimic a user is like hiring a Michelin-star chef to taste-test frozen dinners — technically competent, but fundamentally misaligned with the target audience.

Assistant LMs are tuned to be:

  • Helpful and cooperative — always trying to smooth the interaction.
  • Structured and exhaustive — often delivering the complete answer in one turn.
  • Reluctant to disengage — they rarely end the conversation, always waiting for more input.

Real users? We’re driven by our own goals, not by the desire to be “good” conversational partners. We’re terse because typing takes effort, we might introduce details piecemeal, and we’ll walk away without ceremony when we’re done.

When an assistant like GPT-4o is prompted to “act like a user,” it fights against its training. It may sprinkle in a typo, but its core persona — clarity, helpfulness, cooperation — shines through. That creates an unrealistically easy evaluation environment, leading to inflated performance scores for the assistant under test.


Flipping the Dialogue: Training Realistic User LMs

The researchers’ big insight is intuitive: to build a model that behaves like a user, train it on user data — not assistant outputs.

Instead of predicting the assistant’s next response, the model is trained to predict the user’s next turn. This is User Language Modeling, and it involves three key stages:

A diagram showing the pipeline for training a UserLM. It starts with real user-assistant conversations, generates high-level user intents from them, and then flips the dialogue to create training samples.

Figure 2: The training pipeline. Start with real user-assistant conversations, distill a high-level “user intent,” and train the model to produce the next user utterance based on that intent and dialogue so far.

1. Start with Real Conversations

The team used WildChat, a dataset of 478,000 real user–ChatGPT conversations from 192 countries. After removing duplicates, they had 384,336 conversations — a rich, messy source of authentic user behavior.

2. Generate a “User Intent”

For each conversation, GPT-4o was asked to create a high-level summary of the user’s goal — the intent. Crucially, this wasn’t a verbatim script; it captured the overall aim without dictating exact phrasing.

Example intent:
“You are a user getting information about weight loss strategies and the impact of medications on weight gain.”

This way, the User LM has a guiding goal but retains room to vary its phrasing and conversational progression.

3. Train the Model to Be the User

Using “flipped” data — the conversation history + intent — the model (e.g., Llama3-8b-Base) learns to generate the next user turn. Over time, it picks up not just what to say, but how to say it like a human.

To emulate real conversation endings, the team introduced a special <|endconversation|> token, teaching the model natural disengagement.


Does It Sound Like a User? Measuring Statistical Fit

Before putting User LMs into simulations, the team checked their language fit using perplexity (PPL) — a measure of how well the model predicts real user text. Lower PPL means the model finds the text unsurprising, suggesting good distributional alignment.

UserLM-8b was tested against base LMs, assistant-tuned LMs, and another simulated user model (USP-8b).

Table showing the perplexity scores of various models. UserLM-8b achieves significantly lower perplexity than all other models, indicating a better distributional alignment with real user utterances.

Table 1: Perplexity scores on two datasets. UserLM-8b has much lower PPL — meaning its predicted user language matches real data more closely than any assistant-based simulator.

Findings:

  • UserLM-8b had 60–70% lower PPL than all baselines.
  • Starting from a neutral base model beat starting from an instruction-tuned assistant hands-down.
  • Training with intent conditioning made the simulator more steerable at test time.

Two plots comparing different training setups for UserLMs. The left plot shows that training with intent conditioning improves performance. The right plot shows that starting from a base model checkpoint is significantly better than starting from an instruction-tuned one.

Figure 3: Left — intent conditioning makes a steerable user LM. Right — base model initialization outperforms assistant-instruct checkpoints.


Evaluating Human-Like Behavior

The team designed six fine-grained tests to measure how well simulators mirror human conversational traits.

Table summarizing results for six evaluations of user-like behavior. UserLM-8b consistently outperforms prompted assistants and aligns closely with human estimates on metrics like diversity, intent decomposition, dialogue termination, and naturalness.

Table 2: Across metrics such as diversity, intent decomposition, dialogue termination, naturalness, role discipline, and intent discipline — UserLMs consistently beat assistant-based simulators.

Multi-Turn Interaction

  • First-turn diversity: UserLM-8b scored 94.55%, matching human variety (94.01%). GPT-4o lagged at 74.42%.
  • Intent decomposition: UserLMs spread details over turns; assistants tend to front-load information.
  • Dialogue termination: UserLM-8b: 63.54 F1. GPT-4o: only 1.38 — assistants just won’t say goodbye.

Simulation Robustness

  • Naturalness: Using a state-of-the-art AI detector, UserLM text scored ~80% human-likeness; assistant outputs scored near-zero.
  • Role adherence: When pressed, UserLMs stayed in character >93% of the time; assistants reverted to answering like helpers.
  • Intent adherence: UserLMs refused to be sidetracked >94% of the time; assistants were easily diverted.

Scaling effect: For assistants, bigger wasn’t better. For User LMs, larger models consistently improved performance.


Facing the Real Test: How Do Assistants Perform Against Realistic Users?

The final challenge: simulate multi-turn conversations for math and coding tasks, using different “user” simulators with GPT-4o as the assistant.

Three example conversations simulated by a prompted GPT-4o and three by UserLM-8b for the same task. The UserLM-8b simulations are far more varied in their language and structure.

Figure 7: For the same task, GPT-4o-based simulations (left) are repetitive. UserLM-8b simulations (right) introduce variety in language, pacing, and constraints.

Key differences with UserLM-8b simulations:

  • More diverse behavior — varied length, selective repetition of important info, omission of trivial details, occasional new constraints.
  • Lexical richness — rephrasing the same intent in many ways.
  • Pace variation — conversations ranged from 2 to 7 turns, reflecting different user “speeds.”

Impact on assistant success:

  • GPT-4o + GPT-4o simulator: 74.6% success rate.
  • GPT-4o + UserLM-8b simulator: 57.4% success rate.

That’s a 17-point drop, revealing hidden weaknesses that easy simulations gloss over.


Conclusion: Better User Simulation = Better Assistants

Prompting an assistant to “act like a user” is like asking a hammer to be a screwdriver — wrong tool, wrong job.

Flipping the Dialogue shows how to create realistic, purpose-built User LMs. And the benefits are clear:

  1. Realistic evaluation — Stress-test assistants in messy, authentic scenarios.
  2. Richer training data — Diverse, natural conversations for fine-tuning assistants.
  3. Personalization potential — Extend to simulate users from specific demographics or domains.
  4. Preserve base models — The best User LMs start from neutral base checkpoints, emphasizing the need for open-source base model releases.

Investing in better user simulators doesn’t just improve testing — it shapes the development of assistants that can handle the unpredictable rhythm and style of genuine human conversation. If we want AI partners that thrive in the real world, we need to first teach them to navigate it.