You’ve probably chatted with an AI assistant like ChatGPT, Claude, or Llama. You type a question, and it fires back with a polished, well-structured answer — articulate, exhaustive, and unfailingly polite. These models are trained to be ideal conversational partners.
But here’s the catch: real human users aren’t like that.
Our requests in the wild are messy. We make typos, use slang, change our minds mid-conversation, and rarely lay out our entire request in perfect order. For example:
“write a python function to sort some numbers”
“oh, and only numbers from 1 to 9”
“actually, reverse it after sorting”
“and can you change the digits to words like ‘One’, ‘Two’?”
This gradual, sometimes chaotic, multi-turn interaction is exactly where even the most advanced AI assistants can falter. The challenge: How do we evaluate assistants in environments that resemble this real, “beautiful mess” of human conversation?
The common shortcut has been to ask a powerful AI assistant to simulate a human user talking to another assistant. Sounds reasonable, right? After all, if a model is good at language, it should be good at role-playing.
But in their paper Flipping the Dialogue, researchers from Microsoft and Georgia Tech show this is fundamentally flawed. Their surprising finding: the better an AI is at being an assistant, the worse it is at pretending to be a human user.
Instead, they introduce a new approach — training dedicated User Language Models (User LMs) designed from the ground up to mirror human conversational quirks.
And it’s not just an academic tweak: when GPT-4o is evaluated in conversation with a realistic User LM instead of another assistant, its task success rate drops from 74.6% to 57.4%. That means our current evaluation methods are giving us an overly rosy picture of how assistants handle real-world interactions.
Figure 1: Two simulators, two realities. On the left, GPT-4o role-plays as a user with simple, direct turns — and the assistant breezes through. On the right, UserLM-8b mirrors real users’ indirect style, and the assistant stumbles.
Why Assistant-LMs Fall Short as Simulated Users
Using an assistant LM to mimic a user is like hiring a Michelin-star chef to taste-test frozen dinners — technically competent, but fundamentally misaligned with the target audience.
Assistant LMs are tuned to be:
- Helpful and cooperative — always trying to smooth the interaction.
- Structured and exhaustive — often delivering the complete answer in one turn.
- Reluctant to disengage — they rarely end the conversation, always waiting for more input.
Real users? We’re driven by our own goals, not by the desire to be “good” conversational partners. We’re terse because typing takes effort, we might introduce details piecemeal, and we’ll walk away without ceremony when we’re done.
When an assistant like GPT-4o is prompted to “act like a user,” it fights against its training. It may sprinkle in a typo, but its core persona — clarity, helpfulness, cooperation — shines through. That creates an unrealistically easy evaluation environment, leading to inflated performance scores for the assistant under test.
Flipping the Dialogue: Training Realistic User LMs
The researchers’ big insight is intuitive: to build a model that behaves like a user, train it on user data — not assistant outputs.
Instead of predicting the assistant’s next response, the model is trained to predict the user’s next turn. This is User Language Modeling, and it involves three key stages:
Figure 2: The training pipeline. Start with real user-assistant conversations, distill a high-level “user intent,” and train the model to produce the next user utterance based on that intent and dialogue so far.
1. Start with Real Conversations
The team used WildChat, a dataset of 478,000 real user–ChatGPT conversations from 192 countries. After removing duplicates, they had 384,336 conversations — a rich, messy source of authentic user behavior.
2. Generate a “User Intent”
For each conversation, GPT-4o was asked to create a high-level summary of the user’s goal — the intent. Crucially, this wasn’t a verbatim script; it captured the overall aim without dictating exact phrasing.
Example intent:
“You are a user getting information about weight loss strategies and the impact of medications on weight gain.”
This way, the User LM has a guiding goal but retains room to vary its phrasing and conversational progression.
3. Train the Model to Be the User
Using “flipped” data — the conversation history + intent — the model (e.g., Llama3-8b-Base) learns to generate the next user turn. Over time, it picks up not just what to say, but how to say it like a human.
To emulate real conversation endings, the team introduced a special <|endconversation|>
token, teaching the model natural disengagement.
Does It Sound Like a User? Measuring Statistical Fit
Before putting User LMs into simulations, the team checked their language fit using perplexity (PPL) — a measure of how well the model predicts real user text. Lower PPL means the model finds the text unsurprising, suggesting good distributional alignment.
UserLM-8b was tested against base LMs, assistant-tuned LMs, and another simulated user model (USP-8b).
Table 1: Perplexity scores on two datasets. UserLM-8b has much lower PPL — meaning its predicted user language matches real data more closely than any assistant-based simulator.
Findings:
- UserLM-8b had 60–70% lower PPL than all baselines.
- Starting from a neutral base model beat starting from an instruction-tuned assistant hands-down.
- Training with intent conditioning made the simulator more steerable at test time.
Figure 3: Left — intent conditioning makes a steerable user LM. Right — base model initialization outperforms assistant-instruct checkpoints.
Evaluating Human-Like Behavior
The team designed six fine-grained tests to measure how well simulators mirror human conversational traits.
Table 2: Across metrics such as diversity, intent decomposition, dialogue termination, naturalness, role discipline, and intent discipline — UserLMs consistently beat assistant-based simulators.
Multi-Turn Interaction
- First-turn diversity: UserLM-8b scored 94.55%, matching human variety (94.01%). GPT-4o lagged at 74.42%.
- Intent decomposition: UserLMs spread details over turns; assistants tend to front-load information.
- Dialogue termination: UserLM-8b: 63.54 F1. GPT-4o: only 1.38 — assistants just won’t say goodbye.
Simulation Robustness
- Naturalness: Using a state-of-the-art AI detector, UserLM text scored ~80% human-likeness; assistant outputs scored near-zero.
- Role adherence: When pressed, UserLMs stayed in character >93% of the time; assistants reverted to answering like helpers.
- Intent adherence: UserLMs refused to be sidetracked >94% of the time; assistants were easily diverted.
Scaling effect: For assistants, bigger wasn’t better. For User LMs, larger models consistently improved performance.
Facing the Real Test: How Do Assistants Perform Against Realistic Users?
The final challenge: simulate multi-turn conversations for math and coding tasks, using different “user” simulators with GPT-4o as the assistant.
Figure 7: For the same task, GPT-4o-based simulations (left) are repetitive. UserLM-8b simulations (right) introduce variety in language, pacing, and constraints.
Key differences with UserLM-8b simulations:
- More diverse behavior — varied length, selective repetition of important info, omission of trivial details, occasional new constraints.
- Lexical richness — rephrasing the same intent in many ways.
- Pace variation — conversations ranged from 2 to 7 turns, reflecting different user “speeds.”
Impact on assistant success:
- GPT-4o + GPT-4o simulator: 74.6% success rate.
- GPT-4o + UserLM-8b simulator: 57.4% success rate.
That’s a 17-point drop, revealing hidden weaknesses that easy simulations gloss over.
Conclusion: Better User Simulation = Better Assistants
Prompting an assistant to “act like a user” is like asking a hammer to be a screwdriver — wrong tool, wrong job.
Flipping the Dialogue shows how to create realistic, purpose-built User LMs. And the benefits are clear:
- Realistic evaluation — Stress-test assistants in messy, authentic scenarios.
- Richer training data — Diverse, natural conversations for fine-tuning assistants.
- Personalization potential — Extend to simulate users from specific demographics or domains.
- Preserve base models — The best User LMs start from neutral base checkpoints, emphasizing the need for open-source base model releases.
Investing in better user simulators doesn’t just improve testing — it shapes the development of assistants that can handle the unpredictable rhythm and style of genuine human conversation. If we want AI partners that thrive in the real world, we need to first teach them to navigate it.