Better AI Patients: How Roleplay-doh Lets Experts Teach LLMs to Behave

Training to become a mental health counselor is a high-stakes endeavor. Novices need extensive practice to navigate sensitive conversations, spot emotional cues, and respond with empathy. Ideally, they would practice with real patients, but that raises massive privacy, ethical, and safety concerns. You can’t exactly use a vulnerable patient as a “guinea pig” for a student’s first attempt at therapy.

Historically, this gap has been filled by roleplaying with peers or hiring actors (“standardized patients”). But recently, Large Language Models (LLMs) like GPT-4 have offered a tantalizing alternative: infinite, on-demand roleplay partners.

However, there is a catch. Off-the-shelf LLMs are often too helpful. They tend to be articulate, cooperative, and emotionally stable—qualities that real patients in crisis often lack. Real patients might be resistant, colloquial, inconsistent, or hostile. When experts try to “prompt” LLMs to act this way, they often struggle to translate their clinical intuition into technical prompt engineering.

Enter Roleplay-doh, a new system developed by researchers at Stanford University. This paper introduces a human-AI collaboration tool that allows domain experts (therapists) to “sculpt” AI patients not by writing code, but by providing feedback. Furthermore, it introduces a clever technical pipeline to ensure the AI actually listens to that feedback.

The Problem: Why Can’t We Just Prompt GPT-4?

If you ask ChatGPT to “act like a depressed patient,” it will give a reasonable approximation. But for high-quality training, “reasonable” isn’t enough. It needs to be authentic.

Mental health data is scarce and highly private, making it difficult to fine-tune models on real therapy transcripts. This leaves us with prompting. The problem is that domain experts—the people who know exactly how a patient should sound—are rarely prompt engineering experts. They know the patient is “sounding too formal,” but they might not know how to edit a system prompt to fix it effectively.

Furthermore, even when you give an LLM a complex persona, it often drifts. It might forget to be resistant, or it might apply a rule incorrectly (e.g., acting angry when the therapist is being helpful). The researchers identified that enabling experts to effectively create simulations requires two things:

An interface to translate expert critique into strict behavioral rules (Principles).
A system ensuring the LLM actually follows those rules during the conversation (Adherence).

As shown in Figure 1, the core workflow of Roleplay-doh involves an expert interacting with the AI. When the AI gets it wrong (e.g., accepting encouragement too easily), the expert provides feedback. The system then translates that feedback into a formal principle (e.g., “Respond with hesitancy when given encouragement”).

The Roleplay-doh Interface: From Critique to Constitution

The researchers built an interactive tool designed for non-technical users. The workflow is iterative: the expert defines a scenario, chats with the bot, and corrects it in real-time.

The tool relies on a concept called Constitutional AI, where the model’s behavior is governed by a set of natural language principles. Instead of the expert trying to write these principles from scratch (which is cognitively demanding), the tool elicits them through reaction.

When the AI generates a response, the expert can perform three actions:

Kudos: Highlight behavior to reinforce.
Critique: Explain what was wrong.
Rewrite: Write out how the patient should have responded.

Behind the scenes, an LLM analyzes this feedback. If the expert rewrites a response, the system compares the original AI response with the expert’s rewrite to figure out the underlying rule. It then generates a principle automatically.

Figure 4: Roleplay-doh alows users to chat with a AI patient,Provide Fedback as a Kudos/Critique/Rewrite,and Convert Feedback into Principles,which in turn shape the roleplay behavior.

Figure 4 demonstrates this interface. The user provides a critique (e.g., “mood is more agitated”), and the system converts this into a persistent rule. This allows the expert to build a complex “constitution” for the patient simply by correcting mistakes during a practice conversation.

The Core Technical Challenge: Getting the LLM to Listen

During pilot testing, the researchers discovered a significant issue. Even with a perfect set of expert-defined principles, the LLM failed to follow them about 20% of the time.

The failures fell into specific categories:

Contextual Misapplication: The AI would apply a rule in the wrong situation. For example, if a principle said “Show hesitation when receiving advice,” the AI might show hesitation when the therapist just said “Hello.”
Complexity Overload: When a principle contained multiple parts (e.g., “Be short, avoid fancy words, and sound anxious”), the AI would often miss one or more components.
Dialogue Awkwardness: Sometimes, in trying to follow a rule, the AI would generate text that just felt unnatural or robotic.

To solve this, the authors developed a novel Principle-Adherence Pipeline.

The Pipeline: Divide and Conquer

Standard prompting asks the LLM to “generate a response following these 10 rules.” This creates a heavy cognitive load for the model. The Roleplay-doh pipeline breaks this generation process down into a rigorous verification loop.

As illustrated in Figure 2 above, the pipeline operates in two distinct stages:

Stage 1: Transforming Principles into Questions The system doesn’t just feed the raw principles to the generator. It first passes them through a Rewriter Module.

Simplification: It turns complex, multi-part principles into simple “Yes/No” questions. A rule like “Be concise and open-ended” becomes two questions: “Is the discussion concise?” and “Does it encourage conversation via open-endedness?”
Automatic Principles: The system also generates “sanity check” questions relevant to general dialogue quality, such as ensuring the response directly addresses the therapist’s question.

Stage 2: Evaluate and Self-Refine Once the initial response is generated, it is not shown to the user yet. It goes through an Applicability and Adherence Evaluator.

Applicability: The system checks if a specific principle even applies to the current context. If the therapist didn’t give advice, the rule about “how to react to advice” is marked N/A. This prevents the model from forcing behaviors where they don’t belong.
Adherence: For all applicable questions, the system asks: “Did the response satisfy this?” If the answer is “No,” the system triggers a rewrite loop, explicitly instructing the model to fix the specific failure.

This “Check, then Generate” approach moves the burden from generation (hard) to verification (easier), resulting in significantly higher quality outputs.

Experimental Setup

To validate Roleplay-doh, the researchers recruited 25 counseling experts. The study was designed to compare two methods of creating an AI patient:

Scenario-Only: The expert writes a detailed description of the patient (backstory, symptoms), but does not use the iterative principle feedback tool.
Scenario + Expert Principles: The expert uses Roleplay-doh to refine the patient via the feedback loop described above.

The experts interacted with both versions and rated them. Additionally, to ensure the creators weren’t biased toward their own work, 5 third-party expert counselors blindly reviewed the transcripts to judge authenticity.

Results: Did It Work?

The qualitative analysis of the principles created by experts was fascinating in itself. Experts didn’t just ask for “sadness”; they defined complex behavioral dynamics.

$Table 2: Themes taken from qualitative analysis of principles and representative examples.We discover several novel ( ^ { \\ast } ) principles comparedto those defined in prior workonAI patients (Chenet al.,2023;Stapletonet al.,2O23).Themesare categorized intostagesofconversationtakenfrom(Liuetal.,O21):exploration,comforting,andaction;thoserelatig tothe overall conversation are categorized as stage-agnostic.$

As seen in Table 2, experts created principles that governed specific stages of therapy. For example, in the “Exploration” stage, they created rules like “Show initial mistrust and hesitation.” They also defined conflicting principles depending on the persona—some patients were told to be disorganized and conflicted, while others were told to be concise. This highlights why a one-size-fits-all “Mental Health Bot” prompt fails; real patients are diverse.

Quantitative Success

The experts rated the Scenario + Principles patients significantly higher across almost every metric.

$Table 1: Creators and third-party counselors compared the Scenario-Only vs. Scenario ^ + ExpertPrinciples AI patients using 7-pointLikert-scale measures; third-party judges were asked identical measures when possible,withtwo measures modifiedto match the external perspective. Creator Ratings: Creators \\mathrm { ( N } { = } 2 5 \\mathrm { ) } rated both AI patients.After refining the AI patient simulation with principles,creatorsatethepatientsignificantlyigheronallmeasures exceptforstayedinrole,forwhichbothApatients score highly. Third-Party Ratings: Third-party counselors \\mathrm { ( N } { = } 5 \\mathrm { ) } provided 125 total comparisons of the two AI patient versions.The treatment effect ofadding expert principles was estimated using using the folowing linear mixed-efect model: Rating~Treatment+Creator \\cdot \\tt I D + \\tt ( 1 IAnnotatorID).Third-party counselors rate AI patients with principles significantly higher on 4 of the 6 measures. ( { } ^ { * * * } { : } p < . 0 0 1 ，* \\ast _ { : p } < 0 . 0 1 ，* \\cdot p < 0 . 0 5 . )$

Table 1 shows that adding expert principles improved Authenticity, Resemblance to Past Cases, and Readiness for Training. The Scenario-Only bots (standard prompting) were often described as “too articulate” or “too cooperative”—traits that make for a pleasant chat but terrible training for a therapist who needs to learn how to handle resistance.

Validating the Pipeline

Finally, the researchers tested whether their technical pipeline (turning principles into questions and self-correcting) was actually necessary. They compared their Full method against a No Critique baseline (standard generation) and several ablations (removing parts of the pipeline).

Figure 3:Win/Tie/Loss for the Error Test Cases along Consistency with Context (M1),Principle Adherence (M3),and Overall.Pairwise preference evaluation results with [No Critique]as a baseline. Results obtained after majority voting.

The results in Figure 3 are clear. The Full pipeline (the yellow bars representing “Win”) consistently outperformed the baseline in adhering to principles (M3) and overall quality.

Crucially, the ablation study showed that the “Naive” approach—simply asking the LLM to “fix itself” without breaking principles down into questions—resulted in a high number of ties (light blue bars). It rarely improved the output. This proves that the specific step of rewriting principles into simple Yes/No questions is the “secret sauce” that allows the LLM to effectively self-correct.

Conclusion and Implications

Roleplay-doh demonstrates a powerful paradigm for Human-AI interaction. Rather than expecting domain experts to learn the dark arts of prompt engineering, we can build tools that speak their language (feedback, critique, praise) and translate it into machine language (structured principles).

The implications extend beyond mental health. This same architecture could be applied to training sales representatives to handle difficult clients, teaching managers to conduct performance reviews, or helping medical students practice bedside manner.

By combining an intuitive expert interface with a rigorous, self-correcting backend pipeline, Roleplay-doh bridges the gap between a generic LLM and a highly specialized, authentic training tool. It moves us from “talking to a chatbot” to “practicing with a patient,” making digital simulations a viable reality for high-stakes professional training.

The Problem: Why Can’t We Just Prompt GPT-4?#

The Roleplay-doh Interface: From Critique to Constitution#

The Core Technical Challenge: Getting the LLM to Listen#

The Pipeline: Divide and Conquer#

Experimental Setup#

Results: Did It Work?#

Quantitative Success#

Validating the Pipeline#

Conclusion and Implications#