Human communication is a labyrinth of subtext. While we often say exactly what we mean, there are dark corners of interaction where words are used as weapons—not through overt insults, but through subtle psychological maneuvering. This is the realm of mental manipulation: gaslighting, guilt-tripping, feigning innocence, and strategic shaming.
For years, Natural Language Processing (NLP) has become adept at spotting explicit toxicity, like hate speech or profanity. However, detecting manipulation is infinitely harder. It relies on context, relationship dynamics, and intent. It’s not just about what is said, but why it is said and how it influences others.
The challenge becomes even steeper when we move from a simple conversation between two people to a multi-person, multi-turn dialogue. In a group setting, manipulation can be triangulated; one person might play the victim while another reinforces the manipulator’s narrative.
In this post, we are doing a deep dive into the research paper “SELF-PERCEPT: Introspection Improves Large Language Models’ Detection of Multi-Person Mental Manipulation in Conversations.” The researchers tackle this complex problem by introducing a new dataset drawn from reality TV and a novel prompting framework inspired by human psychology.
The Problem: Why LLMs Struggle with Manipulation
Large Language Models (LLMs) like GPT-4 and Llama-3 are powerful, but they often struggle with “Theory of Mind”—the ability to attribute mental states (beliefs, intents, desires) to others.
Existing research in this field has been limited in two major ways:
- Dyadic Focus: Most datasets only look at conversations between two people (dyadic). Real-world manipulation, however, often happens in groups where power dynamics shift rapidly.
- Fictional Data: Many datasets rely on movie scripts or entirely fictional scenarios which follow predictable tropes. They often lack the messy, unstructured nature of real human speech.
The researchers identified a critical gap: Can LLMs identify specific manipulation techniques in complex, multi-participant dialogues that resemble real-world conversations?
To answer this, they needed to move beyond simple binary classification (is this bad or good?) and teach the model to identify specific psychological tactics.
Defining the Tactics
Before we look at the solution, we must understand the behavior being detected. The researchers adopted a taxonomy of 11 distinct manipulation techniques derived from psychological research.

As shown in Figure 2, the taxonomy is granular. It distinguishes between Evasion (avoiding the topic) and Denial (refusing responsibility), or between Intimidation (veiled threats) and Brandishing Anger (explosive emotion to force submission).

Table 3 provides the definitions used in the study. Understanding these nuances is crucial because a model must distinguish between a genuine apology and “Feigning Innocence,” or between constructive criticism and “Shaming.”
Contribution 1: The MultiManip Dataset
To train and test models on these complex behaviors, the authors created MultiManip.
Instead of writing fake scenarios, they turned to a source known for its raw, strategic, and often manipulative social dynamics: the reality TV show Survivor.
Why Survivor? The show is essentially a laboratory for social engineering. Contestants are financially incentivized to form alliances, betray trust, and manipulate perceptions to avoid elimination. The conversations are:
- Multi-person: Often involving 3 or more participants discussing strategy or conflict.
- Multi-turn: Conversations evolve over many exchanges, allowing manipulation to unfold gradually.
- Realistic: While edited for TV, the speech patterns, interruptions, and emotional reactions are unscripted.
The Curation Process
The researchers extracted transcripts from the Survivor series. They used a multi-step pipeline to ensure quality:
- Filtering: They used Llama-3.1-70B to scan thousands of lines of dialogue to identify potential candidates containing manipulation.
- Balancing: They ensured the dataset was balanced between manipulative and non-manipulative dialogues to prevent the model from becoming biased (i.e., assuming everything is manipulation).
- Human Annotation: Five human annotators analyzed the dialogues. This was difficult work; the inter-annotator agreement (Fleiss’ Kappa) was moderate (0.429), highlighting just how subjective and subtle manipulation can be even for humans.
The result is a dataset of 220 high-quality, multi-turn, multi-person dialogues labeled with the 11 specific manipulation types.
Contribution 2: The SELF-PERCEPT Framework
This is the core methodological contribution of the paper. Standard prompting techniques—like asking an LLM “Is there manipulation here?” (Zero-Shot) or “Let’s think step by step” (Chain-of-Thought)—often fail to capture the subtle cues of manipulation.
The researchers proposed SELF-PERCEPT, a two-stage prompting framework inspired by Self-Perception Theory (SPT).
What is Self-Perception Theory?
In psychology, SPT suggests that individuals infer their own attitudes and internal states by observing their own behavior and the circumstances in which it occurs. The researchers flipped this concept for LLMs: To understand the internal intent of a speaker, the model should first explicitly observe and list their external behaviors.
Instead of asking the model to jump straight to a conclusion (“Is he lying?”), SELF-PERCEPT forces the model to act as a behavioral psychologist first.
The Two-Stage Process

Figure 1 illustrates the workflow of SELF-PERCEPT compared to a standard K-Shot prompt.
Stage 1: Self-Percept (Observation)
In this stage, the prompt instructs the model to holistically observe the conversation. It must identify:
- Verbal cues (what is said).
- Non-verbal cues (actions described in the transcript, like “sighs” or “laughs”).
- Discrepancies between words and actions.
The output of Stage 1 is not a classification. It is a detailed list of observed behaviors and statements for every participant. For example, the model might note: “Sylvia dominates the conversation,” or “James remains silent despite being accused.”
Stage 2: Self-Inference (Interpretation)
The model takes the behavioral observations from Stage 1 as input. It then performs “Self-Inference.” It asks: Based on these observed behaviors, what are the underlying attitudes? Is manipulation present?
If the answer is “Yes,” it then classifies the specific type (e.g., Evasion, Persuasion).
By decoupling observation from inference, the model is less likely to hallucinate manipulation where there is none, and more likely to catch subtle cues it might otherwise gloss over.
Experiments and Results
The researchers evaluated their framework against state-of-the-art models (GPT-4o and Llama-3.1-8B) using standard prompting strategies:
- Zero-Shot: Direct query.
- Few-Shot: Providing examples.
- Chain-of-Thought (CoT): Asking for step-by-step reasoning.
They tested on both the new MultiManip dataset and an existing dyadic dataset called MentalManip.
Key Findings on MultiManip

Table 1 reveals the results. We should look primarily at the F1 Score, which is the harmonic mean of Precision (accuracy of positive predictions) and Recall (ability to find all positive instances).
- GPT-4o + SELF-PERCEPT wins: The proposed method achieved the highest Accuracy (0.42) and F1 score (0.37).
- Balancing Act: While Chain-of-Thought (CoT) had higher Recall (0.32), it had lower Precision (0.21). This means CoT was “trigger happy”—flagging innocent conversations as manipulative. SELF-PERCEPT provided a much more balanced and reliable detection (Precision 0.31).
- Model Disparity: Llama-3.1-8B, being a smaller model, struggled significantly compared to GPT-4o, though SELF-PERCEPT still improved its performance over standard prompting.
Why does it work better? (SHAP Analysis)
To understand why SELF-PERCEPT outperformed Chain-of-Thought, the researchers used SHAP (SHapley Additive exPlanations) values. SHAP helps visualize which words in the text influenced the model’s decision the most.

Figure 3 offers a fascinating look “under the hood” of the models’ reasoning processes when analyzing a non-manipulative (“No”) instance.
- Right Chart (CoT): Look at the words influencing the Chain-of-Thought model. It assigns high importance to neutral, context-free words like “game,” “desire,” and “focused.” The model is getting distracted by the topic of the conversation rather than the dynamics.
- Left Chart (SELF-PERCEPT): Now look at the SELF-PERCEPT model. It assigns negative SHAP values (blue bars) to words like “anxious,” “situation,” and “teamwork.”
This indicates that SELF-PERCEPT is effectively weighing behavioral and emotional attributes. It correctly identified that words expressing “anxiety” or “teamwork” in that specific context were indicative of psychological pressure or persuasive intent. It wasn’t just reading the text; it was reading the room.
Conclusion and Implications
The paper “SELF-PERCEPT” makes a compelling case that to fix AI’s shortcomings in social intelligence, we need to look at human psychological processes. By forcing an LLM to “introspect”—to observe behavior before judging intent—we can significantly improve its ability to detect the dark arts of conversation.
Why this matters
The implications of this research extend far beyond academic benchmarks:
- Online Safety: As the authors note, accurate manipulation detection could be deployed in social media moderation tools to flag gaslighting or coordinated harassment campaigns that current toxicity filters miss.
- Mental Health: Therapeutic AI tools could help users recognize when they are being manipulated in their personal relationships, acting as an objective third-party observer.
- AI Alignment: As AI agents become more autonomous, teaching them to recognize (and avoid using) manipulation is a critical safety step.
Limitations
It is important to note, as the authors did, that the MultiManip dataset is relatively small (220 samples). While Survivor provides realistic dynamics, it is a specific context (a high-stakes game) that might not perfectly transfer to a domestic argument or a workplace dispute. Furthermore, even the best model (GPT-4o) only achieved an F1 score of 0.37, proving that we are still in the early stages of solving this problem.
However, by moving from simple binary classification to a nuanced, psychology-based framework, this research provides a strong foundation for the future of social-emotional AI.
](https://deep-paper.org/en/paper/2505.20679/images/cover.png)