Large Language Models (LLMs) have mastered the art of conversation. They can write poetry, debug code, and summarize history. But can they lie strategically? Can they deduce who among their friends is a traitor? Can they understand the subtle difference between what someone says and what they actually intend?
These capabilities fall under the umbrella of Social Intelligence. While we have plenty of benchmarks for math and coding, evaluating whether an AI can navigate complex social dynamics is much harder. Most current tests are static—multiple-choice questions that don’t reflect the fluid, high-stakes nature of real human interaction.
In this post, we are diving deep into a fascinating paper titled “INTERINTENT: Investigating Social Intelligence of LLMs via Intention Understanding in an Interactive Game Context.” The researchers developed a novel framework to test LLMs (specifically GPT-3.5 and GPT-4) within the social deduction game Avalon.
The results are surprising: while LLMs are great at planning their own moves, they struggle significantly when trying to understand the minds of others.
The Challenge: Measuring Social Smarts
Social intelligence isn’t just one thing. Psychological definitions usually break it down into four key components:
- Situational Awareness: Understanding the environment and context.
- Self-Regulation: Controlling one’s own thoughts and actions to achieve a goal.
- Self-Awareness: Understanding one’s own motives and desires.
- Theory of Mind (ToM): The ability to attribute beliefs, intents, and thoughts to others (i.e., “I know that you know that I am lying”).
To test these, the researchers turned to Avalon, a game of hidden roles, deception, and deduction. In Avalon, players are either Loyal Servants of Arthur or Minions of Mordred (Evil). The Loyal team wants to complete quests; the Evil team wants to sabotage them. The catch? The Evil players know who each other are, but the Loyal players do not.
This game is the perfect testbed because it requires Intention Understanding. You cannot win Avalon just by speaking good English; you must have a plan (an intention) and you must decipher the intentions of others behind their words.
The INTERINTENT Framework
The core contribution of this paper is INTERINTENT, a framework designed to systematically evaluate those four dimensions of social intelligence using the concept of “Intention.”
Instead of just asking the LLM to “play the game,” the researchers forced the models to explicitly articulate their intentions at every step. They mapped the four social intelligence dimensions to four specific game tasks:

As shown in Figure 1 above, the framework breaks down the cognitive process:
- Intention Selection (Situational Awareness): Can the model pick a logical goal given the game state?
- Intention Following (Self-Regulation): Can the model actually stick to that goal when thinking and speaking?
- Intention Summarization (Self-Awareness): Can the model look back at its own speech and accurately describe what it was trying to do?
- Intention Guessing (Theory of Mind): Can the model look at another player’s speech and guess their hidden agenda?
The Intention-Guided Gameplay Pipeline
To make this work, the researchers didn’t use a standard prompt. They built a sophisticated game pipeline that forces the LLM to “stop and think” before acting.

Figure 2 illustrates this flow. Notice the “Internal steps” on the right side. Before a player speaks, they go through a rigorous cognitive process:
- First-Order Reasoning: Deduce facts about the game (e.g., “Player 1 might be Merlin”).
- Intention Selection: Choose specific goals from a predefined list (e.g., “Support teammate”).
- Thinking & Speaking: Draft the internal thought process and the public speech.
- Second-Order Reasoning: Anticipate how others will react to that speech.
- Intention Modification: Adjust the plan if the anticipated reaction is bad.
This structure allows the researchers to isolate exactly where the AI fails. If the AI loses, was it because it picked the wrong goal? Or because it picked the right goal but failed to execute it in speech?
The Menu of Intentions
To standardize the evaluation, the researchers curated a specific list of intentions that players could choose from. These aren’t vague goals like “win the game,” but tactical moves like “Cast suspicion on innocent players” or “Pretend to be Merlin.”

By forcing the LLM to select from this list (Table 9), the researchers converted abstract social strategy into a classification task that can be measured for accuracy.
Evaluating the Model: The Grading Rubric
One of the hardest parts of evaluating LLMs is subjectivity. To solve this, the researchers developed strict criteria for human annotators, particularly for Intention Following.
It’s not enough for the model to just try to follow an intention. It has to do it effectively and without “hallucinating” (making up facts that didn’t happen in the game).

As Table 1 shows, a score of 5 requires the content to follow the intention well with clear information. A score of 3 is a borderline case where the model tries but uses wrong context or is too vague.
Key Findings
The researchers ran 40 games with GPT-3.5 and 5 games with GPT-4. Here is what they found regarding the social intelligence of these models.
1. Situational Awareness: They Know What’s Going On
LLMs are surprisingly good at Intention Selection.
- GPT-3.5 Accuracy: 87.5%
- GPT-4 Accuracy: 88.8%
This means the models generally understand the game state. If a quest failed, they know they shouldn’t trust the people on that team. They rarely pick intentions that are completely contradictory to the facts.
2. Self-Regulation: The Gap Between Thought and Speech
Knowing what to do is one thing; doing it is another. The researchers measured Intention Following in two phases: “Thinking” (internal monologue) and “Speaking” (public output).

Figure 3 reveals a fascinating drop-off:
- Thinking (Left): Both models (especially GPT-3.5 in blue/light blue segments) are decent at planning internally.
- Speaking (Right): The performance drops. GPT-4 (bottom right) is much better than GPT-3.5 (top right), with 64.8% of its spoken outputs scoring a “5” compared to only 47.5% for GPT-3.5.
This suggests that Self-Regulation is a bottleneck. The model might secretly think, “I need to protect my teammate,” but then fail to construct a convincing sentence that achieves that goal without revealing its identity.
3. Does Intention Understanding Actually Help You Win?
The researchers analyzed the correlation between high scores in intention understanding and actual game victory.

The charts in Figure 4 show something critical for the “Loyal” players (the Green bars).
- Look at the “Success” column. The Green bar (Loyal) is extremely high.
- This means that in games/quests that the Loyal team won, their intention understanding was significantly better than the Evil team’s.
- Takeaway: Evil players can often win by confusing the group, but Loyal players must have high social intelligence and precise intention understanding to cut through the noise and win.
4. Theory of Mind: The Achilles’ Heel
Here is the most critical finding. While LLMs are good at understanding themselves (Self-Awareness), they are terrible at understanding others (Theory of Mind).
The researchers compared the models against human performance in Intention Summarization (explaining your own move) vs. Intention Guessing (explaining someone else’s move).

Table 5 shows the gap.
- Human Performance (ToM): ~61-65% F1 Score.
- GPT-4 Guessing GPT-4: 46.87%
- GPT-3.5 Guessing GPT-3.5: 31.73%
The models are trailing human performance by roughly 20%. They struggle to “read the room.”
Furthermore, providing the models with more context (history of previous rounds) didn’t necessarily help.

As Figure 5 shows, human performance (Blue line) stays consistently high across game rounds. However, GPT-3.5’s performance (Dark Green squares) creates a “sawtooth” pattern or drops off. This implies that as the game gets more complex and the chat history gets longer, the LLM gets confused rather than smarter. It suffers from information overload, failing to extract the hidden intentions from the accumulated noise.
Conclusion and Implications
The INTERINTENT framework provides a reality check for AI social intelligence. Through the lens of Avalon, we learned:
- LLMs are situationally aware: They know the rules and the current state.
- Execution varies: GPT-4 is a much better actor (Self-Regulation) than GPT-3.5.
- Theory of Mind is missing: This is the major frontier. LLMs cannot reliably infer the hidden intentions of other agents.
Why does this matter beyond board games? If we want to deploy AI agents in the real world—negotiating contracts, assisting in legal disputes, or even acting as personal tutors—they need Theory of Mind. They need to understand not just what a user is saying, but why they are saying it.
This paper highlights that while LLMs can mimic social interaction, they haven’t yet mastered the deep cognitive empathy required for true social intelligence. Social deduction games like Avalon, it turns out, are the perfect gym for training the next generation of AI to close this gap.
](https://deep-paper.org/en/paper/2406.12203/images/cover.png)