Diagnosing the Multi-Party Hangover: Can LLMs Handle Complex Group Chats?

We have all been there: a chaotic group chat on WhatsApp, Slack, or Discord. Multiple conversations happen simultaneously, people reply to messages from three hours ago, and users jump in and out of the discussion. Navigating this web of interactions requires more than just understanding language; it requires understanding structure. You need to know who is talking to whom to make sense of the “what.”

Large Language Models (LLMs) like GPT-4 or Llama-2 excel at one-on-one dialogues. But do they struggle when the room gets crowded?

In the paper “Do LLMs suffer from Multi-Party Hangover?”, researchers Nicolò Penzo and colleagues from Fondazione Bruno Kessler investigate whether LLMs can truly grasp the complexity of Multi-Party Conversations (MPCs). They propose a novel diagnostic pipeline to test whether these models rely on the actual text content or the structural web of interactions.

This article breaks down their methodology, their unique approach to data privacy via summarization, and their findings on how conversation structure impacts AI performance.

The Problem: Linguistics vs. Structure

In a standard dialogue system (like a customer service chatbot), the conversation flow is linear: User A speaks, System B responds. However, in an MPC, the flow is a graph. User A might speak to User B, then User C interrupts, and User A responds to User C while User B adds a comment meant for User A.

To understand an MPC, a model needs to master two distinct dimensions:

Linguistic Information: The semantic content of the messages (what is being said).
Structural Information: The interaction graph (who is addressing whom).

Most evaluation methods for LLMs focus heavily on the linguistic side. This paper argues that we are overlooking the structural variance. To diagnose this, the authors focus on two specific proxy tasks, illustrated below:

Figure 1: A graphical representation of the experiments.

As shown in Figure 1, the researchers break the problem down into:

Addressee Recognition (AR): Given the conversation history and the next speaker, can the model predict who they are speaking to? This task is inherently structural.
Response Selection (RS): Given the history and the next speaker, can the model select the correct text of the response from a list of candidates? This task is inherently linguistic.

By testing performance on these two tasks across conversations with varying numbers of participants, the researchers aim to isolate the model’s structural reasoning capabilities.

The Method: Deconstructing the Conversation

The core of this research is a “diagnostic pipeline.” rather than simply feeding the raw chat logs into Llama2-13b-chat (the model used for this study), the authors experiment with different ways of representing the conversation. This allows them to see exactly which pieces of information the model uses to make decisions.

1. Input Representations

The authors identified four distinct ways to represent a conversation in the prompt fed to the LLM.

Figure 2: Example of the 4 possible conversation representations.

Referencing Figure 2, the four input types are:

Conversation Transcript (Top Left): This is the standard log containing the speaker and the message text. It provides full linguistic context.
Interaction Transcript (Top Right): This strips away the text entirely. It only lists “Speaker X to Addressee Y.” This tests if the model can guess the outcome based purely on the flow of interaction, without knowing what was actually said.
Summary (Bottom Left): Instead of raw text, the LLM is given a generated summary of the topics discussed.
User Description (Bottom Right): The LLM is given a generated description of the users’ behaviors (e.g., “User A is sarcastic”).

Why Summaries and Descriptions? The inclusion of summaries and user descriptions addresses a secondary goal of the paper: Data Minimization. In the era of GDPR and privacy concerns, sharing raw chat logs is risky. If an LLM can perform well using only anonymized structural graphs and high-level summaries, researchers could share datasets without exposing specific user messages.

2. The Evaluation Pipeline

The researchers structured their experiment using a rigorous pipeline. They utilized the Ubuntu Internet Relay Chat (IRC) corpus—a massive dataset of technical support discussions. To control for complexity, they created four “diagnostic datasets” containing conversations with exactly 3, 4, 5, and 6 users (referred to as Ubuntu3, Ubuntu4, etc.).

Figure 4: Schematic representation of our evaluation pipeline.

As outlined in Figure 4, the process works as follows:

Extract Data: Get the raw conversation and interaction graph.
Generate Intermediate Inputs: Use Llama-2 to generate the Summary and User Descriptions (if the specific experiment calls for it).
Prompt Design: Construct a prompt combining specific inputs (e.g., Structure + Summary).
Task Execution: Ask the model to perform Addressee Recognition (AR) or Response Selection (RS).

3. Prompt Engineering and Sensitivity

A major challenge with LLMs is that they are sensitive to how you ask a question. To ensure their results weren’t just a fluke of phrasing, the authors tested three levels of prompt verbosity: Verbose (very detailed instructions), Medium, and Concise.

Figure 3: Example of the beginning of the system prompt in the three prompt schemes.

Figure 3 shows how the system prompt evolves. The “Verbose” prompt explicitly defines what a conversation is and how users interact, while the “Concise” prompt assumes the model already understands these concepts.

The full organization of the system prompt is modular, allowing the researchers to swap out different “Input Elements” (like the Interaction Transcript or Summary) while keeping the instruction template constant.

Figure 7: Graphical representation of the system prompt organization.

Experiments & Key Results

The researchers ran these experiments in a zero-shot setting, meaning the model was not fine-tuned on the data; it had to rely on its pre-trained knowledge to understand the instructions and the chat logs.

Finding 1: Task Dependencies (Text vs. Structure)

The most significant finding is the stark difference in what information is required for the two different tasks.

Figure 5: AR and RS macro-accuracy results.

Looking at the results in Figure 5:

Addressee Recognition (AR - Top Row): The model performs best when it has the Interaction Transcript (STRUCT). Surprisingly, adding the text (CONV) doesn’t help much, and sometimes the text-only input (CONV) performs terribly. This confirms that knowing “who talks to whom” is a structural problem, not a linguistic one.
Response Selection (RS - Bottom Row): The results flip. The model needs the Conversation Transcript (CONV). Inputs that only have structure (STRUCT) fail completely because you cannot predict what someone will say if you don’t know the topic of conversation.

The Privacy Trade-off: Interestingly, the combinations using Summaries (SUMM) performed reasonably well, especially in larger groups (Ubuntu6). This suggests that for complex conversations, a good summary might be a viable, privacy-preserving alternative to raw text. However, User Descriptions (DESC) generally performed poorly, suggesting that knowing a user’s “personality” (e.g., that they are helpful or rude) doesn’t help the LLM predict their next move in a technical support chat.

Finding 2: The “Hangover” of Complexity

The paper’s title asks if LLMs suffer from a “Multi-Party Hangover.” The answer appears to be yes.

The researchers analyzed performance against network metrics, specifically Degree Centrality—a measure of how many different people a specific speaker interacts with.

Figure 6: AR and RS accuracy results for different values of degree centrality.

Figure 6 provides the critical diagnostic insight, particularly for Addressee Recognition (AR, top row):

Low Complexity (Low Degree Centrality): When a user is only talking to 1 or 2 people, the models perform very well (high accuracy on the left side of the graphs).
High Complexity (High Degree Centrality): As soon as a user starts interacting with 3, 4, or 5 different people, performance plummets.

This graph reveals a weakness hidden by the average accuracy scores. The models are essentially “cheating” on the easy interactions. When the structural web gets tangled—the hallmark of a true Multi-Party Conversation—the LLM struggles to keep track of the threads.

Finding 3: Prompt Sensitivity

Finally, the researchers examined how sensitive the model was to the phrasing of the prompt.

Table 1: Relative gap between the best prompt result and the average.

Table 1 shows the “Relative Gap,” which measures the drop in performance if you use an average prompt versus the best possible prompt.

AR (Addressee Recognition) is highly sensitive. The relative gaps are large (e.g., 10.9% on Ubuntu4). This suggests that because AR requires structural reasoning—a capability LLMs aren’t explicitly optimized for—they need very clear, verbose instructions to succeed.
RS (Response Selection) is robust. The gaps are tiny (<1%). Predicting the next sentence is what LLMs are trained to do; they don’t need hand-holding to perform this task.

Conclusion and Implications

This research acts as a reality check for the deployment of LLMs in social platforms and group settings. While LLMs are linguistically fluent, they exhibit a “structural blindness” when conversations become interconnected.

Key Takeaways:

Structure Matters: You cannot evaluate an MPC model solely on text processing. You must evaluate its ability to track the interaction graph.
Privacy Potential: Summaries show promise as a replacement for raw text in classification tasks, potentially unlocking new, safer datasets for research.
The Complexity Cliff: Current models handle group chats well only when the sub-interactions remain simple. As soon as a user becomes a central hub of communication, the model’s performance degrades significantly.

For students and researchers entering this field, the paper highlights a clear opportunity: we need better methods for encoding conversation structure into LLMs. Simply pasting a transcript into a prompt is not enough to cure the Multi-Party Hangover. Future architectures may need to explicitly model the interaction graph alongside the text to truly understand the chaos of a group chat.

The Problem: Linguistics vs. Structure#

The Method: Deconstructing the Conversation#

1. Input Representations#

2. The Evaluation Pipeline#

3. Prompt Engineering and Sensitivity#

Experiments & Key Results#

Finding 1: Task Dependencies (Text vs. Structure)#

Finding 2: The “Hangover” of Complexity#

Finding 3: Prompt Sensitivity#

Conclusion and Implications#