TMI! Why LLMs Share Too Much and How the PANDA Framework Fixes It

Introduction

Imagine you are chatting with a new acquaintance. You mention that you enjoy reading mystery novels. A normal response might be, “Oh, I love those too! Who is your favorite author?”

Now imagine the acquaintance responds: “I love reading too! I am a 35-year-old accountant living in Chicago. I have three cats named Mittens, Oreo, and Luna. I suffer from anxiety and I go to the gym every Tuesday at 6 PM.”

This response is unnatural, jarring, and frankly, a bit too much information (TMI). While humans rarely speak this way, Large Language Models (LLMs) instructed to adopt a specific “persona” often do. In their eagerness to follow instructions and prove they are “roleplaying” correctly, they tend to dump irrelevant personal details into the conversation or overload a single response with too many attributes.

This phenomenon is what researchers from Korea University have termed the Overuse Problem. In their paper, PANDA: Persona Attributes Navigation for Detecting and Alleviating Overuse Problem in Large Language Models, the authors propose a novel framework to quantify, detect, and mitigate this issue.

Example of Mistral overusing persona attributes.

As seen in the image above, when a user mentions a simple preference, the model (Mistral) responds by awkwardly shoehorning in its profession (“I am a teacher”), habits (“I enjoy yoga”), and dietary preferences (“drinking coffee”), completely derailing the natural flow of conversation.

In this post, we will dissect the PANDA framework, exploring how we can teach AI to hold a persona without making the conversation weird.

Background: The Evolution of Persona-Grounded Dialogue

The task at hand is known as Persona-Grounded Dialogue (PGD). The goal is to create chatbots that possess a consistent personality, history, and set of preferences, making them useful for applications ranging from mental health support to immersive gaming.

In the era of smaller Pre-trained Language Models (PLMs), the challenge was getting the model to remember it had a personality at all. Researchers focused on “grounding”—ensuring the model actually used the persona information provided. Metrics were designed to reward models that overlapped significantly with their persona descriptions.

The New Challenge with LLMs

With the advent of massive models like GPT-4 and LLaMA-3, the problem has flipped. These models are excellent at following instructions. When told, “You are a nurse who loves jazz,” they don’t forget it—they obsess over it.

Existing evaluation metrics (like BLEU, Rouge, or simple F1 scores) focus on fluency or text overlap. They fail to penalize a model for being too grounded. If a model repeats its entire biography, an F1 score might actually rate it highly because it successfully recalled all the persona keywords.

Defining Overuse

The authors categorize the Overuse problem into two specific criteria, inspired by Gricean maxims of conversation:

Off-topic: The model brings up persona attributes that have nothing to do with the current conversation topic (e.g., mentioning your job when discussing food).
Excess of Quantity: The model brings up relevant attributes but uses too many of them at once, creating an unnatural information dump.

The PANDA Framework

To solve this, the researchers introduced PANDA (Persona Attributes Navigation for Detecting and Alleviating overuse). It is a comprehensive framework designed to measure how naturally an LLM uses its persona.

Overview of the PANDA Framework showing the pipeline from labeling to measurement.

The framework operates in a pipeline:

Taxonomy: Defining what we are talking about.
Dialogue Labeling: Tagging the conversation with these definitions.
Persona-Topic Mapping: Connecting specific persona sentences to topics.
Overuse Measurement: Calculating a score to quantify the “TMI” factor.

Step 1: A Fine-Grained Taxonomy

To judge if a response is “off-topic,” we first need to define what “topics” exist in a personal conversation. Existing datasets often lack these labels. The authors devised a detailed taxonomy of 14 dialogue topics, significantly expanding on previous work.

Table 1 showing the 14 defined dialogue topics ranging from Hobbies to Beliefs.

This granular list allows for nuance. Distinguishing between Experiences:past/present and Beliefs/Values is crucial for determining whether a response fits the context.

Step 2: Formalizing the Problem

How do we turn a conversation into math? The authors define a method to map text to topics.

First, they utilize a tagging function, denoted as \(TAG_P(x)\). This function scans a text \(x\) (either the user’s utterance or the model’s response) and identifies which persona attributes \(p_i\) are present.

Equation 1: The Persona Tagging Function.

This step effectively extracts the “personal facts” from the raw text. Once the attributes are extracted, they must be converted into topics. For example, the attribute “I own a Ferrari” needs to be converted to the topic [Possessions].

This is handled by the Topic Mapping function, \(TAG_T\):

Equation 2: Mapping Persona Attributes to Topics.

Finally, by summing these up, the framework generates a complete set of topics found in the text.

Equation 3: Aggregating the topics.

Step 3: Measuring the Overuse Score (OVS)

This is the heart of the PANDA framework. The goal is to calculate a single score, OVS, that represents how much the model is over-sharing.

The logic relies on comparing the topics in the Partner’s Utterance (\(u\)) vs. the topics in the Model’s Response (\(\hat{y}\)).

In a natural conversation, the topics should largely align. If the partner talks about [Preferences:food], the model should ideally respond with [Preferences:food]. If the model responds with [Occupations] and [Family], the overuse score should rise.

The global Overuse Score is calculated as follows:

Equation 4: The Global Overuse Score Calculation.

This equation looks complex, but it is essentially an average of individual scores (\(ovs_i\)) for every topic involved in the conversation. The function \(\sigma\) (sigmoid) normalizes the result between 0 and 1.

But how is the individual score for a specific topic calculated?

Equation 5: Individual Topic Overuse Score.

Here, the score compares the count of a specific topic in the model’s response (\(|T^{\hat{y}}|\) ) against the partner’s utterance (\(|T^u|\)).

Crucially, the authors introduce a penalty weight, \(w_i\). This weight determines how severely to punish the model based on the type of error (Off-topic vs. Excess Quantity).

Equation 6: The Penalty Weighting Function.

Let’s break down the logic of \(w_i\):

Off-topic: If the partner didn’t mention a topic (count = 0) but the model did (count > 0), this is a severe violation. The penalty is exponential (\(e^{x+1}\)).
Excess of Quantity: If the partner mentioned the topic once, but the model mentioned it five times, this is bad, but less bad than being off-topic. The penalty is linear (\((x+1) \cdot e\)).
Otherwise: If the counts are balanced, there is no penalty (\(e\)).

Building the Resource

To test this framework, the authors couldn’t rely on existing datasets as they lacked these detailed labels. They took the famous PersonaChat dataset and annotated it using the 14-topic taxonomy.

They found that the distribution of topics in standard persona datasets is heavily skewed toward hobbies and habits.

Distribution of topics in the annotated dataset.

As shown above, Preferences:hobby/habit accounts for nearly a quarter of all persona attributes. This is important context: if models are overusing personas, they are likely talking about their hobbies too much.

Experiments and Verification

The researchers tested four major LLMs: ChatGPT, LLaMA-3 (8B), Mistral (7B), and Gemma (7B).

They fed these models prompts including a persona and a dialogue history, then analyzed the responses using PANDA.

1. Which Models Overuse the Most?

The results were revealing. While all models struggled somewhat, specific models were “chattier” than others regarding their personas.

Density distribution of Overuse Scores across models.

In the density plot above, look at the “OVS” (Overuse Score) graph in the top left.

ChatGPT and Gemma have higher peaks near 0 (meaning low overuse).
Mistral and LLaMA-3 show “fatter tails” extending to the right, indicating a higher frequency of high-overuse responses.

The quantitative results confirm this:

Table 2: Experimental results showing OVS and other metrics.

Mistral had the highest (worst) Overuse Score of 0.773, followed closely by LLaMA-3. Interestingly, while these models had high fluency scores (Rouge-L, chrF++), their ability to hold a natural conversation regarding persona was lower because of this TMI tendency.

2. What Topics are Abused?

The authors analyzed exactly what the models were over-sharing. They generated a “heatmap” to visualize correlations. The X-axis represents the topic the human partner brought up, and the Y-axis represents the topic the model responded with.

Correlation heatmap for Mistral showing topic triggers.

In an ideal world, this heatmap would show a strong diagonal line—talking about food triggers a response about food. However, look at the Mistral heatmap above. There are “hot spots” off the diagonal.

When the partner talks about Preferences:hobby/habit, Mistral often brings up Occupations or Characteristics:others.
This visualizes the “Off-topic” overuse phenomenon perfectly.

3. The Impact of Dialogue History

One hypothesis was that models might panic and dump persona information because they lack context. To test this, the authors varied the length of the “Dialogue History” provided to the model.

Graph showing Overuse Score decreasing as history length increases.

The graph on the left is fascinating. As the length of dialogue history increases (from 4 turns to 12 turns), the Overuse Score (OVS) steadily decreases for all models.

This suggests that when LLMs have more conversational context to latch onto, they rely less on crutching their static persona attributes. They become more reactive to the conversation flow rather than just broadcasting their identity.

4. Can Prompt Engineering Fix It?

Finally, the researchers explored if “thinking” prompts could alleviate the problem. They tested methods like Chain-of-Thought (CoT), Task Decomposition, and Self-Refine.

Table 10: Results of reasoning-enhanced prompting.

The results were mixed:

ChatGPT: Advanced prompting (especially Task Decomposition) significantly reduced overuse.
LLaMA-3: The “Self-Refine” method worked wonders, dropping the OVS from 0.76 to 0.66.
Mistral/Gemma: Surprisingly, these methods often made the problem worse or had negligible effect. This suggests that smaller or different architectures might not benefit from complex reasoning prompts in the same way when it comes to persona regulation.

Conclusion

The PANDA framework highlights a critical nuance in the development of Generative AI. We have moved past the point of asking, “Can the model speak fluent English?” or “Can the model remember it’s playing a character?”

We are now at the stage of asking, “Can the model understand social cues?”

The Overuse problem—generating 500 words about being a vegan astronaut when the user just asked “How are you?"—is a significant barrier to creating lifelike, empathetic chatbots. By formalizing this problem into measurable components (Off-topic vs. Excess Quantity) and creating a robust taxonomy, the PANDA researchers have given the NLP community a compass to navigate this issue.

Key Takeaways

Fluency \(\neq\) Quality: A model can write beautiful sentences but fail at social pragmatics by over-sharing.
Context is King: Providing longer dialogue history helps models chill out and stop spamming their persona.
One Size Doesn’t Fit All: Mitigation strategies like Chain-of-Thought work for some models (ChatGPT) but backfire on others (Mistral).

As LLMs continue to integrate into our daily lives as assistants and companions, frameworks like PANDA will be essential in ensuring they act less like robots reading a script and more like genuine conversational partners.

Introduction#

Background: The Evolution of Persona-Grounded Dialogue#

The New Challenge with LLMs#

Defining Overuse#

The PANDA Framework#

Step 1: A Fine-Grained Taxonomy#

Step 2: Formalizing the Problem#

Step 3: Measuring the Overuse Score (OVS)#

Building the Resource#

Experiments and Verification#

1. Which Models Overuse the Most?#

2. What Topics are Abused?#

3. The Impact of Dialogue History#

4. Can Prompt Engineering Fix It?#

Conclusion#

Key Takeaways#