Imagine you are having a terrible day. You turn to a friend to vent about your stress. In response, they give you a single-word reply: “Okay.” You feel unheard and indifferent.
Now, imagine the opposite scenario. You share your problem, and that same friend responds with a ten-minute, breathless monologue, analyzing every micro-factor of your situation, citing historical precedents, and offering fifteen different solution paths simultaneously. You feel overwhelmed. Instead of feeling supported, you are now exhausted.
This is the central challenge in Emotional Support Conversation (ESC) systems. For an AI to be truly supportive, it must find the “Goldilocks” zone: it needs to say enough to be helpful, but not so much that it burdens the user.
In the paper “Be Helpful but Don’t Talk too Much - Enhancing Helpfulness in Conversations through Relevance in Multi-Turn Emotional Support,” researchers from The Hong Kong Polytechnic University propose a fascinating solution rooted in cognitive psychology. They argue that helpfulness isn’t just about the quality of advice; it is a ratio between the positive effect of the message and the processing effort required to understand it.
In this post, we will deconstruct their method, known as Optimal Relevance Learning (ORL), and their model architecture, VLESA. We will explore how they taught an AI to balance empathy with brevity, creating a system that is not only smarter but significantly more human-like.
The Problem: The Effect-Effort Trade-off
Why do current chatbots struggle with emotional support? Traditional dialogue systems are often trained to maximize a specific objective, such as “helpfulness” or “engagement.” However, when you optimize solely for helpfulness, the model might hallucinate that “more is better,” leading to long, rambling responses that tax the user’s cognitive load.
The researchers base their solution on the Cognitive Relevance Principle. This principle suggests that for a conversation to be relevant and successful, speakers must optimize a trade-off:
- Maximize Cognitive Effect: Provide information that actually changes the listener’s mental state or solves their problem.
- Minimize Processing Effort: Make that information as easy as possible to digest.

As shown in Figure 1, we can visualize this trade-off.
- Interaction 1 -> 2: The user (A) shares stress. The responder (B) asks a probing question (“What happened?”). This moves the conversation up the “Effect” axis. It requires a bit more effort from the listener, but the payoff in helpfulness is high. This is a desirable transition.
- Interaction 4 -> 5: The responder gives a long, empathetic statement (“Hey there… you’re not alone…”). This has a high effect and, crucially, keeps the effort relatively low because the language is comforting and easy to process.
- The Trap: If a system tries to maximize effect without regarding effort (moving far right on the X-axis), it becomes “preachy” or overwhelming.
The researchers’ goal was to build a system that inherently understands this graph, seeking the “Optimal Relevance” zone where effect is high, and effort is low.
The Architecture: Variational Latent Emotional Support Agent (VLESA)
Before diving into the reinforcement learning aspect, we need to understand the structure of the agent itself. The authors developed the Variational Latent Emotional Support Agent (VLESA).
Unlike a standard transformer that simply predicts the next word based on previous words, VLESA is hierarchical. It mimics how humans think before they speak. When we comfort someone, we don’t just pick words randomly; we first decide on a strategy (e.g., “I should ask a question” or “I should reassure them”) and an emotion (e.g., “I should sound caring”).
Hierarchical Decision Making
VLESA models this by using Latent Variables. These are hidden states that represent the high-level decisions of the bot.

As illustrated in Figure 3, the workflow operates in a coarse-to-fine manner:
- Encoder (\(LM^{enc}\)): The system takes the dialogue history (\(D\)) and encodes it into a hidden state.
- Latent Variables (\(z\)): The model samples two specific latent variables:
- \(z_a\) (Speech Act): Determines the strategy (e.g., Self-disclosure, Affirmation).
- \(z_e\) (Emotion): Determines the emotional tone.
- Policy Heads: These heads (\(\pi_a\) and \(\pi_e\)) generate the probability distributions for the specific action and emotion to be used.
- Decoder (\(LM^{dec}\)): Finally, the decoder generates the actual words (\(w_t\)). Crucially, this generation is conditioned on the chosen speech act and emotion variables.
This structure ensures that the generated text aligns with a coherent high-level strategy, rather than meandering aimlessly.
The mathematical formulation for the decoder shows this dependency clearly:

Here, the probability of the next word \(w_t\) relies not just on previous words \(w_{0:t-1}\), but specifically on the latent variables \(z_a\) (action) and \(z_e\) (emotion).
The Training Method: Optimal Relevance Learning (ORL)
The architecture provides the capacity to be strategic, but how do we teach it to be optimally relevant? This is where Optimal Relevance Learning (ORL) comes in.
ORL is a reinforcement learning (RL) framework. In standard RL for dialogue, an agent gets a reward based on whether the conversation ended successfully. Here, the researchers designed a specific reward function based on the Cognitive Relevance Principle.
The Simulated Environment
Since training on real humans is slow and expensive, the researchers set up a “User-in-the-Loop” simulation.

As shown in Figure 2, the training loop consists of:
- The Supporter (Agent): Generates a response.
- The Simulated User: Another LLM (like Llama-2 or DialoGPT) that acts as the person seeking help.
- The Helpfulness Scorer: A BERT-based model trained to predict how helpful a response is.
The Mathematics of Relevance
The core contribution of this paper is the definition of the Reward function (\(r_u\)). Recall that Relevance = Effect / Effort. The researchers mathematically formalized this ratio.
1. Calculating Cognitive Effect (\(Efct\))
The “Effect” is defined as the change in the helpfulness score of the conversation after the agent speaks. If the agent’s response makes the situation better, the effect is positive.

Here, Helpful(\(D'\)) is the score after the new utterance, and Helpful(\(D\)) is the score before it.
2. Calculating Processing Effort (\(Efrt\))
How do we measure “effort” for a machine? The researchers used Surprisal (also known as perplexity in some contexts). In Information Theory, surprisal measures how unexpected a word is.
If a sentence uses highly obscure words or convoluted syntax, it has high surprisal. Humans have to work harder to process high-surprisal text. Therefore, minimizing surprisal minimizes cognitive load.

The effort is the sum of the surprisal of all words in the response.

3. The Relevance Reward
Finally, the reward \(r_u\) for a specific utterance is the ratio of the Effect to the Effort.

This simple equation changes everything.
- If the agent says something very helpful (High Effect) but uses 100 words to say it (High Effort), the Reward decreases.
- If the agent says something useless (Low Effect) in 2 words (Low Effort), the Reward is still low because the numerator is small.
- The agent is forced to find the most helpful content using the most efficient language.
The Complete Workflow

Figure 4 summarizes the entire learning process. The Supporter Agent generates an utterance \(u^s_T\) using its latent variables. This utterance is fed into the Relevance Model (the environment), which calculates the Effect (via the Helpfulness Scorer) and the Effort (via the User Simulator’s perplexity).
This reward signal is then passed back to update the agent. Note that the system assigns rewards not just to the whole sentence, but to individual words (\(r^t_w\)) based on their attention weights, allowing for fine-grained optimization.

Experiments and Results
The researchers evaluated VLESA against several strong baselines, including specialized emotional support models like MISC, TransESC, and KEMI. They tested the model using the ESConv dataset, a large-scale emotional support conversation dataset.
Automatic Evaluation
The results were statistically significant.

Looking at Table 2, we can see that VLESA (feat. Llama, Bart) outperforms baselines across almost all metrics.
- BLEU/METEOR/ROUGE: These metrics measure how closely the generated text matches ground-truth human responses. VLESA scores highest, indicating natural phrasing.
- HumanLike: This is a classifier score predicting if the text looks human-written. VLESA achieved 71.01, significantly higher than models like KEMI (18.09) or MISC (59.83).
- Non-Toxic: The model maintained a high safety rating.
Does “Effort” Really Matter?
One might ask: “Why not just optimize for Helpfulness? Why bother with the Effort denominator?”
The researchers performed an ablation study to answer this. They trained a version of the model w/o Effort (optimizing only for Helpfulness).

Figure 5 reveals a surprising insight. The red line represents the full model (with Effort optimization), while the purple/other lines represent variations or baselines.
- Top Right (Effect/Helpful): You might expect the model trained only for helpfulness to be the most helpful. But actually, the model trained with the Effect/Effort trade-off (Red) achieves higher helpfulness scores over time.
- Bottom Left (Effort/Surprisal): The full model successfully lowers the surprisal (effort) compared to the user simulator.
This suggests that brevity and clarity actually enhance helpfulness. By constraining the model to be efficient (low effort), it learns to be more precise and impactful with its words.
Qualitative Analysis: Learning Not to Ramble
The most compelling evidence comes from watching the model learn. In the early stages of training, the model behaves like the “overwhelming friend” we described in the introduction.

In Table 5, look at the progression:
- Step 0: The model rambles. “I’ve been in that position… I have a book… I also started a new hobby… I like to write down my goals…” It’s talking too much.
- Step 39: It starts to focus. “I’m sure you’re doing great! I’ve been in a similar situation…”
- Step 78 (Converged): “I’ve found that learning new skills online can be a great way to get your mind off of things. I think you’re doing the right thing…”
The final response is concise, validating, and offers a specific perspective without overwhelming the user. This is the direct result of the “Effort” penalty in the reward function.
Conclusion and Implications
The “Be Helpful but Don’t Talk too Much” paper provides a robust framework for the next generation of conversational AI. By integrating the Cognitive Relevance Principle into the loss function of a neural network, the authors bridged the gap between linguistic theory and machine learning.
The key takeaways are:
- Helpfulness is a Ratio: It is not enough to simply provide good information; that information must be easy to process.
- Latent Variables Matter: Hierarchical modeling (Speech Act -> Word) produces more coherent support than flat generation.
- Constraints Breed Quality: Penalizing “effort” (surprisal) doesn’t make the model simpler; it makes it smarter and more human-like.
For students and researchers in NLP, this work highlights the importance of looking beyond standard metrics like accuracy or perplexity. By modeling the human experience of a conversation—specifically the cognitive load placed on the listener—we can build agents that don’t just output text, but actually provide comfort.
](https://deep-paper.org/en/paper/file-2791/images/cover.png)