If you have ever been part of a busy group chat on WhatsApp or Slack, you know the chaos. Multiple conversations happen simultaneously. Someone answers a question from five minutes ago while two other people are debating lunch options. Keeping track of who is talking to whom—and more importantly, what they are talking about—is a significant cognitive task for humans.

For Artificial Intelligence, this is a nightmare.

In the field of Natural Language Processing (NLP), this problem is known as Multi-party Dialogue Generation (MDG). While standard chatbots (like early versions of Siri or simple customer service bots) only have to deal with one user (a one-on-one structure), MDG agents must navigate a web of entangled conversation threads.

Most existing research focuses on the “reply-to” structure—figuring out who the bot should answer. However, a new paper by researchers Fan, Li, and Zhu argues that getting the “who” right isn’t enough. The bot also needs to get the topic and the logic (rhetoric) right.

In this post, we will deep-dive into their proposed solution: RL-TRC (Reinforcement Learning with Topic and Rhetorical Coherence). We will explore how they use reinforcement learning not just to generate text, but to enforce strict logical and topical consistency in chaotic conversation environments.

The Problem: When AI Loses the Thread

In a two-party dialogue, the rule is simple: the bot responds to the last thing you said. In a multi-party dialogue, the “target utterance” (the specific message the bot should reply to) might be buried three or four messages back in the history.

State-of-the-art models attempt to map these relationships using graph neural networks or “reply-to” prediction algorithms. While they often succeed in identifying the correct person to address, the content of their response often drifts off-topic.

Let’s look at a concrete example from the paper to understand this failure mode.

Figure 1: An example of multi-party dialogue from the Hu (Hu et al., 2019) dataset. EMMDG (Li and Zhao, 2023) and MADNet (Gu et al., 2023b) are two SOTA baselines.

In Figure 1, we see a conversation log (History) with multiple participants (\(P_1\) to \(P_6\)).

  • The target utterance is \(U_2\): “what does opera consume at startup ?” (referring to the Opera web browser).
  • Other messages (\(U_3\) through \(U_6\)) discuss a text editor called Emacs.

The bot (\(P_1\)) needs to reply to \(U_2\).

  • The Human response (correct) answers the question about Opera.
  • The EMMDG response (a previous AI model) says: “i don’t use firefox at all.” It hallucinated a topic (Firefox) that appeared in \(U_1\) but is irrelevant to the specific question in \(U_2\).
  • The MADNet response says: “i use opera, but i don’t use it.” This is topically correct but logically incoherent (a contradiction).

The researchers identified that models struggle with two specific types of coherence:

  1. Topic Coherence: Sticking to the subject of the target utterance (e.g., Opera vs. Emacs).
  2. Rhetorical Coherence: Using the correct logical structure (e.g., if the target is a question, the response should be an answer, not a continuation or a contradiction).

The Solution: The RL-TRC Framework

To fix this, the authors propose a framework that treats dialogue generation as a Reinforcement Learning (RL) problem. Instead of just training a model to predict the next word (which is how most Language Models work), they train an “agent” that actively decides how to maintain coherence.

The architecture is sophisticated, utilizing a standard Transformer encoder-decoder (BART) wrapped inside an Actor-Critic RL framework.

Figure 2: Framework of our method. The parameters of both decoders are shared.

As shown in Figure 2, the process flows as follows:

  1. Encoder: The dialogue history is processed to create a state representation (\(s_i\)).
  2. Coherence Tasks: The model simultaneously analyzes the Topic and Rhetoric of the target utterance.
  3. Policy Network: An RL agent (the “Actor”) decides which coherence information is most critical for the current step.
  4. Decoder: The model generates a response (\(y\)) guided by that decision.
  5. Rewards: The generated response is graded on whether it stayed on topic, made sense logically, and addressed the right person.

Let’s break down the mathematical machinery engine driving this process.

1. The Topic Coherence Task

First, the model needs to understand what is being discussed. The researchers use Pointwise Mutual Information (PMI) to build a matrix of how strongly words relate to each other. For example, “Opera” and “Consume” might have a high PMI score in this context.

\[ P M I ( w _ { i } , w _ { j } ) = l o g \frac { p ( w _ { i } , w _ { j } ) } { p ( w _ { i } ) p ( w _ { j } ) } \]

Using these relationships, the model tries to predict the keywords of the response before it generates the full sentence. It takes the semantics of the target utterance (\(\mathbf{h}_{ut}\)) and projects them to predict the next relevant keywords.

\[ \mathbf { h } _ { u t } ^ { a } = M L P ( \mathbf { h } _ { u t } ) \]\[ \begin{array} { r } { \mathbf { h } _ { t i } = s o f t m a x ( \mathbf { W } _ { v } t a n h ( \mathbf { W } _ { q } \mathbf { h } _ { u t } ^ { a } + } \\ { \mathbf { W } _ { m } \mathbf { E } _ { c k } + b ) ) \mathbf { E } _ { c k } ) } \\ { P _ { a } = s o f t m a x ( \mathbf { W } _ { t i } \mathbf { h } _ { t i } ) } \end{array} \]

This prediction is supervised by a loss function (\(\mathcal{L}_{t}\)), ensuring the model learns to identify keywords that actually appeared in the ground-truth response (\(w_i\)).

\[ \mathcal { L } _ { t } = - \frac { 1 } { f } \sum _ { i = 1 } ^ { f } l o g P _ { a } ( w _ { i } ) \]

2. The Rhetorical Coherence Task

While “Topic” handles the what, “Rhetoric” handles the how. Rhetorical relations define the logical link between two sentences. Is the response an Elaboration? A Clarification Question? A Correction?

To understand these relations, the paper utilizes a discourse parsing tool trained on the Molweni dataset. The possible relations are diverse, as seen below:

Table 12: Discourse relations and their descriptions, cited from Li et al. (2020).

The model attempts to predict the correct discourse relation (\(r\)) based on the target utterance semantics (\(\mathbf{h}_{ut}\)):

\[ \mathbf { h } _ { u t } ^ { b } = M L P ( \mathbf { h } _ { u t } ) \]\[ P _ { b } = s o f t m a x ( \mathbf { W } _ { u b } \mathbf { h } _ { u t } ^ { b } ) \]

This is optimized via a standard cross-entropy loss function (\(\mathcal{L}_{r}\)):

\[ \mathcal { L } _ { r } = - l o g P _ { b } ( r ) \]

3. The Reinforcement Learning Agent

Here is where the “RL” in RL-TRC comes into play. The system uses an Actor-Critic algorithm.

  • The State (\(s_k\)): The current history of the conversation.
  • The Action (\(a_k\)): The agent must choose between focusing on Topic Semantics or Rhetorical Semantics to guide the next generation step.

The policy network (\(\pi\)) calculates the probability of taking a specific action based on the current state.

\[ \pi _ { \boldsymbol { \theta } } ( a _ { k } | s _ { k } ) = s o f t m a x ( \mathbf { o } _ { k } \mathbf { W } _ { \boldsymbol { \theta } } ) \]

The goal of the agent is to maximize the expected cumulative reward. But how does the agent know if it did a good job?

4. The Three Rewards

In Reinforcement Learning, the “Reward” is the grade the student gets after taking a test. This paper designs three specific “Discourse-Aware” rewards to guide the agent.

A. Topic-Coherence Reward (\(R_{tc}\)) This reward measures how well the generated response’s keywords overlap with the topics found in the target utterance. It essentially asks: Are you still talking about the same subject?

\[ R _ { t c } = f _ { t c } ( [ u _ { t } ; u _ { t \_ k w s } ] , [ y _ { t } ; y _ { k w s } ] ) \cdot e ^ { ( n / | y _ { k w s } | - 1 ) } \]

B. Rhetorical-Coherence Reward (\(R_{rc}\)) This reward compares the rhetorical relation of the generated response against the golden (human) response. It uses Kullback-Leibler (KL) divergence to measure how close the model’s logical structure is to the human’s logic.

\[ R _ { r c } = - K L ( f _ { r c } ( u _ { t } , y ^ { * } ) | | f _ { r c } ( u _ { t } , y ) ) \]

C. Reply-to Reward (\(R_{rt}\)) This ensures the model is actually recognizable as a reply to the correct target utterance (\(u_t\)) rather than a random message in the history.

\[ R _ { r t } = - K L ( f _ { r t } ( C , y ^ { * } ) | | f _ { r t } ( C , y ) ) \]

Total Reward Finally, these three rewards are weighted and summed to create a final score (\(r\)) that updates the agent’s policy.

\[ r = w _ { t c } R _ { t c } + w _ { r c } R _ { r c } + w _ { r t } R _ { r t } \]

Experimental Results

The researchers tested their model on two major datasets derived from Ubuntu IRC chat logs (technical support channels). These are notoriously difficult datasets because they involve jargon, code snippets, and many overlapping conversations.

Table 1: Statistics of the two datasets evaluated in this paper.

The team compared RL-TRC against several baselines, including:

  • ChatGPT (GPT-4): A powerful general-purpose LLM.
  • HeterMPC & MADNet: Previous state-of-the-art models specifically designed for multi-party chat.
  • BART: The base model without the RL enhancements.

Quantitative Analysis

The results were evaluated using standard text generation metrics like BLEU (precision of n-grams), METEOR, and ROUGE.

Table 2: Automatic evaluation results on the Hu dataset…

As shown in Table 2, RL-TRC (Ours) achieved the highest scores across the board on the Hu dataset. It significantly outperformed MADNet and EMMDG. Interestingly, it also outperformed ChatGPT on these specific metrics, highlighting that general-purpose LLMs still struggle with the specific structural constraints of entangled multi-party logs compared to specialized, fine-tuned models.

The dominance held true for the second dataset (Ou5) as well:

Table 3: Automatic evaluation results on the Ou5 dataset.

Ablation Study: What Mattered Most?

A critical part of any AI paper is the “Ablation Study”—systematically removing parts of the model to see if they were actually necessary.

Table 5: Ablation results on the Hu dataset.

Looking at Table 5, we can draw important conclusions:

  1. w/o TC (Without Topic Task): Performance drops significantly. This suggests that keeping the topic straight is the single most important factor for success.
  2. w/o RC (Without Rhetoric Task): Performance drops, but less so than Topic. The authors note this might be because discourse parsing (understanding logic) is inherently harder and the tools used are less accurate than topic keyword extractors.
  3. w/o TCR (Without Topic Reward): Removing the RL reward for topics hurts the model more than removing the task itself. This proves the Reinforcement Learning approach is doing the heavy lifting.

Case Study: The “Opera” Example Revisited

Remember the confusing conversation about the Opera browser from the introduction? Let’s see how RL-TRC handled it compared to the baselines.

Table 11: Responses generated by our model and two SOTA baselines. The dialogue history is shown in Figure 1.

  • Target Question: “what does opera consume at startup?”
  • EMMDG: Talked about Firefox (Wrong Topic).
  • MADNet: “i use opera, but i don’t use it” (Logical contradiction).
  • RL-TRC: “i don’t use opera, i use firefox.”

While RL-TRC’s answer is simple, it is coherent. It acknowledges the topic (Opera) and provides a logically sound response (Contrast/Elaboration). It successfully filtered out the noise about Emacs from the other participants and focused on the browser discussion.

Conclusion and Implications

The RL-TRC paper highlights a crucial evolution in dialogue systems. We are moving past the phase of simply “predicting the next word” or “guessing the addressee.” We are entering a phase where models are explicitly designed to understand the structure of conversation.

By separating the tasks of Topic Coherence (what we are talking about) and Rhetorical Coherence (how the conversation flows), and enforcing them via Reinforcement Learning, the authors have created an agent that can navigate the noisy, chaotic waters of multi-party chatrooms with much higher fidelity.

For students and researchers, this paper serves as an excellent example of how to combine distinct linguistic goals (topic vs. rhetoric) into a unified reward function, demonstrating that sometimes, the best way to generate good text is to treat it as a series of strategic decisions rather than just a sequence of probabilities.