From a Head Nod to a Thumbs Up: How Multimodal Signals Teach AI to Hold Better Conversations

Introduction: The “Long Conversation” Problem

Imagine you are teaching a friend how to tell a story. If you stop them after every single sentence to say “good job” or “that was boring,” the flow of the conversation is ruined. It’s unnatural. Instead, you usually listen to the whole story and, at the end, give a reaction—perhaps a laugh, a sigh, or a compliment like, “That was a great story!”

This dynamic represents a massive bottleneck in the development of Artificial Intelligence, specifically in training Large Language Models (LLMs) to be better conversationalists.

Currently, the gold standard for aligning AI with human preferences is Reinforcement Learning with Human Feedback (RLHF). This method generally relies on humans rating specific AI responses turn-by-turn. A human looks at a prompt, looks at the AI’s answer, and gives it a thumbs up or down.

But in the real world, conversations are long. A chat might last 30 minutes and involve hundreds of exchanges. Asking users to rate every single sentence is intrusive and impractical. In “wild” settings, users typically provide a single score at the very end of the interaction—a Global Explicit (GE) reward.

The challenge is a classic “credit assignment” problem: If a user gives a conversation a 5-star rating at the end, which specific sentence deserves the credit? Was it the joke in minute two? The empathetic response in minute ten? Or was it the goodbye?

In a fascinating paper titled “Global Reward to Local Rewards: Multimodal-Guided Decomposition for Improving Dialogue Agents,” researchers from MIT and Carnegie Mellon University propose a solution called GELI. Their insight is that while we only give explicit scores at the end, we give off implicit signals (like smiles, frowns, or nods) constantly throughout the chat. By using these multimodal signals to break down that final global score, we can teach AI to be a far better conversational partner.

In this post, we will tear down the GELI framework, exploring how it mathematically decomposes global feedback and uses computer vision to guide AI alignment.

The Background: Why Current RLHF Struggles with Long Talks

To understand why GELI is necessary, we first need to look at how conversational agents are currently trained.

Most modern chatbots are powered by autoregressive language models. The model behaves as an agent with a policy, taking in the dialogue history and outputting the next sentence. To make these models helpful and harmless, we use RLHF.

The standard objective function for RLHF looks like this:

Standard RLHF Objective Function.

Here is what this equation tells us:

Maximize Reward: We want to maximize the expected reward \(r_{\theta}(s_t, a_t)\). This reward usually comes from a “reward model” trained to mimic human preferences at a specific turn \(t\).
Stay Grounded: The second part (the KL divergence term) penalizes the model if it drifts too far from its original pre-trained knowledge base (\(\pi_{\eta}\)). This prevents the AI from gaming the system and speaking gibberish just to get a high score.

The Missing Piece

The equation above works great when \(r_{\theta}\) (the reward for a specific turn) is easy to get. But in long-term social dialogue, we often only have \(R_{GE}(\tau)\)—a reward for the entire trajectory or conversation session.

If we simply assign that one global score to every single sentence the AI said, it creates a noisy signal. The AI might have said something terrible in the middle of a great conversation, but if we give the whole chat a high score, the AI learns that the terrible sentence was actually good.

We need a way to decompose that global score into accurate local scores.

Enter GELI: Multimodal-Guided Decomposition

The researchers propose a framework called GELI (Global Explicit, Local Implicit). The core philosophy is that we can solve the credit assignment problem by combining two sources of data:

Global Explicit (GE): The final score given by the user (e.g., “I felt positive about this chat”).
Local Implicit (LI): Naturally occurring multimodal signals, specifically facial expressions (e.g., the user smiled after the AI made a joke).

The method essentially says: Let’s mathematically chop up the final score so that the sum of the parts equals the whole. But, let’s use the user’s facial expressions to help us decide which parts deserve the biggest slice of the pie.

Here is the high-level overview of the GELI architecture:

Figure 2: Overview of the GELI method showing the decomposition of global rewards guided by facial affect.

As shown in the figure above, the system takes the episode-level reward (\(R_{ep}\)) and learns a decomposed reward function. This function is shaped by visual facial affect (did the user smile?) to assign specific scores to specific utterances. Finally, these derived scores are used to update the Language Model via PPO (Proximal Policy Optimization).

Let’s break down the math of how this fusion happens.

1. The Global Explicit (GE) Decomposition

The first goal is to ensure that the local rewards we generate actually add up to the global reward the user gave. This is based on the assumption of sum decomposition.

Equation showing that the Global Reward is approximately the sum of local rewards.

The researchers train a reward model \(r_{\theta}\) such that when you sum up its outputs for every turn in the conversation, it matches the human’s final rating. To achieve this, they minimize the difference (Mean Squared Error) between the actual global score and the predicted sum:

Loss function for Global Explicit reward decomposition.

However, summing over a long conversation (e.g., 100+ turns) is computationally expensive and difficult to optimize. To solve this, the authors utilize a technique called Randomized Return Decomposition (RRD).

RRD is a clever statistical trick. Instead of calculating the sum for the entire conversation every time, it estimates the return using Monte-Carlo sampling. It takes random snippets of the conversation and ensures the average reward of those snippets aligns with the global score (adjusted for length).

Loss function for Randomized Return Decomposition (RRD).

This allows the model to handle very long conversations without the computational explosion usually associated with long-horizon reinforcement learning.

2. The Local Implicit (LI) Guidance

If we only used the Global Explicit decomposition described above, the model might still be confused. There are many ways to assign numbers that sum up to a final score. For example, if the final score is 10, the model could assign “1, 1, 1…” ten times, or “10, 0, 0…”

To guide the model toward the correct distribution, GELI uses Local Implicit (LI) feedback. In face-to-face dialogue, humans constantly signal how they feel. If the AI says something empathetic, the user might look sad but validated, or they might smile.

The researchers treat this as a Crossmodal Knowledge Distillation problem. They want the text-based reward model to learn from the visual signals.

They define a proxy reward based on multimodal signals:

Equation defining Local Implicit reward as a function of multimodal state.

In this specific study, they used an “affect classifier”—a computer vision model that detects emotions from the user’s face. They designed a simple indicator function:

Indicator function for positive facial affect.

If the user shows positive affect (happiness/smiling) immediately after the AI speaks, that turn gets a “1”. Otherwise, it gets a “0”.

The model is then trained to minimize the difference between its predicted reward and this visual proxy reward:

Loss function for Local Implicit feedback.

3. The Joint Objective

The magic of GELI is that it doesn’t choose between these two methods. It combines them. The final training objective blends the Global decomposition (making sure the math adds up) with the Local implicit guidance (making sure the rewards match the user’s body language).

The combined GELI loss function mixing Global and Local objectives.

By tuning the parameter \(\lambda\), the researchers can balance how much the model cares about the final survey score versus the moment-to-moment smiles.

Visualization: Does it work?

Before running the full experiment, the researchers analyzed the rewards generated by their GELI model on unseen conversations.

Figure 1: Example of GELI reward score predictions for an unseen conversation.

In the figure above, you can see the “unrolled” rewards. The bar charts on the right are particularly illuminating. Notice turns 6 through 10.

Turn 9: The AI makes a specific comment about video games (“Okay, pray for you play Fortnite…”). The reward spikes to +0.056.
Turn 7: The AI asks a generic question (“Video games?”). The reward dips to -0.032.

This confirms that the decomposed reward function is successfully distinguishing between high-quality, engaging turns and low-effort filler, even though it was never explicitly told which specific sentences were good.

The Experiments: Testing in the Wild

To validate GELI, the researchers used the CANDOR corpus. This is a massive dataset of naturally occurring conversations where two strangers talk via video call. It contains over 850 hours of video, making it perfect for extracting both global scores (post-chat surveys) and local visual signals (video feeds).

They compared GELI against several baselines:

GE Only: Using just the global score (via RRD).
LI Only: Using just the visual signals (Visual Affect) or just sentiment analysis of the text (Language Sentiment).
Human: Real human responses.

The reward functions were used to train a LLAMA-2 model via PPO.

Quantitative Results: The Reward Function

First, let’s look at how well the reward functions learned their tasks.

Table 4: Automatic Evaluation on Reward Function Training.

This table reveals a critical insight.

Global Loss (\(L_{GE}\)): The GE-only methods (like RRD) are great at minimizing global loss. They are good at ensuring the numbers add up.
Local Difference (\(\Delta\)): The LI-only methods are great at distinguishing positive vs. negative visual frames.
GELI (RRD + VA): This approach achieves the “best of both worlds.” It maintains a low Global Loss (176.897) while achieving a significant Local Difference (0.063). It learned to satisfy the global score and respect the local visual cues.

Human Evaluation: The Ultimate Test

The true test of a dialogue agent is whether humans enjoy talking to it. The researchers generated conversations using the different models and asked human annotators to rate them on 8 different metrics, including emotional connection, specificness, and sensibleness.

Table 1: Human evaluation results comparing GELI against baselines.

The results in Table 1 are striking. GELI outperforms the baselines on 6 out of 8 metrics.

Positivity: GELI scored 44.33%, nearly double the score of the GE-only baseline (16.33%).
Reuse: Users were far more likely to say they would talk to the GELI chatbot again (41.67%) compared to the others.
Emotional Connection: GELI achieved the highest connection score (39.67%), significantly higher than the base LLAMA-2 model.

The authors also noted that training on only one signal (just GE or just LI) often hurt performance compared to the base model. This suggests that combining the signals is necessary to extract a useful reward signal for reinforcement learning.

Generalization

One might worry that GELI only works on the dataset it was trained on. To test this, the researchers applied the GELI-trained model to a completely different dataset called SODA (social commonsense dialogue).

Even on this new dataset, GELI outperformed GPT-3.5 and the base LLAMA-2 model, showing that the “social skills” learned from the multimodal decomposition transferred to new text-based contexts.

Training Dynamics

It is often helpful to look under the hood at how the Reinforcement Learning (RL) process actually looked during training. RL is notoriously unstable, and seeing the curves can tell us a lot about the health of the optimization.

The standard RLHF objective used (PPO) optimizes the policy while keeping the KL divergence low:

PPO Objective Function.

Let’s look at the training curves for the different methods.

GELI (The Winner): Figure 7: GELI RL Training showing stable reward growth. Notice the Left Chart (Reward). It starts around 500 and gently climbs, stabilizing around 540. The KL divergence (Right Chart) grows but remains controlled until the very end. This indicates healthy learning—the model is finding a way to get more reward without breaking the language model immediately.

LI: Visual Affect Only: Figure 9: Visual Affect RL Training showing collapse. Compare this to the “Visual Affect Only” training. The reward crashes to near zero after 80 steps. This is a sign of reward hacking or objective collapse—the visual signal alone was likely too noisy or sparse to guide the language model effectively on its own.

GE: RRD Only: Figure 10: GE RRD RL Training. The Global-only method is extremely noisy. The reward fluctuates wildly between 100 and 250. Without the local guidance of the visual signals, the credit assignment is too difficult, and the model struggles to figure out exactly what it is doing right.

Conclusion and Implications

The GELI paper presents a significant step forward in how we think about aligning AI agents. It challenges the assumption that we need expensive, fine-grained human annotations for every sentence.

By leveraging the “Global Explicit” scores we already have (like survey ratings) and decomposing them using the “Local Implicit” signals we usually throw away (like video feeds of facial expressions), we can train agents that are:

More Empathetic: They understand which turns create emotional connection.
More Engaging: Users want to talk to them again.
Data Efficient: We can utilize naturally occurring interaction data rather than paying for thousands of hours of manual labeling.

This approach mimics human learning. We don’t learn social skills by filling out a questionnaire after every sentence we speak. We learn by watching the faces of the people we talk to, interpreting their smiles and frowns in real-time, and correlating that with how the interaction went overall.

GELI brings AI one step closer to that natural, multimodal style of learning, paving the way for companions that don’t just process text, but actually understand the feeling of a conversation.

Introduction: The “Long Conversation” Problem#

The Background: Why Current RLHF Struggles with Long Talks#

The Missing Piece#

Enter GELI: Multimodal-Guided Decomposition#

1. The Global Explicit (GE) Decomposition#

2. The Local Implicit (LI) Guidance#

3. The Joint Objective#

Visualization: Does it work?#

The Experiments: Testing in the Wild#

Quantitative Results: The Reward Function#

Human Evaluation: The Ultimate Test#

Generalization#

Training Dynamics#

Conclusion and Implications#