Have you ever had a conversation where you thought the other person understood you, only to realize ten minutes later they had absolutely no idea what you were talking about? In human communication, avoiding this requires a constant, subtle process of checking, clarifying, and confirming. This is called Conversational Grounding.
We know Large Language Models (LLMs) like GPT-4 and Llama are fluent speakers. They can generate poetry, code, and essays. But are they good listeners? Do they truly build a shared understanding with a user, or are they just predicting the next likely word based on surface-level patterns?
In the paper “Evaluating the Effectiveness of Large Language Models in Establishing Conversational Grounding,” researchers Mohapatra, Kapadnis, Romary, and Cassell take a deep dive into this question. They developed a rigorous framework to test whether LLMs can handle the messy, dynamic nature of correcting misunderstandings—and they found that size matters more than we thought.
The Problem: Talking vs. Understanding
In linguistics, “common ground” refers to the mutual knowledge and assumptions shared by participants in a conversation. Conversational Grounding is the active process of updating this common ground. It involves:
- Repairs: Correcting oneself (e.g., “The blue box… sorry, the red box”).
- Clarifications: Asking when something is ambiguous (e.g., “Do you mean the one on the left?”).
- Acknowledgements: Confirming receipt of information.
While humans do this intuitively (using nods, “uh-huhs,” and eye contact), AI models struggle. Previous research suggested that even models trained on massive dialogue datasets often fail to update their internal state when a user corrects them. However, testing this is hard. Human evaluation is slow and expensive, making it difficult to keep up with the rapid release of new LLMs.
The researchers in this paper propose a scalable, automated way to evaluate these skills across models of varying sizes, from T5 to GPT-4.
The Setup: The “Meetup” Game
To test grounding, you need a scenario where precise understanding is required. The researchers utilized the Meetup dataset, based on a 2D grid game.
In this scenario, two participants are in a virtual building. They can’t see each other. To win, they must navigate to the same room. They have to describe their surroundings (generated from images) to figure out where they are relative to one another.
This setup is perfect for testing grounding because it forces negotiation. If Player A says, “I’m in the kitchen,” and Player B says, “I see a kitchen too, does yours have a fridge?”, they are actively building common ground.
The researchers fed the models the dialogue history, game instructions, and descriptions of the visual scenes.

As shown in Figure 1 above, the model is given a rich context including timestamps, speaker labels, and visual descriptions. The challenge is to see how the model reacts when the conversation gets tricky.
Method 1: The Perplexity Test
How do we measure if a model understands a correction without asking a human to grade it? The authors used Perplexity (PPL).
Perplexity is a measurement of how surprised a model is by a sequence of words. If a model understands the context, it should assign a high probability (and thus low perplexity) to a logical response. If a response makes no sense in context, the model should be “surprised” (high perplexity).

The researchers identified specific “Grounding Acts” in the dataset—moments where communication was negotiated. They then created two possible responses for the model to evaluate:
- The Correct Response: A response that properly incorporates the grounding information (e.g., acknowledging a correction).
- The Deceptive/Wrong Response: A response that fits grammatically and topically but fails to account for the grounding context (e.g., ignoring a correction).
Ideally, a smart model will have lower perplexity for the Correct Response than the Wrong one.
The Test Categories
They tested several linguistic phenomena:
Repair: The speaker corrects themselves. Does the model update its knowledge?
Example: “It’s a yellow seat… sorry, yellow table.” (Model must look for a table).
See Figure 7 below for a visualization of a Repair test case.

Cancel: The speaker retracts information entirely.
Example: “I’m going North… actually forget that.”
See Figure 8 below for a Cancel test case.

Reference Ambiguity: The speaker uses a vague term. Does the model realize it needs to ask for clarification?
See Figure 2 below.

Results: The “Emergent Property” of Listening
The results revealed a stark divide between small and large models.
The researchers calculated the ratio of test cases where the model correctly preferred the logical response (assigned it lower perplexity). A score of 0.50 means the model is essentially guessing randomly.

Key Takeaways from the Data:
- Small Models Struggle: Look at the T5 and Godel models in Table 2. Their scores hover around 0.40–0.50. They often preferred the wrong response. Even when fine-tuned (CLM), they barely improved. This suggests they are relying on simple keyword matching rather than understanding the flow of conversation.
- Size Matters: The Llama-13B model significantly outperformed the 7B version.
- Data Matters: The Llama 3.1-8B model (trained on much more data than the original Llama-7B) performed comparably to the larger Llama-13B.
This suggests that conversational grounding is an emergent capability. It doesn’t seem to exist in smaller models but appears spontaneously as models get larger and consume more training data.
What about GPT-4?
Since GPT-4 is closed-source, the researchers couldn’t measure perplexity directly. Instead, they used prompt-based testing, asking the model to pick the best response.

As Table 4 shows, GPT-4 is nearly perfect, scoring 0.95 or 1.00 in difficult categories like “Cancel” and “Request-Repair.” It demonstrates a robust understanding of pragmatic nuances that smaller open-source models lack.
Deep Dive: Why Do Smaller Models Fail?
The researchers didn’t stop at “smaller models are worse.” They wanted to know why. To find out, they designed a novel Embedding Study.
They analyzed the internal vector representations (embeddings) of the dialogue. They created four versions of specific dialogues:
- D1 (Original): Contains the grounding event (e.g., a self-correction).
- D2 (Clean): A paraphrase where the error never happened (straight to the point).
- D3 (Paraphrase): Another rephrasing of the clean version.
- D4 (Wrong): A version containing incorrect information.
The Logic: If a model truly understands that “I saw a dog… no, a cat” means the same thing as “I saw a cat,” then the embeddings for D1 and D2 should be close together in mathematical space. They should be far away from D4 (the wrong info).

The researchers calculated a score (\(V\)) to represent this relationship.

The findings were illuminating: Smaller models clustered dialogues based on lexical overlap (using the same words) rather than semantic meaning. If the user said “Red box… no, blue box,” the small model focused on the word “Red” and associated it with other sentences containing “Red,” failing to process the “no” that cancelled it out.
Larger models, however, showed embedding distances that reflected the final meaning of the conversation, effectively “erasing” the corrected mistake from their internal representation.
Can We Fix It? Positive and Negative Reward Training
The team proposed a solution to help medium-sized models (like Llama-7B/13B) catch up to giants like GPT-4 without needing trillions more parameters.
They used Positive and Negative Reward Training. This is a fine-tuning technique where the model is explicitly punished for picking the deceptive “wrong” response and rewarded for picking the correct one.

In this equation, \(W1\) and \(W2\) are weights used to balance the reward for being right and the penalty for being wrong.
The Result? It worked significantly well for the Llama models.

Table 6 shows the improvements. For Llama-7B, the performance on “Reference Ambiguity” jumped from 0.80 to 0.95. The models learned to pay attention to the specific cues of grounding acts. However, notably, the very small models (T5) did not improve, further reinforcing the idea that a minimum capacity is required to grasp these concepts.
Conclusion and Future Implications
This research highlights a critical nuance in the development of AI. We often assume that if a model speaks fluently, it understands. This paper proves that conversational grounding—the ability to negotiate meaning—is a sophisticated cognitive task that simple language modeling does not guarantee.
Key Takeaways:
- Grounding is Emergent: Only models of a certain size (or trained on massive datasets) naturally acquire the ability to handle repairs and cancellations.
- Surface vs. Depth: Smaller models process dialogue largely by keyword matching, while larger models process the pragmatic state of the conversation.
- Targeted Training Works: We don’t always need the biggest model. With specific positive/negative reward training, mid-sized models can be taught to “listen” better.
As we move toward more autonomous agents—AI that books appointments, navigates interfaces, or acts as a therapist—grounding becomes non-negotiable. An AI that can’t understand “Wait, not Tuesday, I meant Thursday” is useless, no matter how eloquent its apology is. This paper provides the benchmarks and methods needed to ensure our future AI companions aren’t just hearing us, but truly understanding.
](https://deep-paper.org/en/paper/file-3052/images/cover.png)