Can You Hear What I Didn't Say? Modeling Pragmatic Reasoning in AI Communication

Imagine you are sitting at a table with three objects: a red circle, a red square, and a gray circle. Someone points toward the table and says, “The red one!”

Strictly speaking, this sentence is ambiguous. There are two red objects. However, most humans would immediately reach for the red circle. Why? Because if the speaker wanted the red square, they likely would have said “The square,” since that is a unique feature. The fact that they used color implies they are distinguishing it from the other shape of the same color or the other object of the same shape.

This ability to read between the lines is called pragmatic reasoning. We don’t just process the literal meaning of words; we model the speaker’s intentions and the context.

Figure 1: The speaker is asking for the red object. For a literal listener, this is ambiguous. A reasoning listener considers alternative messages about shape and color features and concludes that the speaker is asking for the red circle, as “square” would have been a more informative message for the other red object.

As shown in Figure 1, a “Literal Listener” gets stuck on the ambiguity. A “Reasoning Listener,” however, understands that efficient communication relies on a shared understanding of context.

In the research paper “Communicating with Speakers and Listeners of Different Pragmatic Levels,” researchers Kata Naszádi, Frans A. Oliehoek, and Christof Monz explore how Artificial Intelligence agents can learn this type of reasoning. They simulate interactions between speakers and listeners with varying levels of “pragmatic competence” to answer a crucial question: Does matching the reasoning levels of communication partners help them learn language faster?

The Trade-off: Clarity vs. Conciseness

In human language, there is a constant tension between being clear and being concise. If we trust our listener to “get it,” we can use short, efficient messages. If we are speaking to a child or someone learning the language, we tend to be over-explicit (verbose).

The authors of this paper investigate this dynamic using the Rational Speech Act (RSA) framework. They create a simulation where agents play “referential games”—essentially looking at a set of images and trying to identify a target based on a caption. By manipulating the agents’ ability to reason about each other, the researchers uncover fascinating insights into how AI—and perhaps humans—should approach teaching and learning language.

The Core Method: Building Recursive Reasoners

To understand the experiments, we first need to understand how the agents “think.” The authors use the RSA model, which formalizes communication as a recursive process. It’s a game of “I think that you think that I think…”

The model is built in layers, starting with the most basic understanding and building up to complex reasoning.

1. The Literal Interpretation ($D$)

Before any reasoning happens, an agent must understand the literal connection between an image and a word. The authors use a neural network approach. They embed images using a Convolutional Neural Network (CNN) and text using a Recurrent Neural Network (RNN).

The compatibility between an image $o_i$ and a message $w$ is calculated as the dot product of their embeddings:

$()\nD ( o _ { i } , w ) = \\mathrm { C N N } _ { \\theta } ( o _ { i } ) ^ { T } \\mathrm { R N N } _ { \\theta } ( w )\n[$

This function, $D$, represents the “truth” value—how well does the description literally fit the image?

2. The Literal Listener ($L_0$)

The base of the hierarchy is the Literal Listener ($L_0$). This agent doesn’t think about why a speaker said something; it just looks at the literal match. It calculates the probability of an image being the target by normalizing the literal scores across all available images in the context ($C$).

$]\nL _ { 0 } ( i | w , C ) = \\frac { e ^ { D ( o _ { i } , w ) } } { \\sum _ { j = 1 } ^ { | C | } e ^ { D ( o _ { j } , w ) } }\n[$

If the speaker says “red” and there are two red objects, $L_0$ assigns them equal probability. It cannot resolve the ambiguity in Figure 1.

3. The Pragmatic Speaker ($S_n$)

Now we introduce the first level of reasoning. A Pragmatic Speaker wants to be understood by a specific listener. The Speaker ($S_n$) chooses a message $w$ that maximizes the probability that the Listener ($L_{n-1}$) will pick the correct target image $i$.

However, the speaker is also “economical.” It subtracts a cost for the message (e.g., longer messages “cost” more).

$]\nS _ { n } ( w | C , i ) = { \\frac { e ^ { \\lambda \\left( \\log ( L _ { n - 1 } ( i | C , w ) ) - \\operatorname { c o s t } ( w ) \\right) } } { \\sum _ { w ^ { \\prime } \\in V } e ^ { \\lambda \\left( \\log ( L _ { n - 1 } ( i | C , w ^ { \\prime } ) ) - \\operatorname { c o s t } ( w ^ { \\prime } ) \\right) } } }\n[$

This equation shows that the Speaker $S_n$ simulates the Listener $L_{n-1}$.

$S_1$ Speaker: Simulates a Literal Listener ($L_0$). It knows $L_0$ is easily confused, so $S_1$ tends to be more descriptive (and verbose) to ensure clarity.
$S_3$ Speaker: Simulates a Pragmatic Listener ($L_2$). It knows the listener is smart, so it can be more concise, trusting the listener to infer the rest.

4. The Pragmatic Listener ($L_n$)

Finally, we have the Pragmatic Listener. This agent doesn’t just look at the literal meaning. It asks, “Given the context, what message would the Speaker have chosen?”

$]\nL _ { n } ( i | C , w ) \\propto S _ { n - 1 } ( w | C , i ) P ( C , i )\n()$

By using Bayes’ rule, the Listener $L_n$ reasons backward from the Speaker $S_{n-1}$’s behavior. An $L_2$ listener knows that an $S_1$ speaker is trying to be clear. If the speaker says “red” (and not “square”), the $L_2$ listener infers that “square” wasn’t the intended target, resolving the ambiguity.

The Innovation: Reasoning While Learning

In most previous research, the “literal meaning” (the lexicon) is learned first, and pragmatic reasoning is added on top during the testing phase.

The authors of this paper take a different approach. They hypothesize that agents should use recursive reasoning during the learning process. By backpropagating the error through the reasoning steps (Equations 3 and 4) down to the literal representation weights, the agent learns a lexicon that is optimized for pragmatic communication.

Experiments: The Game of Shapes

The researchers used a dataset called ShapeWorld. Agents are presented with a target image and several “distractors.” The images vary by color (e.g., red, blue) and shape (e.g., square, circle).

They set up different environments:

Uncorrelated: Shapes and colors are random.
Correlated ($Corr=1$): Certain feature combinations appear more often, making context more important.

They tested combinations of:

Listeners: $L_0$ (Literal) vs. $L_2$ (Pragmatic).
Speakers: $S_1$ (Verbose/Explicit) vs. $S_3$ (Concise/Pragmatic).

Results: Who is the Best Teacher?

The results offer several compelling takeaways about how different reasoning levels interact.

1. Higher-Level Speakers are Lazier (More Efficient)

First, the researchers confirmed that their modeled speakers behave as expected.

$Table 1: Average message length in words over 5000 samples for different number of distractors and speaker levels, \$C o r r = 1\$. Higher level speakers send shorter messages and more distractors result in longer messages.$

As seen in Table 1, the $S_3$ speaker uses fewer words (1.01 to 1.09 average) compared to the $S_1$ speaker. The $S_3$ speaker assumes the listener is smart ($L_2$) and can infer meaning from context, so it drops redundant words. The $S_1$ speaker assumes the listener is literal ($L_0$) and over-communicates to be safe.

2. Pragmatic Listeners are Robust

When the agents are just “talking” (inference) using an already learned vocabulary, who performs better?

$Table 2: A listener trained as \$L _ { 0 }\$ upgraded to different listener levels and paired with \$S _ { 1 }\$ or \$S _ { 3 }\$ at evaluation. Both \$L _ { 0 }\$ and \$L _ { 2 }\$ perform significantly better with the more verbose \$S _ { 1 }\$. When receiving messages from an \$S _ { 3 }\$, the higher level \$L _ { 2 }\$ is significantly better. Evaluation setup: \$c o s t = 0 . 6\$, \$N = 5\$, \$C o r r = 1\$$

Table 2 reveals a few key points:

Everyone loves $S_1$: Both listener types perform best with the verbose speaker ($S_1$). Explicit instructions are simply easier to follow.
Pragmatics saves the day with concise speakers: Look at row (b) vs (a). When the speaker is concise ($S_3$), the literal listener ($L_0$) struggles (80.5%). But if that listener upgrades to pragmatic reasoning ($L_2$), accuracy jumps significantly (81.2%).

3. Explicit Teachers are Better for Learning

The most interesting finding comes from the learning phase. If you are a blank slate trying to learn what “red” means, who should you listen to? A concise poet or a verbose instructor?

$Table 3: For each level of listener, learning from lower level \$S _ { 1 }\$ results in significantly better accuracy. Listener levels are kept the same during evaluation and training. Training and evaluation setup: \$c o s t = 0 . 6\$, \$N = 5\$, \$C o r r = 1\$. Evaluation: \$S _ { 1 }\$.$

Table 3 provides a clear answer: Learn from the verbose speaker ($S_1$). Regardless of whether the learner is smart ($L_2$) or literal ($L_0$), they achieve much higher accuracy if they are trained by an $S_1$ speaker. The $S_1$ speaker provides more data (more words per image), which helps ground the literal meanings of words more effectively. The concise $S_3$ speaker leaves too much unsaid, making it hard for a learner to map words to features.

4. Pragmatic Learners can Compensate for Difficult Teachers

But what if you don’t have a choice? What if you have to learn from a concise, difficult speaker ($S_3$)?

$Figure 2: During training, listeners are paired with speakers of different pragmatic competence. The listeners are trained in environments of increasing difficulty. \$L _ { 0 }\$ learners paired with \$S _ { 1 }\$ speakers have the same performance as \$L _ { 2 }\$ paired with \$S _ { 3 }\$.$

Figure 2 illustrates the interaction between training difficulty and learner capability.

The dark blue line ($L_0 - S_1$) represents a literal learner with a verbose teacher. They do well.
The yellow line ($L_2 - S_3$) represents a smart, pragmatic learner with a concise teacher.

Remarkably, these two lines overlap. This means that being a pragmatic learner allows you to learn just as well from a difficult teacher as a literal learner does from an easy teacher. The pragmatic learner uses reasoning to fill in the gaps left by the concise speaker.

5. Pragmatic Learners Learn Faster

Finally, the authors looked at the speed of convergence.

Figure 3: Higher level listeners learn quicker. In this comparison all other parameters such as speaker level, number of distractors, correlation between shapes are left constant.

Figure 3 shows the learning curves. The orange line ($L_4$) and teal line ($L_2$) rise much faster than the cyan line ($L_0$). Listeners that integrate pragmatic reasoning into their learning process grasp the task significantly faster than those relying solely on literal interpretations.

Conclusion

This research highlights a fundamental truth about communication: it is a cooperative game.

The authors demonstrated that explicit, literal language ($S_1$) is the best tool for teaching. Teachers should not assume their students can perform complex pragmatic inference immediately; they should be verbose and descriptive.

However, on the flip side, students (listeners) benefit immensely from modeling the speaker. By integrating pragmatic reasoning into the learning process, agents become more robust. They can learn faster and handle concise, ambiguous messages that would stump a literal agent.

As we build AI systems that interact with humans—who are naturally pragmatic, often concise, and context-dependent—these findings are vital. An AI that assumes humans are always literal will likely fail. But an AI that models the human as a pragmatic partner can “read between the lines,” resulting in smoother, more human-like communication.