Language is rarely a private affair. When a sergeant shouts a command to a squad, or an advertiser broadcasts a commercial to millions, a single message must be understood by multiple people simultaneously. Yet, in the world of Artificial Intelligence and “Emergent Communication,” we have mostly studied language as a one-on-one game: one speaker, one listener.

If we want AI agents to develop languages that look and behave like human language, we need to replicate the pressures under which human language evolved. One of the most critical properties of human language is compositionality—the ability to combine simple units (like words) according to rules (grammar) to create complex meanings. “Blue square” is compositional; a unique, unrelated sound for every possible colored shape is not.

In this post, we are deep-diving into the paper “One-to-Many Communication and Compositionality in Emergent Communication.” The researchers explore a fascinating hypothesis: Does the pressure of speaking to a crowd force a language to become more structured?

The answer is yes, but not in the way you might think. It turns out that how the crowd listens matters just as much as the size of the crowd.

The Problem: The “Holistic” Trap

In typical AI experiments (often called signaling games or Lewis games), two neural networks play a game. The Speaker sees an object (like a red triangle) and sends a message. The Listener hears the message and tries to pick the right object from a lineup. If they succeed, they both get a reward.

The problem? AI agents often cheat the system. They develop “holistic” languages where a random noise like “glorp” means “red triangle,” and “bleep” means “blue square.” There is no structure. You can’t figure out which part of the message means “red.” This makes the language brittle and hard for new agents to learn.

The researchers propose that moving from a One-to-One model to a One-to-Many model introduces specific environmental pressures that might force agents to stop memorizing “glorps” and start building a grammar.

The Setup: Broadcasting to a Crowd

To test this, the authors set up a variant of the standard reconstruction game.

  1. The Speaker: Observes an object with attributes (like Color, Shape, Style). It produces a sequence of symbols (a message).
  2. The Listeners: A group of multiple agents who receive the same broadcasted message.
  3. The Goal: The listeners must interpret the message to identify attributes of the object.

The researchers didn’t just add more listeners; they manipulated the social dynamics of the crowd in two specific ways:

  1. Interests: Do the listeners care about the whole message, or just parts of it?
  2. Coordination: Do the listeners need to agree with each other to succeed?

Does Size Matter? (The Naive Approach)

First, let’s look at the baseline. What happens if we simply take the standard game and add more listeners who all want to know everything about the object?

Graph showing that naive one-to-many communication offers no significant improvement over one-to-one communication.

As shown in Figure 1, the results are underwhelming. The gray dashed line (One-to-Many) tracks closely with the blue line (One-to-One). Simply broadcasting a message to 10 people instead of 1 doesn’t magically create grammar. The speaker just learns to send holistic codes that 10 people memorize instead of one.

To get compositionality, we need to apply pressure.

Pressure 1: Listeners with Different Interests

In the real world, an advertisement for a car is seen by many people. Some care about the price; others care about the horsepower. The message must convey all that info, but different listeners will “tune out” the parts they don’t care about.

The researchers modeled this by creating groups of listeners with distinct interests.

Diagram showing a speaker broadcasting to three listeners, each interested in different attributes like color or shape.

As illustrated in Figure 7, Listener 1 might only be graded on guessing the Color, while Listener 2 is graded on the Shape. The Speaker, however, must satisfy everyone.

The “Readability” Hypothesis

The hypothesis here is Readability Pressure. If Listener 1 only cares about Color, they prefer a message where the “Color” information is easy to spot and separate from the “Shape” information. They don’t want to decode the whole complex message just to find the one bit they need.

The results confirm this hypothesis brilliantly.

Charts showing that partial and mixed interests lead to higher topographic similarity and specific types of disentanglement compared to full interest.

In Figure 2(a), look at the “Topographic Similarity” (a metric where higher means more compositional/structured). The Partial-Interest groups (Red bars) significantly outperform the Full-Interest groups (Gray bars).

The “Bag-of-Symbols” Strategy

What kind of language emerges here? You might expect a positional grammar (e.g., “Word 1 is color, Word 2 is shape”). But the agents found a different solution.

Look at Figure 2(c), labeled “Bag-of-Symbols Disent.” This metric shoots up for the Partial-Interest groups. This implies the agents developed a counting-based language.

  • Example: “A” means Red. “B” means Triangle.
  • Message: “AAB” might mean “Very Red Triangle” or just “Red Triangle.”
  • The position doesn’t matter much. If I’m the Color Listener, I just scan the message for “A"s and ignore the “B"s.

This structure makes the language incredibly easy to learn for new agents.

Line graphs showing that languages formed under partial interests are learned much faster by new agents.

Figure 3 shows the training curves of new listeners trying to learn an existing language. The Red line (Partial-Interest language) shoots up to high accuracy much faster than the Gray line. By forcing the language to be “readable” for specialists, the Speaker created a language that is easier for everyone to learn.

Pressure 2: The Need for Coordination

The second major experiment introduces a different kind of pressure: Coordination.

Imagine a squad of soldiers. It’s not enough for most of them to understand the command. If one soldier turns left while the rest turn right, the mission fails.

The researchers modeled this by grouping listeners. A reward is only given if every listener in the group predicts the object correctly.

Diagram showing listeners grouped into squads, where a single failure by one listener causes the whole group to fail.

As shown in Figure 8, Group 2 fails completely because Listener 4 made a mistake, even though Listener 3 got it right. This creates a high-stakes environment. The Speaker cannot afford a message that is “mostly” clear. It must be unambiguous to everyone simultaneously.

The “Structure” Hypothesis

The researchers argued that a holistic language (random sounds) is risky because different listeners might memorize different associations. A compositional language (structured rules) is safer because it relies on shared logic.

Graphs showing that coordination pressure significantly increases topographic similarity and positional disentanglement.

Figure 4 reveals the impact of this pressure.

  • Graph (a): As the group size increases (more coordination required), Topographic Similarity (structure) rises.
  • Graph (b): Look at Positional Disentanglement. This is the opposite of the previous experiment. It rises sharply with coordination.

The “Positional” Strategy

Unlike the “Different Interests” group, the “Coordination” group prefers Positional Structure. This is much closer to human syntax (e.g., English).

  • Structure: The first symbol always refers to Color. The second symbol always refers to Shape.
  • Why? Because if everyone knows “Slot 1 = Color,” there is less room for misinterpretation across the group. It is a strict protocol, essential for coordinated action.

When Pressures Collide

What happens if we combine these pressures? We have listeners with different interests, and we force them to coordinate.

The researchers found that these two pressures can actually fight each other.

  • Different Interests push for a “Bag-of-Symbols” (scan for what you need).
  • Coordination pushes for “Positional” (strict ordering).

Charts showing that increasing group size can actually slightly decrease topographic similarity when listeners have mixed interests.

Figure 5 shows that when you add coordination pressure (Group Size) to agents with Mixed Interests (Blue line), the compositionality actually dips slightly. The agents are torn between a flexible scanning strategy and a rigid positional strategy.

Advanced Scenario: Iterated Learning

A classic theory in linguistics is Iterated Learning—the idea that language evolves as it is passed down through generations (like a game of Telephone). Usually, replacing the entire population at once (Simultaneous Reset) creates better languages than replacing one agent at a time (Staggered Reset), because a total wipe forces the new generation to reinvent a structured system.

However, real populations change gradually (Staggered).

Table showing that coordination pressure helps maintain compositionality even in staggered reset scenarios.

Table 1 shows a fascinating result. In a normal single-listener setup, the “Staggered” approach (gradual replacement) performs poorly (TopSim 29.52). But when Coordination is added, the Staggered approach performs almost as well as the simultaneous reset (TopSim 34.25).

Takeaway: If a society needs to coordinate, their language remains robust and structured even as the population gradually changes over time.

From Symbols to Pixels: Real-World Generalization

Finally, the researchers stepped away from simple abstract attributes and tested their theories on pixel-based images (3dshapes dataset).

A sample image from the 3dshapes dataset showing a green cube in a room.

The task is much harder: the agents must look at raw images (like Figure 9) and communicate about color, shape, and orientation.

Did the findings hold up?

Charts confirming that the trends observed in simple datasets hold true for complex visual data like 3dshapes.

Yes. Figure 6 confirms that even with complex visual inputs, listeners with partial interests (Red bars) drive the development of highly compositional languages compared to full-interest listeners (Gray bars).

They even tested this on ImageNet (a massive database of real-world photos).

Table showing results on the ImageNet discrimination game, where larger group sizes lead to better generalization.

Table 2 shows that as group size increases (coordination pressure), the agents’ ability to generalize to new, unseen images (Test OOD) improves.

Conclusion

This paper provides a compelling look into the social origins of grammar. It suggests that our language isn’t just structured because our brains are wired for it, but because of the specific ways we communicate.

  1. Readability: We speak to people who only care about parts of what we say. This drives us to use symbols that can be easily scanned (like keywords).
  2. Coordination: We speak to groups that must act in unison. This drives us to use strict word order (syntax) to ensure everyone interprets the message exactly the same way.

For AI researchers, the lesson is clear: If you want agents to develop intelligent, structured languages, don’t just put them in a private room. Make them broadcast their message to a diverse, coordinated crowd. The pressure of the audience is what shapes the speech.