Imagine you are helping a friend find their lost keys. You are standing in the doorway, and they are behind the kitchen island. You see the keys on the counter, but from their angle, the keys are hidden behind a fruit bowl. If you simply say, “It’s on the counter,” they might not see them. If you say, “It’s to your left, behind the apples,” they find them immediately.
This everyday interaction requires a complex cognitive ability known as perspective-taking (or Theory of Mind). You aren’t just describing what you see; you are modeling what your friend sees and adjusting your language accordingly.
While humans do this intuitively, Artificial Intelligence struggles immensely with it. Most Vision-Language Models (VLMs) operate on a single, static image. They describe “what is in the image,” not “where the object is relative to you, the viewer.”
In this post, we are doing a deep dive into the paper “Grounding Language in Multi-Perspective Referential Communication” by researchers at UC Berkeley. This work introduces a fascinating new benchmark for embodied AI and proposes a method to teach open-source models how to “step into the shoes” of another agent, eventually outperforming even proprietary giants like GPT-4o in this specific task.
The Problem: The “Egocentric” AI
Current embodied agents—robots or virtual assistants—must reason about the space they occupy. However, there is a disconnect. When a robot speaks to a human, they often possess different visual fields.
The paper formalizes this as a Referential Communication Game.
- The Speaker: Sees a target object (e.g., a specific blue ball) and must describe it.
- The Listener: Sees the scene from a different angle where color information might be missing (e.g., all balls look red) and must identify the target based only on the Speaker’s description.
If the Speaker ignores the Listener’s perspective, the communication fails.

As shown in Figure 1, the Speaker (top view) sees the target clearly. The Listener (bottom view) sees a different arrangement. The Speaker must realize that saying “the ball on the left” is ambiguous or wrong for the Listener.
The Environment: Building a Testbed for Theory of Mind
To study this, the researchers created a platform that generates photorealistic 3D scenes using ScanNet++ and habitat-sim. These aren’t just flat images; they are physically simulated environments where objects (spheres) are dropped into the scene using physics engines to ensure they land naturally on surfaces.
Defining Communicative Success
The core metric here is binary: Communicative Success. Did the Listener pick the right object?
Mathematically, the researchers define this interaction as a probability game. The Speaker model (\(p_s\)) generates an expression \(x\) based on its observation (\(o_s\)) and the target (\(t\)). The Listener model (\(p_l\)) then picks a target \(\hat{t}\) based on its own observation (\(o_l\)) and the expression \(x\).

Success is achieved only when the listener’s choice \(\hat{t}\) equals the actual target \(t\).
Controlled Difficulty: The Adversarial Setup
One of the most innovative aspects of this paper is how they generate scenes. Randomly dropping objects into a room is easy. But in the real world, “hard” cases—where objects are clustered together or partially occluded—are where perspective-taking matters most.
The authors introduced two difficulty controls:
- Relative Orientation: Varying the angle between the Speaker and Listener from \(0^\circ\) (standing next to each other) to \(180^\circ\) (facing each other).
- Adversarial Placement: They trained a separate “Adversary” model to place objects in the worst possible locations to maximize communication failure.

In Figure 2, you can see the difference. The top row shows random placement—the balls are scattered. The bottom row shows adversarial placement—the balls are clustered near landmarks, forcing the Speaker to be extremely precise.
The adversarial placement policy (\(R\)) is trained to find object configurations that confuse a baseline speaker/listener pair (like GPT-4o), mathematically represented as maximizing the failure rate:

Benchmarking: Humans vs. Machines
With the environment built, the researchers collected a dataset of 2,970 human-written referring expressions and pitted them against state-of-the-art models.
They tested:
- General Purpose Models: GPT-4o and LLaVA-1.5.
- Fine-Grained Models: Ferret and Groma (designed for referring to specific image regions).
- Modular Systems: ViperGPT (which writes code to solve visual tasks).
The results were stark.

Table 1 reveals a significant gap:
- Human-to-Human success: ~87.6%.
- GPT-4o (Speaker) to Human (Listener): 64.9%.
- LLaVA-1.5 (Speaker) to Human (Listener): 55.7%.
When models talk to each other (Black text in the table), performance drops even further. The “Adversarial” scenes (Adv.) consistently stump the models more than random scenes (Ran.), proving the effectiveness of the difficulty control.
Why do models fail?
The researchers analyzed the language used by humans versus models. They categorized the strategies into:
- Object-centric: “Next to the lamp.”
- Listener-centric: “On your left.”
- Speaker-centric: “In front of me.”

As seen in the top chart above, humans (far right bar) use Listener’s View strategies significantly more often than GPT-4o or LLaVA. LLaVA almost never references the listener’s perspective.
Furthermore, as the overlap between the Speaker and Listener’s field of view (FOV) decreases (meaning they see different things), humans adapt.

In the bottom chart above, notice the Human bars. As overlap decreases (moving left), humans shift away from “Other Candidates” (comparing the ball to other balls) and toward “Listener’s View” or “Speaker’s View.” Models fail to adapt their strategy dynamically based on how much of the scene is shared.
Error Analysis
When models mess up, how exactly do they fail?

In Figure 5, the LLaVA speaker says “The ball is near a lamp on a table.” While factually true from the Speaker’s view (left), from the Listener’s view (right), the “lamp” might be obscured or the perspective makes the spatial relation confusing. This leads to Out-of-Context Reference errors.

The error breakdown (above) confirms this. LLaVA (left bar) has a massive chunk of errors (Pink/Red) related to Out-of-Context References. It describes things the listener simply cannot see or understand.
The Solution: Learning from Communicative Success
So, we have models that are bad at perspective-taking. How do we fix them?
Standard training involves Supervised Fine-Tuning (SFT)—showing the model a picture and a “correct” caption. But here, there isn’t one correct caption; the “correctness” depends on whether the listener understands it.
The researchers propose Reinforcement Learning (RL), specifically acting on the concept that empirical success is the best teacher.
The Method: Pairwise Preference Learning (PPL)
They took the weaker open-source model, LLaVA-1.5, and fine-tuned it. They didn’t just show it human examples (Imitation Learning). Instead, they let the model generate a description, had a Listener (either a human or another model) guess the target, and used the result to update the model.
They utilized a technique called Pairwise Preference Learning (PPL). If the model generated a description \(x\) intended for target \(t\), but the listener guessed \(\hat{t}\) (where \(\hat{t} \neq t\)), this is a failure. However, this failure contains a signal: the description \(x\) likely fits the wrong object \(\hat{t}\) better than the intended object \(t\).
The reward function maximizes the probability of the description given the listener’s chosen object and minimizes it for the intended object (in failure cases). This pushes the model to stop writing descriptions that accidentally sound like the wrong object.

Does it work?
The results are impressive. They fine-tuned LLaVA-1.5 on just 200 examples—a tiny amount of data.

Looking at Table 2:
- Pre-trained LLaVA: 58.9% success.
- GPT-4o: 67.1% success.
- LLaVA + PPL (Human Feedback): 69.3% success.
By using this preference-based learning signal, the open-weight LLaVA-1.5 model overtook the proprietary GPT-4o model. It learned to be more concise (average length dropped from 61 tokens to 15.6, matching human brevity) and more effective.
The model learned to stop rambling about irrelevant details and focus on the distinct features that distinguish the target from the listener’s perspective.
Conclusion and Implications
The paper “Grounding Language in Multi-Perspective Referential Communication” highlights a critical gap in modern AI: the ability to understand that what I see is not what you see.
Through a rigorous setup involving adversarial scene generation and human benchmarking, the authors proved that current SOTA models lack this “Theory of Mind.” However, their proposed solution offers a hopeful path forward. By moving away from purely static supervised learning and embracing interaction-based learning—where the model is rewarded for being understood rather than just being correct—we can drastically improve communicative capabilities.
For students and researchers in AI, the takeaways are clear:
- Embodiment matters: Language doesn’t exist in a vacuum; it is grounded in physical space and perspective.
- Feedback loops are powerful: A listener’s confusion is a stronger training signal than a ground-truth label.
- Open models can win: With clever training objectives (like PPL), smaller, open models can punch above their weight class, beating giants like GPT-4o on specialized reasoning tasks.
This work brings us one step closer to robots that can actually find those lost keys when we tell them, “No, not that one—the one behind the apples!”
](https://deep-paper.org/en/paper/2410.03959/images/cover.png)