Introduction
Imagine you are standing in a crowded museum. You point to a distant exhibit and say to your friend, “Look at that!” Your friend instantly turns their head, follows the line of your finger, identifies the specific object among dozens of others, and understands exactly what you mean. This interaction, which feels instantaneous and effortless to humans, is a masterpiece of multimodal processing. It involves integrating visual data, spatial reasoning, and language into a coherent understanding of the world.
In the rapidly evolving world of Artificial Intelligence, Multimodal Large Language Models (MLLMs) like GPT-4o and Gemini have begun to demonstrate impressive capabilities. They can describe photos, answer questions about videos, and hold fluent conversations. But do they truly see the world the way we do? Specifically, can they understand the subtle, physical language of hand gestures?
A recent research paper, “Do Multimodal Large Language Models Truly See What We Point At?”, dives deep into this question. The researchers investigated whether state-of-the-art AI models can distinguish between different types of hand gestures—specifically focusing on the difference between pointing at something (indexical gestures) versus describing something with hands (iconic gestures). Their findings reveal a fascinating “blind spot” in modern AI: while models are getting better at talking, they are still struggling to ground their understanding in the physical world.
Background: The Language of Hands
To understand the challenge MLLMs face, we first need to break down how humans use hands to communicate. In linguistics and cognitive science, gestures are often categorized based on how they convey meaning. This study focuses on three primary types:
- Indexical Gestures: These are “pointing” gestures. They rely entirely on the physical environment. If I point and say “that,” the meaning depends 100% on what is located at the end of my finger. Without “grounding” the gesture in the physical world (seeing the referent), the gesture is meaningless.
- Iconic Gestures: These gestures depict imagery. For example, drawing a circle in the air to represent a round object, or moving your hand in a wave motion to describe a roller coaster. These are often understandable through context and shape, even without seeing a specific physical object.
- Symbolic Gestures: These are culturally defined signs, like a “thumbs up” for good, or a wave for hello. Their meaning is fixed by convention.

As shown in Figure 1, the differences are distinct. The left panel shows an Indexical gesture where a man points off-camera. To understand him, you must know where he is pointing. The middle panel shows an Iconic gesture; the man is miming looking through a telescope. You can likely guess the meaning just by watching him. The right panel shows a Symbolic gesture, emphasizing a concept through a conventional hand pose.
The researchers hypothesized that MLLMs would struggle specifically with Indexical gestures. Why? Because these models are trained primarily on vast amounts of text and static images. They may lack the “embodied” experience required to understand that a finger is a vector pointing to a specific coordinate in 3D space.
The Experiment: Testing AI in the Wild
To test this hypothesis, the authors did not rely on staged photos. They utilized the Miraikan Science Communication (SC) Corpus, a dataset of real-world videos capturing conversations between science communicators and visitors at a Japanese science museum. This setting is perfect for this study because museum conversations are naturally filled with pointing (“Look at this robot”) and descriptive gestures (“It spins like this”).
Dataset Construction
The researchers manually annotated 925 gesture instances from the corpus, labeling them as Indexical, Iconic, Symbolic, Mixed, or Others.

Table 1 highlights the distribution of these gestures. Notice that Indexical gestures are the most common (33.4%), reflecting how crucial pointing is in real-world environments like museums. Iconic gestures are also frequent (18.3%). The average duration of these gestures is roughly 7.4 seconds, ensuring that differences in model performance aren’t simply due to some gestures being shorter or harder to catch than others.
The Task
The researchers tested several leading models, including GPT-4o, Gemini 1.5 Pro, Qwen2.5-VL, and LLaVA-NeXT-Video.
The setup was straightforward but rigorous:
- Input: The model is given a video clip and the dialogue transcript leading up to a gesture.
- Prompt: The model is asked to explain the meaning and intent of the gesture performed at the end of the scene.
- Evaluation: The model’s generated description is compared to a human-written “ground truth” description. Another LLM (GPT-4o-mini) acts as a judge, scoring the accuracy from 0.0 to 1.0.

Figure 3 provides an example of the prompt used (translated from Japanese). The model sees the conversation history and frames from the video (visualized as “Cam A” through “Cam F”) and must output an explanation.
Core Results: The “Pointing” Gap
The results of the study confirmed the researchers’ hypothesis: MLLMs have a significant weakness when it comes to indexical grounding.

As visualized in Figure 2, look at the performance of GPT-4o (the first cluster on the left). It scores 0.50 on Iconic gestures (orange bar) but drops to 0.47 on Indexical gestures (red bar). This trend appears across most models. While the raw number difference might seem small, it represents a consistent systemic failure to resolve what is being pointed at.
The only exception appears to be LLaVA-NeXT-Video (far right), which performed poorly across the board, likely due to weaker overall capabilities compared to the commercial giants.
Why Does This Happen?
The researchers argue that MLLMs are not actually “tracking” the pointing finger to a target. Instead, they are acting as sophisticated guessing machines. When a model sees an Iconic gesture (like miming a telescope) and reads text about “looking at stars,” it can easily infer the gesture’s meaning using common sense and linguistic associations.
However, Indexical gestures break this reliance on text. If a user says “Look at that,” the text contains zero information about the object. The information is entirely visual and spatial. The model must trace the vector of the arm to the object. The lower scores suggest MLLMs fail to do this reliably, defaulting instead to guessing based on the general topic of conversation.
Deep Dive: Vision vs. Context
To prove that Indexical gestures require visual grounding (while Iconic gestures rely on text), the researchers performed an “ablation study.” They tested the models again, but with a twist: they selectively removed information to see what would break the model’s understanding.
- Dialogue Only: The model gets the text but no video.
- Vision Only: The model gets the video but no text.
- Full Input: The model gets both.

Table 3 offers the most compelling evidence in the paper.
- Look at the Indexical column: When visual input was removed (“Dialogue Only”), accuracy dropped significantly (0.47 \(\rightarrow\) 0.38). This proves that visual cues are critical for pointing gestures. However, even with vision, the score is low, implying the models aren’t using the visual data effectively.
- Look at the Iconic column: When dialogue was removed (“Vision Only”), the score crashed (0.50 \(\rightarrow\) 0.29). This reveals that for Iconic gestures, the models heavily rely on hearing what is being said to interpret the hand movement.
This suggests that MLLMs are currently “text-first” learners. They excel at Iconic gestures because they can use the spoken context (“The mountain was this high”) to guess the gesture. They struggle with pointing because they cannot “cheat” using the text alone.
Can We Fix It?
If MLLMs are bad at seeing pointing gestures, can we help them by giving them more information? The researchers tried augmenting the prompts with extra data:
- Extended Context: Giving 10 seconds of dialogue instead of 5.
- Physical Description: Manually telling the model how the hand is moving (e.g., “The hand is extended forward”).
- Labeling: Explicitly telling the model “This is an indexical gesture.”

Table 2 shows the results of these interventions. Simply adding more dialogue (Extended dialogue context) did almost nothing (0.47 \(\rightarrow\) 0.48). This reinforces the idea that the answer isn’t in the text.
However, explicitly describing the hand movement (Physical-level gesture description) caused a massive jump in performance (0.47 \(\rightarrow\) 0.60). This indicates that the models can reason about the gesture if they are told exactly what the hand is doing. The failure lies in their visual perception—their ability to extract that physical motion from the video pixels themselves.
Conclusion and Implications
This research highlights a critical limitation in the current generation of Multimodal AI. While we often think of models like GPT-4o as “seeing” images, they process visual data differently than humans do. They rely heavily on linguistic priors and semantic associations.
The study concludes that MLLMs have not yet fully internalized the role of external reference in communication. They struggle to draw the invisible line between a pointing finger and a real-world object.
Why Does This Matter?
This isn’t just an academic curiosity. As we move toward Embodied AI—robots that work in homes or factories, and Augmented Reality (AR) assistants that see what we see—this limitation becomes a blocker.
If you tell a robot, “Put that box over there,” while pointing, the robot relies on an indexical gesture. If the AI cannot ground your pointing gesture in 3D space, it cannot perform the task. This paper suggests that simply making models “larger” or feeding them more text might not solve the problem. Future development needs to focus on better visual-spatial grounding, helping models truly “see” the world, not just read about it.
](https://deep-paper.org/en/paper/file-2331/images/cover.png)