Imagine you are looking at a picture of a mountain range.
If you found this image on a travel blog, you might ask: “Where is this located?” or “How difficult is the hike?”
However, if you encountered the exact same image in a science magazine, your questions would likely shift to: “Is this a volcanic range?” or “How were these peaks formed?”
This simple thought experiment highlights a fundamental aspect of human communication: our questions are rarely generated in a vacuum. They are shaped by our goals, our environment, and the information we already possess.
Yet, for years, the field of Visual Question Answering (VQA)—where AI models are trained to answer questions about images—has largely ignored this reality. Traditional datasets present images and questions in isolation, treating the task as a test of object recognition rather than a communicative act.
In this post, we will explore CommVQA, a research paper that challenges the status quo by introducing a new dataset and benchmarking framework. This work posits that to build truly useful AI assistants, particularly for accessibility, we must situate VQA within realistic communicative contexts.
The Problem with “Void” VQA
To understand why CommVQA is necessary, we first need to look at how VQA models are typically trained. In standard datasets, annotators are shown an image and asked to write a question to “stump a smart robot.” This adversarial approach leads to questions that verify visual content (e.g., “Is the dog black or white?”) rather than questions that seek new information based on a user’s need.
This disconnect is particularly problematic for accessibility applications. Blind and Low-Vision (BLV) users rely on VQA tools to understand images they cannot see. In a real-world scenario, a BLV user isn’t trying to quiz the AI; they usually have some context (like the website they are browsing or a brief alt-text description) and have a specific goal (like buying a gift or reading the news).
The researchers behind CommVQA argue that current datasets lack two central communicative drives:
- Information Needs: The goal of the user changes based on the scenario (e.g., shopping vs. social media).
- Prior Knowledge: Users often have some information (like a caption) and ask follow-up questions based on that.
Introducing CommVQA
To address these limitations, the researchers introduced CommVQA, a dataset designed to treat VQA as a communicative task. The dataset contains 1,000 images, but unlike previous collections, these images are deeply embedded in context.
The dataset includes:
- Images sourced from Wikipedia.
- Scenarios representing where the image appears (e.g., a Health website).
- Descriptions (simulating alt-text).
- Context-aware Questions and Answers.
How the Dataset Was Built
The construction of CommVQA was a multi-step pipeline designed to simulate a real-world information gap.

As shown in Figure 1 above, the process involved several distinct stages:
- Scenario Matching: First, images were paired with plausible website scenarios. The researchers identified six categories: Shopping, Travel, Science Magazines, News, Health, and Social Media.
- Description Generation: To simulate the prior knowledge a user might have (like alt-text), GPT-4V generated initial descriptions, which were then refined by human editors to ensure quality.
- Question Elicitation (The Crucial Step): This is where CommVQA diverts from tradition. Human participants were given the scenario and the description, but they were not shown the image. They were asked to imagine they were browsing that specific website and to ask questions they wanted answered by someone who could see the image. This simulates the experience of a BLV user who has access to text metadata but not the visual content itself.
- Answer Elicitation: Finally, a different set of participants (who could see the image, the question, and the context) provided the answers.
What Does the Data Look Like?
The result is a rich tapestry of questions that feel significantly more natural and goal-oriented than standard VQA datasets.

In the examples above (Figure 5), notice how specific the questions are to the context. In the Shopping scenario (Scenario 3), the user knows from the description that there is a diver and a wreck, but specifically asks about the color of the flippers—likely relevant if they are looking to buy diving gear. In the Health scenario (Scenario 4), the user asks about the age of the people exercising, fitting the goal of learning about a healthy lifestyle.
Does Context Actually Change the Questions?
A skeptic might ask: “Does the website category really change the question that much?”
To verify this, the researchers fine-tuned a BERT model to classify which scenario a question came from, based solely on the text of the question. If questions were generic, the model would fail. Instead, the model achieved 56% accuracy (far above the random chance of 16%), proving that the linguistic patterns of questions are inherently tied to their context.

Figure 2 illustrates this distinguishability. Some scenarios are vastly different; for example, the model could easily distinguish Science Magazines from Shopping (94% accuracy). However, Travel and Social Media were harder to tell apart (83%), likely because travel photos are frequently shared on social platforms, leading to overlapping interests in “where” and “who.”
The researchers also found that specific question words correlated with scenarios. “Who” questions dominated Social Media contexts, while “Where” questions were king in Travel contexts. This confirms that to solve VQA, a model must understand the user’s intent, not just the pixels.
Benchmarking AI Models: Can They Handle Context?
With the dataset established, the researchers put four state-of-the-art Vision-Language Models (VLMs) to the test: LLaVA, BLIP-2, mPLUG-Owl, and IDEFICS.
They devised two experimental setups:
- Baseline: The model is given the Image and the Question. (This is how models are usually tested).
- Contextual: The model is given the Image, Question, Scenario, and Description. (This mimics the full CommVQA setup).
The hypothesis was simple: giving the model more context should help it provide better answers.
The Results
The performance was measured using standard metrics (BLEU, METEOR, CIDEr) which compare the model’s generated text to the human reference answers.

Table 1 reveals a surprising split in performance. IDEFICS was the only model that successfully leveraged the context, seeing a significant boost in performance (e.g., CIDEr score rising from 0.758 to 0.839).
However, for LLaVA, mPLUG-Owl, and BLIP-2, performance actually dropped when context was added. Why would having more information make these models perform worse?
The “Parroting” Problem
The drop in scores for the other models wasn’t because they stopped understanding the image. It was because they became lazy.
When provided with a detailed text description of the image, models like LLaVA and mPLUG-Owl tended to over-rely on that text. Instead of looking at the image to answer the specific new question, they simply repeated information found in the description.

The researchers confirmed this using CLIPScore (Table 2), which measures how well the answer describes the image. Paradoxically, while the reference-based scores (like BLEU) went down, the CLIPScores went up. This indicates the models were generating highly descriptive, visual text—just not the specific text needed to answer the user’s question.
We can visualize this behavior by looking at the semantic similarity between the model’s answer and the provided description.

Figure 4 shows that models (the green distributions) have a much higher similarity to the description than human answerers (the orange distribution). Humans know that the questioner has already read the description, so they provide new information. Models, struggling to understand the “communicative gap,” just parrot back what they were told.
A Concrete Example of Failure
To see this in action, look at the example below involving a walrus.

In Figure 3, the description clearly mentions a walrus and a snowy environment. The user asks, “What else is in the image?” A human would look for details not in the description. However, the model (IDEFICS Contextual), despite being the best performer overall, still falls into the trap of repeating “a walrus with large tusks.”
This highlights a major limitation in current AI: Theory of Mind. The models struggle to model what the user already knows versus what the user wants to know.
Hallucinations and Unanswerable Questions
The study revealed two other critical weaknesses in current models.
1. Hallucinations: Even the best model (IDEFICS) frequently made things up. In a human evaluation of 100 answers, 23% contained clearly erroneous information. For an accessibility tool, this is a dangerous failure rate. A user who cannot verify the image visually is forced to trust the AI.
2. Unanswerable Questions: Because the questioners in CommVQA couldn’t see the image, they sometimes asked questions that the image couldn’t answer (e.g., asking “What does the text say?” when the text is too blurry).
- IDEFICS was the best at recognizing this, successfully abstaining (saying “I can’t answer”) 21% of the time.
- BLIP-2 failed to abstain 100% of the time, always trying to hallucinate an answer.
Interestingly, when the researchers explicitly prompted IDEFICS with “If you don’t know, say ‘unanswerable’,” its success rate jumped to 87%. This suggests that models have the latent capability to judge their own uncertainty but need very specific instructions to do so.
Conclusion: The Future of Situated VQA
CommVQA demonstrates that Visual Question Answering is not just a computer vision problem; it is a communication problem.
The data clearly shows that the “where” (scenario) and the “what I know” (description) fundamentally change the “what I want” (question).
- For Dataset Creators: We need more benchmarks that mimic these information gaps rather than isolated object recognition tests.
- For Model Builders: The “parroting” effect shows that current Instruction Tuning isn’t enough. Models need to be trained to provide information gain—telling the user what they don’t know, rather than summarizing what they already do.
By moving VQA closer to these real-world communicative contexts, we pave the way for AI assistants that are not just smart, but genuinely helpful, especially for those who rely on them to navigate the visual world.
](https://deep-paper.org/en/paper/2402.15002/images/cover.png)