Can AI Doctors See? Benchmarking Vision Language Models in the ER

Introduction

Imagine a busy emergency room. Doctors and nurses are rushing between patients, machines are beeping, and decisions need to be made in split seconds. Now, imagine an AI assistant in the corner, observing the scene through a camera, ready to alert the staff if a patient pulls out an IV line or if a ventilator setting looks wrong.

This sounds like the future of healthcare, doesn’t it? With a global shortage of over 6 million physicians, the promise of autonomous medical agents is alluring. But before we hand over the reins to Artificial Intelligence, we have to ask a critical question: Do these models actually understand what they are looking at?

We have seen the rise of Large Vision Language Models (LVLMs) like GPT-4V and Gemini, which can describe images with impressive fluency. But describing a cat on a couch is very different from interpreting a patient’s posture during a nasopharyngeal swab.

In this post, we are diving deep into a recent paper that challenges these models with the ERVQA (Emergency Room Visual Question Answering) dataset. The researchers didn’t just want to know if the models could talk; they wanted to know if they could reason, perceive, and act safely in a high-stakes hospital environment. The results reveal a fascinating, and somewhat concerning, gap between AI capability and medical necessity.

An example from the ERVQA dataset showing a doctor performing a procedure. The question asks about the patient’s head position.

The Problem with Current Benchmarks

To understand why ERVQA is necessary, we first need to look at the landscape of “Medical VQA” (Visual Question Answering). There are plenty of datasets out there, but they tend to focus on specific, isolated tasks.

Most existing datasets, such as VQA-RAD or PathVQA, rely heavily on radiology (X-rays, CT scans) or pathology slides. While vital, these images are static and highly structured. They don’t capture the chaos and visual complexity of a hospital room. Furthermore, the answers in these older datasets are often very short—sometimes just “Yes” or “No,” or a single word.

Real-world medical assistance requires more. If a nurse asks, “Is the patient’s position correct for this procedure?” a simple “No” isn’t helpful. The system needs to explain why and how to fix it.

A comparison of medical VQA datasets. Panels A through E show radiology/pathology examples. Panel F shows the ERVQA dataset focusing on patient scenarios and equipment.

As shown in Figure 2 above, existing datasets (a-e) look very different from the ERVQA approach (f). The ERVQA dataset moves away from isolated scans and steps into the room with the patient. It focuses on:

Patient Condition: Consciousness, symptoms, mood.
Machines & Apparatus: Are the readings normal? Is the IV bag placed correctly?
Environment: General anomalies in the ER.

Introducing ERVQA: The Dataset

The researchers curated 367 real-world images from hospital environments. These aren’t pristine stock photos; they are scraped from news articles and reports to reflect the noise and reality of actual hospitals.

To create the Questions and Answers (QA), they didn’t just crowdsource to random internet users. They employed medical experts—people with formal medical education and hospital experience.

The process involved a mix of manual annotation and semi-automatic generation (using GPT-4V) which was then rigorously verified and corrected by human doctors. The result is a dataset of 4355 QA pairs.

Table showing dataset statistics: 4355 QA pairs across 367 images.

What makes this dataset distinct is the depth of the data.

Open-Ended: The answers are free-form text, not multiple choice.
Reasoning-Heavy: The models must infer information from visual cues (e.g., “Is the patient critical?” requires looking at the number of staff, the equipment, and the patient’s posture).
Diverse: As shown in the chart below, the questions vary significantly, asking about existence (“Is there…”), description (“What is…”), and capability (“Can you…”).

A pie chart showing the distribution of question types, with ‘Is the’ and ‘What is’ being the most common start phrases.

The Error Taxonomy: How Models Fail

In a general conversation, if an AI gets a detail slightly wrong, it’s annoying. In a hospital, it can be fatal. The researchers realized that standard accuracy metrics (like “did the model get the exact word right?”) were insufficient. They needed to categorize how the models were failing.

They developed a detailed Error Taxonomy consisting of 8 distinct error types. Let’s look at a few of them with examples from the paper.

1. Reasoning and Medical Factual Errors

A Reasoning Error occurs when the model sees the image but draws the wrong conclusion. A Medical Factual Error is when the model hallucinates medical knowledge or misinterprets a procedure.

Two examples of errors. Left: A reasoning error where the model claims monitoring is impossible despite cues. Right: A medical factual error regarding IV lines.

In the example above (Figure 10), look at the left panel. The model claims it is “impossible to confirm” if vital signs are monitored because “there are no visible monitors.” This is a reasoning failure—a human doctor knows that in such a setting, the wires attached to the patient imply monitoring, even if the screen isn’t in the frame.

2. Specificity and Linguistic Errors

Sometimes the model is just vague (Specificity Error) or grammatically confused (Linguistic Error).

Left: Specificity error where the answer is irrelevant. Right: Linguistic error where ‘alert’ contradicts the visual evidence of sedation.

In Figure 12 (right panel), the model says the patient is “alert and oriented.” However, the child has nasogastric tubes and appears sedated or sleeping. The model is using a “safe,” generic medical phrase that is factually wrong for the specific patient.

3. Hallucination

Perhaps the most dangerous error is Hallucination, where the model invents objects or details that simply aren’t there.

Left: Hallucination error where the model invents an injury exam. Right: Uncertainty error where the model is overly cautious about a syringe pump.

In Figure 13 (left), the model adds, “the patient is also being examined for injuries.” There is no visual evidence of an injury exam. In a medical log, this fabricated detail could lead to confusion about what procedures were actually performed.

The “Doubling Down” Effect

One of the most profound insights from the paper is how these errors interact. The researchers analyzed the co-occurrence of errors—if a model makes one type of mistake, what other mistakes does it make?

A heatmap showing the correlation between different error types. Dark blue indicates high co-occurrence.

The heatmap above reveals a concerning trend:

Reasoning Errors (Type 1) are highly correlated with Hallucination (Type 7) and Perception Errors (Type 3).
This suggests that when a model fails to perceive an object correctly, it doesn’t just stop; it hallucinates a detail to fill the gap and then builds a reasoning chain based on that lie.
The models tend to “double down” on their errors rather than expressing uncertainty.

Benchmarking the Models

The researchers tested a variety of state-of-the-art models, including open-source options (Llava, mPLUG-Owl, Open-Flamingo) and closed models (GPT-4V, Gemini Pro Vision).

Adapted Metrics

Since standard text metrics (like BLEU or ROUGE) don’t capture medical accuracy well, the authors adapted two specific metrics for this domain:

Entailment Score (ES): This measures whether the meaning of the ground truth answer is contained within the generated answer. It uses a Natural Language Inference (NLI) model to check logical consistency.

Equation for Entailment Score.

CLIPScore Confidence (CLIP-C): This metric checks visual consistency. It measures how well the generated answer aligns with the image compared to the ground truth answer.

Equation for CLIPScore Confidence.

The Results

The quantitative results (Table 3) showed that proprietary models like GPT-4V generally outperformed open-source models in semantic understanding (Entailment Score).

Table showing performance metrics. GPT-4V and Gemini generally lead in semantic scores.

However, the metric scores don’t tell the whole story. To get a true picture of reliability, the researchers trained a classifier (using a finetuned BLIP-2 model) to automatically detect the 8 error types discussed earlier. This “silver-label” analysis provided a breakdown of error rates.

Key Trends

1. Bigger Models \(\neq\) Fewer Errors You might assume that the massive 13B models or the closed commercial models would be significantly safer. Surprisingly, the error analysis suggests otherwise.

Bar chart comparing error rates by model size. Error rates remain high even for larger models.

As shown in Figure 6, while larger models (like Gemini) perform better, the gap isn’t as wide as one might hope. Even the best models suffer from significant rates of Reasoning (Type 1) and Specificity (Type 5) errors. GPT-4V, for instance, showed a high tendency for reasoning errors, often hallucinating details to support a complex (but wrong) argument.

2. In-Context Learning: A Mixed Bag The researchers provided the models with examples (1-shot and 3-shot learning) to see if “teaching” them the format helped.

Bar chart showing Gemini’s error rates across 0-shot, 1-shot, and 3-shot settings. The error rates remain stagnant.

Figure 7 illustrates a frustrating reality for Gemini Vision Pro: increasing the number of examples (shots) did not significantly reduce the error percentage. While the metrics (like BLEU) might go up because the model learns to copy the style of the answer, the underlying reasoning and perception errors remained stubbornly high. The model learned to sound more like a doctor, but it didn’t learn to see better.

3. The Problem of Verbosity Many models, especially open-source ones, suffered from “Specificity/Relevance” errors (Type 5). They would generate long, winding answers full of generic medical definitions rather than answering the specific question about the patient in the image.

Bar chart comparing error rates by decoder type. Note the high Type 5 errors for several models.

Conclusion: Are They Ready?

The title of the paper asks if LVLMs are ready for hospital environments. The evidence provided by the ERVQA benchmark offers a clear answer: No, not yet.

While models like GPT-4V and Gemini Vision Pro demonstrate impressive general capabilities, their application in a high-stakes, safety-critical environment like an emergency room is currently fraught with risk.

They hallucinate non-existent treatments.
They misinterpret visual cues (like an disconnected IV).
They are overconfident in their incorrect reasoning.

The ERVQA dataset serves as a crucial reality check. It highlights that medical AI needs more than just general training; it requires domain-specific grounding, a better grasp of visual nuance, and, crucially, the ability to say “I don’t know” rather than inventing a plausible-sounding falsehood.

For students and researchers entering this field, this paper opens a massive door for future work. How do we reduce visual hallucinations? How do we teach models to prioritize patient safety over conversational fluency? The ERVQA benchmark is the first step toward answering those questions.

Introduction#

The Problem with Current Benchmarks#

Introducing ERVQA: The Dataset#

The Error Taxonomy: How Models Fail#

1. Reasoning and Medical Factual Errors#

2. Specificity and Linguistic Errors#

3. Hallucination#

The “Doubling Down” Effect#

Benchmarking the Models#

Adapted Metrics#

The Results#

Key Trends#

Conclusion: Are They Ready?#