Introduction

In the rapidly evolving world of Artificial Intelligence, we have reached a fascinating recursive milestone: we are increasingly relying on AI models to evaluate other AI models.

As Large Vision-Language Models (LVLMs) like GPT-4o and Claude 3.5 Sonnet become more capable, human evaluation becomes prohibitively expensive and slow. To solve this, researchers use “Generative Reward Models” (GenRMs)—essentially using a powerful LVLM as a judge to rank responses, provide feedback, and guide the training of newer models through Reinforcement Learning from Human Feedback (RLHF).

But this raises a critical question: Who watches the watchmen? If we trust an AI to grade an exam, we need to be absolutely certain that the AI grader knows what it is doing.

This brings us to a significant bottleneck. Current methods for evaluating these “AI judges” are flawed. They either rely on AI-generated labels (which introduces circular bias) or traditional, simplistic tasks that are too easy for modern state-of-the-art models.

Enter VL-RewardBench. In a recent paper, researchers introduced a rigorous, challenging benchmark designed specifically to stress-test Vision-Language GenRMs. The results were surprising: even the most advanced commercial models frequently fail at basic visual perception tasks when asked to judge them.

Figure 1. An example from our VL-RewardBench asking the visual details in a restroom. Open-source VL-GenRMs (Qwen2-VL-7B and Llama-3.2-90B) and the commercial model (Claude-3.5-Sonnet) all fail to provide accurate judgments.

As shown in Figure 1 above, when asked to judge a response about the number of sinks and mirrors in a restroom, top-tier models like Llama-3.2-90B and Claude-3.5-Sonnet failed to identify the correct answer, despite the visual evidence being clear to a human.

In this post, we will tear down the VL-RewardBench paper. We will explore how the authors constructed a dataset that stumped the giants, analyze the specific “blind spots” of modern multimodal models, and discuss what this means for the future of AI alignment.

Background: The Era of the AI Judge

Before diving into the benchmark, it is helpful to understand the concept of a VL-GenRM (Vision-Language Generative Reward Model).

In the text-only world, models like GPT-4 are often used to score the quality of summaries or translations. This “LLM-as-a-Judge” paradigm allows for scalable evaluation. Now, this concept is being applied to multimodal tasks—where models must understand both images and text.

A reliable VL-GenRM is foundational for three things:

Evaluation: Automatically tracking progress of new models without waiting for humans.
Data Generation: Filtering synthetic training data to keep only the best examples.
RLHF: Providing the “reward” signal during reinforcement learning to align models with human preferences.

However, prior benchmarks for these judges were insufficient. Some used GPT-4V to generate the “correct” labels, which means if GPT-4V has a specific bias or hallucination habit, the benchmark reinforces it. Others used old academic datasets that simply weren’t hard enough to differentiate between a 7B parameter model and a 90B parameter model.

The Core Method: Constructing VL-RewardBench

The primary contribution of this paper is the construction of a benchmark that satisfies three criteria: it covers real-world scenarios, it is genuinely difficult, and it has objective, human-verified ground truth.

To achieve this, the authors curated 1,250 high-quality preference pairs. A “preference pair” consists of an image, a question, two possible answers (Answer A and Answer B), and a label indicating which answer is better.

The construction process, illustrated in Figure 2 below, involves two clever pipelines designed to filter out the noise and keep only the signal.

Figure 2. Construction process overview of VL-RewardBench. Two strategies for different datasets: (1) Ensemble filtering process using small LVLMs to identify challenging samples from general and hallucination queries; (2) AI-aided preference labeling for multimodal reasoning tasks.

1. The Ensemble Filtering Strategy

For general queries and hallucination tasks, the researchers didn’t want to just pick random images. They wanted images that are hard for machines but solvable for humans.

They employed an Ensemble Filtering technique. They took a committee of smaller models (like LLaVA and Qwen-VL) and asked them to judge various samples.

If the small models easily identified the correct answer, the sample was discarded (too easy).
If the small models inconsistently guessed or all failed, the sample was flagged as “challenging.”

The hypothesis was that if an ensemble of different small models all struggle with an image, the difficulty likely stems from a fundamental visual complexity rather than a specific bug in one model. As we will see in the results, this hypothesis held true: samples that stumped small models also stumped the giants.

2. AI-Aided Preference Labeling for Reasoning

For complex reasoning tasks (like math problems or chart analysis), existing datasets often lacked “bad” answers to compare against. The researchers needed to create preference pairs.

They used strong commercial models (GPT-4o, Claude 3.5 Sonnet) to generate candidate solutions. Then, they used GPT-4o to act as an initial judge to propose which answer was better.

Crucially, humans were the final gatekeepers. Every single chosen pair in the benchmark underwent a multi-stage human verification process. This eliminated cases where:

Both answers were wrong.
The image quality was too poor.
The “better” answer was only better because it was longer (verbosity bias).

Dataset Statistics and Quality Control

The resulting dataset covers three domains:

General Multimodal Instructions: Everyday queries.
Hallucination-oriented Queries: Specific checks on whether objects exist or attributes are correct.
Multimodal Reasoning: Math, logic, and knowledge-intensive tasks.

Table 1 provides a breakdown of the dataset. Notice the high number of “Hallucination-oriented” queries (749), which serves as a stress test for model faithfulness.

Table 1. Statistics of VL-RewardBench showing the breakdown of samples across categories and error types.

One common issue in training reward models is “length bias”—models often prefer longer answers regardless of quality. To ensure their benchmark wasn’t just testing which model could write the most text, the authors analyzed the word count distribution.

Figure 3. Distribution of the word count difference between the chosen and the rejected response. The bell curve centered at 0% indicates no systematic length bias.

As shown in Figure 3, the difference in length between chosen and rejected responses forms a zero-centered bell curve. This confirms that the preference labels are based on content quality, not verbosity.

Experiments and Results

The authors evaluated 16 state-of-the-art models, ranging from open-source 7B models to proprietary giants like GPT-4o and Gemini 1.5 Pro. The evaluation setup followed the “LLM-as-a-Judge” protocol: the model is given the image, question, and two answers, and must output which answer is better.

The Main Leaderboard

The results were sobering. Table 2 (below) reveals that even the most powerful models are far from perfect.

Table 2. Evaluation results on VL-RewardBench. The best results are shown in bold. Even the top models struggle to achieve high accuracy.

Key Takeaways from the Results:

The Ceiling is Low: The best-performing model, Gemini-1.5-Pro, only achieved 62.5% macro average accuracy. GPT-4o followed closely at 62.4%. Considering a random guess yields 50%, this shows there is massive room for improvement.
Open-Source Struggle: Leading open-source models like Llama-3.2-90B achieved roughly 53.9%, while many 7B models hovered near random chance (33% - 40%).
Hallucinations are Hard: The “Hallucination” column shows significantly lower scores compared to reasoning. This implies that models are better at abstract math reasoning than they are at simply looking at a picture and verifying if a specific object is present.

Validation: Does this Benchmark Matter?

A skeptic might ask: “Maybe the benchmark is just pedantic? Does scoring well on VL-RewardBench actually translate to real-world utility?”

To answer this, the researchers checked the correlation between a model’s score on VL-RewardBench and its ability to improve downstream performance using Best-of-N (BoN) sampling.

In BoN sampling, a model generates \(N\) answers, and the “Judge” picks the best one. If VL-RewardBench accurately measures judging ability, a high score here should lead to better BoN selection on a hard task like MMMU-Pro (a massive multimodal understanding benchmark).

Figure 4. VL-GenRMs accuracy on VL-RewardBench correlates positively (Pearson r > 0.9) with the improvements to serve as Best-of-N selector on MMMU-Pro.

Figure 4 confirms this with a stunningly high correlation (Pearson r > 0.9). Models that score higher on VL-RewardBench are statistically better at selecting the correct answers in complex reasoning tasks. This validates VL-RewardBench as a legitimate proxy for a model’s utility as a reward model.

Analysis: Why Do Models Fail?

The most insightful part of the paper is the deep dive into why these models fail. The authors categorized the errors into specific types: Attribute, Counting, Existence, Recognition, and Reasoning.

1. Perception vs. Reasoning

There is a widely held assumption that “Reasoning” is the hardest part of AI. However, Figure 5 flips this narrative.

Figure 5. Error rate analysis across different types. VL-GenRMs suffer more from perception-related errors (Existence, Recognition) than Reasoning tasks.

Look at the error rates for Existence (checking if an object is in the image) and Recognition (identifying what an object is). They are significantly higher than the error rates for Reasoning.

GPT-4o has a ~29.5% error rate on Reasoning but a ~40% error rate on Recognition.
Qwen2-VL-7B has a massive ~68% error rate on Existence.

This indicates that the current bottleneck for “AI Judges” isn’t high-level logic; it’s basic visual perception. Models are hallucinating objects that aren’t there or failing to see objects that are. They can solve the math equation if the text describes it, but they can’t reliably read the equation off the whiteboard.

2. The Limits of Inference-Time Scaling

In text LLMs, a common trick to boost performance is “majority voting”—asking the model the same question 5 or 10 times and taking the most common answer. Does this work for Vision-Language Judges?

Figure 6. Performance changes with varying K (number of votes). The increased test-time computation benefits large models like GPT-4o but degrades performance for open-source models.

Figure 6 shows a divergence. For powerful models like GPT-4o (green line), increasing \(K\) (the number of votes) improves accuracy. However, for open-source models like Qwen2-VL (purple squares), increasing \(K\) actually hurts performance.

This suggests that smaller models aren’t just making random errors; they are confidently wrong or easily confused by repeated queries. Inference-time scaling is not a silver bullet for vision-language tasks yet.

3. Can We Train Better Judges?

Finally, the authors explored if we can specifically train a model to be a better judge (“Critic Training”). They tested LLaVA-Critic, a model fine-tuned on critique data.

Figure 7. Evaluation of LLaVA-Critic on VL-RewardBench. Critic training greatly improves judgment accuracy compared to the base model.

As shown in Figure 7, training specifically for the “Critic” role yields massive gains. The “Pointwise Critic” (which scores a single answer) improved macro accuracy from 38.2% to 52.9%—a leap that puts a 7B model within striking distance of much larger models. This suggests that the “Judge” capability is a learnable skill that can be unlocked with targeted data.

Conclusion and Implications

VL-RewardBench serves as a reality check for the multimodal AI community. It highlights that while we have made tremendous progress in generative capabilities, our ability to evaluate those generations automatically is lagging behind.

Key Takeaways:

Blind Judges: Current VL-GenRMs struggle more with basic looking (perception) than thinking (reasoning). Future research needs to focus on grounding models in visual reality to fix hallucination and existence errors.
Size Matters: There is a clear scaling law at play. Larger models are significantly better judges, and they are the only ones that currently benefit from inference-time scaling (voting).
Training Works: We don’t have to wait for models to get larger. We can train “Critic” models specifically to be judges, yielding substantial improvements in reliability.
A New Standard: With its high correlation to downstream tasks, VL-RewardBench provides a robust “North Star” for researchers trying to build the next generation of aligned multimodal models.

As we move toward autonomous AI systems that need to self-correct and self-improve, the ability to accurately judge reality is non-negotiable. VL-RewardBench provides the measuring stick we need to reach that goal.

Introduction#

Background: The Era of the AI Judge#

The Core Method: Constructing VL-RewardBench#

1. The Ensemble Filtering Strategy#

2. AI-Aided Preference Labeling for Reasoning#

Dataset Statistics and Quality Control#

Experiments and Results#

The Main Leaderboard#

Validation: Does this Benchmark Matter?#

Analysis: Why Do Models Fail?#

1. Perception vs. Reasoning#

2. The Limits of Inference-Time Scaling#

3. Can We Train Better Judges?#

Conclusion and Implications#