When LLMs Cheat: Why We Need Per-Question Balancing in Embodied AI

Imagine you are designing a search-and-rescue robot. You send it into a collapsed building and ask, “Is there anyone behind that concrete slab?”

If the robot answers “No” because it used its camera, scanned the area, and saw nothing, that is a success. But what if the robot answered “No” simply because its training data suggests that statistically, people are rarely found behind concrete slabs? The latter is a disaster waiting to happen. The robot isn’t looking; it’s guessing based on prior knowledge.

This is the core problem facing Embodied Question Answering (EQA) today. Large Language Models (LLMs) have become so good at predicting likely answers that they can achieve high scores on benchmarks without actually perceiving the environment. They are “blind” models masquerading as sighted ones.

In this post, we dive into a fascinating paper from the University of Richmond titled “Grounded, or a Good Guesser?”. The researchers propose a clever new dataset methodology called Per-Question Balancing (PQB). By forcing every question to have two opposing answers in two different environments, they mathematically ensure that a blind robot cannot perform better than a coin flip.

To understand the innovation, we first need to define the task. Embodied Question Answering (EQA) requires an agent to:

Understand a natural language question (e.g., “Are there cobwebs in the house?”).
Perceive the environment (using vision).
Act (navigate, turn, look around) to find the information.

Ideally, an agent combines all three. However, researchers have noticed a troubling trend: “blind” models—text-only systems that see no images—are performing shockingly well on EQA benchmarks.

How is this possible? Biased datasets.

If a dataset contains 100 questions about kitchens, and 90 of them ask about stoves, and the answer is almost always “Yes,” the model learns to just say “Yes” whenever it sees the word “kitchen.” It doesn’t need eyes; it just needs statistics. This phenomenon allows models to hallucinate plausible-sounding answers based on language priors rather than grounding their answers in reality.

The Solution: Per-Question Balancing

Previous attempts to fix this involved balancing the entire dataset. For example, ensuring there are 500 “Yes” answers and 500 “No” answers in total.

But the researchers argue this isn’t enough. Imagine a dataset where questions about dogs always have the answer “Yes” and questions about cats always have the answer “No.” The dataset is balanced overall (50/50), but the model still only needs to read the text (“dog” or “cat”) to guess the answer. It still doesn’t need to look at the image.

The paper introduces PQB-EQA (Per-Question Balanced EQA). The rule is simple but strict:

Every specific question must appear twice in the dataset, paired with two different environments that yield different answers.

If the dataset contains the question “Is there a blue flower?”, there must be:

Environment A: Contains a blue flower. (Correct Answer: Yes)
Environment B: Does not contain a blue flower. (Correct Answer: No)

The agent turns and walks into the house to look for cobwebs before determining there are none.

As shown in Figure 1 above, the agent might face a house in two different scenarios. In one, there are cobwebs; in the other, there are none. Because the question is identical, a blind model (which only sees the text) is forced to guess. If it guesses “Yes,” it gets it wrong half the time. If it guesses “No,” it gets it wrong half the time.

Mathematically, this forces the performance of any blind model down to random chance. The only way to score higher is to look at the environment.

Building the Benchmark: Minecraft as a Testbed

To create this dataset, the authors needed a simulation engine that was flexible enough to generate slight variations of the same environment (e.g., placing or removing a specific item). They chose Minecraft.

Minecraft offers diverse biomes (deserts, forests, caves), thousands of items, and a modding capability (WorldEdit) that allows for precise environmental control. It strikes a balance between visual complexity and the ability to programmatically manipulate the world.

Human-Curated Questions

To ensure the questions were natural and challenging, the researchers didn’t just generate them automatically. They recruited human players to play cooperative games in Minecraft, recording their dialogue to extract authentic questions.

Game 1: Can-you-do-it? In this game, a “Questioner” has a secret task (like “dye a sheep orange”) but cannot see the world. An “Agent” can see the world but doesn’t know the task. They have to talk to figure it out.

Screenshots from the can-you-do-it game. The questioner knows the task is “dye a sheep orange,” but the agent does not.

As seen in Figure 2, the players must coordinate. The Questioner asks things like “Do you see any animals?” or “Are there red and yellow flowers?” (to make orange dye). This generates questions that naturally require exploration to answer.

Game 2: Spot-the-difference Here, two players are placed in nearly identical environments and must ask questions to figure out what is different.

Example of two environments from the spot-the-diffrence game.The team would earn points for noting that one environment includes a hay wagon and the other does not or that only one environment has cobwebs on the buildings, but would lose them if they said the environments were different biomes.

Figure 3 shows how subtle these differences can be. Maybe one temple has a hay wagon and the other doesn’t. This setup is perfect for PQB because it naturally generates a question (“Is there a hay wagon?”) that has a “Yes” answer for Player A and a “No” answer for Player B.

Chat interface for the spot-the-difference game.

The players interacted via a custom chat interface (Figure 4), allowing the researchers to capture the exact queries used to differentiate the worlds.

Once the dataset was constructed (424 tuples of Question-Environment-Answer), the authors ran the ultimate test. They compared two state-of-the-art models:

Blind GPT-4o: Given only the question and multiple-choice answers.
Grounded GPT-4o: Given the question and a sequence of images (screenshots) from the environment showing the relevant information.

If the dataset works as intended, the Blind model should fail miserably, and the Grounded model should succeed.

Results

The results were definitive.

Table 1: Accuracy and p of the two models. The blind model does not significantly outperform chance,while the grounded model does.

As Table 1 shows, the Blind GPT-4o model achieved an accuracy of 50.7%. Since these are binary or multiple-choice questions effectively balanced 50/50, this score is statistically indistinguishable from random guessing (\(p = 0.8082\)).

In contrast, the Grounded GPT-4o model (equipped with vision) achieved 82.7% accuracy. This huge gap confirms that the questions can be answered, but only if you look.

Comparison to Previous Benchmarks

The significance of this result becomes clear when we compare it to previous EQA benchmarks. In older datasets, the gap between blind and grounded models was often trivial.

Table 2: Reported difference in scores between models with and without vision on previous benchmarks and on PQB-EQA.

Table 2 highlights this disparity. In the EQA v1 benchmark, adding vision only improved performance by 1.8%. In A-EQA, it was 6.3%. This implies that in those older benchmarks, 90%+ of the performance came from just guessing based on text.

In PQB-EQA, the gap is 32.0%. This is a massive shift. It indicates that this benchmark successfully isolates the variable of “perception.”

Consistency Across Question Types

The researchers further broke down the data to ensure this wasn’t just working for simple “Yes/No” questions.

Table 3: Results of each model on yes/no questions and all other types of questions. The model with vision outperforms the blind model by a wide margin on both categories of question.

Table 3 confirms that the trend holds. Whether the question is binary (Yes/No) or requires specific details (“Other”), the blind model hovers near 50%, while the grounded model excels.

The Role of Action

One final important note from the paper is the necessity of action. The researchers analyzed the human gameplay logs and found that, on average, humans took 278 actions (moving, turning, jumping) to find the answer to a question.

This emphasizes that EQA is not just about looking at a static picture (like Visual Question Answering). It is about navigation. To answer “What is behind the house?”, you cannot just stare at the front door; you must walk around it. This dataset provides a testing ground where an agent must learn to explore intelligently.

Conclusion

The “Grounded, or a Good Guesser?” paper exposes a critical weakness in how we evaluate Embodied AI. For too long, we have allowed language models to rely on statistical crutches, inflating their scores without true understanding of the physical world.

By introducing Per-Question Balancing, the authors have created a benchmark where “cheating” via text priors is mathematically impossible. The PQB-EQA dataset sets a new standard for rigor. It ensures that when a robot tells us there are no survivors behind the rubble, or no cobwebs in the corner, it is saying so because it looked—not because it guessed.

When LLMs Cheat: Why We Need Per-Question Balancing in Embodied AI