Introduction
In the rapidly evolving world of Large Vision-Language Models (LVLMs), the ability of an AI to look at an image and ask intelligent questions is just as important as its ability to answer them. We rely on massive datasets of “Visual Question Answering” (VQA) pairs to train these models. However, there is a bottleneck: creating high-quality, multiple-choice questions for images is labor-intensive for humans, and when machines try to do it, they often get stuck in a loop of redundancy.
Imagine showing a child a picture of a giraffe standing next to a tree. If you asked, “Who is in the photo?” and then immediately asked, “What animal is in the photo?”, you wouldn’t be adding much value. You are asking about the same region of the image twice. Yet, this is exactly what many state-of-the-art models, including GPT-4o, tend to do. They exhibit “tunnel vision,” focusing repeatedly on the most obvious subject while ignoring the background, the context, or the relationships between objects.
Today, we are diving deep into a research paper that proposes a clever solution to this problem. The paper, “Exploring Union and Intersection of Visual Regions for Generating Questions, Answers, and Distractors,” introduces a framework called ReBo.
ReBo forces the AI to “look around.” It generates questions, answers, and distractors (QADs) cyclically, ensuring that each new question focuses on a different part of the image. By mathematically optimizing for the Union of visual regions (seeing the whole picture) and minimizing the Intersection (avoiding redundancy), ReBo creates richer, more diverse training data.

As shown in the figure above, where GPT-4o asks three variations of the same question about the giraffe, ReBo intelligently shifts its gaze—asking about the giraffe, the background trees, and the rocks on the ground.
In this post, we will unpack how ReBo works, the mathematics behind its “visual attention,” and why this matters for the future of AI training.
Background: The Challenge of Multiple-Choice VQA
Before we deconstruct the solution, let’s establish the context. Multiple-choice Visual Question Answering (MC-VQA) requires a model to:
- Read an image.
- Understand a natural language question.
- Select the correct answer from a list of options.
- Ignore the “distractors” (incorrect options designed to confuse the model).
For an AI to learn this, it needs training data—specifically, sets of QADs (Questions, Answers, Distractors).
The Problem with Current Generation Methods
Traditionally, generating these datasets has been a disjointed process. Some algorithms generate a question first, then try to find an answer. Others generate answers and then retroactively fit a question. Distractors are often generated independently, leading to options that are either too easy to guess or nonsensical.
More importantly, when machines generate multiple QADs for a single image, they lack intrinsic dependency. The model doesn’t remember what it asked two seconds ago. If the most salient object in an image is a red car, the model might generate five different questions about the red car, completely missing the pedestrian, the traffic light, or the weather.
This redundancy limits the learning potential of the LVLMs trained on this data. They become experts at recognizing the “main character” of an image but fail at comprehensive scene understanding.
The Core Method: ReBo
The researchers introduce ReBo (Recurrent Bounding box), a framework designed to unify the generation of Questions, Answers, and Distractors while enforcing visual diversity.
At a high level, ReBo works on two main principles:
- Recurrence: It remembers what it has already asked.
- Region Scoring: It explicitly calculates which parts of the image have been covered and penalizes overlapping attention.
1. The Architecture
The architecture of ReBo is built upon the standard Encoder-Decoder structure used in many language models, but with a crucial twist.

As illustrated in the architecture diagram, the system consists of three parts:
- Image Encoder (Frozen): This extracts visual features from the raw image. The researchers use a standard Vision Transformer (ViT).
- LLM Decoder (Frozen): This is the language brain (based on FlanT5-XL) that actually writes the text for the questions and answers.
- Recurrent Multimodal Encoder (Trainable): This is the “manager” and the core innovation of the paper.
The process is cyclic. To generate \(n\) groups of QADs:
- Step 1: The encoder looks at the image and a prompt (Prefix) to generate the first QAD.
- Step 2: The encoder looks at the image, the Prefix, and the QAD generated in Step 1.
- Step 3: The encoder looks at the image, the Prefix, and the QADs from Steps 1 and 2.
By feeding the previous outputs back into the input, the model is conditioned on its own history. It “knows” what has already been discussed.
2. Diversifying with Union and Intersection
Recurrence alone isn’t enough. The model effectively needs a map of the image to know which physical regions it has already inspected. The researchers achieve this by associating every QAD with a Bounding Box—a rectangular region of the image that the question is about.
The goal is to select a combination of bounding boxes that maximizes the coverage of the image (Union) while minimizing the overlap between boxes (Intersection).
Defining the Combinations
Let’s say we want to generate \(n\) QADs. Each QAD corresponds to a specific region \(R\). The model explores the set of all possible combinations of regions.

Here, \(C\) represents the set of all possible bounding box combinations. If there are many potential objects in an image, the number of combinations can be large (\(n^n\)), so efficient scoring is vital.
The Penalty: Intersection over Union (IoU)
To prevent the model from staring at the same spot (like the giraffe example), we calculate the Intersection over Union for a combination of boxes.

In this equation, we sum up the intersections between pairs of regions. A high IoU score is bad in this context—it means the boxes are piled on top of each other, providing redundant information. We want this value to be low.
The Reward: Union over Total (UoT)
Conversly, we want the questions to explore the “four corners” of the image. We define the Union over Total (UoT) as the ratio of the combined area of all selected bounding boxes to the total area of the image (\(H \times W\)).

A high UoT score is good. It means that collectively, the generated questions cover a large percentage of the image’s pixels.
The Scoring Function
The researchers combine these two metrics into a single score vector \(s\). This score acts as a guide or a “ground truth” for visual diversity.

The formula is elegant in its simplicity: \(s_k = \frac{UoT_k}{IoU_k}\).
- If coverage is high (High Numerator) and overlap is low (Low Denominator), the score is massive.
- If coverage is low or overlap is high, the score drops.
3. Training the Model
How does the neural network learn to optimize this mathematical score? It treats the diversity score as a target distribution.
First, the model predicts embeddings (vector representations) for the QADs. It compares these predicted embeddings (\(e_i\)) with the ground-truth embeddings (\(e_j^*\)) using cosine similarity.

This similarity tells us how likely it is that a generated question matches a specific region-based topic. Using these similarities, the model calculates the probability \(p\) of selecting a specific combination of bounding boxes.

Finally, the loss function (the metric the model tries to minimize during training) combines two objectives:
- Language Modeling Loss (\(LM\)): Is the text grammatically correct and sensible?
- Diversity Loss (\(H(s,p)\)): Does the probability distribution of the regions match the optimal diversity score calculated earlier?

By minimizing this Cross-Entropy term \(H(s,p)\), ReBo learns to prefer generating sequences of QADs that result in high Union and low Intersection.
Experiments and Results
The researchers evaluated ReBo primarily on the Visual7W dataset, a standard benchmark for grounded visual question answering. They compared ReBo against a suite of heavy hitters, including LLMs (Llama-2, Llama-3, ChatGPT) and Visual-Language models (BLIP, VisualBERT, Qwen-VL).
Quantitative Performance
The results were impressive. ReBo consistently outperformed baselines across standard text generation metrics like BLEU (precision of n-grams), ROUGE (recall), and CIDEr (consensus-based image description).

As shown in Table 1, ReBo achieves a CIDEr score of 48.28, significantly higher than the closest competitor, Qwen-VL (34.45), and massive models like Llama-3 (23.09). This indicates that the QADs generated by ReBo are not only more diverse but also align much better with human-validated references.
Ablation Study: Do the Components Matter?
It is fair to ask: Is the improvement coming from the recurrent structure, or the fancy bounding box math? The researchers performed an ablation study, removing the Bounding Box Combination Scores (BBCS) and the Recurrent Multimodal Encoder (RME).

The bar chart in the figure above (Figure 3) shows a clear drop in performance when these components are removed (the black bars vs. the purple bars). This confirms that both the cyclic generation and the geometric guidance (Union/Intersection) are essential for peak performance.
Human Evaluation
Metrics like BLEU only tell part of the story. The researchers also recruited human annotators to rate the QADs on Quality, Intersection (Low is better, but here rated as a score where higher = less intersection/better), and Union (Coverage).
Looking at Table 4 in the image above, ReBo scored highest in:
- Quality: 4.07 (vs 3.68 for BLIP2)
- Intersection: 3.70 (indicating less redundancy)
- Union: 4.02 (indicating better image coverage)
Can ReBo Help Train Other Models? (Data Augmentation)
One of the most valuable applications of a generator like ReBo is creating synthetic data to train other models. The researchers used ReBo to generate a massive dataset of synthetic QADs based on Visual7W images. They then used this augmented data to train a standard VQA model (InstructBLIP) and tested it on a completely different dataset (A-OKVQA).

Table 3 shows that adding ReBo-generated data (“Raw+ReBo”) resulted in the highest accuracy (41.80% Average) compared to using data generated by Llama-3 or Qwen-VL. This proves that the diversity of ReBo’s questions actually helps downstream models learn better general reasoning skills.
Qualitative Case Study
Let’s look at a concrete example to see the difference in quality.

In this skiing example:
- GPT-4o asks “Who is in the image?” but provides “snowboarder” as a distractor for “skier.” Visually, these can be very hard to distinguish, making the distractor potentially “too” good or confusing.
- ReBo (without optimization) creates an error, identifying the jacket color as “yellow” when it is clearly green.
- ReBo (Full Model) generates three distinct QADs:
- Who is skiing? (Focus on the person)
- Where is the skier? (Focus on the snow)
- What is in the background? (Focus on the trees)
The full ReBo model successfully separates the distinct semantic layers of the image (Actor, Environment, Background) into separate, valid questions.
Conclusion and Implications
The “ReBo” framework represents a significant step forward in how we think about machine vision and automated question generation. By moving away from independent generation and towards a holistic, recurrent approach, the authors have solved a critical redundancy problem.
The key takeaways are:
- Context Matters: Generating a question is better when you know what you’ve already asked.
- Geometry Guides Semantics: Using simple geometric properties like Union and Intersection of bounding boxes is a powerful proxy for semantic diversity.
- Better Teachers make Better Students: Using ReBo to generate training data creates smarter VQA models than using data from standard Large Language Models.
For students and researchers in this field, ReBo serves as a reminder that “bigger” models (like GPT-4) aren’t always better at specific tasks if they aren’t constrained by the right logic. Sometimes, you need to explicitly program the AI to widen its gaze and look at the whole picture.
](https://deep-paper.org/en/paper/file-3063/images/cover.png)