Thinking Fast and Slow in AI: How FOCUS Optimizes Visual Question Answering
Imagine you are looking at a picture of a clear blue sky. If I ask you, “What color is the sky?”, you answer instantly. You don’t need to squint, search, or think hard. It is intuitive.
Now, imagine a picture of a crowded “Where’s Waldo?” scene. If I ask, " Is Waldo holding a cane?", your brain shifts gears. You stop scanning the whole image generally and start looking for specific features—red stripes, a hat, glasses. You deliberately ignore the distractions and focus on the target.
This distinction between instinctive and deliberate reasoning is known as Dual Process Theory. Humans do it naturally. However, Multimodal Large Language Models (MLLMs)—the AI systems that process both text and images—historically haven’t. They tend to treat every question with the same computational hammer, regardless of whether it’s a simple color query or a complex reasoning task.
In this post, we will dive deep into a paper titled “Fast or Slow? Integrating Fast Intuition and Deliberate Thinking for Enhancing Visual Question Answering.” This research introduces FOCUS, a novel method that teaches AI to mimic human cognitive modes. By dynamically switching between fast intuition and deliberate analysis, FOCUS not only improves accuracy but also significantly reduces computational costs.
The Problem: Visual Noise and Computational Waste
Visual Question Answering (VQA) is a task where an AI is given an image and a question about it. Modern MLLMs, like LLaVA or GPT-4V, are quite good at this. However, they struggle with complex scenarios requiring fine-grained perception.
To help models “see” better, researchers often use visual prompting. This usually involves overlaying bounding boxes or segmentation masks on objects in the image to help the model distinguish distinct items. The current state-of-the-art method, Set-of-Mark (SoM), takes a brute-force approach: it detects and marks every object in the image.
While this helps in some cases, it creates two major problems:
- Visual Clutter: By marking everything, the image becomes noisy. Irrelevant objects get highlighted, distracting the model from the actual answer.
- Inefficiency: It is computationally expensive to run segmentation on every object for every single image, even for simple questions that don’t require it.
Take a look at the comparison below:

In Figure 1, notice panel (c). The SoM method places colored boxes everywhere. The model gets confused by the “66” distraction and answers incorrectly. In panel (b), the FOCUS method (the subject of this post) only highlights the relevant player, leading to the correct answer “5”.
The researchers identified that existing methods fail because they indiscriminately annotate all detected objects. This leads to the core questions driving this research: Are all objects equally important? And do all questions actually require visual prompts?
The Solution: Dual Process Theory for AI
The authors propose FOCUS, a plug-and-play approach inspired by Daniel Kahneman’s Dual Process Theory. The theory defines two systems of thinking:
- System 1 (Fast Intuition): Automatic, quick, and requires little effort.
- System 2 (Deliberate Thinking): Slow, analytical, and requires attention.
FOCUS implements this by allowing the MLLM to first assess the difficulty of a question. If it’s simple, it uses Fast Intuition. If it’s complex, it switches to Deliberate Thinking.
The FOCUS Pipeline
The architecture of FOCUS is elegant in its logic. It does not require retraining the massive underlying MLLM; instead, it acts as an intelligent inference strategy.

As shown in Figure 2, the process follows a decision tree:
- Question Complexity Evaluation: The model looks at the image and question and decides, “Is this hard?”
- Branch 1: Fast Intuition: If the confidence is high, the model answers immediately (Zero-shot reasoning).
- Branch 2: Deliberate Thinking: If the confidence is low (meaning the question is complex or ambiguous), the model triggers the “Conceptualizing before Observation” strategy. This involves identifying key elements, segmenting only those elements, and re-processing the image.
Let’s break down these components in detail.
Step 1: Evaluating Question Complexity
How does an AI know if a question is “hard”? The authors utilize the concept of model confidence.
LLMs can be prone to hallucinations or overconfidence, so simply asking “Are you sure?” isn’t enough. Instead, FOCUS employs a self-consistency check.
The system prompts the MLLM (with a high temperature setting to induce variation) to generate multiple responses—specifically, asking the model if the question is “Answerable” or “Unanswerable” based on the current visual input.
- If the model consistently says “Answerable” across multiple attempts (\(N=3\)), the question is deemed Simple.
- If the model wavers or marks it “Unanswerable,” it is deemed Complex.
If the question is Simple, the system skips all the heavy image processing and outputs the answer directly. This mimics our “Fast Intuition.”
Step 2: Conceptualizing Before Observation (Deliberate Thinking)
If the question is flagged as Complex, FOCUS engages its Deliberate Thinking mode. This mode is designed to fix the “clutter” problem we saw earlier with SoM. Instead of highlighting everything, FOCUS highlights only what matters.
This process is called Conceptualizing before Observation, and it has three sub-steps:
A. Keyword Extraction
First, a language model analyzes the text of the question \(Q\) to extract key visual elements (keywords), denoted as \(\{k_i\}\). For example, if the question is “What number is on the pitcher’s hat?”, the model extracts “Pitcher” and “Hat” as the key concepts.
B. Targeted Segmentation
Next, the system uses a segmentation model (specifically Grounded-SAM) as an open-set object detector. Unlike previous methods that segment everything, this model searches the image \(I\) specifically for the extracted keywords \(k_i\).
Mathematically, the segmented regions \(s_i\) are generated as:

Here, \(\mathcal{S}\) represents the segmentation model. It takes the original image and the specific keyword to generate a precise visual mask.
C. Image Refinement
Once the specific segments are found, the system creates a new, processed image \(I'\). This image highlights the segmented regions—usually by darkening the background or drawing bright contours around the target objects—while leaving the rest of the image less prominent.
The aggregation of these segments is represented as:

Finally, this refined image \(I'\) is fed back into the MLLM along with the original question to generate the final answer \(A\):

By processing the image this way, the MLLM’s attention is forcibly directed toward the relevant pixels, filtering out the visual noise that usually causes hallucinations.
Visualizing the Impact: Attention Calibration
Does this actually change how the model “looks” at the image? The researchers visualized the attention weights of the LLaVA-1.5 model to find out.

In Figure 3, we see a clear difference:
- Row (a) - Original: When asked “What number is on his jersey?”, the model’s attention (the green heatmap) is scattered. It even looks at the woman’s shirt (which has a ‘C’), leading to the wrong answer “66”.
- Row (b) - FOCUS: With the targeted highlighting, the model’s attention is tightly constrained to the man’s jersey. The distraction is ignored, and the model correctly answers “5”.
This confirms that FOCUS acts as an attention calibration mechanism, physically guiding the model’s focus to the right pixels.
Experiments and Results
The researchers tested FOCUS against various baselines, including standard MLLMs (like LLaVA and InstructBLIP) and the previous best method, Set-of-Mark (SoM). They used four distinct benchmarks:
- ScienceQA: Logical reasoning.
- TextVQA: Text recognition in images (OCR).
- VizWiz: Real-world visual understanding (often with poor quality images).
- MME: A comprehensive perception and cognition suite.
Performance Gains
The results were consistent and impressive.

Looking at Table 1, we can see that FOCUS (highlighted in pink) consistently outperforms the baselines.
- Vs. SoM: On the MME benchmark, FOCUS with LLaVA-1.5-13B achieves a score of 1551.0, surpassing SoM’s 1540.1. It also shows significant gains in ScienceQA and VizWiz.
- State-of-the-Art: By combining FOCUS with open-source models, the researchers achieved State-of-the-Art (SoTA) performance across all four benchmarks.
It is worth noting that FOCUS helps across different model sizes. Even the smaller 7B models saw substantial improvements when using this strategy, sometimes rivaling the performance of larger 13B models that didn’t use FOCUS.
Efficiency: Being Fast and Accurate
One of the most critical contributions of this paper is efficiency. Previous prompting methods run slow segmentation processes on every image. FOCUS only runs them when necessary (System 2).

Figure 4 illustrates the inference time relative to the SoM method (which is normalized to 100%).
- On TextVQA, FOCUS takes only 47% of the time SoM takes.
- On ScienceQA, it takes 53%.
This creates a “best of both worlds” scenario: The model is accurate because it uses deep processing for hard questions, but it is fast because it skips that processing for easy ones.
Effectiveness on Black-Box Models
The researchers also applied FOCUS to closed-source “black-box” models like GPT-4V and Gemini Pro. Since we cannot access the internal weights of these models, improving them via external prompting strategies is highly valuable.

As shown in Table 3, applying FOCUS to GPT-4V improved its accuracy on a sample of ScienceQA from 79.2% to 82.4%. This proves that FOCUS is model-agnostic; it works regardless of the underlying architecture.
Why Is Deliberate Thinking Better Than “Mark Everything”?
The paper includes an ablation study to prove that the “Conceptualizing before Observation” strategy (Deliberate Thinking) is superior to the “Mark Everything” strategy (SoM) even when applied to the same images.

Table 5 compares SoM directly against FOCUS. Even if we strip away the “Fast Intuition” part and just look at the highlighting strategy, FOCUS wins. This validates the hypothesis that less is more. By removing irrelevant markers, we reduce the cognitive load on the MLLM.
Conclusion and Implications
The “FOCUS” paper presents a compelling step forward for Multimodal AI. By borrowing from human cognitive psychology, the researchers have created a system that is both more accurate and more efficient than previous brute-force methods.
Here are the key takeaways:
- Dynamic Reasoning: Not all visual questions require the same level of processing. Switching between “Fast” and “Slow” modes saves time.
- Less Noise: Highlighting only the objects relevant to the specific question is far better than highlighting everything.
- Universality: This method works on open-source models (LLaVA) and proprietary giants (GPT-4V) alike.
This research implies that the future of computer vision isn’t just about building bigger models, but about building smarter inference strategies. Just as humans don’t hyper-analyze every pixel of every scene we see, AI shouldn’t either. Sometimes, a quick glance is enough. But when it’s not, it pays to stop, think, and FOCUS.
](https://deep-paper.org/en/paper/2506.00806/images/cover.png)