Artificial Intelligence is revolutionizing healthcare, particularly in the realm of medical imaging. We now have Medical Large Vision Language Models (Med-LVLMs) capable of looking at an X-ray or a retinal scan and answering clinical questions. However, there is a persistent “elephant in the room” with these models: Hallucinations.
Even the best models sometimes generate medical responses that sound plausible but are factually incorrect. In a clinical setting, a factual error isn’t just a glitch—it’s a safety risk.
To solve this, researchers often turn to Retrieval-Augmented Generation (RAG). The idea is simple: instead of forcing the AI to memorize everything, give it access to a library of reliable medical reports (an “open-book” exam). But as the authors of the paper “RULE: Reliable Multimodal RAG for Factuality in Medical Vision Language Models” discovered, RAG isn’t a magic bullet. In fact, giving an AI extra information can sometimes make it perform worse if not managed correctly.
In this post, we will break down RULE, a new framework designed to make medical AI more factually accurate by teaching it exactly how much external help it needs and when to trust its own training over retrieved documents.
The Double-Edged Sword of Medical RAG
The standard approach to fixing AI hallucinations is RAG. When the model receives a medical image and a question, it retrieves similar historical cases or reports from a database to use as reference.
However, the researchers identified two major failure modes when applying RAG to medical imaging:
- The Context Quantity Problem: If you retrieve too few documents, the model might miss critical information. If you retrieve too many, you introduce noise and irrelevant details that confuse the model.
- The Over-Reliance Problem: This is the more subtle danger. Sometimes, a model knows the correct answer based on the image alone. But, if the retrieved text contains slightly inaccurate or irrelevant information, the model might blindly trust the retrieval and hallucinate an incorrect answer.

As shown in Figure 1 (c) above, there are cases where the “Stronger Med-LVLM” gets the answer right initially, but once RAG is introduced, it gets confused and answers incorrectly. The researchers call this “over-reliance.”
The Solution: RULE
To address these issues, the authors propose RULE (Reliable mUltimodaL RAG). The framework consists of three main stages:
- Standard Context Retrieval: Finding relevant medical reports.
- Factuality Risk Control: A statistical method to determine the exact number of documents (\(k\)) to retrieve to minimize error risk.
- Knowledge Balanced Preference Tuning (KBPT): Fine-tuning the model to balance its internal knowledge with the retrieved context.

Let’s break these components down.
1. Retrieval Strategy
Before we can control risk, we need a way to find relevant information. RULE uses a dual-encoder system (similar to CLIP) to match the input medical image with similar text reports from a database.
The system uses a vision encoder for images and a text encoder for reports. It is trained using a contrastive loss function to ensure that images and their corresponding reports have similar mathematical representations (embeddings).

Once trained, when a new patient image arrives, the system retrieves the top-\(k\) most similar medical reports to serve as a reference.
2. Calibrating the Retrieval (Risk Control)
How many reports should the model look at? The standard practice in AI is to arbitrarily pick a number, like the top 3 or top 5.
RULE takes a more rigorous, statistical approach. The goal is to select a subset of retrieved contexts (\(k\)) such that the risk of a factual error is statistically guaranteed to stay below a certain threshold.
The authors use a technique inspired by conformal prediction. They calculate a “factuality risk” for different values of \(k\).

They then compute probabilities to determine if the risk at a specific \(k\) is acceptable.

By performing hypothesis testing, they select a set of \(k\) values that control the risk with high probability (at least \(1 - \delta\)). This removes the guesswork. Instead of hoping \(k=5\) works, the model mathematically verifies which quantity of context minimizes the chance of hallucination.

3. Knowledge Balanced Preference Tuning (KBPT)
This is arguably the most innovative part of the paper. As mentioned earlier, models often suffer from over-reliance. They act like a student who knows the answer but changes it because they peeked at a neighbor’s incorrect test paper.
The researchers quantified this problem. They found that in many cases where the RAG model failed, it was because the retrieval “poisoned” the generation.

As shown in Table 1, roughly 47-58% of the errors made by retrieval-augmented models were due to over-reliance.
The Fix: Direct Preference Optimization (DPO)
To fix this, the authors used Direct Preference Optimization (DPO). DPO is a method usually used to align language models with human preferences (like making ChatGPT polite). Here, the authors adapted it to align the model with factuality.

How they built the dataset: They created a specific “preference dataset” to teach the model when to ignore the retrieval.
- They identified samples where the model answered correctly on its own (without RAG).
- They identified samples where the same model answered incorrectly when RAG was added.
- They treated the correct (no-RAG) answer as the preferred response (\(y_{w,o}\)) and the incorrect (RAG-influenced) answer as the dispreferred response (\(y_{l,o}\)).
They then fine-tuned the model using a modified DPO loss function specifically for this knowledge balance:

This process effectively teaches the model: “If the retrieved text leads you astray, trust your internal training instead.”
Experimental Results
The researchers tested RULE on three major medical datasets: IU-Xray, MIMIC-CXR (Radiology), and Harvard-FairVLMed (Ophthalmology). They compared it against standard models like LLaVA-Med and other hallucination mitigation techniques (like Greedy Decoding and DoLa).
Accuracy Improvements
The results were impressive. RULE significantly outperformed the baseline LLaVA-Med-1.5 and other methods.

In Visual Question Answering (VQA) tasks, RULE achieved the highest accuracy across all datasets. For example, on the IU-Xray dataset, accuracy jumped from 75.47% (baseline) to 87.84% (RULE).
They saw similar improvements in report generation tasks (writing full medical descriptions), measured by metrics like BLEU and ROUGE-L.

Why is it working?
To prove that KBPT was actually fixing the “over-reliance” issue, the authors visualized the model’s attention—essentially, what the AI is looking at when it generates an answer.

In Figure 3, look at the attention maps on the right (b):
- w/o KBPT (Top): The model spends a lot of “energy” attending to the retrieved tokens (the red blocks on the right side of the heat map). It answers incorrectly.
- w/ KBPT (Bottom): The model shifts its attention. It focuses less on the retrieved text and more on the question and the image. It answers correctly (“No”).
This confirms that the fine-tuning successfully taught the model to weigh its internal knowledge more heavily when necessary.
Compatibility
The authors also showed that RULE isn’t just a one-off success for a specific model version. They applied it to different backbones (LLaVA-Med-1.0 and 1.5) and consistently saw improvements.

Real-World Examples
To visualize the impact, let’s look at two specific cases where RULE saved the day.

In Figure 5:
- Top Case (Lung X-ray): The question asks about “focal airspace consolidation.” The standard LLaVA-Med says “Yes.” The RAG model gets confused by an irrelevant retrieved report about “presbyopia” (an eye condition!) and says “No” for the wrong reason. RULE correctly identifies that the answer is “No” based on the image, ignoring the bad retrieval.
- Bottom Case (Eye Fundus): The standard model identifies presbyopia correctly. The RAG model looks at a retrieved document that says “no presbyopia” and blindly copies it, resulting in an error. RULE balances the two, trusts the visual evidence, and correctly answers “Yes.”
Conclusion
The “RULE” framework highlights a critical nuance in deploying AI for high-stakes fields like medicine: More data isn’t always better. Simply retrieving external documents (RAG) can solve some knowledge gaps, but it introduces noise and bias.
By statistically calibrating how much to read (Factuality Risk Control) and fine-tuning the model to resist peer pressure from bad retrievals (Knowledge Balanced Preference Tuning), we can build medical AI that is not only smarter but more reliable.
This research provides a robust blueprint for the future of multimodal medical AI, moving us closer to automated systems that clinicians can actually trust.
](https://deep-paper.org/en/paper/2407.05131/images/cover.png)