Artificial Intelligence is revolutionizing healthcare, particularly in the realm of medical imaging. We now have Medical Large Vision Language Models (Med-LVLMs) capable of looking at an X-ray or a retinal scan and answering clinical questions. However, there is a persistent “elephant in the room” with these models: Hallucinations.

Even the best models sometimes generate medical responses that sound plausible but are factually incorrect. In a clinical setting, a factual error isn’t just a glitch—it’s a safety risk.

To solve this, researchers often turn to Retrieval-Augmented Generation (RAG). The idea is simple: instead of forcing the AI to memorize everything, give it access to a library of reliable medical reports (an “open-book” exam). But as the authors of the paper “RULE: Reliable Multimodal RAG for Factuality in Medical Vision Language Models” discovered, RAG isn’t a magic bullet. In fact, giving an AI extra information can sometimes make it perform worse if not managed correctly.

In this post, we will break down RULE, a new framework designed to make medical AI more factually accurate by teaching it exactly how much external help it needs and when to trust its own training over retrieved documents.

The Double-Edged Sword of Medical RAG

The standard approach to fixing AI hallucinations is RAG. When the model receives a medical image and a question, it retrieves similar historical cases or reports from a database to use as reference.

However, the researchers identified two major failure modes when applying RAG to medical imaging:

  1. The Context Quantity Problem: If you retrieve too few documents, the model might miss critical information. If you retrieve too many, you introduce noise and irrelevant details that confuse the model.
  2. The Over-Reliance Problem: This is the more subtle danger. Sometimes, a model knows the correct answer based on the image alone. But, if the retrieved text contains slightly inaccurate or irrelevant information, the model might blindly trust the retrieval and hallucinate an incorrect answer.

Figure 1: Examples of factuality issues in Med-LVLM. (a) shows hallucination. (b) shows the difficulty of choosing the right number of documents. (c) shows the model answering correctly on its own, but failing once RAG is added due to over-reliance.

As shown in Figure 1 (c) above, there are cases where the “Stronger Med-LVLM” gets the answer right initially, but once RAG is introduced, it gets confused and answers incorrectly. The researchers call this “over-reliance.”

The Solution: RULE

To address these issues, the authors propose RULE (Reliable mUltimodaL RAG). The framework consists of three main stages:

  1. Standard Context Retrieval: Finding relevant medical reports.
  2. Factuality Risk Control: A statistical method to determine the exact number of documents (\(k\)) to retrieve to minimize error risk.
  3. Knowledge Balanced Preference Tuning (KBPT): Fine-tuning the model to balance its internal knowledge with the retrieved context.

Figure 2: The framework of RULE comprises two main components: Factuality Risk Control and Knowledge Balanced Preference Tuning.

Let’s break these components down.

1. Retrieval Strategy

Before we can control risk, we need a way to find relevant information. RULE uses a dual-encoder system (similar to CLIP) to match the input medical image with similar text reports from a database.

The system uses a vision encoder for images and a text encoder for reports. It is trained using a contrastive loss function to ensure that images and their corresponding reports have similar mathematical representations (embeddings).

Equation 2: The contrastive loss function used to align image and text representations.

Once trained, when a new patient image arrives, the system retrieves the top-\(k\) most similar medical reports to serve as a reference.

2. Calibrating the Retrieval (Risk Control)

How many reports should the model look at? The standard practice in AI is to arbitrarily pick a number, like the top 3 or top 5.

RULE takes a more rigorous, statistical approach. The goal is to select a subset of retrieved contexts (\(k\)) such that the risk of a factual error is statistically guaranteed to stay below a certain threshold.

The authors use a technique inspired by conformal prediction. They calculate a “factuality risk” for different values of \(k\).

Equation 21: Definition of Factuality Risk (FR) based on the model’s accuracy.

They then compute probabilities to determine if the risk at a specific \(k\) is acceptable.

Equation 5: Probability calculations for risk control.

By performing hypothesis testing, they select a set of \(k\) values that control the risk with high probability (at least \(1 - \delta\)). This removes the guesswork. Instead of hoping \(k=5\) works, the model mathematically verifies which quantity of context minimizes the chance of hallucination.

Equation 6: The probability guarantee for the factuality risk control.

3. Knowledge Balanced Preference Tuning (KBPT)

This is arguably the most innovative part of the paper. As mentioned earlier, models often suffer from over-reliance. They act like a student who knows the answer but changes it because they peeked at a neighbor’s incorrect test paper.

The researchers quantified this problem. They found that in many cases where the RAG model failed, it was because the retrieval “poisoned” the generation.

Table 1: Over-Reliance Ratio showing that nearly half of the errors in RAG models are due to over-reliance on retrieved context.

As shown in Table 1, roughly 47-58% of the errors made by retrieval-augmented models were due to over-reliance.

The Fix: Direct Preference Optimization (DPO)

To fix this, the authors used Direct Preference Optimization (DPO). DPO is a method usually used to align language models with human preferences (like making ChatGPT polite). Here, the authors adapted it to align the model with factuality.

Equation 2: The standard Direct Preference Optimization (DPO) loss function.

How they built the dataset: They created a specific “preference dataset” to teach the model when to ignore the retrieval.

  1. They identified samples where the model answered correctly on its own (without RAG).
  2. They identified samples where the same model answered incorrectly when RAG was added.
  3. They treated the correct (no-RAG) answer as the preferred response (\(y_{w,o}\)) and the incorrect (RAG-influenced) answer as the dispreferred response (\(y_{l,o}\)).

They then fine-tuned the model using a modified DPO loss function specifically for this knowledge balance:

Equation 8: The Knowledge Balanced Preference Tuning (KBPT) loss function.

This process effectively teaches the model: “If the retrieved text leads you astray, trust your internal training instead.”

Experimental Results

The researchers tested RULE on three major medical datasets: IU-Xray, MIMIC-CXR (Radiology), and Harvard-FairVLMed (Ophthalmology). They compared it against standard models like LLaVA-Med and other hallucination mitigation techniques (like Greedy Decoding and DoLa).

Accuracy Improvements

The results were impressive. RULE significantly outperformed the baseline LLaVA-Med-1.5 and other methods.

Table 2: Factuality performance on VQA datasets. RULE achieves state-of-the-art results across accuracy, precision, and recall.

In Visual Question Answering (VQA) tasks, RULE achieved the highest accuracy across all datasets. For example, on the IU-Xray dataset, accuracy jumped from 75.47% (baseline) to 87.84% (RULE).

They saw similar improvements in report generation tasks (writing full medical descriptions), measured by metrics like BLEU and ROUGE-L.

Table 3: Performance on report generation datasets showing significant improvements in BLEU and ROUGE-L scores.

Why is it working?

To prove that KBPT was actually fixing the “over-reliance” issue, the authors visualized the model’s attention—essentially, what the AI is looking at when it generates an answer.

Figure 3: Attention maps comparing the model without KBPT vs. with KBPT.

In Figure 3, look at the attention maps on the right (b):

  • w/o KBPT (Top): The model spends a lot of “energy” attending to the retrieved tokens (the red blocks on the right side of the heat map). It answers incorrectly.
  • w/ KBPT (Bottom): The model shifts its attention. It focuses less on the retrieved text and more on the question and the image. It answers correctly (“No”).

This confirms that the fine-tuning successfully taught the model to weigh its internal knowledge more heavily when necessary.

Compatibility

The authors also showed that RULE isn’t just a one-off success for a specific model version. They applied it to different backbones (LLaVA-Med-1.0 and 1.5) and consistently saw improvements.

Figure 4: RULE shows consistent improvements across different model backbones.

Real-World Examples

To visualize the impact, let’s look at two specific cases where RULE saved the day.

Figure 5: Case studies in radiology and ophthalmology.

In Figure 5:

  1. Top Case (Lung X-ray): The question asks about “focal airspace consolidation.” The standard LLaVA-Med says “Yes.” The RAG model gets confused by an irrelevant retrieved report about “presbyopia” (an eye condition!) and says “No” for the wrong reason. RULE correctly identifies that the answer is “No” based on the image, ignoring the bad retrieval.
  2. Bottom Case (Eye Fundus): The standard model identifies presbyopia correctly. The RAG model looks at a retrieved document that says “no presbyopia” and blindly copies it, resulting in an error. RULE balances the two, trusts the visual evidence, and correctly answers “Yes.”

Conclusion

The “RULE” framework highlights a critical nuance in deploying AI for high-stakes fields like medicine: More data isn’t always better. Simply retrieving external documents (RAG) can solve some knowledge gaps, but it introduces noise and bias.

By statistically calibrating how much to read (Factuality Risk Control) and fine-tuning the model to resist peer pressure from bad retrievals (Knowledge Balanced Preference Tuning), we can build medical AI that is not only smarter but more reliable.

This research provides a robust blueprint for the future of multimodal medical AI, moving us closer to automated systems that clinicians can actually trust.