Imagine you are feeling unwell. You have a headache, a slight fever, and a history of asthma. You open a chat window with a powerful AI assistant and ask, “What can I take for this?” The AI confidently recommends a specific combination of pills.

Ideally, this interaction saves you a trip to the doctor. But realistically, we face a terrifying question: Is the AI hallucinating?

Large Language Models (LLMs) have demonstrated incredible prowess in passing medical licensing exams and summarizing patient notes. However, medication management is a high-stakes game where “mostly correct” isn’t good enough. A hallucination here doesn’t just mean a weird sentence; it means recommending a dosage that is toxic or a drug interaction that could be fatal.

In a recent paper titled “Should I Believe in What Medical AI Says?”, researchers from Xunfei Healthcare Technology introduce ChiDrug, a new systematic benchmark designed to stress-test LLMs on Chinese medication tasks. This article will break down their work, explaining how they separate “memorized facts” from “logical reasoning,” and why even the most advanced models still struggle to know their own limits.

The Problem: Confidence Without Knowledge

The core issue the researchers address is the distinction between fluency and accuracy. An LLM can write a grammatically perfect prescription for a drug that doesn’t exist or for a patient population that should never take it (e.g., pregnant women or infants).

Previous benchmarks have attempted to measure medical competency, but they often suffer from two flaws:

  1. Data Leakage: Many benchmarks scrape questions from the open web (like medical forums or Wikipedia). Since LLMs are trained on the web, they might just be reciting answers they’ve already seen, rather than demonstrating understanding.
  2. Muddled Metrics: They often conflate knowledge (Does the model know the dosage of Aspirin?) with reasoning (Can the model determine if Aspirin is safe for this specific patient?).

ChiDrug aims to solve this by constructing a dataset from scratch using authoritative drug brochures, specifically checking for Knowledge Boundaries—the ability of a model to realize when it doesn’t know the answer.

The ChiDrug Framework: Knowledge vs. Reasoning

To rigorously evaluate an AI, the researchers structured the ChiDrug benchmark into two distinct dimensions: Parametric Knowledge and Reasoning Capability.

Figure 1: Our benchmark involves four datasets that directly examine model parametric knowledge and two datasets that examine model reasoning ability.

As illustrated in Figure 1 above, the benchmark is not a monolith. It separates tasks based on the cognitive load required:

1. Parametric Knowledge (The “Fact Retrieval” Layer)

These tasks test the static information stored within the model’s parameters during training. It asks, “Do you have this specific fact memorized?”

  • Indication: What disease does this drug treat?
  • Dosage and Administration: How much should be taken, and how often?
  • Contraindicated Population: Who must not take this drug (e.g., newborns, elderly over 55)?
  • Mechanisms of Action: How does the drug biologically work?

2. Reasoning Capability (The “Logic” Layer)

These tasks require the model to take facts and apply them to a dynamic scenario.

  • Medication Recommendation: Given a patient’s symptoms and their demographic constraints, what is the best treatment?
  • Drug Interaction: If a patient takes Drug A and Drug B together, what is the risk level?

In Figure 1, notice the “Universal Knowledge” boundary. The model might know everything about “Aminophylline Injection” (inside the boundary), but when asked about a specific combination with “Aspirin Paracetamol Caffeine Tablets,” it must use reasoning to identify that both drugs affect the same biological pathways (caffeine interactions), flagging a potential risk.

Constructing a Clean Benchmark

To avoid the data leakage problem mentioned earlier, the team didn’t just download existing quizzes. They built ChiDrug using a semi-automated pipeline involving official drug brochures and human expert verification.

Figure 2: Overview of our benchmark construction process

Figure 2 details this three-step construction pipeline:

  1. Extraction (Left): They collected 8,000 official drug brochures. Using a helper LLM (Spark), they extracted key sections (Indications, Dosage, etc.) and converted them into multiple-choice questions.
  2. Simulation (Middle): To test recommendations, they used real doctor-patient dialogues. They masked the doctor’s actual prescription and asked the AI to identify the correct drug from a list that included “distractors”—drugs that treat the symptom but are dangerous for that specific patient (e.g., a drug that treats fever but is banned for pregnant women).
  3. Interaction Logic (Right): They identified drug pairs with known interaction risks (High, Medium, Low) and asked the model to classify the risk.

The “Double-Check” Mechanism

Because an AI helped generate the questions, the researchers implemented a rigorous quality control process. Every question was reviewed by three other LLMs (GPT-4, Qwen-max, ERNIE). A question was only kept if all three models agreed it was unambiguous and had a unique correct answer. Finally, licensed human doctors reviewed the dataset to ensure medical accuracy.

The result is a set of varied and difficult questions. You can see examples of these questions in the case study below:

Figure 6: Partial cases of ChiDrug on 6 sub datasets.

Look at the Dosage and Administration example in Figure 6. The question asks for the recommended dosage of “Compound Zinc Ibuprofen Granules.” The distractors are subtle (varying ages and packet counts). This requires precise knowledge, not just general guessing.

Experiment Results: Who Is the Best Doctor?

The researchers tested a mix of 8 closed-source models (proprietary, accessible via API) and 5 open-source models (downloadable weights). The lineup included heavy hitters like GPT-4o and Claude 3.5 Sonnet, as well as Chinese-specific models like XiaoYi, GLM4, and ERNIE Bot.

The Leaderboard

Table 1: This table presents the performance of 8 closed-source models and 5 open-source models across various medication-related tasks. Bold indicates the best performance, while underlining denotes the second-best.

Table 1 reveals several critical insights:

  1. Closed-Source Dominance: Closed-source models generally outperformed open-source ones. This is likely due to the massive scale of data and reinforcement learning used in proprietary models.
  2. The “Home Field” Advantage: The top performer was XiaoYi, a specialized Chinese medical model, followed closely by GLM4 and GPT-4o. XiaoYi dominated in specific knowledge tasks like “Mechanism of Action” and “Dosage.”
  3. The Difficulty of Safety: Look at the “Contraindicated Population” column. Scores here are significantly lower across the board compared to “Mechanism of Action.” Models are good at explaining how a drug works biologically (textbook knowledge) but struggle to identify exactly who shouldn’t take it (safety constraints).

We can visualize these strengths and weaknesses using radar charts.

Figure 4: Radar Chart Representation of Close-Source Models Performance.

In Figure 4, you can see the shape of competence. XiaoYi (the purple line) stretches furthest to the outside edges, particularly in Dosage and Indication. However, almost all models dip inward on Drug Interaction and Medication Recommendation. This visualizes a critical gap: models are better at being encyclopedias (parametric knowledge) than being doctors (reasoning).

The Reasoning Paradox: Can Logic Replace Knowledge?

One of the most fascinating parts of this paper is the analysis of “Reasoning Models.” Recently, models like OpenAI’s o1 have been released, which are trained to “think” before they speak (Chain-of-Thought). The assumption is that better thinking leads to better answers.

However, the ChiDrug benchmark challenges this assumption in the medical domain.

Table 4: Zero-shot accuracy of reasoning models across medical knowledge tasks.

As shown in Table 4, OpenAI o1 actually performed worse than the standard GPT-4o in several tasks, particularly in “Drug Interaction” (45.10% vs 59.93%).

Why does the “smarter” model fail? The researchers hypothesize that reasoning is useless without the underlying facts. If a model tries to reason about a drug interaction but doesn’t have the specific chemical ingredients of the drug stored in its parametric knowledge, no amount of logic will lead to the correct answer. It is like asking a master logician to solve a math problem where the variables are written in a language they don’t speak.

To prove this, the researchers ran a second experiment where they provided the full drug brochures in the prompt (giving the model “perfect knowledge”).

Table 5: Accuracy on reasoning tasks with knowledge-complete prompts.

Table 5 confirms the hypothesis. Once the knowledge gap was closed (by giving the model the text), the reasoning models (o1 and o3-mini) immediately jumped to the top, significantly outperforming GPT-4o.

Key Takeaway: You cannot reason your way out of a lack of knowledge. For medical AI, RAG (Retrieval-Augmented Generation) or comprehensive pre-training is non-negotiable.

The Knowledge Boundary: Getting AI to Say “I Don’t Know”

Perhaps the most dangerous aspect of medical AI is overconfidence. A model that guesses a dosage is far worse than a model that admits it doesn’t know. The researchers define this as the Knowledge Boundary.

They formalized the task of “Abstention” using the following equation:

Equation describing the output r as M(q) if within knowledge, or U if beyond knowledge.

Here, \(r\) is the response. If the query \(q\) is within the model’s parametric knowledge, it should output the answer \(M(q)\). If it is beyond the boundary, it should output \(U\) (Uncertainty/Abstention).

Visualizing the Boundary

The team visualized how “stable” the models’ knowledge was across 282 common drugs. They queried the models multiple times.

Figure 5: Knowledge boundary chart for GLM4, XiaoYi, and GPT4o across 282 common drugs.

In Figure 5, the Orange area represents robust knowledge—the model got the answer right on the first try. The Yellow area represents “shaky” knowledge—the model got it wrong initially but eventually got it right within 5 attempts.

XiaoYi shows a very dense outer ring of orange, indicating high, stable confidence. GPT-4o, while powerful, has more “yellow” zones, suggesting that its knowledge on specific Chinese medications might be less stable, requiring multiple prompts or retries to access correctly.

Improving Safety with Semantic Entropy

So, how do we force models to respect this boundary? The researchers tested several methods, including:

  • Prompting: Just asking the model “Are you sure?” (Post-calibration).
  • Probing: Looking at the internal mathematical states of the model.
  • Semantic Entropy (SE): Generating multiple answers and checking if they mean the same thing.

If a model generates five different answers that all use different words but mean the same thing (Low Semantic Entropy), it is likely confident. If the answers have different meanings (High Semantic Entropy), the model is hallucinating.

The researchers found that Semantic Entropy was highly effective. By refusing to answer questions with high Semantic Entropy, they improved the reliability of the models significantly.

Conclusion

The ChiDrug benchmark serves as a reality check for the deployment of AI in healthcare. While models like XiaoYi and GPT-4o show impressive capabilities, the gap between “knowing facts” and “safe reasoning” remains substantial.

The paper highlights two critical paths forward for Medical AI:

  1. Knowledge Augmentation: We cannot rely on models to just “know” everything from pre-training. They need access to external, verified databases to bridge the gap that reasoning models like o1 faced.
  2. Uncertainty Modeling: Implementing techniques like Semantic Entropy is crucial. A medical AI must be programmed to be humble—prioritizing patient safety over the appearance of competence.

As we move closer to AI-assisted medicine, benchmarks like ChiDrug act as the rigorous board exams these digital doctors must pass before they can be trusted with our health.