Large Language Models (LLMs) are rapidly entering the healthcare space. We use them to summarize patient visits, answer medical questions, and extract structured data from messy clinical notes. The promise is enormous: automated systems that can read through thousands of history files to identify patients at risk due to Social Determinants of Health (SDOH).
However, a recent study reveals a critical flaw in how these models “reason.” It turns out that models like Llama and Qwen often don’t read clinical notes the way a human would. Instead, they rely on shortcut learning—superficial patterns that allow them to guess the answer without understanding the context.
In this deep dive, we will explore a paper that investigates a specific, dangerous shortcut: how the mere mention of alcohol or tobacco can trick an AI into falsely accusing a patient of illicit drug use. We will also uncover how patient gender exacerbates these errors and look at prompt engineering strategies to fix them.
The Problem: Social Determinants of Health and Trust
Social Determinants of Health (SDOH) are the non-medical factors that influence health outcomes. These include housing stability, employment, and substance use (alcohol, tobacco, and drugs). Extracting this information from unstructured clinical text—the free-text notes doctors write—is vital for population health analytics.
The researchers focused on a specific sub-task: Drug Status Time Classification. When a model reads a clinical note, it usually performs two steps:
- Trigger Identification: Find a word related to drug use (e.g., “heroin,” “marijuana,” “illicit substances”).
- Argument Resolution: Determine the status—is the use Current, Past, or None?
The study highlights a phenomenon called Spurious Correlations. This occurs when a model learns a statistical association that isn’t causally true. For example, if 90% of the training data that mentions “partying” also mentions “drug use,” the model might learn that “partying” equals “drug use,” even if the specific note says “patient enjoys partying but denies drugs.”
Why This Matters
The researchers prioritized a specific metric: the False Positive Rate (FPR). In this context, a false positive means the model predicts a patient is currently using or has used drugs when the text actually says they haven’t (or doesn’t mention it).
In healthcare, a false positive is not just a data error; it is a patient safety issue. Incorrectly labeling a patient as a drug user can lead to stigmatization, biased care from providers, and a breakdown of trust in automated systems.
The Methodology: Setting the Trap
To test for these shortcuts, the authors used the MIMIC-III dataset (a massive database of de-identified health records) specifically annotated for social history (the SHAC dataset).
They set up a clever experiment to test if models were “hallucinating” drug use based on the presence of other substances. They categorized clinical notes into two types:
- Substance-positive: Notes that document alcohol or smoking use.
- Substance-negative: Notes that do not mention alcohol or smoking use.
In all cases tested for false positives, the ground truth regarding drug use was None.
The hypothesis was simple: If the model understands English and clinical context, the presence of the word “alcohol” shouldn’t change its opinion on “drugs.” If the model is using shortcuts, seeing “alcohol” might trigger it to predict “drug use” because those concepts frequently co-occur in training data.
They tested several models, including Llama-3.1-70B, Llama-3.1-8B (fine-tuned), Qwen-72B, and the medical-specific Llama3-Med42.
Evidence of Shortcut Learning
The results were stark. The models exhibited massive biases when alcohol or smoking was mentioned in the text.

Take a look at Table 1 above. Focus on the first column (Llama-70B Zero-shot).
- Alcohol-negative notes: When alcohol was NOT mentioned, the False Positive Rate (FPR) for drug detection was 28.83%. This is already high (a known issue with zero-shot extraction), but it serves as the baseline.
- Alcohol-positive notes: When the note mentioned alcohol use, the FPR for drug detection skyrocketed to 66.21%.
This means that in two-thirds of the cases where a patient admitted to drinking alcohol (but not drugs), the model incorrectly claimed they used drugs. The model is essentially assuming, “If they drink, they probably use drugs too.”
The row labeled Smoking+Alcohol is even worse. If a note contained mentions of both smoking and alcohol, the Llama-70B model yielded a 73.26% False Positive Rate. It became almost statistically impossible for a patient to be a smoker and a drinker without the AI labeling them a drug user.
Is the Trigger Causal?
To prove that the words “alcohol” or “smoking” were the direct cause of these errors, the researchers performed an ablation study. They took the exact same notes and simply deleted the alcohol or smoking keywords, then ran the models again.

Table 3 confirms the hypothesis. Look at the Llama 3.1 70b Zero-shot column:
- Full Text: 66.21% FPR (Alcohol-positive).
- Without Alcohol: When the alcohol triggers were removed, the FPR dropped to 55.17%.
While the error rate didn’t drop to zero (suggesting the model has other biases or struggles with the “None” class generally), the significant decrease confirms that the specific presence of alcohol terms causes the model to hallucinate drug use. This confirms that the model is relying on superficial cues rather than deep semantic understanding.
The Hidden Bias: Gender Disparities
The study uncovered a second, perhaps more disturbing layer of spurious correlation: Demographic Bias.
The researchers analyzed the performance based on the biological sex of the patients mentioned in the notes. If models are objective, the error rate should be roughly the same for male and female patients. It was not.

Table 2 illustrates a systematic bias against male patients.
- In the Alcohol-positive scenario (Llama-70B Zero-shot), the False Positive Rate for Female patients was 53.66%.
- For Male patients, it was 71.15%.
The model is nearly 20 percentage points more likely to falsely accuse a man of drug use than a woman, given the same context of alcohol consumption.
This suggests the model has learned a “gender shortcut.” During its pre-training on the vast internet, the model likely encountered more text associating men with drug use than women. It is now applying that statistical likelihood to individual clinical notes, effectively profiling patients based on their gender.
Interestingly, looking at Llama-8B SFT (a smaller model fine-tuned on this specific dataset), the gap remains. Even after domain-specific training, the model retains a bias, showing that these prejudices are deeply embedded in the pre-trained weights.
Can We Fix It? Mitigation Strategies
Identifying the problem is only half the battle. The researchers tested several prompting strategies to see if they could force the models to think more clearly and abandon these shortcuts.
They evaluated three main strategies:
- In-Context Learning (ICL): Providing the model with 3 correct examples before asking it to predict.
- Chain-of-Thought (CoT): Explicitly instructing the model to “reason step-by-step” and explain its logic before giving an answer.
- Warning-Based Prompts: Pre-pending the prompt with instructions like “Evaluate each factor independently” and “Never assume one behavior implies another.”
Which Strategy Worked?

Table 5 displays the results of these mitigation strategies across different models (Qwen and Med42).
Chain-of-Thought (CoT) emerged as the most effective intervention.
- Look at the Alcohol-positive row for the Qwen-72B model (right side of the table).
- ICL (Baseline): 62.76% FPR.
- CoT: The rate drops dramatically to 28.97%.
By forcing the model to articulate its reasoning—“The patient mentions drinking beer. The text says ‘denies illicit drugs’. Therefore, drug status is None”—the model is prevented from jumping to the statistical conclusion (“Drinks = Drugs”).
Warning-Based prompts also helped, lowering the Qwen FPR to 34.38%, but they were generally less effective than CoT. This suggests that simply telling an AI “don’t be biased” is less effective than forcing it to show its work.
However, it is crucial to note that bias was not eliminated. Even with the best mitigation strategies, the False Positive Rates remained clinically unacceptable in many cases.
Implications for Healthcare AI
This research serves as a sobering reality check for the deployment of Generative AI in medicine.
- The “Clever Hans” Effect: Just because an LLM gives the right answer 80% of the time doesn’t mean it knows why. It may be relying on shortcuts that break as soon as the context shifts (e.g., a patient who smokes but doesn’t do drugs).
- Amplifying Human Bias: The study notes that the biases found in the models (inferring drug use from smoking, gender profiling) mirror documented biases possessed by human healthcare providers. The models are holding a mirror up to the data they were trained on. If we aren’t careful, deploying these models could automate and scale up existing systemic discrimination.
- Documentation Standards: The results suggest that clinicians need to be hyper-aware of how they document. Ambiguous phrasing that relies on human inference might be misinterpreted by AI systems.
Conclusion
The extraction of social determinants of health is a powerful application for LLMs, potentially unlocking data that can help treat the whole patient. However, this paper demonstrates that we cannot blindly trust these models to perform “reasoning.”
The models exhibited severe shortcut learning, treating alcohol and smoking as proxies for drug use, and demographic bias, profiling male patients more harshly than females. While techniques like Chain-of-Thought prompting significantly reduce these errors, they do not eradicate them.
For students and researchers entering this field, the takeaway is clear: Accuracy metrics like F1-score are not enough. We must probe our models for spurious correlations, test them against counterfactuals (like removing trigger words), and audit them for demographic fairness before they are allowed near patient care. Future work must move beyond prompt engineering toward more robust training methods that can unlearn these deep-seated associations.
](https://deep-paper.org/en/paper/2506.00134/images/cover.png)