Introduction: The High-Stakes World of Clinical AI
Imagine a doctor dictating notes after a complex surgery. The patient has a history of heart arrhythmia, specific medication allergies, and a new prescription plan. Now, imagine an Artificial Intelligence (AI) system summarizing those notes for the patient’s discharge papers. If the AI hallucinates—fabricating a medication the patient never took or omitting a critical allergy—the consequences could be life-threatening.
Large Language Models (LLMs) like GPT-4 and Llama have revolutionized text summarization. They are fluent, coherent, and fast. However, they suffer from a persistent flaw: hallucination. In creative writing, a hallucination might be a quirky feature; in clinical Natural Language Processing (NLP), it is a dangerous bug.
To fix this, we typically rely on Reinforcement Learning from Human Feedback (RLHF). We ask human experts to review AI outputs, correct them, and feed that data back into the model. But in healthcare, this hits a wall. Expert doctors are incredibly expensive and time-poor, and patient data is protected by strict privacy laws (like HIPAA). We cannot simply crowd-source clinical data annotation to the general public.
This brings us to a fascinating research paper: SYNFAC-EDIT. The researchers propose a novel solution: if we can’t easily get human experts, why not use the most powerful LLMs (like GPT-4) to act as “synthetic experts”? By teaching these giant models to simulate the editing process of a doctor, we can generate vast amounts of training data to teach smaller, cheaper, and privacy-compliant models (like Llama-2 or GPT-2) how to stay factually accurate.
Background: Why Standard Training Isn’t Enough
Before diving into the solution, we must understand the problem with how language models are traditionally trained.
The Flaw of Supervised Fine-Tuning (SFT)
Most models undergo Supervised Fine-Tuning (SFT). In this process, the model is given a clinical note and a “ground truth” summary written by a human. The model tries to predict the summary word-by-word.
The issue is that SFT is indiscriminate. If the model misspells “the,” the loss function penalizes it. If the model changes “10mg” to “100mg,” the loss function penalizes it. SFT often treats a grammatical hiccup and a dangerous medical error with similar weight. It doesn’t inherently understand that one error is annoying while the other is fatal.
The Alignment Gap
To fix this, researchers align models using preference data—pairs of “good” and “bad” summaries. This allows the model to learn values, not just word probabilities. Methods like DPO (Direct Preference Optimization) have shown great promise here.
However, preference data usually requires a human to look at two summaries and say, “Summary A is better than Summary B.” The researchers behind SYNFAC-EDIT argue that this isn’t enough. Real doctors don’t just rate summaries; they edit them. They cross out wrong dosage instructions and write in the correct ones. This Edit Feedback is much richer signal for learning, but it is also the hardest data to collect at scale.
The Core Method: Synthetic Imitation Edit Feedback
The SYNFAC-EDIT pipeline is designed to solve the data scarcity problem by using massive LLMs (>100B parameters) to generate high-quality edit feedback, which is then used to train smaller, weaker models (<10B parameters).
The researchers devised a clever pipeline that operates in two distinct directions to create training data. Let’s look at the architecture.

As shown in Figure 1, the standard approach (a) is straightforward but limited. The proposed method introduces two new pathways, (b) and (c), which we will break down in detail.
Direction 1: High \(\rightarrow\) Low (Generating Hallucinations)
This approach, shown in Figure 1(c), seems counterintuitive at first. Why would we want to generate bad summaries?
The goal of alignment training is to show the model the difference between a factual summary and a hallucinated one. Since we already have the “Ground Truth” (the human-written summary), we treat that as the High-Quality (Preferred) option. We need a corresponding Low-Quality (Dispreferred) option to complete the pair.
The researchers prompt a “Synthetic Expert” (like GPT-4) to take a perfect summary and intentionally insert hallucinations using specific edit operations. By controlling exactly how the summary is “corrupted,” they create a perfect negative example that looks plausible but is medically wrong.
Direction 2: Low \(\rightarrow\) High (Fixing Mistakes)
This approach, shown in Figure 1(b), mirrors the real world.
- A weaker model (like a base GPT-2 or Llama-2) generates a summary. It likely contains errors.
- The “Synthetic Expert” (GPT-4) reviews this draft against the original clinical note.
- The expert edits the draft to fix factual errors, creating a High-Quality version.
Now, the model has a pair: the original flawed draft (Dispreferred) and the synthetic expert’s fix (Preferred).
The Anatomy of an Edit
The researchers didn’t just ask GPT-4 to “rewrite this.” They forced the model to generate explicit Edit Instructions. This mimics a teacher grading a paper. They restricted the edits to two specific operations to maintain control: ADD and OMIT.

As defined in Table 6:
- To Improve Factuality (Low \(\rightarrow\) High): The expert uses ADD to include medico-legally essential information that was missing, and OMIT to remove incorrect or non-essential fluff.
- To Induce Hallucination (High \(\rightarrow\) Low): The expert does the reverse—omitting essential details (creating a dangerous gap) or adding non-essential/incorrect info.
This structured approach ensures the feedback is precise and medically relevant, rather than just stylistic changes.
Alignment Algorithms: DPO and SALT
Once this synthetic dataset of “Original vs. Edited” pairs is generated, how does the weaker model learn from it? The paper employs two advanced alignment algorithms:
- DPO (Direct Preference Optimization): This effectively tells the model, “When given this clinical note, the probability of generating the Preferred Summary should go up, and the probability of the Dispreferred Summary should go down.”
- SALT (Sequence Alignment Learning): This technique is specifically designed for edit feedback. It aligns the two summaries to identify exactly which tokens changed. It rewards the model for keeping the good parts (the intersection of both summaries) and the unique parts of the preferred summary, while penalizing the unique parts of the bad summary.
Experiments and Results
The researchers tested their pipeline using two different “Synthetic Experts” (GPT-3.5 and GPT-4) to train two different “Weaker Models” (GPT-2 and Llama-2-7B). They evaluated the results on the MIMIC-III dataset, a standard benchmark for clinical text.
Validating the Synthetic Expert
Before seeing if the students learned anything, we have to check if the teacher is competent. Did GPT-4 and GPT-3.5 actually follow the instructions to ADD and OMIT facts correctly?
Human annotators (medical students and doctors) reviewed the synthetic edits. The results highlighted a significant difference between the models.

Figure 2 reveals an interesting trend. The top chart shows the volume of edits. However, the bottom chart is more telling—it shows how many of those edits actually resulted in the desired outcome (hallucination or factuality improvement).
GPT-4 proved to be a much stricter and more accurate instructor. It followed the prompt constraints better than GPT-3.5. Interestingly, the data showed that generating hallucinations (High \(\rightarrow\) Low) resulted in higher quality preference data than trying to fix errors (Low \(\rightarrow\) High). This suggests that it is easier for a model to break a summary convincingly than to fix a broken one perfectly.
Did the Weaker Models Improve?
The ultimate test is whether Llama-2 and GPT-2 got better at summarizing clinical notes after being trained on this synthetic data.
The researchers measured performance using:
- ROUGE scores: A standard metric for text overlap.
- Factuality Metrics: UMLS-F1 (measuring the accuracy of medical terms) and G-Eval (using GPT-4 to grade factual consistency).
- Human Evaluation: Asking real humans which summary they preferred.
Results for High \(\rightarrow\) Low Training
This dataset involved taking good summaries and creating “bad” versions to teach the model what to avoid.

Table 4 shows a clear victory for the SYNFAC-EDIT method. Look at the Human H2H (Head-to-Head) column.
- When GPT-2 was trained using SALT with GPT-4 data, humans preferred it 72% of the time over the standard SFT baseline.
- Llama-2 showed similar gains, with a 74% win rate using SALT and GPT-4 data.
The factuality metrics (UMLS-F1 and G-Eval) also saw significant bumps. This confirms that training on synthetic “negative” examples (hallucinations) effectively teaches the model to stick to the facts.
Results for Low \(\rightarrow\) High Training
This direction involved fixing the weaker model’s mistakes.

Table 5 (focused on GPT-2) reinforces the finding that GPT-4 is the superior teacher. The SALT algorithm combined with GPT-4’s edits resulted in the highest scores across the board.
However, the researchers noted a limitation here. When they tried this “Low \(\rightarrow\) High” method on Llama-2, the results were mixed (as seen in supplementary data). The reason? The “Low” quality summaries generated by small models were sometimes so bad that even the synthetic expert struggled to fix them into something highly educational. The “High \(\rightarrow\) Low” method (corrupting good data) proved more robust and model-agnostic.
Visualizing the Data
To make this concrete, let’s look at what the “Low \(\rightarrow\) High” data actually looks like.


In these examples (Table 17 and continuation), we see the pipeline in action. The “Unaligned Model” misses critical context about the patient’s surgery (Coronary artery bypass).
- GPT-4 catches this immediately. Instruction 1 is an ADD Operation: “Patient underwent a coronary artery bypass graft x 3…”
- GPT-3.5 also catches it, but its edits are often messier or less focused.
The human doctor’s comments confirm that adding the surgery details is “useful” and factually necessary. This validated pair (the bad summary vs. the GPT-4 fixed summary) becomes a single training point for the AI.
Conclusion: The Future of Medical AI Training
The SYNFAC-EDIT paper demonstrates a powerful concept: we can bootstrap intelligence.
By leveraging the capabilities of massive, general-purpose models like GPT-4, we can create synthetic training environments for smaller, specialized models. This approach solves two massive headaches in clinical NLP:
- The Privacy Problem: We don’t need to send thousands of patient records to human annotators. The synthetic expert runs within the secure compute environment.
- The Cost Problem: We don’t need to pay surgeons $300/hour to correct grammar in AI summaries. GPT-4 can simulate that feedback for a fraction of a cent.
The results are conclusive: models trained with this synthetic edit feedback—particularly utilizing the SALT algorithm and GPT-4 as the teacher—significantly outperform models trained with standard supervised learning. They hallucinate less and capture more medical concepts.
While we are not yet at the point where AI can operate without human oversight, this research bridges the gap. It moves us toward a future where AI assistants in hospitals are not just fluent, but factually reliable, having been “schooled” by synthetic experts on what mistakes to avoid.
](https://deep-paper.org/en/paper/2402.13919/images/cover.png)