From Textbooks to Treatment - Automating Medical Decision Trees with Generative AI

Imagine a doctor facing a patient with a complex set of symptoms. To prescribe the right medication, the doctor mentally traverses a flowchart: Is the condition mild or severe? If severe, is there a complication? If yes, use Drug A; otherwise, use Drug B.

This logical roadmap is a Medical Decision Rule (MDR). These rules are the backbone of Clinical Decision Support Systems (CDSS), software that helps medical professionals make safe, evidence-based choices.

Traditionally, building these systems has been a manual, labor-intensive nightmare. Experts have to read textbooks and clinical guidelines, identify the logic, and manually code it into “If-Then” rules. This process is expensive, slow, and hard to scale.

But what if Artificial Intelligence could read the medical textbooks and build these logic trees automatically?

In this post, we are doing a deep dive into the paper “Generative Models for Automatic Medical Decision Rule Extraction from Text.” The researchers propose a shift from traditional extraction methods to Generative Models, leveraging the power of Sequence-to-Sequence (Seq2Seq) architectures and Large Language Models (LLMs) to construct complex Medical Decision Trees (MDTs) directly from raw text.

The Problem: Structuring Medical Logic

The core challenge here is translation. We are translating unstructured natural language (a paragraph in a medical textbook) into a structured, executable logic tree.

A Medical Decision Tree (MDT) is a binary tree composed of:

Condition Nodes: Internal nodes representing a patient’s status (e.g., “Patient has high blood pressure”).
Decision Nodes: Leaf nodes representing the clinical recommendation (e.g., “Prescribe ACE inhibitors”).

Take a look at the example below. The text describes how to treat subacute thyroiditis based on severity. The diagram shows the target output: a clean, logical tree where paths diverge based on “Yes/No” answers to clinical conditions.

Figure 1: An example (translated from Chinese) of extracting tree-form medical decision rules from clinical guidelines and textbooks.

Standard Information Extraction (IE) techniques—like Named Entity Recognition (NER)—aren’t enough here. NER can find “thyroiditis” or “prednisone,” but it cannot understand the logic connecting them. It doesn’t know that prednisone is conditional on the disease being moderate or severe.

The authors of this paper argue that because these trees represent a logical “language,” generative models (which are designed to produce coherent sequences) are better suited for this task than discriminative classification models.

The Core Method: Linearization and Generation

To use a generative model (like GPT or a Transformer) to build a tree, we first need to solve a representation problem. Neural networks generate sequences of tokens, not tree data structures.

The researchers introduce Linearization: the process of flattening a binary tree into a string of text that a model can learn to predict. They explored three styles:

Natural Language (NL) Style: The tree is written out as a verbose sentence using “If,” “Then,” “Else,” and “Otherwise.”
Augmented Natural Language (AugNL) Style: Relation triples (like Subject-Relation-Object) are treated as special, single tokens. This makes the sequence shorter and easier for the model to manage.
JSON Style: The tree is represented as a nested JSON code block (key-value pairs).

The paper proposes two distinct generative architectures to tackle this: a custom Sequence-to-Sequence (Seq2Seq) model and an Autoregressive (LLM) approach.

Approach 1: The Sequence-to-Sequence Model

This is a specialized architecture designed specifically for this task. It doesn’t just blindly generate text; it first understands the entities and relations, then uses that knowledge to construct the tree.

As illustrated in Figure 2(a) below, the model works in distinct steps. First, it encodes the text. Then, a “Query-based Entity-Relation Extractor” identifies all the medical facts (triples). Finally, a decoder generates the linearized tree.

Figure 2: An overview of our generative medical decision tree extraction models.

1. Query-Based Extraction

Before building the tree, the model needs to find the building blocks. The authors use a query-based system where learnable vectors (queries) probe the text to find entities (like “migraine”) and relations (like “symptom”).

2. Relational Context

This is a critical innovation. A standard text decoder only looks at the input text. But a medical decision tree is essentially a collection of logical relationships.

To help the decoder, the authors inject Relational Context. They take the relation triples found in step 1 and feed them into the decoder alongside the original text. This ensures the decoder knows exactly which medical facts are available to be placed into the tree nodes.

The math behind the decoding step incorporates this context:

Equation for the decoder state update

Here, \(\mathcal{C}\) represents the relational context (RQC, RTC, or HRC). The decoder predicts the next token \(\hat{y}_t\) based on the previous tokens and this enriched context.

3. Constrained Decoding

Generative models can sometimes “hallucinate” or produce syntactically invalid sequences (e.g., generating an “Else” without a preceding “If”).

To prevent this, the researchers use Constrained Decoding. They use a Trie (prefix tree) to restrict the vocabulary at each step.

If the last word was “If,” the model is forced to generate an entity next.
If the last word was “Else,” the next word must be “Then.”

This ensures that the output is always a valid, parseable tree structure.

Approach 2: Autoregressive Models (LLMs)

The second approach leverages the power of Large Language Models (specifically ChatGLM and ChatGPT). Unlike the Seq2Seq model, which uses an encoder-decoder structure, these are decoder-only models.

The researchers experimented with two strategies: In-Context Learning (ICL) and Supervised Fine-Tuning (SFT).

Multi-Task Joint Fine-Tuning

Simply asking an LLM to “extract the tree” can be overwhelming. To improve performance, the authors devised a multi-task training regimen. The model is trained on three tasks simultaneously:

Medical Decision Rule Extraction: The main goal (generating the tree).
Relation Triple Extraction: A simpler auxiliary task to list all medical facts.
Tree Shape Extraction: A structural auxiliary task to outline the logic skeleton (e.g., “If… or… then…”).

This works like a curriculum. By learning to identify the facts (Task 2) and the skeleton (Task 3), the model becomes better at the complex main task.

Figure 4 below shows the original prompts (translated) used to guide the model through these tasks. Notice how the prompts explicitly break down the logic into structural components.

Figure 4: Original Chinese prompts for the main decision tree extraction task

Progressively-Dynamic Sampling

The authors didn’t just throw all tasks at the model at once. They used a dynamic sampling strategy.

Early training: High focus on easy tasks (Relation Extraction).
Late training: Focus shifts almost entirely to the main task (Tree Extraction).

This mimics human learning—master the basics (facts) before tackling the complex (logic trees).

Experiments and Results

The models were tested on Text2DT, a benchmark dataset containing Chinese medical guidelines and textbooks. The results were compared against “Discriminative” models (the previous State-of-the-Art), which treat the problem as a classification task rather than a generation task.

The Generative Advantage

The results were decisive. Generative models significantly outperformed the discriminative baselines.

Table 1: Main Results showing comparison between models

As shown in Table 1:

Seq2Seq (CPT) models using Augmented Natural Language (AugNL) achieved high scores.
Autoregressive (ChatGLM) models, when Fine-Tuned (SFT), performed remarkably well, achieving 60% Tree Accuracy.
The Ensemble (combining both) reached a new SOTA of 67%.

This proves that modeling the structure of decision-making as a language sequence is superior to trying to classify individual nodes and edges separately.

Why Augmented Natural Language (AugNL)?

For the Seq2Seq models, the style of linearization mattered. AugNL consistently beat standard Natural Language.

In AugNL, instead of writing “The symptom is headache,” the model uses a pointer to a pre-extracted triple (migraine, symptom, mild). This reduces the sequence length and simplifies the decoding process.

The probability calculation for AugNL (shown below) involves looking at both the relation embeddings and structural tokens:

Equation for AugNL probability distribution

Ablation Studies: What Components Matter?

The researchers performed ablation studies to verify their design choices.

For Seq2Seq Models: Table 3 shows that “Harmonized Relation Embeddings” (combining text and relation queries) provided the best results for the AugNL style.

Table 3: Results of ablation experiments on sequence-to-sequence models

For Autoregressive Models: Table 4 demonstrates the value of the auxiliary tasks. Removing the Relation Extraction (RE) or Tree Shape (TS) tasks caused a significant drop in accuracy. The “Progressively-Dynamic Sampling” (PDS) further boosted the Tree Accuracy to 60%.

Table 4: Ablation results of autoregressive models

Error Analysis: Where do models fail?

Despite the success, the task is far from solved. The authors analyzed the errors to understand the remaining bottlenecks.

Figure 3: Error distributions of different generative models

Figure 3 reveals an interesting trend:

Relation Triple Errors (Green): This is the most common error type. The logic might be correct (e.g., “If X then Y”), but the model extracted the wrong drug or symptom name.
Structure Errors (Blue): The Seq2Seq model struggled more with the tree structure than the LLM (ChatGLM).
Entity Boundaries: Autoregressive models (LLMs) had more trouble identifying the exact start and end of entity names compared to the specialized Seq2Seq architecture.

The Difficulty of Depth

Not all medical rules are created equal. Some are simple “If A then B.” Others are deep, nested logic trees.

Figure 6: Comparison of different generative models on extracting trees of different depths

Figure 6 shows that performance drops as trees get deeper (Depth 4). Interestingly, the AugNL model (second bar from left) and the Fine-Tuned ChatGLM (orange bar) maintained their performance much better on deep trees compared to the ICL-based approaches (middle bars), which collapsed on complex logic.

Diversity Leads to Better Ensembles

Why did combining the Seq2Seq model and the ChatGLM model yield the best results (67%)?

The answer lies in Diversity.

Figure 7: Diversity of trees generated by different models

Figure 7 illustrates that these two types of models generate fundamentally different types of trees (high edit distance between them). Because they make different types of mistakes, voting between them allows the ensemble to correct errors, leading to a robust “Best of Both Worlds” solution.

Conclusion and Implications

The paper “Generative Models for Automatic Medical Decision Rule Extraction from Text” marks a significant step forward in medical AI.

By treating decision rules as a language generation problem, the researchers achieved a 12% improvement over previous methods. They showed that:

Generative is better than Discriminative for capturing the flow of logic.
Linearization matters: Augmenting natural language with structured tokens (AugNL) helps models focus on logic rather than syntax.
Auxiliary Tasks help LLMs: Teaching an LLM to identify relations and shapes separately improves its ability to build complex trees.

Why does this matter? Scalability. Medical knowledge is expanding faster than humans can codify it. If we can reliably automate the extraction of decision rules from the thousands of new papers and guidelines published every year, we can update Clinical Decision Support Systems instantly. This leads to faster dissemination of new treatments, more consistent patient care, and ultimately, better health outcomes.

While challenges remain—particularly in extracting perfect relation triples and handling extremely deep logic—this work provides a blueprint for the future of automated medical logic extraction.

This post is based on the research paper “Generative Models for Automatic Medical Decision Rule Extraction from Text” by Yuxin He, Buzhou Tang, and Xiaoling Wang.

The Problem: Structuring Medical Logic#

The Core Method: Linearization and Generation#

Approach 1: The Sequence-to-Sequence Model#

1. Query-Based Extraction#

2. Relational Context#

3. Constrained Decoding#

Approach 2: Autoregressive Models (LLMs)#

Multi-Task Joint Fine-Tuning#

Progressively-Dynamic Sampling#

Experiments and Results#

The Generative Advantage#

Why Augmented Natural Language (AugNL)?#

Ablation Studies: What Components Matter?#

Error Analysis: Where do models fail?#

The Difficulty of Depth#

Diversity Leads to Better Ensembles#

Conclusion and Implications#