Right for the Right Reasons: Teaching AI to Argue Like a Human Using Informal Logic

Imagine asking a student to explain why gravity keeps the Moon in orbit. If they reply, “Because the Moon is made of cheese,” and then somehow circle “Gravity” on the multiple-choice test, they got the right answer, but their reasoning was catastrophic.

In the world of Artificial Intelligence, Large Language Models (LLMs) are that student. They are incredibly good at selecting the correct answer, but when asked to show their work—to generate a chain of reasoning that leads to that answer—they often hallucinate, use irrelevant facts, or descend into circular logic.

For AI to be truly reliable, especially in science and medicine, it needs to be right for the right reasons. This brings us to a fascinating paper titled “Enhancing Systematic Decompositional Natural Language Inference Using Informal Logic.” The researchers tackle a massive bottleneck in AI: the inability to reliably judge whether an explanation actually makes sense.

By borrowing concepts from philosophy (specifically Informal Logic) and creating a new way to train models, they have built a system that doesn’t just answer questions—it builds valid, step-by-step proofs.

The Problem: The Black Box of Reasoning

To understand the contribution of this paper, we first need to understand Entailment Trees.

In explainable AI, we don’t just want an output; we want a proof. An entailment tree is a structured explanation where a complex hypothesis (the conclusion) is broken down into simpler, atomic facts (premises). If the premises are true, and they logically lead to the conclusion, the structure is sound.

However, current AI models struggle with Decompositional Textual Entailment. This is the specific task of looking at a hypothesis and breaking it down into smaller pieces that support it.

Take a look at the image below. It perfectly illustrates the problem.

Figure 1: Comparison of valid vs. invalid decomposition and annotation protocols.

In Figure 1 (Upper), we see two attempts to explain why gravity keeps the Moon in orbit.

Decomposition 1 is logical: It breaks the concept down into “The moon is in orbit” and “Gravity keeps objects in orbit.” This is a valid argument.
Decomposition 2 is a mess. It claims “Gravity causes objects to orbit” (not always true) and “A person weighs less on the moon” (true, but completely irrelevant to the question of orbit stability).

The problem is that standard AI training datasets treat entailment as a binary switch: Yes or No. As shown in Figure 1 (Lower A), two different human annotators might disagree on whether a messy decomposition is “good enough,” leading to noisy data. Without a clear definition of what constitutes a “good argument,” models cannot learn to distinguish between the logical decomposition and the irrelevant one.

The Solution: Informal Logic and the RDTE Protocol

The researchers argue that strict mathematical logic is too brittle for natural language, but “vibes-based” binary labeling is too vague. The middle ground is Informal Logic—the study of arguments in natural language.

They introduce a new protocol called RDTE (Recognizing Decompositional Textual Entailment). Instead of asking “Is this correct?”, RDTE evaluates an argument based on specific facets derived from the RAS criteria of informal logic:

Relevance: Does the premise actually matter to the conclusion? (e.g., The Moon’s weight is relevant to gravity; the Moon’s color is not).
Acceptability (Factuality): Is the premise true in the context of the real world?
Sufficiency: Do the premises, combined, provide enough grounds to believe the conclusion?
Redundancy: (Added by the authors) Are we just repeating the same information in different words?

Moving Beyond Binary Labels

By using these four criteria, the researchers moved from a binary label to a 5-point ordinal scale. This allows for nuance. A decomposition might be factual but irrelevant, or relevant but insufficient.

Figure 2: Distribution of sufficiency scores in the RDTE dataset.

Figure 2 shows the distribution of scores in their new dataset. Notice how much nuance is captured. If we only accepted perfect “5/5” scores, we would have very little data. By setting a threshold (e.g., scores \(\ge\) 4 are considered valid), the model can learn to distinguish between a “mostly correct” argument and a “complete failure.”

This faceted approach allows for much higher agreement between human annotators because they are following a strict rubric rather than their gut feeling.

Figure 5: RDTE annotation guidelines for premise-specific qualia.

Figure 5 gives us a peek into the actual guidelines used for the ARC (science) domain. Notice strict definitions for things like Redundancy. If a fact just restates the conclusion, it’s a redundancy score of 1. This level of granularity is what allows the AI to learn how to reason, rather than just memorizing answers.

The Method: Knowledge Distillation

Annotating data with this level of detail is expensive and slow for humans. It is also slow for massive models like GPT-4. To solve this, the authors used a technique called Knowledge Distillation.

Here is the pipeline:

The Expert (Humans): The authors created a “Gold Standard” dataset (RDTE) of 1,000 highly curated examples.
The Teacher (GPT-4): They prompted GPT-4 with the rigorous RDTE guidelines. They found that when given these specific instructions, GPT-4 is an excellent judge of reasoning. They used GPT-4 to label tens of thousands of reasoning traces (Silver Data).
The Students (RoBERTa / ChatGPT): They took smaller, faster models and trained them on the data generated by GPT-4.

The goal? To create a small, fast model that judges arguments as well as GPT-4, which can then be used inside a complex reasoning engine without breaking the bank or taking forever to run.

Table 1: RDTE entailment results showing distillation performance.

Table 1 illustrates the success of this approach. Look at the bottom section under “Knowledge Distillation.” The RoBERTa student model (trained on the silver data) actually achieved a higher F-score (66) on the ARC dataset than the teacher GPT-4 (58-59).

This implies that a specialized, smaller model trained on high-quality, logic-focused data can outperform a massive generalist model at the specific task of spotting bad logic.

TREEWISE: The Reasoning Engine

Armed with their new “Reasoning Judge” (the distilled model), the authors built a new inference engine called TREEWISE (Textual Reasoning Engine with Enriched Ways to Intelligently Search for Entailment).

TREEWISE is designed to answer a question by building a proof tree rooted in a trusted corpus (like Wikipedia).

How TREEWISE Works

The engine uses a Backward Chaining search strategy. Imagine working a maze backward from the finish line to the start.

Start with the Hypothesis: The engine looks at the potential answer (e.g., “The Moon is kept in orbit by gravity”).
Decompose: It asks the LLM to break this down into premises.
Filter (The Crucial Step): This is where the RDTE-trained model shines. It looks at the proposed premises. If they are irrelevant, redundant, or illogical, it throws them out immediately.
Grounding: It checks if the remaining premises can be found in the knowledge base (Wikipedia).
Recurse: If a premise isn’t found in Wikipedia yet, it becomes a new sub-hypothesis, and the process repeats.

Figure 4: The TREEWISE search algorithm logic.

Figure 4 visualizes this flow.

The NL Hypothesis is at the top.
The system generates Candidate Decompositions (the right branch).
Some branches fail (Red X) because the logic is bad.
Some branches succeed (Green Check) and are grounded in Corpus Documents (the orange icons).

By strictly filtering out bad logic early using the RDTE model, TREEWISE avoids going down “rabbit holes” of hallucinated reasoning. It saves computational budget and results in a cleaner final tree.

Results: Does it actually work?

The researchers tested TREEWISE against other tree-generating baselines on difficult datasets like EntailmentBank (science questions) and HotpotQA (multi-hop reasoning).

They measured two things:

QA Accuracy: Did it get the right answer?
Tree Integrity: Is the explanation actually logical? (Measured by having GPT-4 grade the final tree).

Table 3: Comparison of approaches on entailment tasks.

While Table 3 focuses on the filtering performance, the broader results in the paper confirmed that TREEWISE significantly outperforms baselines.

Accuracy: It achieves higher question-answering accuracy because it constructs better proofs.
Quality: The trees it produces are far more coherent.

Let’s look at what a “good” tree looks like.

Figure 15: Example TREEWISE output for an ARC science question.

Figure 15 shows TREEWISE answering a science question about phase transitions (“A balloon filled with water…”).

Diagram A: Notice the clear logical flow.
Premise: Water freezes into a solid.
Premise: The water is in a freezer.
Conclusion: The state of the water changes.
The system grounds these facts in Wikipedia. It doesn’t just guess; it builds a structure that a human can verify.

Figure 16: Example TREEWISE output for HotpotQA.

Figure 16 shows the system handling a complex history question in HotpotQA. It successfully links the “New York City Fire Commissioner” to “Rhinelander Waldo” and connects the timeline of “Providenza Panno’s death” to the “Triangle Shirtwaist Factory fire.”

This is Multi-hop Reasoning: connecting Fact A to Fact B to prove Conclusion C.
Without the RDTE filter, the model might have hallucinated a connection or used an irrelevant fact about the fire.

Conclusion: Why This Matters

This paper represents a significant maturity in how we approach AI reasoning. We are moving past the “Clever Hans” era—where models appear smart by picking the right answer based on statistical patterns—into an era of accountable reasoning.

The key takeaways are:

Logic is Nuanced: Reasoning isn’t binary. Using Informal Logic (Relevance, Acceptability, Sufficiency) provides the vocabulary AI needs to understand arguments.
Small Models Can Be Smart: You don’t always need the largest model at inference time. You can “distill” the reasoning capabilities of a giant model (GPT-4) into a smaller, efficient filter.
Structure Matters: Systems like TREEWISE prove that forcing an LLM to show its work—and grading that work step-by-step—leads to better answers and, more importantly, answers we can trust.

By teaching AI not just what the answer is, but what constitutes a good argument, we pave the way for AI agents that can act as reliable assistants in law, science, and education.

Right for the Right Reasons: Teaching AI to Argue Like a Human Using Informal Logic#

The Problem: The Black Box of Reasoning#

The Solution: Informal Logic and the RDTE Protocol#

Moving Beyond Binary Labels#

The Method: Knowledge Distillation#

TREEWISE: The Reasoning Engine#

How TREEWISE Works#

Results: Does it actually work?#

Conclusion: Why This Matters#