Beyond Text Generation: Teaching AI to Argue Logically with PESA

Have you ever asked an AI to write an essay on a controversial topic? Often, the result looks impressive at first glance. The grammar is perfect, the vocabulary is sophisticated, and the structure seems sound. But if you look closer, cracks begin to appear. The AI might make a bold claim in the first sentence, only to provide evidence that contradicts it three sentences later. Or, it might list facts that are technically true but irrelevant to the argument at hand.

This is a classic problem in Argumentative Essay Generation (AEG). While Large Language Models (LLMs) are excellent at predicting the next word, they often struggle with the high-level architecture of logic. They know how to write, but they often forget why they are writing specific sentences, leading to “logical hallucinations.”

In this post, we will deep dive into a fascinating research paper titled “Prove Your Point!: Bringing Proof-Enhancement Principles to Argumentative Essay Generation.” The researchers propose a new framework called PESA (Proof-Enhancement and Self-Annotation) that teaches AI not just to generate text, but to adhere to strict principles of logic and proof, much like a human debater.

The Problem: Logical Disorganization

Argumentative writing is different from creative writing. It requires a coherent flow where a main thesis is supported by claims, and those claims are supported by specific evidence (grounds).

Current methods often use a “Plan-and-Write” paradigm. They generate a list of keywords or a knowledge graph to guide the essay. However, these plans are often too simple. They tell the model what words to use, but not how to structure the proof.

Consider the example below from the research paper. The model is asked to discuss whether public libraries should spend money on high-tech media.

Two examples of proof and logical disorganization leading to impaired persuasiveness. Obviously, the upper example gives self-contradiction claim and ground, the lower example gives correct and persuasive proof.

In the upper example (without proof principles), the model claims that technology makes information easier to reach. But in the very next sentence, it provides “evidence” that search engines cannot find information. This is a logical self-contradiction. The lower example, generated using the principles we will discuss today, creates a claim about maintenance costs and supports it with specific data about library funds.

Background: The Toulmin Argumentation Model

To solve this, the researchers turned to a classic theory in philosophy and rhetoric: the Toulmin Argumentation Model.

Proposed by Stephen Toulmin, this model suggests that a valid argument consists of several specific components. For the purpose of AI generation, the researchers simplified this into a hierarchical tree structure consisting of two main layers:

Claims (The Abstract): These are the high-level assertions or positions the essay takes.
Grounds (The Concrete): These are the specific data, evidence, warrants, or reasoning that support the claims.

Most AI models try to generate the essay (Claims + Grounds + Filler) all at once. The core idea of this paper is to force the AI to plan the Claims first, then plan the Grounds based on those claims, and only then write the essay.

Example of the logical structure in human-authored argumentative text. The leftmost writing prompt extends two Major claims, after which each Major claim expands into several grounds or evidence.

As shown in Figure 4 above, a human-authored text naturally follows this structure. A prompt leads to major claims, which branch out into specific reasoning and examples. The goal of the PESA framework is to mimic this mental process.

The Core Method: PESA

The researchers developed a unified framework called PESA. It stands for Proof-Enhancement and Self-Annotation.

These two names correspond to the two biggest challenges in this field:

Proof-Enhancement: How do we force the model to follow a logical structure?
Self-Annotation: Where do we get the training data? (Standard datasets contain essays, but they don’t have the “Claims” and “Grounds” separately labeled).

Let’s look at the full architecture of the system:

The full flow chart of PESA. The upper figure shows the Proof-Enhancement process of generating text-planning from writing prompt and finally generating argumentative text, while the lower figure shows the Self-Annotation process of gradually building pseudo-labels for Proof-Enhancement training from ground truth using GPT-4.

The framework operates in a cycle. The bottom half of the image represents the data preparation (Self-Annotation), and the top half represents the actual generation process (Proof-Enhancement). Let’s break these down.

Stage 1: Self-Annotation (The Data Problem)

Training a model to generate “Claims” and “Grounds” requires a dataset where essays are already split into these components. Since such a dataset is expensive to create manually, the researchers used a technique called Self-Annotation.

They utilized a powerful LLM (GPT-4) to “reverse engineer” existing high-quality essays. It works like a text summarization task but in reverse order of the writing process:

Extracting Grounds (\(U^g\)): The model reads a paragraph of a human-written essay (\(Y\)) and summarizes the specific evidence and reasoning.
Extracting Claims (\(U^c\)): The model reads the extracted grounds and the essay to summarize the main point (the Major Claim).

This is formalized mathematically as:

\[ \begin{array} { l } { { U ^ { g } = \psi ( y ) , } } \\ { { U ^ { c } = \psi ( y , U ^ { g } ) , } } \end{array} \]

Here, \(\psi\) represents the LLM-based extraction function. This process turns a standard dataset of essays into a “pseudo-labeled” dataset containing the Prompt (\(X\)), the Claim Plan (\(U^c\)), the Ground Plan (\(U^g\)), and the final Essay (\(Y\)).

Stage 2: Proof-Enhancement (The Generation Process)

Once the model is trained on this structured data, it can generate new essays using a three-step Tree Planning Approach. This hierarchy ensures that the logic flows top-down, preventing the “rambling” nature of standard language models.

The specific design of Proof-Enhancementg. Two levels of text-planning are shown from top to bottom: the first level is the claim planning contains major claim, and the second level is the ground planning contains grounds, evidence and writing material.

Step 1: Claim Planning

First, the model (\(\mathcal{M}^c\)) looks at the writing prompt (\(x\)) and generates the major claims (\(U^c\)). This acts as the skeleton of the essay. It decides the stance and the main arguments without getting bogged down in details.

\[ \tilde { U } ^ { c } = { \mathcal { M } } _ { \theta } { } ^ { c } ( x ) . \]

Step 2: Ground Planning

Next, a second model (\(\mathcal{M}^g\)) takes the prompt (\(x\)) AND the just-generated claims (\(U^c\)) to generate the grounds (\(U^g\)). This fills in the skeleton with muscle—evidence, data, and reasoning that directly support the claims.

\[ { \tilde { U } } ^ { g } = { \mathcal { M } } _ { \theta } ^ { g } ( x , { \tilde { U } } ^ { c } ) . \]

Step 3: Essay Generation

Finally, the generation model (\(\mathcal{M}^e\)) takes the prompt, the claims, and the grounds to write the final essay (\(y\)). Because the logic and evidence are already planned out, the model essentially just needs to “connect the dots” with fluent language.

\[ \tilde { y } = \mathcal { M } _ { \theta } ^ { e } ( x , \tilde { U } ^ { c } , \tilde { U } ^ { g } ) . \]

Training the Model

The system is trained using distinct loss functions for each stage. This ensures that the model learns to be a good planner just as much as it learns to be a good writer.

Loss equations for the three stages of training.

\(\mathcal{L}_c\): Penalizes the model if the Claims don’t match the training data.
\(\mathcal{L}_g\): Penalizes the model if the Grounds don’t match (given the Claims).
\(\mathcal{L}_e\): Penalizes the model if the final Essay doesn’t match (given the Claims and Grounds).

Experiments and Results

The researchers evaluated PESA on the ArgEssay dataset, which contains over 11,000 essays on topics from exams like IELTS and TOEFL. They compared PESA against several strong baselines, including LLaMA-2 (fine-tuned) and previous state-of-the-art planning models like DD-KW (Dual Decoder with Keywords).

Automatic Evaluation

Since evaluating essay quality automatically is difficult, the researchers used GPT-4 as a judge, asking it to score essays based on Relevance, Validity of Reasoning, and Credibility of Evidence.

Table 1: The results of comparison of baselines on automatic metrics. Bold numbers denote the best performance among all methods on each dataset.

Note: The table above displays the automatic evaluation results.

The results were impressive. PESA (labeled “Ours”) outperformed all baselines.

Validity of Reasoning: PESA achieved a score of 84.64, significantly higher than the standard LLaMA2-base (80.26).
Credibility of Evidence: PESA scored 49.20, beating the closest competitor.

This confirms that explicitly planning the “Claims” and “Grounds” helps the model stick to a logical path.

Human Evaluation

Automatic metrics are useful, but human judgment is the gold standard for writing. The researchers hired evaluators to compare PESA against the baselines and even ChatGPT.

Table 2: The results of comparison of baselines on human evaluation metrics. Bold numbers denote the best performance among all methods on each dataset.

As seen in the table above, PESA achieved a score of 4.76 in Overall Persuasiveness, nearly matching ChatGPT (4.82). This is a remarkable feat considering PESA is based on LLaMA-13B, which is a significantly smaller model than the massive architecture behind ChatGPT.

The researchers also conducted a “Head-to-Head” win rate analysis:

PESA compared to other baselines. Human raters compared different model generations and scored them accordingly.

When pitted directly against other models:

PESA won against DD-KW (the previous specific AEG method) 86% of the time.
PESA won against LLaMA-2-base 64% of the time.
Against ChatGPT, PESA managed to achieve a Tie or Win in about 62% of cases, proving it is highly competitive.

Does the “Self-Annotation” actually work?

A critic might ask: “Is the data generated by GPT-4 actually good enough to train on?” The researchers analyzed this by having humans rate the quality of the pseudo-labels (the extracted claims and grounds).

Table 3: Comparison of the effects of different models on the construction of fine-grained training Data.

The data shows that GPT-4 (used for the final method) produces highly relevant and high-quality planning data (4.95/5 for relevance). This validates the strategy of using a stronger model to “teach” a smaller model how to structure its thoughts.

Conclusion & Implications

The PESA framework represents a significant step forward in Natural Language Generation. It moves us away from “black box” generation—where we hope the model gets the logic right—toward a structured, transparent process.

By integrating the Toulmin Argumentation Model, the researchers successfully taught an AI to:

Plan its major arguments (Claims).
Substantiate those arguments with evidence (Grounds).
Write a coherent essay based on that plan.

The most exciting takeaway is that logic can be decoupled from size. You don’t necessarily need a trillion-parameter model to write persuasive essays. You need a model that understands the structure of persuasion. By explicitly modeling the proof process, a smaller model (LLaMA-13B) was able to punch above its weight class, delivering results comparable to ChatGPT.

For students and researchers in AI, this highlights the importance of inductive bias—designing model architectures that reflect the underlying structure of the problem (in this case, the hierarchical nature of arguments) rather than relying solely on massive amounts of unstructured data.

The Problem: Logical Disorganization#

Background: The Toulmin Argumentation Model#

The Core Method: PESA#

Stage 1: Self-Annotation (The Data Problem)#

Stage 2: Proof-Enhancement (The Generation Process)#

Step 1: Claim Planning#

Step 2: Ground Planning#

Step 3: Essay Generation#

Training the Model#

Experiments and Results#

Automatic Evaluation#

Human Evaluation#

Does the “Self-Annotation” actually work?#

Conclusion & Implications#