Beyond Grammar: Teaching AI to Fix Coherence in Student Writing with DECOR

Introduction

Imagine reading an essay where every individual sentence is grammatically perfect, yet the paragraph feels confusing. The ideas jump around, the pronouns don’t seem to refer to anyone specific, and the arguments don’t flow logically. This is a failure of coherence.

For Second Language (L2) English learners, mastering grammar is a significant milestone, but achieving coherence is often the harder, final boss. While tools like Grammarly have revolutionized how we fix surface-level errors (spelling, syntax, punctuation), they often fall short when it comes to discourse-level issues. They can tell you how to spell a word, but they rarely tell you if that word makes sense in the context of the previous three sentences.

Currently, automated writing evaluation systems can give an essay a “coherence score,” but they rarely explain why the score is low or how to fix it. This leaves students with a number but no path to improvement.

Enter DECOR, a new research contribution from Columbia University and UC Davis. The researchers have introduced a novel benchmark designed specifically to detect incoherence, identify the underlying reasons, and—most importantly—rewrite the text to fix the issue minimally. In this post, we will break down how DECOR works, why it matters, and how it proves that teaching AI the “reason” for a mistake makes it significantly better at fixing it.

The Coherence Gap

To understand DECOR, we first need to define the problem. Coherence is what makes a text hold together. It involves:

Cohesion: The linguistic glue (like pronouns and transition words) that connects sentences.
Consistency: Ensuring new information doesn’t contradict previous statements.
Relevance: Staying on topic.

L2 learners often struggle here. A student might introduce a term like “active points” when they meant “actions,” or switch topics abruptly without a transition.

The researchers identified a major gap in Natural Language Processing (NLP): there was no dataset specifically designed to help machines understand and repair these errors in L2 writing. Previous approaches used “out-of-distribution” data (often machine-generated gibberish) which doesn’t reflect the nuanced mistakes actual students make.

The DECOR Benchmark

The core contribution of this paper is the DECOR benchmark. It is a comprehensive dataset and pipeline built from the TOEFL-11 corpus (a collection of essays written by non-native English speakers).

The researchers didn’t just label essays as “good” or “bad.” They broke the problem down into three specific tasks performed on Context-Sentence pairs. In this setup, the model looks at a “Context” (the preceding sentences) and a “Current Sentence” to decide if they fit together.

As illustrated in the flowchart below, the process follows a strict logic:

Detection: Is the current sentence coherent with the context?
Reasoning: If not, exactly why?
Rewriting: How can we fix it with the least invasive edit?

The overview of DECOR, containing three tasks: incoherence detection, reasoning, and rewriting.

Figure 1: The DECOR pipeline. Note how the system identifies the conflict between “action” in the context and “active points” in the current sentence, identifies the reason as “Tangential,” and proposes a rewrite.

1. Incoherence Detection

The first step is binary: Yes or No. The system analyzes a context-sentence pair. For example, if the context discusses the advantages of young men studying, and the next sentence suddenly talks about “active points” without clear definition, the system flags it as incoherent.

2. Incoherence Reasoning (The “Why”)

This is where DECOR distinguishes itself. Merely flagging an error isn’t enough for education; the student needs to know why it’s wrong. The researchers developed a taxonomy of seven specific reasons for incoherence, grouped into three categories:

Cohesion: Issues with connecting words or references.
R1 Semantic Connection: The sentence doesn’t connect in meaning.
R2 Entity Reference: Using a pronoun (like “it” or “they”) that has no clear antecedent in the previous sentences.
R3 Discourse Relation: Missing transition words (e.g., “However,” “Therefore”).
Consistency:
R4 Consistency: Contradicting previously stated facts.
Relevance:
R5 Contextual Relevance: Completely off-topic.
R6 Tangential Relevance: Slightly related but unnecessary or distracting.
Other:
R7: Miscellaneous logical breaks (e.g., topic-comment disagreement).

The table below provides real examples from the dataset showing how these reasons are applied and corrected.

Label codes for the specific reasons for incoherence during annotation.

Figure 2: A detailed breakdown of the 7 reasons for incoherence. Notice specifically R4 (Consistency), where the original sentence contradicts the context about expensive gas.

3. Incoherence Rewriting

The final task is correction. The goal is not just to generate a new sentence, but to perform a minimally invasive rewrite.

This is a critical distinction. Large Language Models (LLMs) like GPT-4 often behave like over-eager editors—they rewrite the whole paragraph, changing the author’s voice and style. DECOR aims to preserve the student’s original intent and phrasing, only changing what is necessary to restore coherence.

Data Construction and Statistics

The researchers employed expert annotators (linguistics professors) to label 1,352 context-sentence pairs. They found that real student essays have specific patterns of incoherence.

As shown in the charts below, Relevance (specifically Tangential Relevance) and Cohesion (specifically Discourse Relation) are the most common stumbling blocks for medium-proficiency learners. Interestingly, direct contradictions (Consistency) are rare—students rarely forget what they just said, but they frequently drift off-topic or forget to use transition words.

Distribution of specific reasons for incoherence, and those clustered into groups.

Figure 3: The distribution of errors. Orange bars represent Relevance issues, which are the most frequent, followed by Blue (Cohesion).

Methodology: Teaching Models to Reason

Because the human-annotated dataset is relatively small (1,352 pairs), it is difficult to train robust deep learning models on this data alone. To solve this, the researchers used a technique involving synthetic data generation.

They used GPT-4 to synthesize a large training set of incoherent examples based on the 7 distinct reasons identified in DECOR. They then used this synthetic data to fine-tune smaller, more efficient models (like BERT, DeBERTa, and Llama-2).

This created two experimental tracks:

Detection & Reasoning Models: Classifiers trained to spot the error and identify the label (R1-R7).
Rewriting Models: Generative models (Llama-2 and Llama-3) trained to fix the sentence.

The Hypothesis: The researchers hypothesized that if you feed the reason for the incoherence into the rewriting model (e.g., “Fix this sentence knowing it is a Relevance error”), the model will produce a better correction than if it tries to fix it blindly.

Experiments and Results

The results of the study were highly encouraging, validating both the dataset and the training methodology.

Can Small Models Detect Incoherence?

The researchers compared their fine-tuned smaller models against GPT-4 (zero-shot and few-shot).

The table below shows that DeBERTa-base, when trained on the DECOR synthetic data (\(D_T\)), achieves performance comparable to, and sometimes better than, GPT-4. This is significant because DeBERTa is a fraction of the size and cost of GPT-4. It proves that specialized training on high-quality coherence data allows smaller models to punch above their weight class.

Evaluation of models on DECOR using weighted F1 scores.

Figure 4: Performance metrics. Note that models trained on \(D_T\) (Task-specific synthetic data) generally outperform those trained on out-of-domain data (\(D_C\)).

Does “Reasoning” Help Rewriting?

This was the most critical question. The team trained two versions of the Llama-2 model:

Without Reason: Just given the context and incoherent sentence.
With Reason: Given context, sentence, and the specific label (e.g., “Entity Reference”).

They evaluated the rewrites using two methods: an “Acceptance Rate” (did the rewrite actually fix the coherence?) and a “Win Rate” (comparing the model’s rewrite against human expert rewrites).

Automatic evaluation of models for the incoherence rewriting task.

Figure 5: Automatic evaluation. The rows marked “w/ reason” consistently show higher acceptance and win rates than “w/o reason.”

The data confirms the hypothesis: Incorporating specific reasons for incoherence consistently improves the quality of the rewrites. When the model knows why a sentence is wrong, it fixes it more accurately.

Human Evaluation: The Ultimate Test

Automatic metrics are useful, but in writing, human judgment is king. The researchers asked human experts to blindly compare rewrites from three sources:

Llama-2 (Fine-tuned without reasons)
Llama-2 (Fine-tuned with reasons)
Human Expert References

They also compared these against GPT-4’s native rewrites.

The results, visualized below, were striking. The model fine-tuned with reasons (the middle bar) had a significantly higher “Win” rate against the baseline than the model without reasons. Furthermore, the expert annotators noted that the fine-tuned models were often indistinguishable from human editors, producing minimal, effective edits.

Human expert as a judge evaluation results with GPT-4 rewrites as the baseline.

Figure 6: Human evaluation results. The green segments represent “Wins” against GPT-4. The model trained “with reason” wins 72% of the time, almost matching the Human benchmark (74%).

Conclusion and Implications

The DECOR paper presents a major step forward in Automated Writing Evaluation (AWE). By moving beyond grammar and syntax into the realm of discourse coherence, it addresses one of the most difficult aspects of learning a new language.

Key takeaways for students and researchers:

Diagnosis precedes Correction: The most successful rewrites happen when the system first identifies why a sentence is incoherent. This mirrors human pedagogy—teachers don’t just rewrite student work; they explain the error first.
Minimalism Matters: GPT-4 and other massive models tend to over-edit. DECOR shows that we can train smaller, specialized models to make surgical edits that respect the student’s original voice.
Data Quality: The creation of the DECOR benchmark fills a void in L2 research, providing a gold-standard dataset that maps real-world student errors rather than synthetic noise.

This research paves the way for a new generation of writing assistants. Future tools could do more than just underline a misspelled word; they could highlight a confusing sentence and say, “This sentence seems off-topic. Try adding a transition word to connect it to your previous point,” effectively acting as a personalized writing tutor.

Introduction#

The Coherence Gap#

The DECOR Benchmark#

1. Incoherence Detection#

2. Incoherence Reasoning (The “Why”)#

3. Incoherence Rewriting#

Data Construction and Statistics#

Methodology: Teaching Models to Reason#

Experiments and Results#

Can Small Models Detect Incoherence?#

Does “Reasoning” Help Rewriting?#

Human Evaluation: The Ultimate Test#

Conclusion and Implications#