Introduction
In democratic societies, argumentation is the bedrock of decision-making. Whether it is a politician advocating for policy change, a student writing a persuasive essay, or a user on a forum trying to change another’s view, the ability to argue effectively is a key competence.
For years, the field of Natural Language Processing (NLP) has focused heavily on Argument Mining (AM)—the task of teaching computers to simply find arguments within a text. AM algorithms can scan a document and identify premises and conclusions. But identifying an argument is only half the battle. The far more complex challenge is determining how good that argument actually is.
This is the domain of Argumentation Quality (AQ). Defining quality is difficult because it is multifaceted. Is an argument good because it is logically sound? Is it good because it is persuasive? Or is it good because it is polite and relevant?
In a comprehensive new survey titled “Let’s discuss! Quality Dimensions and Annotated Datasets for Computational Argument Quality Assessment,” researchers Rositsa V. Ivanova, Thomas Huber, and Christina Niklaus provide a representative overview of the state of the art in this field. They surveyed 211 publications and analyzed 32 annotated datasets to map out how computer science defines and measures argument quality.
This post will take you through their findings, breaking down the complex taxonomy of argument quality and examining the datasets that fuel modern AI models.
The Search for “Good” Arguments
To understand the landscape of Argumentation Quality, the researchers performed a systematic literature review. They didn’t just look for papers with “argument quality” in the title; they employed a rigorous search strategy to ensure they captured the evolution of the field.
Methodology: The Snowball Effect
The authors started with a search on DBLP (a computer science bibliography) which yielded an initial set of 80 publications. However, relying solely on keyword searches can often miss foundational or adjacent work. To counter this, they applied Snowball sampling.
In this iterative process, the researchers extracted references from their initial papers (backward snowballing) and looked for papers that cited them. They filtered these for relevance—excluding purely philosophical works to focus on computational aspects—and repeated the process until no new relevant publications were found.

As shown in Figure 1, this method allowed them to expand their corpus significantly, moving from a small set of DBLP publications to a robust collection of 211 relevant papers.
The Growth of the Field
Why is this survey happening now? Because interest in AQ is exploding. As we rely more on automated systems for information retrieval, writing assistance, and content moderation, the need for machines to understand nuance has grown.

Figure 2 illustrates this trajectory. While the field was relatively quiet in the early 2000s, there was a sharp uptake in publications around 2014-2015. This correlates with the rise of deep learning and more sophisticated NLP techniques. Interestingly, while the number of papers (green line) has remained high, the creation of new datasets (red bars) appears to happen in bursts, peaking around 2015 and 2019.
Core Method: Defining Quality Dimensions
The heart of Computational Argument Quality Assessment lies in definition. You cannot train an AI to score an argument if you cannot define what you are scoring.
Early work in the field focused on specific, isolated aspects. For example, in automated essay scoring, researchers looked at organization and thesis clarity. In online reviews, the focus was often on sentiment (is the review positive or negative?) or helpfulness.
However, as the field matured, researchers realized that argument quality is not a single metric but a hierarchy of dimensions.
The Logic, Dialectic, and Rhetoric Taxonomy
The survey adopts and extends a seminal taxonomy proposed by Wachsmuth et al. (2017), which maps perfectly to the classical Aristotelian view of argumentation. This taxonomy divides quality into three high-level dimensions: Cogency, Reasonableness, and Effectiveness.

Figure 3 provides a detailed map of these dimensions. Let’s break down the three main pillars shown in the diagram:
1. Cogency (Logic)
Highlighted in Green in Figure 3
This dimension focuses on the internal structure of the argument. It asks: Is the argument logically sound?
- Local Acceptability: Are the premises true or believable?
- Local Relevance: Do the premises actually relate to the conclusion?
- Local Sufficiency: Are the premises enough to support the conclusion, or is more evidence needed?
In computational terms, this is often measured by checking for the presence of evidence and the level of support provided.
2. Reasonableness (Dialectic)
Highlighted in Purple in Figure 3
This dimension views the argument as part of a dialogue or debate. It asks: Is this argument a constructive contribution to the discussion?
- Global Acceptability: Would the target audience accept this argument?
- Global Relevance: Does this argument help resolve the issue at hand?
- Global Sufficiency: Does the argument adequately address counter-arguments?
This dimension is crucial for applications like Argument Search, where a system needs to rank arguments based on how useful they are to a user’s query (Recommendation).
3. Effectiveness (Rhetoric)
Highlighted in Red in Figure 3
This dimension is about the impact on the audience. It asks: Does this argument work? This is the most expansive category in the survey, covering aspects like:
- Persuasiveness / Convincingness: Does the argument change minds?
- Clarity: Is it easy to understand?
- Emotional Appeal: Does it resonate emotionally?
- Arrangement: Is the argument structured well?
The authors of the survey found that recent literature has expanded this taxonomy further to include dimensions like Sentiment, Objectivity, and Impact.
Enhanced Information Utilization
The survey notes a trend toward using “Enhanced Information” to assess these dimensions. Researchers are no longer just looking at the raw text of the argument. They are incorporating:
- Syntactic Features: Sentence length, vocabulary richness, and part-of-speech tagging.
- Contextual Knowledge: External knowledge graphs to check facts or understand cultural context.
- Coherence: Analyzing how well the argument flows with the topic.
For example, a study by Sun et al. (2021) demonstrated that incorporating syntax and coherence information significantly boosts classification performance compared to models that only look at semantic content.
Experiments & Results: The Landscape of Annotated Datasets
In Machine Learning, a model is only as good as its data. The researchers analyzed 32 datasets specifically created for Argument Quality. This analysis reveals both the strengths and significant weaknesses of the current research landscape.
The “English” Problem
One of the most striking findings in the survey is the language bias. Out of the 211 publications and 32 datasets analyzed:
- Almost 100% of the datasets are in English.
- Only one dataset (Toledo-Ronen et al., 2020) is explicitly multi-lingual.
This is a massive research gap. Argumentation varies wildly across cultures. The structure of a persuasive argument in German, Chinese, or Arabic might differ significantly from English norms. By relying almost exclusively on English data, the field risks creating tools that are culturally biased and globally inapplicable.
Absolute vs. Relative Quality
How do you annotate these datasets? The survey identifies two main approaches: Absolute and Relative assessment.
1. Absolute Quality: Annotators look at a single argument and assign it a score (e.g., 1 to 5 stars for “persuasiveness”).
- Pros: Provides a specific value for every argument.
- Cons: Highly subjective. What counts as a “4” to one person might be a “2” to another. This leads to low inter-annotator agreement.
2. Relative Quality: Annotators are shown two arguments and asked, “Which one is better?”
- Pros: Humans are much better at comparison than absolute scoring. It yields higher agreement and reliability.
- Cons: It doesn’t tell you if both arguments are terrible, only that one is better than the other.
The survey found that while Relative assessment yields better data consistency, over 75% of the datasets still use Absolute measures. This suggests a disconnect: researchers want specific scores (absolute), but the method for getting them is flawed.
Dataset Overview
The following images provide a granular look at the datasets identified by the authors. These tables list the datasets by name, year, size, approach, and the specific quality dimensions they annotate.

In the early years (shown above), you can see a focus on student essays (Persing et al.) and product reviews (TripAdvisor). The dimensions are often singular: “organization,” “thesis clarity,” or “sentiment.”

As the field progressed (shown above), we see the entry of major industrial players like IBM. The IBM-Rank and IBM-Pairs datasets are significant because they introduced large-scale annotation of “convincingness” and “recommendation.” You can also see the mix of “absolute” and “relative” approaches in the third column.

In the most recent datasets (shown above), the complexity increases. FinArgQuality (Alhamzeh, 2023) looks at financial conference calls, checking for specificity and temporal relevance. Appropriateness Corpus (Ziegenbein et al., 2023) moves into content moderation, annotating for “toxic emotions” and “missing intelligibility.”
The Difficulty of Annotation
The survey highlights that annotating argument quality is incredibly difficult. Inter-annotator agreement scores (metrics like Cohen’s kappa) are often low.
- Subjectivity: Dimensions like “persuasiveness” depend heavily on the reader’s prior beliefs. An argument for raising taxes will rarely seem persuasive to a libertarian, regardless of its logical structure.
- Complexity: Annotators struggle with sarcasm, irony, and rhetorical questions.
- Scale: Most datasets use point scales (1-5), but different datasets define these scales differently, making it hard to merge data from different sources.
Implications and Future Directions
The paper concludes that while Computational Argument Quality Assessment has made massive strides, it faces significant hurdles before it can be reliably deployed in the real world.
1. The Need for Multilingualism
The NLP community must break the English monopoly. Future work needs to focus on creating high-quality annotated datasets in other languages to understand cross-cultural argumentation standards.
2. Bridging Absolute and Relative
Since relative annotation (A vs. B) is more reliable, but absolute scores (1-10) are more useful for applications, the authors suggest developing methods to mathematically translate relative comparisons into absolute scores. This would give us the best of both worlds.
3. Handling Subjectivity
We need to stop treating subjectivity as “noise” to be removed. The “eye of the beholder” is central to argumentation. Future datasets should model the annotator’s background—their political leanings, education, and prior beliefs—as part of the data. Instead of asking “Is this argument persuasive?”, AI should learn to predict “Is this argument persuasive to this specific audience?”
4. Beyond Text
Argumentation doesn’t just happen in essays. It happens in debates, videos, and podcasts. The authors point to a need for multimodal analysis—using audio (tone of voice) and video (facial expressions/gestures) to assess quality alongside the text.
Conclusion
The survey by Ivanova et al. serves as a crucial roadmap for students and researchers entering the field of Argumentation Quality. It moves the conversation beyond simple identification of arguments to the nuanced evaluation of their worth.
By categorizing quality into Logic, Dialectic, and Rhetoric, and by critically evaluating the existing data, the authors have exposed the fragility of current AI models. We have built systems that can read English essays and guess if they are organized, but we are far from systems that can truly appreciate the “quality” of a complex, cross-cultural, or spoken debate.
For the aspiring data scientist or NLP engineer, this represents a frontier of opportunity. The tools for assessing truth and persuasion are being built right now, and the next breakthrough lies in teaching machines not just what we say, but how well we say it.
](https://deep-paper.org/en/paper/file-3304/images/cover.png)