Can AI Detect Flawed Logic? Investigating Zero-Shot Fallacy Classification with LLMs

“I am a great leader because I make great leadership decisions.”

At first glance, that sentence might sound confident. But if you look closer, it’s empty. It’s a classic example of Circular Reasoning—the conclusion is just a restatement of the premise.

We encounter defective arguments like this every day. Whether it’s “Appeal to Emotion” in advertisements, “Ad Hominem” attacks in political debates, or “False Dilemmas” in social media comments, logical fallacies are the building blocks of misinformation and manipulation. Detecting them automatically is a crucial task for Natural Language Processing (NLP), but it has historically been very difficult.

In this post, we are doing a deep dive into the paper “Are LLMs Good Zero-Shot Fallacy Classifiers?”. The researchers explore a fascinating question: Instead of training specialized models on expensive, hard-to-find datasets, can we simply ask Large Language Models (LLMs) like GPT-4 or Llama-3 to spot these fallacies for us?

The results offer a surprising look into how LLMs “reason” and provide a playbook for how to prompt them effectively.

The Problem with Traditional Detection

Before we jump into LLMs, we need to understand the status quo. Traditionally, if you wanted to build a system to detect fallacies, you followed a full-shot supervised learning pipeline. You would:

Hire experts to read thousands of sentences and label them (e.g., “This is a Straw Man fallacy”).
Train a model (like BERT or T5) on this data.
Test the model.

This approach has three major flaws:

Data Scarcity: Fallacies are complex. creating a labeled dataset is expensive and time-consuming.
Imbalance: Some fallacies, like Ad Hominem, are common. Others, like Equivocation, are rare. Models often get good at the common ones and fail on the rare ones.
The Generalization Gap: This is the biggest issue. A model trained on political debates often fails miserably when tested on Reddit comments. The “distribution” of language is just too different. This is known as the Out-of-Distribution (OOD) problem.

The authors of this paper propose a shift in strategy. Since LLMs are pre-trained on vast amounts of text containing logic, debates, and definitions, perhaps they possess the inherent knowledge to classify fallacies zero-shot—meaning without seeing a single training example.

What Does a Fallacy Look Like?

To understand the challenge, let’s look at what the models are up against. Fallacies aren’t just about grammar; they are about the logical link between premises and conclusions.

Figure 1: Examples of fallacies and their types from existing datasets.

As shown in Figure 1, fallacies appear in various contexts. In the “Reddit” example, a user creates a False Dilemma (implying you must either hate technology or live in a cave). in the “Propaganda” example, the text uses Name-calling. The diversity of these styles—from formal debates to casual internet slang—is exactly why traditional models struggle to adapt.

The Methodology: Prompting for Logic

The core contribution of this paper is not a new model architecture, but a systematic exploration of prompting schemes. How do you ask an LLM to find a fallacy? The authors tested two main categories: Single-Round and Multi-Round prompting.

1. Zero-Shot Single-Round Prompting

This is the most straightforward approach. You give the LLM the text and a list of fallacy types, and you ask: “Which fallacy is this?”

The authors tested two variations:

Without Definitions: Relying entirely on the LLM’s internal knowledge of terms like “Red Herring” or “Straw Man.”
With Definitions: Providing a specific definition for each fallacy type in the prompt to guide the model.

2. Zero-Shot Multi-Round Prompting

This is where the research gets innovative. The authors hypothesized that fallacy classification requires complex reasoning steps: reading comprehension, information extraction, and pattern recognition. A single “guess the label” prompt might be too demanding.

To solve this, they designed Multi-Round strategies to guide the LLM’s “thought process.”

Figure 2: Ilustration of single-round and multi-round prompting schemes.

As illustrated in Figure 2, the authors broke the task down into several distinct logical flows:

(b) Definition Generation: Instead of giving definitions, the authors ask the LLM to generate its own definitions for the fallacy types first. In the second round, the LLM uses its own definitions to classify the text. This aligns the classification criteria with the model’s internal representation.
(c) General Fallacy Analysis (GFA): The model is first asked to analyze the text and determine if it is fallacious and why, without picking a label yet. Once the analysis is generated, the model is asked to map that analysis to a specific label.
(d) GFA with Warm-up: Some datasets (like news snippets) lack context. Here, the model is asked to summarize or infer context (the “Warm-up”) before moving to the analysis phase.
(e) Premises & Conclusion: This approach attempts to use formal logic. The model extracts the premises and the conclusion, checks if the premises entail the conclusion (the formal definition of a sound argument), and then classifies the error.
(f) Zero-Shot CoT (Chain-of-Thought): The classic “Let’s think step by step” prompt is used to encourage intermediate reasoning within a single response.

To make this reproducible, the authors utilized structured prompt templates, ensuring the outputs were machine-readable (JSON).

Table 9: Sample templates of our proposed single-round and multi-round prompting schemes.

Experimental Results

The researchers tested these prompts on 7 benchmark datasets ranging from political debates (ElecDeb) to COVID-19 news (COVID-19) and internet arguments (Argotario, Reddit). They compared several LLMs (GPT-4, GPT-3.5, Llama-3, Mistral, Qwen) against a state-of-the-art T5 model that was fine-tuned (trained) specifically on fallacy data.

1. LLMs vs. Supervised Models

The results revealed a major victory for LLMs in terms of generalization.

Table 2: Fallacy classification results of Macro-F1.

Take a close look at Table 2. The blue numbers represent the T5 model performance on Out-of-Distribution (OOD) data—datasets it wasn’t trained on. The red numbers represent the zero-shot LLMs.

Key Findings:

OOD Dominance: Zero-shot LLMs consistently outperformed the fully trained T5 baseline in OOD scenarios. For example, on the MAFALDA dataset, the T5 model scored a meager 25.13 F1 score, while GPT-4 scored 52.86. This proves that supervised models are brittle; they learn specific dataset quirks rather than the underlying concept of fallacies.
Open Domain Success: On simpler, open-domain datasets like Argotario and Reddit, GPT-4 achieved results comparable to, or even better than, the fully trained models (78.94 vs 69.13 on Argotario).
Difficulty with Niche Domains: On highly specific domains like Logic or Propaganda, LLMs still trailed behind the specialized T5-3B model, likely because those datasets contain very specific or academic definitions of fallacies that differ from general usage.

2. Does Multi-Round Prompting Help?

Does asking the model to “think” before answering actually improve the score? The answer is yes, but it depends on the model size.

Table 4: Best two zero-shot prompting schemes for different base models and data domains based on average Macro-F1 rankings.

Table 4 summarizes the most effective strategies:

For Powerful Models (GPT-4): Simpler is often better. Single-round prompting (with definitions) or Zero-shot CoT worked best. GPT-4 already has strong internal reasoning, so forcing it through multiple formal steps sometimes over-complicated the task.
For Smaller Models (Llama-3, Qwen, Mistral): Multi-round prompting was a game-changer. Strategies like General Fallacy Analysis with Warm-up (GFA-W) significantly boosted performance. These models benefitted from breaking the task into “Summarize context” -> “Analyze logic” -> “Classify.”

3. The Failure of “Premises & Conclusion”

One of the most interesting negative results came from the Premises & Conclusion (P&C) scheme. In theory, this is the most “correct” way to detect a fallacy: identify the premise, identify the conclusion, and check the link.

However, Table 3 (below) shows that P&C ranked as the worst performing multi-round scheme.

Table 3: Overall rankings on Macro-F1 of multi-round prompting schemes.

Why did formal logic fail? The authors analyzed the error cases and found that prompting for “entailment” (a strict logical concept) confused the models. The models started rejecting informal but reasonable arguments because they weren’t logically watertight deductively, or they got distracted by the mechanics of the premise-conclusion extraction rather than focusing on the rhetorical flaw. It turns out that treating fallacies as a natural language analysis task (GFA) works better than treating them as a formal logic task.

Error Analysis: Where do LLMs get confused?

No model is perfect. To understand where LLMs fail, the authors visualized the confusion matrices—charts that show which fallacies are mistaken for others.

Figure 3: Misclassification confusion matrix of common fallacy types given by GPT-4 and Llama3-Chat (8B).

Figure 3 highlights distinct personality differences between models:

GPT-4 (Left): Notice the dark blue square for “No Fallacy” (bottom right). GPT-4 is conservative. It is very good at identifying when no fallacy is present, but it tends to over-predict “No Fallacy” when it’s unsure. It also struggles with the “Hasty Generalization” bucket, often dumping other weak arguments into that category.
Llama-3 (Right): Llama-3 is more aggressive. It frequently predicts “Appeal to Emotion” (top left) even when it might be a different fallacy. It struggles more with distinguishing “Red Herring” from other relevance fallacies.

Conclusion and Implications

The paper “Are LLMs Good Zero-Shot Fallacy Classifiers?” provides a compelling answer: Yes, potentially.

While they haven’t completely beaten fully supervised models on every dataset, they have solved the biggest bottleneck in the field: Data Scarcity.

No Training Required: You don’t need to spend thousands of dollars annotating data to get a decent fallacy detector.
Better Generalization: If you want a system that works on new, unseen topics (like a breaking news event), an LLM is a safer bet than a trained BERT/T5 model.
Prompting Matters: If you are using a smaller, open-source model (like Llama-3), you shouldn’t just ask “What fallacy is this?”. You should use a multi-round workflow: ask it to summarize the context, analyze the reasoning, and then classify.

For students and researchers, this paper highlights that we are moving away from “training models” toward “designing reasoning chains.” The future of fallacy detection likely lies in hybrid systems—using the robust general knowledge of LLMs guided by sophisticated, multi-step prompting strategies.

The Problem with Traditional Detection#

What Does a Fallacy Look Like?#

The Methodology: Prompting for Logic#

1. Zero-Shot Single-Round Prompting#

2. Zero-Shot Multi-Round Prompting#

Experimental Results#

1. LLMs vs. Supervised Models#

2. Does Multi-Round Prompting Help?#

3. The Failure of “Premises & Conclusion”#

Error Analysis: Where do LLMs get confused?#

Conclusion and Implications#