Introduction

Since the arrival of Transformers and Large Language Models (LLMs) like GPT-4 and LLaMA, a singular question has dominated the field of Natural Language Processing (NLP): Do these models actually reason, or are they just sophisticated parrots?

We know LLMs are incredibly proficient at language. They can write poetry, summarize emails, and even generate code. But proficiency in language does not automatically equate to proficiency in logic. When a human solves a math problem or a logic puzzle, they are (usually) applying strict deductive rules. When an LLM does it, are they doing the same? Or are they acting as “soft reasoners”—emulating the appearance of reasoning based on statistical patterns and surface-level semantics?

This distinction matters. If we want to trust AI in critical fields like law, medicine, or programming, we need to know if they can detach form from meaning to draw valid conclusions.

In a fascinating research paper titled “A Systematic Analysis of Large Language Models as Soft Reasoners,” researchers Leonardo Bertolazzi, Albert Gatt, and Raffaella Bernardi dive deep into this question using one of the oldest tools in the history of logic: the Aristotelian syllogism.

Their work offers a comprehensive look at how LLMs handle deductive reasoning, identifying human-like biases and testing whether we can “teach” these models to ignore the meaning of words and focus purely on logic.

Figure 1 illustrates three syllogisms: one invalid, one valid but unbelievable, and one with made-up words. This highlights the core difficulties LLMs face: invalid inferences, content effects, and multi-step reasoning.

As shown in the figure above, the researchers identified three critical weaknesses in pre-trained LLMs:

  1. Difficulty with Invalid Inferences: They struggle to admit when “nothing follows.”
  2. Content Effects: They prefer conclusions that “sound” true in the real world, even if the logic is flawed.
  3. Complex Chains: They struggle when the logic requires multiple steps.

In this post, we will break down their methodology, their experiments with “nonsense” words, and their surprising findings about how LLMs actually “think.”

Background: The Logic of Syllogisms

To understand the paper, we first need a quick refresher on the syllogism. Extensive study in cognitive psychology utilizes these logic puzzles, making them a perfect benchmark for comparing AI “brains” to human ones.

A syllogism consists of two premises and a conclusion. They share a specific structure involving quantifiers like “All,” “No,” “Some,” and “Some… not.”

The structure is defined by Moods and Figures.

  • Moods: These define the type of statement (e.g., Universal Affirmative “All A are B”).
  • Figures: These define the arrangement of the terms.

Figure 2 displays the building blocks of syllogisms: The Moods (A, E, I, O) and Figures (1-4). It shows how combining them, such as in the AE2 schema, creates a specific logical structure.

There are 64 possible pairs of premises. According to Aristotelian logic, only 27 of these lead to a valid conclusion. The other 37 represent “invalid” schemas—meaning that based only on the premises provided, nothing follows.

The Human Element: Content Bias

Why use syllogisms to test AI? Because humans are notoriously bad at them in very specific ways. Psychology tells us that humans suffer from a content effect bias. We tend to accept a conclusion if it aligns with our world knowledge, even if it doesn’t logically follow from the premises.

For example:

  • Premise 1: All flowers need water.
  • Premise 2: Roses need water.
  • Conclusion: Roses are flowers.

This feels correct because roses are flowers in the real world. However, logically, this is invalid. The premises do not prove that roses are flowers (roses could be a separate category that also needs water).

The researchers aimed to see if LLMs exhibit these same human-like biases or if they can act as pure “logic machines.”

The Core Method: Investigating Learning Strategies

The researchers didn’t just test one model; they performed a systematic analysis of different learning strategies using open-access models (Pythia and LLaMA). They wanted to see if the ability to reason could be elicited or taught.

They compared three distinct approaches:

1. Zero-Shot Chain-of-Thought (ZS-CoT)

This is the baseline. The model is given the syllogism and a prompt to “think step by step” before answering. No examples are provided. This tests the model’s innate, pre-trained capabilities.

2. In-Context Learning (ICL)

In this setting, the model is given a few examples (demonstrations) of solved syllogisms in the prompt before the actual test question. The researchers split this into two clever sub-categories:

  • \(ICL_{in}\) (In-Schema): The examples provided use the exact same logical structure (schema) as the test question.
  • \(ICL_{out}\) (Out-of-Schema): The examples provided use different logical structures than the test question.

This distinction is crucial because it tests whether the model is actually learning the logic from the examples or just copying the pattern of the answer.

3. Supervised Fine-Tuning (SFT)

Here, the researchers went a step further. They took the pre-trained models and fine-tuned them (retrained them slightly) on a specific dataset of syllogisms.

The Twist: Pseudo-Words To force the models to learn logic (form) rather than relying on meaning (content), the researchers created a dataset using pseudo-words—nonsense words generated by a computer.

Instead of “All cats are mammals,” the model might see “All glorps are smeefs.”

If the model can solve the puzzle with glorps and smeefs, it proves it isn’t cheating by using its knowledge of cats. It has to understand the logical relationship between the variables.

Figure 3 illustrates the two main pipelines. On the left, In-Context Learning shows the model receiving examples (same or different schema). On the right, Supervised Learning shows the model being trained directly on the task.

Experiments and Results

The researchers ran these models through a gauntlet of tests involving believable syllogisms (consistent with reality), unbelievable ones (violating reality), and complex multi-step chains.

Here is what they found.

1. Pre-trained LLMs act like Humans (in a bad way)

In the Zero-Shot setting (ZS-CoT), the models behaved remarkably like human undergraduates who haven’t studied logic.

  • Content Bias: The models were much more likely to choose a conclusion if it was “believable” in the real world, regardless of the logic.
  • The “Nothing Follows” Problem: The models struggled immensely with invalid syllogisms. They almost always tried to force a conclusion rather than admitting that the premises led nowhere.

This suggests that out-of-the-box LLMs are indeed “soft reasoners.” They aren’t running a logic engine; they are predicting the next word based on probability and “vibes.”

2. In-Context Learning (ICL) isn’t enough

Giving the models examples (ICL) helped improve accuracy on valid inferences, but it introduced a new problem: Inconsistency.

The researchers analyzed not just if the model got the right answer, but all the answers it generated (since the text generation can produce multiple sentences). They looked for contradictions. For example, a model might say “All A are B” and then immediately say “Some A are not B.”

ICL made the models “hallucinate” logic more often, generating contradictory conclusions. Even when given examples of the exact same logical schema (\(ICL_{in}\)), the models failed to consistently recognize invalid syllogisms.

3. Supervised Fine-Tuning (SFT) is the Winner

The models fine-tuned on the nonsense pseudo-words showed a massive improvement.

  • Accuracy: They reached near-perfect performance on valid syllogisms.
  • Bias Removal: Because they were trained on nonsense words, they learned to ignore the “content bias.” When tested on real words, they no longer favored “believable” conclusions over logical ones.
  • Consistency: Unlike the ICL models, the SFT models stopped contradicting themselves.

This is a powerful finding: You can teach an LLM to reason logically, but you have to train it specifically on the form of logic, detached from the meaning of words.

4. The “Atmosphere” Heuristic

Perhaps the most interesting part of the analysis was why the models were failing in the Zero-Shot settings. The researchers compared the models’ errors to theories from cognitive science.

They found that the models’ behavior strongly correlated with the Atmosphere Theory. This theory suggests that reasoners simply match the “mood” of the premises.

  • If the premises use “All” (Universal), the model guesses a conclusion with “All.”
  • If the premises use “No” (Negative), the model guesses a conclusion with “No.”

The heatmap below illustrates this analysis. It shows the proportion of the models’ conclusions that were predicted by various heuristic theories.

Figure 10 presents heatmaps comparing model predictions against heuristic theories like Atmosphere, Matching, and Conversion. The high values for Atmosphere (top left) suggest LLMs rely on premise moods to guess answers.

Look at the LLaMA-3-8b ZS-CoT row (top) under the Atmosphere column. The model’s behavior is highly predictable by this simple heuristic. This confirms that without fine-tuning, the model isn’t really “thinking”—it’s matching the linguistic pattern of the quantifiers.

5. The Limits of Generalization

While Fine-Tuning (SFT) was the most effective strategy, it wasn’t magic. The researchers tested the models on “longer” inferences—chains of logic requiring 3 or 4 premises instead of the standard 2.

Table 3 compares the models’ performance on unseen numbers of premises. It shows that while SFT performs best overall, its performance drops significantly as the chain of reasoning gets longer (from 2 to 3 to 4 premises).

As the table above shows, even the Supervised Fine-Tuned models (the bottom rows for Pythia and LLaMA) saw a drop in performance when moving from 2 premises to 4. This indicates that while they learned the form of a standard syllogism, they struggled to generalize that logic to longer, unseen sequences. They haven’t fully mastered the recursive nature of logic; they’ve just become very good at the specific pattern of 2-premise syllogisms.

Consistency vs. Completeness

Finally, the researchers visualized the trade-off between Inconsistency (contradicting oneself) and Incompleteness (missing valid conclusions).

Figure 9 plots Inconsistency vs. Incompleteness. LLaMA (green dots) is generally consistent, while Pythia (purple/pink) struggles more. The chart shows how SFT improves consistency significantly compared to ICL.

This chart helps visualize the “personality” of the models.

  • Pythia (Pink/Purple): Tends to be highly inconsistent (right side of the X-axis), meaning it often contradicts itself.
  • LLaMA (Green): Is much more consistent (left side of the X-axis).
  • SFT Impact: Notice how the SFT dots (circles) are generally closer to the origin (0,0) or the left axis compared to the ICL dots (triangles/squares), especially for Pythia. SFT reins in the chaos of the model’s generation.

Conclusion and Implications

So, are LLMs reasoners? The answer, as is often the case in science, is “it depends.”

This paper demonstrates that pre-trained LLMs are soft reasoners. They rely on heuristics like the Atmosphere theory—matching the linguistic “vibe” of the prompt rather than executing a logical operation. They are easily swayed by content bias, preferring answers that “sound right” over answers that are logically valid.

However, the research also provides a roadmap for fixing this. By utilizing Supervised Fine-Tuning (SFT) on abstract, nonsense data (pseudo-words), we can effectively teach these models to prioritize form over content. This process:

  1. Mitigates the content effect bias.
  2. Helps the model recognize invalid inferences (“nothing follows”).
  3. Drastically reduces logical contradictions.

For students and future AI researchers, this highlights a critical design principle: Don’t assume an LLM knows logic just because it knows language. Reasoning is a distinct skill that appears to require specific, formal training to decouple it from the statistical probabilities of language generation.

While we haven’t reached the point of a perfect “Silicon Aristotle”—evidenced by the struggle with longer reasoning chains—this work suggests that with the right curriculum, we can teach our soft reasoners to be a little bit harder.